The article SoftTrust Tiancheng explains in detail the data catalog with observability capabilities

Mondo Health Updated on 2024-03-01

In recent years, data governance and data observability have been widely adopted by more and more enterprises. Modern data systems provide users with a number of functions and precise services, allowing users to freely store and query relevant data in different forms, but with the increase of functions and the rapid generation of massive and variable data between systems, it may lead to data drift, data mode changes, data delays, etc., and ultimately make it difficult for the system to operate normally. Therefore, it is important for enterprises to have a data catalog with observability capabilities, and accurate and timely data information can help you gain an edge in the fierce competition.

Data observability ensures the quality, reliability, and availability of data products and business decisions by collecting the health status of data in the system and monitoring, detecting, and resolving data problems in real time. It's worth mentioning that data observability provides the added benefit of alerting on data behavior and anomaly thresholds, allowing organizations to get a more timely and accurate picture of the true state of system operations.

The four pillars of observability

Data observability is the sum of the key features that run and improve data health, and SoftTrust Data Catalog has native data observability capabilities, as follows:

1.Metadata: External characteristics of the data

Metadata is usually defined as "data about data", but in the view of SoftTrust, metadata is "about data independent of the data itself", which has attributes such as data volume (number of rows), data structure (schema) and data timeliness (freshness).

Benefits: Ease of understanding the structure of data is critical to improving data reliability and reducing data downtime. Both metadata and internal metrics of the data can be used to identify data quality issues and provide the right data information to the organization.

2.Lineage: Dependencies between data

In the data world, the main internal interaction is to derive one dataset from another, and the upstream data derives the downstream data, which is what we often call data kinship; It describes in detail the complex data processing logic between systems, between tables, between columns in tables, and between values in columns.

Benefits: Through data lineage, we can fully understand the relationship between data, so as to analyze the upstream root cause and downstream impact of data quality problems.

3.Metrics: Internal characteristics of the data

The internal features of the data reflect the ontology feature attributes of the stored data, including the distribution summary of data patterns, the average value of data, the standard deviation, the skewness, and the sensitivity characteristics.

Advantages: The data itself is described by metrics such as computational completeness, whether sensitive information and accuracy are included, which enables real-time monitoring of anomalies based on metrics, making it possible to conduct timely alarms and minimize data corruption.

4.Logs: Connections between data and users

We further describe how the data interacts with users in the outside world, in addition to the metrics that describe the internal state of the data, the metadata that describes its external stages, and the kinship between the data. We break down these interactions into machine-data interactions, data-user interactions.

Advantages: Machine-data interaction: including movement and transformation, can be carried out through ELT tools and DBT jobs.

Data-user interaction: Users can make the right decisions by creating new models like data engineering teams, stakeholders using decision panels, or data engineers creating machine learning models that allow users to better understand and use data.

In summary, we can fully grasp the state of data at any point in time by describing metadata that describes the external characteristics of the data, data lineage that describes the dependencies between the data, metrics that describe the internal characteristics of the data, and logs that describe the connection between the data and the user.

Related Pages