Data quality management involves properly defining data standards and employing the right technology and resources to manage data quality. The application of data quality management strategies and technologies is a relatively broad category, which can be applied to the three stages of data quality management before, during and after the event.
Data quality management should uphold the concept of prevention first, adhere to the "pre-control as the core, meet business needs as the goal" as the fundamental starting point and foothold of the work, and strengthen the various measures of data quality management in advance, control, and remediation, so as to achieve continuous improvement of enterprise data quality, as shown in the figure below.
Pre-emptive prevention of data quality management strategies
The Eastern Han Dynasty historian Xun Yue mentioned three methods of giving advice to the emperor in "Shen Jian Miscellaneous Words", which also said that there are three techniques for entering loyalty: "one is prevention, two is rescue, and three is precaution." The first thing is to prevent, the one who stops is to be saved, and the one who is responsible for doing is the precept. Prevention is the first, salvation is second, and precaution is the bottom. ”
Prevention is the best strategy for data quality management. The prevention of data quality management can start from three aspects: organizational personnel, standards and specifications, and system processes.
1. Strengthen organizational construction
Companies need to build a culture where more people are aware of the importance of data quality, which is inseparable from organizational mechanisms. Establishing an organizational system for data quality management, clarifying role responsibilities and assigning appropriate skills to each role, as well as strengthening the training and development of relevant personnel, are effective ways to ensure data quality.
1. Organizational role setting
When implementing data quality management, enterprises should consider setting up relevant data quality management roles under the overall organizational framework of data governance, and determine the division of their responsibilities in data quality management. Common organizational roles and their responsibilities are as follows.
Data Governance Council:Set the tone for data quality and make decisions about data infrastructure and processes. The Data Governance Committee meets regularly to drive measurement and analysis of the state of data quality across business units with new data quality objectives.
Data Analyst:Responsible for the root cause analysis of data problems in order to provide decision-making basis for the development of data quality solutions.
Data Administrator:Responsible for managing data as a corporate asset and ensuring data quality, such as regular data cleansing, deduplication, or resolving other data issues.
2. Strengthen personnel training
The main reason for data inaccuracy is the human factor, and strengthening the training of relevant personnel and improving the data quality awareness of personnel can effectively reduce the occurrence of data quality problems.
Data quality management training is a win-win process.
For employees,Through the training, I can not only realize the importance of data quality to business and management, but also learn Xi knowledge and skills of data management theory, technology, tools, etc., to ensure that upstream business personnel know the impact of their data on downstream business and applications, so that they can make no mistakes and make fewer mistakes in their work, and improve the efficiency and quality of their business processing.
For businesses,Through training, data standards can be publicized and implemented, employees' data thinking and data awareness can be improved, and the data culture of the enterprise can be established to support the long-term stability of enterprise data governance.
In addition, enterprises should encourage employees to participate in professional qualification training, so that relevant personnel can learn Xi data governance knowledge system more systematically and improve their professional ability in data management.
2. Implement data standards
The effective implementation and implementation of data standards is a necessary condition for data quality management. Data standards include data model standards, master data and reference data standards, and indicator data standards.
1. Data model standard
The number of data model standards uniformly defines the business definitions, business rules, data relationships, and data quality rules in the data model, and manages these standards and rules in a unified manner through metadata management tools. In the process of data quality management, these standards can be mapped to the business process, and the data standards can be used as the basis for data quality assessment, so that the data quality verification can be based on evidence.
2. Master data and reference data standards
Master data and reference data standards include classification, coding and model standards for master data and reference data, which are the guarantee for the sharing of master data and reference data among various departments and business systems. If the master data and reference data standards cannot be effectively implemented, the quality of the master data will be seriously affected, and the inconsistencies, incompleteness, and non-uniqueness of the master data will be affected, which in turn will affect business collaboration and decision support.
3. Indicator data standards
Indicator data is the data processed and summarized according to certain business rules on the basis of business data, and the indicator data standard mainly covers three aspects: business attributes, technical attributes, and management attributes. The indicator data standard unifies the basis for the statistical caliber, statistical dimension, and calculation method of the analysis indicators, which is not only the basis for the consensus of various business departments, but also the main construction content of the data warehouse and BI project, and provides a basis for the data quality audit of the data warehouse.
3. Institutional process guarantees
1. Data quality management process
Data quality management is a closed-loop management process that includes business requirements definition, data quality measurement, root cause analysis, implementation of improvement plans, and control of data quality, as shown in the following figure.
Business Requirements Definition
The author's consistent proposition is that enterprises do not govern data for the sake of governing data, but behind it are all to achieve business and management goals, and the purpose of data quality management is to better achieve business expectations.
First
Map your organization's business goals to your data quality management strategy and plan.
Second.
By allowing business personnel to deeply participate in or even lead data quality management, the business department, as the main user of data, can better define data quality parameters.
Third
Define the business problem so that you can analyze the root cause of the data quantity problem and develop a more reasonable solution.
Data quality measurements
Data quality measurement is to design data evaluation dimensions and indicators around business needs, use data quality management tools to complete the evaluation of data quality of relevant data sources, classify data problems according to the measurement results, and analyze the causes of data problems.
First
Data quality measurement is guided by an analysis of the impact of data quality issues on the business, and clearly defines important parameters such as the scope and priority of the data to be measured.
Second.
Adopt a combination of top-down and bottom-up strategies to identify anomalies in your data. A top-down approach is to evaluate and measure the data sources to be measured from the business objectives as a starting point;The bottom-up approach is based on data profiling, identifying data source issues and mapping them to potential impact on business goals.
Third
Generate a data governance assessment report with a clear list of data quality measurements.
Root cause analysis
There are many reasons for data quality issues, but some of them are symptomatic rather than fundamental. To do a good job in data quality management, we should grasp the key factors that affect data quality, set up quality management points or quality control points, and start from the source of data to fundamentally solve data quality problems.
Implement improvement programs
There is no one-size-fits-all solution to ensure the accuracy and completeness of every type of data for every business in an enterprise. Enterprises need to define data quality rules and data quality metrics based on the root cause of data problems and the extent of the impact of data on the business, form a unique data quality improvement plan that meets the business needs of the enterprise, and take immediate action.
Control data quality
Data quality control is to set up a data quality "firewall" in the data environment of the enterprise to prevent the generation of bad data. The data quality "firewall" is based on the root cause analysis and problem handling strategy of data problems, the data problem measurement and monitoring program set up at the entrance of the data problem, and the prevention and control of data problems at the source or upstream of the data environment, so as to prevent bad data from spreading downstream and polluting subsequent storage, which in turn affects the business.
2. Data quality management system
The data quality management system sets up assessment KPIs, and evaluates the data quality management of each business domain and department of the enterprise through special assessment and scoring. Based on the evaluation results of data quality, the problem data is summarized into the corresponding classification, and quantified according to the weight of the classification. Summarize the pattern of data quality problems, use data quality management tools to regularly monitor and measure data quality, promptly discover existing data quality problems, and supervise the implementation of corrections.
The role of the data quality management system is to restrain all parties to strengthen their awareness of data quality, urge all parties to pay attention to data quality in their daily work, and be able to trace back to the source and proactively solve problems when they are found.
In-process control of data quality management policies
The in-process control of data quality management refers to the monitoring and management of data quality during the maintenance and use of data. Through the establishment of a process control system for data quality, the data quality of data creation, change, collection, cleaning, conversion, loading, analysis and other links is controlled.
1. Strengthen control of data sources
Ask the canal to be as clear as promised, for there is a source of living water. "Understanding the data is very important for the data quality of the enterprise, and controlling the data quality from the source of the data, so that the data "standardized input and standardized output" is the key to solving the problem of enterprise data quality. Enterprises can consider the following aspects to manage the quality of source data.
1. Maintain a good data dictionary
A data dictionary is an important tool for recording standard data and ensuring data quality. Data accumulates over time, and if data accumulates in informal data systems such as electronic**, there may be a risk that this valuable data may be lost, for example, with the departure of key employees. By establishing an enterprise-level data dictionary to effectively identify the key data of the enterprise, and clearly and accurately defining each data element, it can eliminate the possible misunderstanding of data by different departments and different personnel, and allow enterprises to save a lot of time and costs on IT projects.
2. Automatic data entry
One of the root causes of poor data quality is the human factor, and it is difficult to avoid data errors by manually entering data. Therefore, businesses should consider automating data entry to reduce human error. A scenario that is worth implementing as long as the system can automate certain actions, such as automatically matching customer information based on keywords and automatically bringing it into the form.
3. Automatic data verification
When it comes to disease, it's easier to prevent than to prevent it, and the same goes for data governance. We can automatically verify the input data through the preset data quality rules, and remind or refuse to save the data that does not meet the quality rules. Data quality verification rules include, but are not limited to, the following categories:
Data Type Correctness:Numbers, integers, text, dates, references, attachments, etc.
Data deduplication:Completely duplicate data items, suspected duplicate data items, etc.
Data Field Value Range:Maximum, minimum, acceptable, unacceptable.
Data Classification Rules:Rules used to determine that data belongs to a classification to ensure that it is categorized correctly.
Is the unit correct:Make sure you're using the correct unit of measurement.
4. Manual intervention review
Data quality audit is an important means to control data quality from the source, using a process-driven data management model to control the addition and change of data, each operation needs to be manually reviewed, and only after the review of the data can it take effect. For example, if there is a new addition or change in the master data, manual review can be used to control the data quality.
2. Strengthen the control of the circulation process
Data quality problems do not only occur at the source, but also at the end user, and data quality problems may occur in every link of data collection, storage, transmission, processing, and analysis. Therefore, it is necessary to comprehensively prevent data quality in all processes in the whole life cycle of data. The quality control strategy for the data transfer process is as follows.
1. Data collection
During the data acquisition phase, the following quality control strategies can be employed:
Clarify the data collection requirements and form a confirmation form;
Standardization of data acquisition processes and models;
Data sources provide accurate, timely, and complete data;
Broadcast new and new data changes to other applications in a timely manner as messages
Ensure that the level of detail or granularity of data collection meets the needs of the business
Defines the range of acceptable values for each object of the acquired data;
Ensure that the data collection tools, collection methods, and collection processes have been validated.
2. Data storage
During the data storage phase, the following quality control strategies can be employed:
Choose an appropriate database system and design a reasonable data table;
Store data at the appropriate granularity;
Establish appropriate data retention schedules;
Establish appropriate data ownership and query permissions;
Clarify guidelines and methods for accessing and querying data.
3. Data transmission
During the data transfer phase, the following quality control strategies can be employed:
Specify data transfer boundaries or data transfer limits;
Ensure the timeliness, integrity and security of data transmission;
Ensure the reliability of the data transmission process and ensure that the data will not be tampered with during the transmission process
Identify the impact of data transfer technologies and tools on data quality.
4. Data processing
During the data processing phase, the following quality control strategies can be employed:
Handle data appropriately and ensure that data processing is in line with business objectives;
Handling of duplicate values;
Handling of missing values;
Handling of outliers;
Handling of inconsistent data.
5. Data analysis
Ensure that algorithms, formulas, and analytical systems for data analysis are effective and accurate;
Ensure that the data to be analyzed is complete and valid;
Analyze data in a reproducible situation;
Analyze data based on appropriate granularity;
Shows appropriate data comparisons and relationships.
Strategies for in-process control.
Ex-post remediation for data quality management
Is it possible to do a good job of pre-event prevention and in-process control, and there will be no more data quality problems?The answer is clearly no. In fact, no matter how much precaution we take and how strict our process controls are, there will always be data problems that slip through the cracks. You will find that as long as it is a process of human intervention, there will always be data quality problems, and even if the human factor is aside, data quality problems cannot be avoided. To minimize data quality issues and mitigate their impact on the business, we need to identify them and remediate them accordingly.
1. Regular quality monitoring
Regular quality monitoring, also known as regular data measurement, is a periodic re-evaluation of some non-critical data and data that are not suitable for continuous measurement, so as to provide a certain degree of assurance that the state of the data meets expectations.
Regularly monitor the condition of the data to provide assurance that the data meets expectations to some extent, identify data quality issues and changes in data quality issues, and develop effective improvement measures. Regular quality monitoring is like people's regular physical examination, regularly check the health status of the body, when there is a significant change in the data of a certain physical examination, the doctor will know what data is abnormal, and take appropriate measures according to these abnormal data.
The same is true for data, and it is necessary to conduct a comprehensive "physical examination" of enterprise data governance on a regular basis to find the "best problems" to achieve continuous improvement of data quality.
IIData issue remediation
Although data quality control can play a role in controlling and preventing the occurrence of bad data to a large extent, in fact, no matter how strict the quality control is, it cannot achieve 100% data problem prevention, and even too strict data quality control will cause other data problems. As a result, businesses need to conduct proactive data cleansing and remediation from time to time to correct existing data issues.
1. Clean up duplicate data
Duplicate data detected by data quality check is processed manually or automatically, and the processing methods can be deleted or merged. For example, for two identical duplicate records, delete one of themIf duplicate records are not identical, merge the two records into one, or keep only the relatively complete and accurate one.
2. Clean up the derived data
Derived data is data derived from other data, for example, "profit margin" is calculated on the basis of "profit", it is derived data. In general, storage-derived data is redundant, which not only increases storage and maintenance costs, but also increases the risk of data errors. If, for some reason, the way the profit margin is calculated has changed, then the value has to be recalculated, which increases the chances of an error occurring. As a result, derived data needs to be cleaned to store its associated algorithms and formulas, rather than results.
3. Handling of missing values
The strategy for dealing with missing values is to imputing the missing values, and there are two ways: manual imputation and automatic imputation. For missing data values of "small data", manual imputation is generally used, such as integrity governance of master data. For the problem of missing values in big data, automatic imputation is generally used to fix it. There are three main ways of automatic imputation:
Fix with contextual interpolation;
Fix with average, maximum, or minimum values
Fix with default values.
Of course, the most effective method is to use similar or similar values for imputation, such as using machine Xi algorithms to find similar values for imputation repair.
4. Outlier handling
At the heart of outlier handling is finding outliers. There are many ways to detect outliers, most of which use the following machine Xi techniques:
Statistical-based anomaly detection;
Distance-based anomaly detection;
Density-based anomaly detection;
Cluster-based anomaly detection.
3. Continuous improvement and optimization
Data quality management is a continuous virtuous cycle, and continuous measurement, analysis, probing, and improvement can improve the overall information quality of the enterprise. Through the continuous optimization and improvement of data quality management strategies, we can transition from reacting to data problems and even emergency data failures to actively preventing and controlling the occurrence of data defects.
After data quality measurement, root cause analysis of data problems, and data quality problem repair, we can go back and evaluate whether the data model design is reasonable, whether there is room for optimization and improvement, whether the processes of data addition, change, collection, storage, transmission, processing, and analysis are standardized, and whether the preset quality rules and thresholds are reasonable. If there are unreasonable areas or room for optimization in the model and process, then implement those optimizations.
Post-event remediation is not the most ideal way for data quality management, so it is recommended to adhere to the principle of prevention to carry out data quality management, and continuously find problems, improve methods, and improve quality through continuous data quality measurement and exploration.
Write at the end
Data quality affects not only the success or failure of informatization construction, but also the core elements of enterprise business collaboration, management innovation, and decision support. For the management of data quality, we adhere to the overall idea of "garbage in, garbage out", adhere to the data quality management strategy of "pre-prevention, in-process control, and post-remediation", and continuously improve the level of enterprise data quality.
While there may not be a truly foolproof way to prevent all data quality issues, making data quality part of the "DNA" of an enterprise data environment will go a long way in gaining the trust of business users and leaders.