Recently, at the "2023 Data Asset Management Conference", the evaluation results of the "Trusted Big Data" product of the China Academy of Information and Communications Technology (hereinafter referred to as the "China Academy of Information and Communications Technology") in the second half of 2023 were announced, HangzhouCounting Dream WorksScience and technology *** hereinafter referred to as "Digital Dream Factory").DTProsphere EMR, a lakehouse platform software(hereinafter referred to as "dtsphere emr").Successfully passed the special evaluation of the basic capability of cloud native lakehouse integration of China Academy of Information and Communications TechnologyHe Baohong, director of the Institute of Cloud Computing and Big Data of the Chinese Academy of Information and Communications Technology, issued a certificate to Digital DreamWorks.
Data, as a key factor of production, is helping enterprises to rapidly improve their competitiveness. The data platform is the basic software that provides enterprises with data storage, computing, analysis and other capabilities, and is an important infrastructure to support upper-layer data applications. In order to process both structured and unstructured data, enterprises generally deploy data warehouses and data lakes at the same time, which solves the problem of heterogeneous data processing, but brings high construction costs and complicated operation and maintenance work. In this context, the lakehouse data platform has become a new research directionWith the trend of all-cloud data platforms, cloud-native lakehouse integrated data platforms have begun to become the focus of the industry
At this stage, the cloud-native lakehouse integrated data platform is still in the early stage of development due to its high technical threshold, resulting in large differences between solutions and unclear application models. In the first half of 2022, the TC601 WG1 Big Data Technology and Product Working Group of the Big Data Technology Standards Promotion Committee of China Communications Standards Association began to be organizedTechnical Requirements for Cloud-Native Lakehouse Data PlatformThe development of theDigital Dream Works is deeply involved in the compilation of the main content of the standard。The standards are generally divided into data lakehouse integration, lakehouse storage, lakehouse computing, lakehouse data governance, and other capabilities of the lakehouseFive competency domainsIt aims to help big data product vendors and users evaluate the technical capabilities and R&D direction of the cloud-native lakehouse integrated data platform. In addition, Digital DreamWorks is also deeply involved based on its technical experience in the lakehouse architecture"Research Report on Lakehouse Technology and Industry (2023)".of the compilation.
It fully supports a variety of scenario-oriented data services
Since 2015, the China Academy of Information and Communications Technology has carried out evaluation and testing work in the direction of big data, privacy computing, data security, etc., after 7 years of developmentIt has become the most influential evaluation and testing system in the industry。Those who participated in the assessment this timedtsphere emrIt is independently developed by DreamWorks and is based onUnified distributed computing engine, a big data computing platform developed for the government and enterprise industries. Relying on the data lakehouse integrated platform softwareHigh-performance integration capabilities of multi-source and heterogeneous dataRealize the integrated collection of real-time and batch data, and support it through unified distributed storage management and unified metadata managementMillion-level high throughputPetabyte-level massive data processingLow latency in millisecondsBusiness needs,It helps customers quickly build multi-scenario data capabilities such as real-time data warehouses, data lakes, and lakehousesto accelerate business innovation.
Digital DreamWorks DTSPHERE EMR business architecture diagram.
Multi-source heterogeneous massive data collection.
The lakehouse realizes the efficient collection of structured, semi-structured, and unstructured data, and solves the problems of slow loading of traditional data warehouses under massive data, low data query efficiency, and difficulty in integrating multiple heterogeneous data sources for analysis.
Unified storage and separation of storage and computing.
By using low-cost object storage to achieve the high-efficiency storage function of the data lake, the data lakehouse can reduce the storage cost by more than 90%. It also solves the data silo problem caused by the common siloed development in the enterprise, reducing the cost and time of maintaining multiple data storage systems by 50%.
Unified and active metadata.
Active metadata discovery can not only extract the metadata of structured tables in the data warehouse, but also analyze files in specific formats in the data lake, and automatically generate metadata information. Through effective and centralized metadata management, the value of data assets can be increased.
Multimodal data processing.
Multi-modal data processing provides multi-modal data collection, unified data lakehouse storage, cleaning and governance, integrated development, and unified external services.
Unified development of incremental data warehouses.
The use of unified metadata management can enhance the ability of multi-modal data management and support the unified storage and calculation of structured, textual, spatial, and other data. Lake-based warehouses are built for unified storage, and data development can be achieved without data migration.
Data upsert delete is supported, and the effective time is at the minute level, which solves the problem of data validity timeliness.
The big data warehouse is increased from T+1 to second, which greatly improves data freshness and reduces the proportion of cluster computing resources to a certain extent.
AI-driven data governance.
Intelligent governance, which provides intelligent recommendation of standard data items, intelligent judgment of data types, inference of named entities in data fields, and intelligent cleaning and matching services of data.
End-to-end flow batching.
Stream-batch integration combines real-time data processing and batch data processing for more efficient, low-cost, and scalable data processing capabilities. The integration of stream batch has the advantages of a set of advanced logic for stream batch calculation, which is easier to develop and maintain, and solves the problem of inconsistent data caliber caused by the original two links.
Unified external services in multiple scenarios.
Through the integrated directory system, it provides multi-scenario service capabilities on the lakehouse platform, such as data aggregation, data push, data query, comparison and subscription, etc., to meet the data sharing and exchange needs of various business scenarios.
Counting Dream Worksdtsphere emrIt has been applied in super-large energy enterprises such as Shandong Energy Group (hereinafter referred to as "Shanneng Group"), helping Shanneng Group to build business scenarios such as production scheduling, production, transportation, storage and marketing, and production suspension and withdrawal in disastrous weather, and design and developmentZhang analyzed the display page to complete the unified construction system, mine-end production system, and external data of Shanneng GroupMore than 30 categories and 1000+The collection and implementation of the system and the data into the lake are collected on an average daily basis2.More than 500 millionbar, the cumulative amount of inflow into the lake700+ billionStrip. Publish information resourcesto provide an external interfaceAPIs are called cumulatively10,000 times, sort out standard data elements, data dictionaryPiece.
Tested by a third party, relying on Digital Dream Factorydtsphere emr, Shanneng GroupThe minimum calculation time for 4 million data is 037 seconds, the data store has the highest read rate4.04g sec, the highest write rate of the data store24.14mb seconds, the average response time of the data service interface41 ms
Nowadays,In addition to super-large energy groups such as Shanneng GroupDTSPHERE EMR is also in the coal mine, emergency, disaster mitigation, fire protection and other industriesand highly recognized。In the future, Digital DreamWorks will continue to use high-level cutting-edge technologies to help customers achieve efficient global data management and promote the further release of the value of data elements.