This columnstore file format is designed for the unique needs of the Internet of Things and is designed to reduce the consumption of network transfers and cloud computing resources.
Translated from tsfile: A Standard Format for IoT Time Series Data by Susan Hall. The tsfile project has reached 10, commits are working to make it a separate project within the Apache Software Association.
TSfile is a columnstore file format designed for time series data with advanced compression techniques to minimize storage space, high-throughput read and write capabilities, and deep integration with processing and analysis tools such as Apache projects Spark and Flink. With the development of the Industrial Internet of Things, a single wind turbine, for example, generates a huge amount of data. According to the project's GitHub page, "especially when IoT enters the Industrial Internet, smart devices generate an order of magnitude or two more data than consumer-facing IoT," and getting actionable insights becomes more complex. It states that TSfile is designed to support "high-throughput ingestion of up to tens of millions of data points per second, only for sparse updates to correct low-quality data; Tight data packaging and deep compression of long-term historical data; Traditional sequential and conditional queries, complex exploratory queries, signal processing, data mining, and machine learning. ”
TSfile is the underlying storage file format for the Apache IoT Time Series Database. iotdb represents more than a decade of research at the School of Software at Tsinghua University in China. It became a top-level project of the Apache Software Association in 2020. "Prior to the advent of TSFILE, time series data lacked a standard file format, complicating data collection and processing. Pengcheng Zheng, a spokesman for the project committee, said in an email.
With TSFILE, users can perform portable data offloading and loading in IoTDB, making the management and migration of underlying data more flexible. Even if there is no database, users can directly use the SDK to read data from the TSFILE to implement some lightweight data reading and writing scenarios. ”
Users can write data to a TSFILE in the end device or gateway and then send it to the cloud to IotDB or other unified management systems. It is not a database per se, but a format that reduces network transmission and computing resource consumption in the cloud through compression and efficient storage.
TSfile can store time series from a single device or multiple devices. While data from multiple devices is stored in a tsfile, each device has its own storage engine and is therefore physically isolated as in a traditional database. Data is indexed by time dimension to accelerate query performance and enable fast filtering and retrieval of time series data.
In IotDB, it supports both Transaction Processing (OLTP) and Analytical Processing (OLAP) without the need to reload data into different storage.
The IoT-native data model organizes time-series data from devices and sensors into log structure merge trees that accommodate delayed data arrival, making it suitable for write-intensive workloads. For short delays, the data is first cached in memtables and then flushed into tsfiles.
TSfile allows users to write data directly, with or without predefined schemas and filters, while the new version adds support for more data types and algorithms. Despite being originally written in J**a, according to Zheng, the need for tsfile implementation in multiple languages is growing, such as C++, Go, and Rust. Its users typically work in scenarios where efficient data storage, fast access, and analysis are critical, such as the Internet of Things, intelligent control systems, financial analytics, and log analysis.
He noted that TSFaffe stands out for its focus on the unique needs of time series data.
In the past, it was common for companies to write time series data in a variety of user-defined file formats, lacking uniformity, or using common columnar file formats such as [Apache Project] Parquet and ORC, which complicated data collection and processing without a standard. "TSFafile offers benefits such as deep compression of long-term historical data, high throughput, and handling of rare updates. Its ability to integrate with IoTDB and other systems further highlights its strengths. Users can write TSfile data on an embedded device or gateway and then transfer the TSFILE directly to the cloud without the need for a traditional ETL [extract, convert, load] process. In this way, the network transmission and computing resource requirements in the cloud are reduced. "In the future, the committee hopes to make TSFaffe a stand-alone project with its own SDK and easier to use documentation, add support for more languages, integrate more coding and compression methods in TSBog, and provide more tools such as visualization, parsing, and repair tools. "However, these plans are not irrevocable, as we are collaborating in the Apache way, and each time a new insight is discussed, it may help to modify and optimize," said Mr. Zheng.
Susan Hall is a sponsored editor of The New Stack. Her job is to help sponsors get the widest possible readership for their contributions. She has been writing since the early days of The New Stack, among others. Read more about Susan Hall.