MinIO optimizes small file storage MinIO optimizes small object storage and adds tar for automatic e

Mondo Technology Updated on 2024-01-30

As the use of object storage becomes the dominant storage category for cloud-native workloads, developers are turning to this technology to meet a growing number of use cases. It's a feature of modern object storage properties - performance, scalability, security, resiliency, and RESTful APIs tailored for Kubernetes. In particular, minio is the embodiment of these attributes and can support a wide range of tasks in a variety of locations – on-premises, edge or private cloud, public or hybrid cloud.

Early object storage platforms were designed and built to archive large objects, often as targets for backup jobs. While these systems are ideal for high-bandwidth access to large files, these systems have struggled (and will continue to struggle) with workloads that involve many small file operations. Specifically, accessing each file's data requires first access to a metadata server (which is used to map information and other settings) and then to physical storage. For large files, the latency introduced by one metadata access per file is almost negligible compared to the time it takes to fully load the entire large file. However, for many small files, metadata server access essentially doubles the latency of data access and becomes a bottleneck for the entire object storage system.

As long as object storage is considered a secondary or archive tier, this is not a big deal. However, with that comes Amazon S3, which provides a set of APIs and endpoints that elevate object storage to the home page of application data. Over time, the use cases for object storage have expanded, and today, they are driving the need for higher performance and flexibility in different object sizes. Small objects are critical to support AI ML DL workloads, deduplication archives, backups, monitoring and log data, IoT applications, analytics databases, and more.

Not every object storage system is capable of extreme performance and resiliency across a wide range of object sizes and access patterns. Many small objects push traditional object storage systems to their limits, requiring low-latency read and write operations. Delivering thousands of concurrent object operations in a way that is strictly consistent, performance-optimized, and efficient use of physical storage is a challenging problem. The problem is compounded by providing metadata for more and more copies of files as they are copied, placing a heavier burden on the system. Systems designed and optimized for large objects cannot meet these requirements and provide a sub-optimal experience when forced into them.

Workloads that rely on large amounts of unstructured data, such as AL ML DL, illustrate the challenges of object storage. For example, an ML workload might look for anomalies in sensor data, inspecting millions or billions of small log files. That's a lot of metadata and file data access calls, which isn't even a complex workload, but the demands on many object storage systems can overwhelm the metadata servers and cluster networks to take advantage of the workload's results in real time.

Unlike other object storage solutions, minio does not rely on an external metadata database. When faced with many concurrent queries and operations across a large number of objects, the metadata database can become unresponsive. By eliminating this dependency, minio can process a large number of small objects faster.

Minio continues to expand its leadership in the small object space, adding several features that provide higher performance and scalability for small object storage and retrieval. minio now includes optimized small object storage, with inline metadata data, as well as uploads and automatic extractionsThe ability of the tar file. Combining metadata and small object data can greatly improve performance because switching back and forth between metadata and data doesn't introduce latency. Automatic extractionTAR files can work with many small objects more efficiently, just archive the small data files together, upload them, and they can be used for application workloads.

First, let's look at some of the difficulties inherent in storing and retrieving a large number of small objects, and then we can dive into how minio optimizes these operations and how we handle them on minio clients and serverstar andNew feature for zip files.

Using a large number of small objects rather than a small number of large objects places different demands on object storage systems. Typically, a storage administrator must design and adjust the storage system based on expected usage and object size, such as adjusting the properties of blocks, blocks, or caches to match typical read and write patterns.

In addition, small object workloads are more affected by metadata IO than large object workloads. minio greatly alleviates this burden by eliminating reliance on external metadata databases. minio stores metadata and data directly on disk to provide higher performance and scalability.

Traditionally, each object was stored in minio as follows:

bucket/

object_path/

object-name/

xl.meta part.1

This means that in order to read the data, at least two files need to be opened. To write data, at least two files and one folder will be written. This overhead is negligible when working with fewer large files, but it can start to increase when working with many parts of many objects.

Starting with the 2021-04-22 version of minio, the server can store object content as XL. that stores the metadata for each objectPart of a meta file. There are several factors that determine when this is done, but typically files smaller than 128kib may be stored inline with metadata. This reduces the IOPS required to read and write to files.

Not all file sizes have this feature enabled, as it affects how quickly objects change. For example, mutations occur when new tags are added or other property changes occur. So, in order to keep the update time reasonable and there is no need to copy or rewrite large megabytes of data, this is only suitable for small files.

This is transparent from the client, user, and application perspective. Everything is managed by Minio, so the only thing you need to do to start optimizing your small object storage is to upgrade your Minio server.

minio has also added automatic extraction after upload. Functionality of tar files.

Putting many small unstructured data files on object storage is not easy to access with programs and users. In many workflows and environments, this can be the most time-consuming part of the process. Think of the case of the millions of sensor logs required for ML analysis, or another common scenario for thousands of small Microsoft Excel or Word documents in a NAS migration. If you upload each file individually, you will incur a lot of network overhead when setting up and disconnecting a large number of connections, and placing thousands of API PutObject calls at the same time. A common solution is to compress all the files together into one large file or zip package, upload it, and then extract all the files.

With this new feature, users no longer need to start uploading and then come back later for extraction. UploadScripts for tar files can be simplified to upload and automatically extract. Applications and users can use objects immediately without a separate extraction step, making their operations on object storage more efficient and timely. Example of revisiting sensor data,TAR file auto-extraction enables real-time anomaly detection by exposing unstructured log data to workloads faster.

Here's an example of how automatic extraction works. Just **the latest version of the minio server and minio client and install them. You can even try it out with our Play Environment via MC.

Use any .tar file, optionally compressed using zstandard (recommended), lz4, gzip, or bzip2.

mc mb play/mybucket

mc cp .tar play/mybucket --disable-multipart --attr "x-amz-meta-snowball-auto-extract=true"

mc ls play/mybucket

That's it!You're trying to work with small objects with minio's simple, transparent, and powerful features. The equivalent API call is PutObjectExtract.

If you have any specific questions, please contact us at [email protected] us a message or join the conversation on Slack.

Related Pages