Databases are a critical component of the world's most complex technology systems, and the way they are used has a significant impact on their performance, scalability, and consistency. In this article, we outline the most critical database topics to know during a system design interview.
The database is responsible for storing and retrieving the application's data at the most basic level. Because a database is a program that handles access to a physical data store, it is sometimes referred to as a database management system (DBMS). The main functions of a database are to store, update, and delete data, return data in response to queries, and manage the database.
Databases must be reliable, efficient, and accurate. Completing all of these tasks at the same time proved to be quite tricky. There are several ideas that you must be familiar with to build an efficient and successful database. Let's first look at the CAP theorem to understand the trade-offs at the heart of database design.
According to the CAP theorem, any distributed database can only meet two of the three requirements:
Consistency: Each node replies with the latest version of the data to ensure consistency.
Availability: Any node can send a response.
Partition tolerance: The system continues to operate even if the connection between any nodes is lost.
Suppose the system provides consistency and partition tolerance. In this case, it is called a CP database, and if it provides availability and partition tolerance, then it is called an AP database.
As dataset sizes grow significantly, scalability is becoming increasingly important in distributed clusters. Different databases are better or worse scalable due to the different features they offer. Scaling can be divided into two categories:
Vertical scaling is the process of increasing the amount of compute and memory available on a single computer.
Horizontal scaling is the process of increasing the number of machines in a cluster.
Vertical scaling is simple, but the overall memory capacity is greatly reduced. Horizontal scaling, on the other hand, has a larger overall compute and storage capacity and can be dynamically scaled without causing downtime. The main drawback is that relational databases, the most widely used database model, are difficult to scale horizontally.
A transaction is a "single unit of work" made up of a series of database operations. The operation of the transaction is either successful or unsuccessful. In this sense, the concept of transactions helps to maintain data integrity in the event of a failure of a system component. The "acid" property is a formalization of this:
Atomicity – A transaction is an "atomic" unit if all operations succeed or fail at the same time.
Consistency – When a transaction completes successfully, the database is legitimate and there are no schema violations.
Isolation – Multiple transactions can run at the same time.
Persistence – When a transaction is "committed", it is kept in memory.
By now, you've learned about some of the basic components related to database design. Let's explore the different types of databases in depth.
All databases require a data model that specifies the logical organization and principles of the data. A relational database is essentially a database that uses a relational data model to organize data into tables with rows of data entries and columns of predefined data types. A foreign key column that is related to the primary key column of another table represents a table relationship.
On top of that, the relational data model imposes strict constraints on data values and relationships to ensure that they are always valid for the schema. To ensure schema consistency, acid transactions are almost always used. The database supports special sorted data structures on indexes, making access faster than progressive scanning. However, because each index must be updated in addition to the primary table, the performance gains for reads are offset by slower writes.
A relational database is the best choice when there is a many-to-many relationship between entries, when the data must follow a predetermined pattern, or when the data relationship must always be precise. On the other hand, their main disadvantage is that they are difficult to scale on distributed clusters. Slow network communication can slow down the database process. Oracle, MySQL, and PostgreSQL are the most widely used relational database technologies.
We'll outline the many non-relational database solutions available in the next section so that you can choose the best database approach for your system architecture.
Depending on the structure of the data, what model do you want to use to store the data?This is a critical question for today's system designers when choosing a database. Non-relational databases are designed for unique scalability, schema flexibility, or specialized query capabilities. As you'll discover in this section, there are many different types of non-relational databases. We'll cover some of the most important things to keep in mind when creating a system.
Because non-relational databases are addressing unique use cases with different priorities between availability and consistency, they are classified as AP or CP databases. The concept of eventual consistency is used in AP non-relational databases to ensure that consistency occurs over time, even if it is not guaranteed to be appropriate when a transaction completes.
As you'll discover in this section, there are many different types of non-relational databases. We'll cover some of the most important things to keep in mind when creating a system. Let's get started!
Nodes and edges are used to model data in a graph database. Because they represent data with many relationships, they are non-relational databases that most closely resemble the relational data model. Because the data is not stored in tables, the fundamental advantage of a graph database is that queries don't require joins to follow relational edges. As a result, queries that traverse multiple edges of a graph, such as social network analysis, are well suited to them. Neo4J and CosmosDB are the industry's top graph databases.
Simple JSON objects with key identifiers are typically stored in document storage. Documentation collects data on a single topic, which may be scattered across multiple tables in a relational database. For example, document storage might archive medical documents for a specific patient so that only one document needs to be retrieved at a time and all important information is available.
MongoDB, Couchbase, Firebase, and Dynamodb are the industry's top document storage databases.
The term "column database" refers to a set of columns that are frequently retrieved together. A column family database represents data in a table like a relational database, but maintains column families in files rather than rows, and does not enforce relational constraints. This model improves performance by minimizing the amount of data that must be read for data with a strong column family access pattern. You can also compress columns to save space, as they tend to store duplicate information, especially if the data is sparse.
Cassandra and HBase are the industry's top column family databases.
Time series databases are optimized for data entries that need to be organized strictly by time. The key use case is to store real-time data streams from system monitors. Because time series databases require a lot of writing, they often include services that order streams as they arrive to ensure they are appended in the correct order. These datasets can be easily divided into multiple periods. InfluxDB and Prometheus are the industry's top time series databases.
A key-value store is somewhat similar to a document store. Because the key-value store doesn't know what's in the value, it only supports read, overwrite, and delete operations. There are no schemas, connections, or indexes;It's essentially a giant hash table. Because of the low overhead, it is easy to scale. Caching implementations greatly benefit from key-value storage. Redis and Memcached are the industry's top key-value stores.
We may look for the following storage solutions based on our needs and how we would like to utilize or access our data:
Caching – When developing a read-intensive system like a large social network, we may end up capturing large amounts of data, or even entire timelines, to achieve low latency requirements. Redis or Memcached are the two alternatives here.
File system storage - If we want to create an asset delivery service and store image or audio files, we will need to use blob storage.
We have now learned about the various storage solutions and how to choose between different databases depending on our needs and the type of data we need to store. However, is that enough?Sometimes a single database is not enough for our needs. For example, in the case of Flipkart, the order data must adhere to the ACID principle while also being infinitely scalable as a columnar database.
In this case, we will use a database combination such as MySQL + Cassandra. All information about ongoing orders that must respect the ACID property will now be saved in the MySQL database. Once done, they will be moved to Cassandra, where they can be used as permanent storage. As long as the ACID quality is met, the data is stored in a relational database and then migrated to a columnar database to scale the data.
Database systems are not created equal, and each system has its own strengths, weaknesses, strengths, and weaknesses. At a high level, the following items should be carefully evaluated and identified in the following order:
Determine the type of data to be stored, queried, and updated (structured, unstructured, or semi-structured) and, if relevant, the best techniques for data modeling (e.g., schema vs. schema, relationship, document, key-value, etc.). )
Determine if transaction and ACID guarantees are required.
Determine if replication and horizontal partitioning are necessary.
Identify any other database requirements and priorities, such as those mentioned earlier in this article.
Security, scalability, performance, vendor establishment, stability, and availability of the necessary expertise are all key considerations.
Identify any cost constraints and budgets that may exist.
As new technologies continue to emerge, everything discussed here and the most important available options at that time will seriously affect the best solution. In addition, software and data solutions can and often do use multiple data storage systems, so consider whether this is appropriate for your specific application. If so, choose the correct database system for each system as described above.
When choosing a database, you need to be clear about the size of the data, the structure of the data, the relationships between the data, and the importance of enforcing patterns and ensuring consistency. In large systems, one database may not be able to do everything you need, so multiple databases are needed. As they get bigger, relational databases become too expensive to maintain, so parts of the system that don't require a robust schema or consistency guarantee can consider using the non-relational database option.
List of high-quality authors