Data modeling strategy is the key skill that will be tested during a system design interview. Your data modeling strategy signals the interviewer whether requirements/priorities are understood correctly or not. There are no graphical illustrations or section-wise presentations, so excuse me if you find it boring ๐
Below are few important things you need to consider when tasked with designing a system. Feel free to explore on your own!
Clarify the data and classify them as shown below:
A general thumb of rule is that a table/collection shouldn't have both mutable and immutable data together.
Main data -> Data that is written once at slower rates (less than 1000/sec). Data that can be read many times, can be modified occasionally, and not something that can be deleted. Since the insertion rate is slow, we may not require partitioning.
Audit/Transaction data ->Data that is written once, but with very high rates (more than 1000/sec). Data that can be read occasionally and almost never modified. Since the volume is high, we need to think about purging the data and partitioning is very important.
Based on these 4 dimensions, categorize all the data observed in the system. This will help you pick the right data stores, understand system interactions and analyze tradeoffs.
Identify the right shard key for your usecase. Once picked, all the tables in the shard need to know the shard key for lookup. And also, identify potential queries first before coming up with a shard key. We should be able to group the scan around shard key instead of scanning all shards. In real-time scenarios, once a pattern is picked it is difficult to modify it later without massive effort. Shard partitioning approach should be maintainable. Cross shard queries are anti-patterns as well.
Creating secondary indexes that spans over the entire data model is an anti-pattern. Instead, we may use distributed secondary indexing with a sharded database (dedicated) for secondary key lookup. This involves creating and managing secondary indexes in a way that accommodates the distributed nature of the database. This ensures that queries based on secondary indexes can be efficiently executed across the distributed environment. The goal of distributed secondary indexing is to optimize query performance by allowing for quicker access to data based on attributes other than the primary key, even in a distributed and potentially geographically dispersed system. This is particularly important in large-scale distributed databases where traditional indexing approaches might face scalability challenges.
Analytical queries should not be executed on sharded workloads. A better approach is to replicate such data to a dedicated datastore meant for such queries. BigQuery tables, HDFS, Teradata etc are exactly for that purpose.
When you choose BigQuery for example, then think about all aspects: partitioning, parallel processing, performance stability, search/reporting ability, query behaviours etc.Replications can be complicated in case of cross-geo applications: writes on multiple geo locations (1) , writes in one geo location and replications in multiple geo locations (2) or data located only in one geo location (3). Sometimes you take care of all these possibilities or only one of them. No conflict resolution required for first two cases obviously. However, conflict resolution is required in the third case to avoid any data inconsistency. Every unique constraint or index is a point of potential conflict in bi-directional replication.
How can we prevent it? By setting up conflict avoidance rules and error handling. This could be done with the help of an intermediary table along with a trigger. The purpose is the verification of data (whether it is verified for conflicts and can be replicated). I will leave out the implementation to you, but basically you may have DB columns for
conflict_resolution_action
,index_type
,index_or_key_name
,name_of_constraint
,index_uniqueness
etc. When a data replication event is triggered, we check the upcoming data against the rules setup. If the data cannot be replicated, we mark the resolution action accordingly and proceed.When we talk about primary keys, we always think it unique across the globe. But it can be locally unique as well. For example,
state_id
can be6
forCalifornia
in country:USA
andRajasthan
in country:India
. That doesn't mean both states are the same. We need to add thecountry_id
to be able to distinguish both of them. So, when we use a locally unique primary key across globe, then we require to have a source identifier key (country_id
in the previous example) to make any transactions. This requirement can impose additional challenges.Avoid multi round trip queries (fetch, modify and resend) and use a single trip PL/SQL transaction query. We are talking about the regular way of writing JPA queries (one query to fetch the results, update the results and insert it back to original source). Single transaction query is a better option when there are multiple threads doing the same operation. Also, explore further ways to ensure concurrency safety at application level.
When you have mutable main data and there's a requirement to track the changes, then it leads to a need of history table. This history table is synonymous to a regular audit table. Introducing the history table also introduces further requirements such as purging. We need to introduce batch jobs and think of separating the deletion farther away from main history table.
In case of concurrent updates, row-level locks can be a bottleneck. Especially when there are multiple updates within a single transaction. We can reduce this bottleneck by distributing concurrent updates to
n
rows instead of a single row. There could still be a chance of collision, but heavily reduced. We can introduce a new columnaccount_mod_number
to query against a modular operator. However, we can't blindly follow this approach all times, rather use it when you have concurrent updates within a single entity (For eg: say the sameaccount_id
oremployee_id
). Grouping the results based on the entity would still give the expected results. The only downside is the data replication for each entity. But thats a tradeoff one can evaluate and storage is cheap nowadays.
These are some data modeling checkpoints that you may want to focus on next time. I haven't mentioned how these can be implemented, but it's a good starting point for your exploration!