Data System Design: Essential Tips for Scalable, and Reliable Systems

This post draws extensively from a book I am currently reading and my experience managing data systems (primarily, core data infrastructure). The book is Designing Data-Intensive Applications by Martin Kleppmann.

Hey! What is even a Data System?

First, I would like to define a data system in my own words. But before that, I used to think of data systems as databases only, or in a simpler sense, anything that serves as a database. It could be physical sheets of paper, the popular Microsoft Excel, to more complex databases like Oracle, MySQL and MSSQL databases. In essence, I only thought of data systems as systems that store data. As I continue to grow in my career, I have begun to realise that almost all systems are, in some way, data systems. To put out a definition, I would say a data system is any system whether digital or analogue that needs data, processes data and/or stores data. It could be one of these things or a combination of some of the functions aforementioned. I say this because a system that needs that, influences the choices of every other system that is coupled with it.

A data system is a system that cares about the shape, size, type, and form of the data it needs to function as designed.

Fundamental concerns when thinking of designing data systems

There are certain things that one must consider when building efficient systems. These concerns apply to almost all types of digital systems. In this blog, I discuss them in the context of data systems specifically. Anyone designing a data system would have questions such as

What happens when we have double the traffic that we have now?
Can we see when something goes wrong, when the fault happens, what caused the fault, who caused it and can we re-create it?
How do we make that the system can recover from failure either manually or automatically (preferably) without losing or corrupting the data
Can we make changes to the system in the future?

If you are building a similar system and have these questions in mind, you are thinking in the right direction. These are typical questions that can be largely grouped into 3 main buckets, which we will delve into shortly. Furthermore, the questions are very broad questions and are far from fully covering all considerations when designing a data system, or even managing an already-built data system.

Back to the buckets 😅. The main areas of concern are reliability, scalability and maintainability.

Let's get cracking!!!!!

Reliability

When referring to people, being reliable is akin to being trustworthy. That is, do people trust you to do what you are tasked and expected to do, even with the numerous distractions life throws at us daily? Likewise, the same is expected from the digital systems we use, much so the mundane everyday systems. We all want our messaging apps to keep our chats no matter what happens to our phones or the servers hosting. Reliability has become an expectation in our society.

Reliability refers to the correct functioning of a system during its life span according to its design expectations. It also entails the ability of a system to tolerate and recover from faults and failures. Other aspects of reliability involve ensuring access to only authorised personnel and keeping the system secure.

Fault and failures are often used interchangeably; However, they differ from each other, such that, a fault is an issue that results in the deviation from its expected behaviour while a failure occurs when the system fails to provide the required service. It is usually as a result of several faults.

Typical categories of faults that occur in data systems are

Hardware faults: These consist of faults such as Disk crashes, power disruption, cable wear, faulty RAMs e.t.c. You typically hear terms such as mean time between failures (MTBF) and mean time to failure (MTTF). IBM has a good blog post explaining both terms here.

Software errors: They are also referred to as bugs. Sometimes, software errors could trigger hardware faults. E.g. a bug at the kernel level could cause the hard disk to crash.

Human errors: The most common source of errors is human errors. Even with the best intentions, we still tend to make numerous errors. In cybersecurity, it is often said that humans are the weakest links and that shows how error-prone we are although we design and build the most robust systems. Being the most common source of errors, here are some tips on dealing with human errors

Decouple components and aspects of the system where humans tend to make the most errors. E.g Make configuration modular instead of one big configuration
Monitor, Monitor, Monitor. Collect telemetry data relevant to the state of a system in parts and as a whole.
Employ thorough and robust testing. Make use of unit tests and integrated tests to make sure the system acts as intended. Netflix uses an interesting testing method called Chaos Monkey.
Well-designed abstractions can ensure that we interact with systems appropriately by minimizing what we can. Hence, eliminating the room for human errors
Design to recover quickly. Make error logs understandable, document common faced errors and recovery steps

Scalability

Scale can refer to different things in their respective contexts. Concerning data systems, we will focus on one fairly broad concept central to many of the questions commonly asked when scaling vertically (up or down), or scaling horizontally (out or in). That central idea is defining your load, because the scalability of a system can be narrowed down to the question - "How will this system perform if we increase or decrease the load by a factor of X?"

Essentially load is the amount of computational work done by a system. To define load appropriately, it must be described by some numbers referred to as load parameters. These parameters differ per system. It could be in the number of users, requests per second, size of each request, writes vs reads per second, number of cache hits and misses and so on. The idea I am trying to pass here is that defining the load of a system using load parameters will many times differ based on the architecture of the system, and a good way to measure the performance is by using percentiles as opposed to using an average. This is because using percentiles allows you to estimate the number of affected users intuitively.

Take, for instance, an application, serving 1 million unique users, connecting to a database that has most of its requests processed within 10ms and 1s. Now, that is not a long time in reality but it is critical when chaining different requests to provide a single functionality. If the median response time is 30ms, and the 95th percentile mark is 50ms, we can then easily say that 50,000 (5%) users are experiencing the slowest response times over a given period. Also, the above scenario shows that half of the user base has response times of 30ms and below.

Ways to deal with varying load.

When it comes to adapting to load, there is usually a dichotomy of increasing the capacity of your system by increasing the number of machines (scaling horizontally) or by adding resources such as CPU and Memory (scaling vertically). Furthermore, there is a likelihood that both scaling approaches would be used within an entire system as horizontal scaling can become incredibly complex to manage when the members are a lot, especially for stateful systems like data systems.

I like to think of stateful applications as systems where all members must be aware of the current state of the system to act on anything. Therefore, in multi-node architecture where data is across several nodes, all nodes must have a way to know the current state of the entire system.

One key thing to note when considering the multiple approaches is the operability of the system. In choosing your architecture, it is essential to make sure that is it relatively easy to manage, optimize, make changes to and is resilient.

Maintainability

This aspect is so important, that there is no point in building a system if it cannot be maintained. Just don't do it. In my experience, I spend a lot of time thinking of how to make any system I work on, easier to maintain. In fact, most of the cost of software is in maintaining it - integrating to new systems, adapting to changes in environment and technology, bug fixes, vulnerability fixes etc.

Maintainability is a big word, so let us define it and deconstruct what it means for a system to be maintainable. IEEE Standard Glossary of Software Engineering Terminology defines maintainability as: "The ease with which a software system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment."

Making a system maintainable takes a lot of effort and planning. It is essential to think about maintenance as early as possible when designing a system. There is no one-size-fits-all method to do this, but there are principles that can guide you to achieving this. These design principles for data systems are:

Simplicity: this refers to building your system, such that, new engineers and operators find it easy to understand. Remember, today's system is potentially tomorrow's legacy system and we know we like a properly built legacy system.
Auditability: In my years working on data platforms, the most frequent requests were around who did what on the system and when they did it. Having that fine-grained access control and visibility on what happens can play a vital role in understanding internal user patterns, tracking and reproducing bugs etc.
Evolvability: No system is likely to stay the same forever, it must evolve to rapidly changing user needs, business needs and other factors such as technological advancement. Systems must be malleable to cater to these changing needs
Operability: Systems must be easy to operate. Good abstractions can make it easy to operate and run smoothly.

The principles mentioned above may not be exhaustive but will cover most of your maintenance needs if you brainstorm on each carefully. Also, it is important to involve other stakeholders when designing systems.

If you have reached this part of this article, I must say a big thank you 🎉 for getting to the end. To recap, data systems are more than just databases and analytical systems but extend to other systems that care about the data they receive and give. Because these systems are prominent, they must be designed and built carefully. In so doing, we discussed three (3) main areas to really think about. Those areas are Reliability, Scalability and Maintainability. Reliability is concerned with making the system fault-tolerant, scalability deals with ensuring the system performs optimally even when the load rapidly changes, and maintainability refers to a system being simple and able to evolve.

Feel free to drop a comment, feedback or question. It will be much appreciated.

What to think about when designing, building, managing and operating data systems.

Hey! What is even a Data System?

Fundamental concerns when thinking of designing data systems

Reliability

Scalability

Maintainability