Debugging a Distributed DBMS: It is all about feelings
A distributed system is fancy talk for a system that can run on more than one computers. There are usually many instances of the same process, be it a database management system (DBMS) or something else, that each runs on their own server.
Debugging is a huge part of life as a software engineer and learning to be good at it is quite essential. When I was starting out, I had a very blurry idea of the kind of bugs that are out there, and I was confused when I was assigned bugs to fix about their difficulty and what kind of approach I would use. As I became more senior, I developed a better map of the kinds of bugs that are out there and, with that, better intuition and clarity. My goal is to help other newbies form that mental map and make their onboarding into the distributed systems world of databases a bit easier.
In a distributed DBMS, each instance of the database process communicates with the other instances and retains some global state and invariants. Because it is a DBMS, the invariants are related to data and metadata. If one of them is violated, then we have a bug.
What kind of bugs does a DBMS have?
The bugs that DBMSs can have can be broken down into the following categories:
Data corruption – Most databases these days have partition tolerance, which means they store separate copies of data so that they can sustain network failures. If the copies diverge and do not agree with each other, then we have a data inconsistency problem. That is usually a severe bug, treated with the highest priority. I have seen those bugs being assigned to junior developers, and if that is you do not fret, it is probably a good sign that your company trusts you or wants to have you learn a lot. From experience these bugs are extremely hard to reproduce and trace.
Wrong data results –Even if your data is consistent, it could still be the case that one of the queries’ execution logic is wrong and that could lead to wrong results. Something like select count(*) from foo; returning 1000 when table foo has 2000 rows. Assuming no data corruption, those bugs are easy to reproduce and, therefore, on the easier side to tackle.
Memory Leaks – the database process is leaking memory over time without actually using it; yikes. Customers do not like having their process leak memory over time. What that means is that the only way to regain memory would be to shut down and restart the process, which can be extremely costly in a noncloud environment.
Hangs – the process just hangs; this could be a result of a deadlock or just waiting for resources to become available. These kinds of bugs are important to the end user but are actually quite easy to diagnose. Most of the time, attaching the debugger in case of a deadlock or simply looking at the logs is enough.
Metadata inconsistencies – Every database usually has another database inside of it that stores metadata for the data. The metadata store keeps track of what kinds of objects exist in the database, for example, an SQL table and its columns and their dependencies. There could be bugs on this layer, much like the data layer.
General system invariants – Developer-enforced invariants that are verified through asserts or various other scripts.
Usability bugs – Bugs that would enhance the user experience but are not necessary for correctness.
Performance bugs – This is the most ambiguous kind of bug and extremely important. Here is an example provided by my colleague Ryan. If you have 1000 rows in a table and you add 1 more and your join performance drops dramatically, that is a performance bug. Performance should be predictable and follow a smooth curve compared to the size of your data.
Regressions – Regressions can be correctness or performance-related. Correctness regressions usually come up when you run an old test on your new code and that is failing by producing a different output. Those are extremely easy. At that point, you have a full reproducer and usually it is just a matter of attaching the debugger, and figuring out what went wrong. Performance regressions come up when performance of a test regresses and those are trickier. You have to run perf or an equivalent performance tracker to figure out what is it that makes your performance suffer, optimize that area and check if that fixed your bug.
Now that we know what kind of bugs are out there, let’s examine the debugging process, which usually involves 7 steps.
Work on classifying the in one of the categories from above, that will help you understand it better and assign a priority to it, as well as a title for internal tracking.
If you have a reproducer go to step 6.
Search for clues in the logs and the code.
Talk to people and ask for help. Contrary to popular belief, engineers like to be social.
Go to step 2.
Come up with a fix.
Celebrate!
There you have it, some distributed systems basics for the newbie distributed systemser. If you liked this, comment below with your process and the kinds of bugs you deal with.
Debugging is often an emotional rollercoaster. I alternate between feeling in the flow, excited, happy and intelligent but also disappointed, tired, not smart enough, and sometimes burned out, especially when I have been staring at the same bug for hours, but more on that later.
Debugging is something I do daily. Acknowledging and sharing my feelings with others is one of the things that has helped me the most in staying sane and learning to enjoy the process and my career as a software engineer. I am writing this post in the hope that I will help others feel less alone in their feelings.
Unlike what you would expect and what I was conditioned to believe, debugging is extremely emotional.
My experience is centered around hard distributed systems bugs because those take longer and usually capture a wider variety of feelings; however, I believe a lot of them are applicable to other scenarios.
Positive Feelings
After every step of progress, I usually feel productive and happy.
Talking to other people and getting different insights from them adds to the feeling of productivity. If I am feeling like I am making progress then I am usually happy.
Sharing my struggles with a co-worker makes me feel less alone.
Looking at the logs and finding some useful pieces of information makes me feel like a detective.
Having a new idea is the best feeling. I have always loved that part. It is a big reason why I love computer science.
Writing more tests for a particular component in an area that the original testing missed. This is when I feel like a professional who knows what they are doing. Making sure an area is fully covered test-wise is not the sexiest job, but it is very illuminating and needs to be done.
Getting a reproducer is awesome! It is so hard, yet it clearly marks your progress with your bug.
Simplifying my reproducer is fun; this is more the chill kind of work for me.
Working on getting the solution is also very enjoyable and stress-free. At the point when I have a reproducer, getting a solution requires designing a new algorithm and writing new code. I enjoy all aspects of that process and I take it as a little treat for spending all the time coming up with a reproducer.
I also feel very grateful for how it makes other aspects of my life feel. This is a weird one. For me, everyday tasks become a lot less overwhelming. For example, I have always dreaded filing my taxes, but not after I started working in distributed systems. I just remind myself that if I can fix distributed systems bugs, I can file my taxes.
Being able to share different technical parts of the debugging process with another engineer is very gratifying and enjoyable. I get to teach another person about a curious part of the code, share my process and way of thinking and see the light bulb also light in their head. Also, joking about how badly written a piece of code is, or how I missed something stupid the first time around is fun.
And now on some of the negative feelings
It is common to feel lost when dealing with a bug, just like going for a hike without a GPS or any prior knowledge of how far you are going.
Working for hours and maybe using the same approaches without making any progress. Unfortunately, this step is necessary for progress to happen.
Not asking for help when I am clearly stuck. This does build some character, but I usually try to avoid that and chat with other people. Finding the right time to do that has been one of my biggest improvements since I started working as a software engineer.
Not taking a break when I clearly need one because I need to make a deadline. This is just ineffective, and over the years, I have learned to recognize it and stir myself clear of that kind of dead-end. Going to the gym, going for a walk, getting tea at the breakroom or having plans with friends that I just cannot cancel are all good breaks that leave me refreshed and ready to go back to working with a clean mind.
Sometimes, when a bug is really hard, and there is no one to ask for help, I feel ready to give up. That is also a good clue that I need a break. 99% of the time, I no longer feel that way when I come back to my bug.
Some general process tips I try to follow:
Usually, after a bug is assigned to me, I try to categorize it and decide what approach is best. At that point, I felt ready to tackle a new challenge and even excited. Equipped with a process I have developed over the years, I am still in known territory that feels familiar and also makes the challenge ahead feel doable.
I also try to optimize my work environment: I work best when I am surrounded by other people. It is very motivating to have more people working next to me, so I try to be in the office most days. In addition to that, I do best when I can chat with fellow engineers and bounce possible ideas off of them. Even if the other person does not have much insight to offer, the mere act of bouncing ideas off of each other and sharing the difficulty of the bug with another person makes me feel less alone, less incompetent, and more productive and ultimately leads to faster solutions.
If I am feeling productive, even if working on something hard, I am overall in a good mood. So, I try to do things that make me productive often and prioritize taking breaks and sharing how I feel with people.
Overall, there are a lot of ups and downs, some days I hate debugging, and others I love it. At the beginning of my career, given a choice, I would always choose to develop a feature over debugging, and even now, four years later, I still do. Developing a feature is more fun and creative to me. However, debugging has made me a better developer, and at the end of the day, it is necessary. Learning to navigate my feelings throughout that process has been key!
How do you go about debugging? What kind of feelings does debugging bring up in you?
*Thanks to @danicaporobic for the Twitter comment on the types of bugs out there.