The Limits of Determinism

There is something a little bit magic about writing code. You start with nothing*, and after hacking around for a while, you end up with a set of executable instructions that can make a computer do something useful or interesting. This is even more true in recent years than when I started, with the explosion of cloud based services that give individual developers (or small teams) the ability to distribute applications globally, with few obstacles, and at a downright reasonable price. I don't think there are too many fields where an individual can produce something valuable without needing approval, lots of money or supplies, or even to leave the house.

Another aspect to programming that I've grown to appreciate over the course of my career is the deterministic nature of software. Barring some circumstances so rare as to be hardly worth mentioning, a piece of software will perform exactly as its told, repeatedly and without error. Granted, programs rarely do exactly what people want them to do. But that's our fault for not clearly understanding what we want to happen, or not correctly expressing those wishes to the machine. The machine doesn't care about project deadlines, political strife, or the ceaseless online debates between humans about the best way to solve a problem.

This attribute turns out to be essential when it comes to debugging, which is the process of finding out why a program isn't working the way we want. If a program always faithfully executes its instructions, and we can get it to do the wrong thing over and over again, we can eventually narrow down on why that wrong thing is happening. This process is much faster if we have access to the source code, and access to the actual machine where it's running to inspect other characteristics not directly related to our program, but which nonetheless affect its behavior (ex: the operating system). It's also much, much faster if we have access to a search engine and Stack Overflow.

Now, I rarely toot my own horn. Maybe that's a result of learning a long time ago about the Dunning–Kruger effect, or spending time with tons of smart people who know way more than me. But if I may say so, I am good at debugging. A big reason for this is that I approach the task with a honed philosophy, which can be summed up simply: assume nothing. Suppose I'm working on a calculator application. I find that when one types "2+3", the output is "3". It's tempting to immediately dive into the code that performs the addition and look for bugs there, but I try to resist the urge. Instead, I first check that the calculator code merely sees the numbers "2" and "3". From my experience, a good quantity of bugs are found through these kinds of "sanity checks", which are easy to do and don't send one down cognitive rabbit holes. Helpdesk technicians the world over must agree with me, given their archetypal first question: "is the computer plugged in?"

That's why I felt excited when my boss called me late one Wednesday night, as my wife and I were putting our kids to bed. One of our customers was having some problems with our product, and these problems were threatening to sink a large, important contract. Would I be willing to fly to Kansas City early the next morning to help figure out what the problems were, and how to fix them? After checking with my wife, I answered "sure!" Big mistake.

The next morning, I met up with some employees of our prospective partner, which I'll henceforth call BitCo. They showed me how one of their data pipelines was running far too slowly for their needs. I quickly figured out the reason for the poor performance. Unfortunately, our system didn't have an elegant, efficient way to handle the data from their source system, so we started to consider some workarounds. By the end of the day, we had one in place, which was simply an additional step that buffers some data to the local disk before doing the remaining processing steps. This wasn't an optimal solution, but it should have been good enough to get things started. And besides, we always work with customers to help them refine their approach over time. A snowy forecast in Chicago had thwarted my plan to fly home that same night, so I headed back to the BitCo office in the morning to wrap things up and see what other minor issues remained.

I arrived to find that the data pipeline had crashed overnight, which is bad. It was immediately apparent that the problem had to do with Kerberos†, at which point the color probably drained from my face. But we had seen almost this exact same thing before, so I was confident we could get to the bottom of it quickly. I had them check all the most obvious things, and started reading up on common pitfalls with Kerberos in Hadoop‡.

Unfortunately, there were some roadblocks. The crash happened only after a long period of time (up to 8 hours), so it was impossible to quickly test a potential fix. There were also a couple more impediments that fell squarely in the "human imposed" category. First, because there were people actively using the system that BitCo was running, we couldn't poke, prod, and restart it in the same way I would to debug something on my own computer. Also, the set of machines where our application was running was actually owned by a third company, which I'll call Ma Bell. Ma Bell had very strict security procedures. I couldn't simply log into one of the Kerberos machines to look at log files. I could have ask BitCo to open a ticket to Ma Bell, who might have eventually relayed some useful information back to us. But this was far too uncertain, not to mention slow, to be of any use.

Fortunately, I had access to a secret weapon in the form of a coworker. He knows more about Kerberos than anybody at the company, and probably more than 95% of the people worldwide who even knew that word before reading this story. If you Google for "Kerberos Hadoop", he is well represented on the first page of results. These results included his finding and fixing a relatively obscure bug that we were sure was somehow involved in this current BitCo problems, since the symptoms were uncannily similar. He generously gave me his time from his home in Europe as I agonized over the problem for the next several days.

Despite his expertise, our collective brainstorming, my various attempts at extracting precious diagnostic information, and numerous speculative fixes, the stubborn crash remained. A personal fault of mine is that I have a hard time letting go of an unsolved problem, and this was the mother of all of them. I knew in the back of my mind that the solution was out there, somewhere, if only I could trace through dozens of source code files and correlate them to log lines and other pieces of information I'd gleaned. After all, the entire system - including the crash itself - were deterministic. This wasn't happening due to the motion of the planets or a butterfly beating its wings. It was happening for one of two reasons. Either there was a bug somewhere deep in the bowels of some software library (maybe one our company wrote, maybe not). Or there was some something set up not quite correctly within the BitCo/Ma Bell system of computers, which I hadn't yet uncovered (Kerberos is very sensitive to slight misconfigurations, for good reason: these can break the carefully orchestrated security). I knew that I could - theoretically - solve the problem given enough time and attention.

But as the days dragged on with no demonstrable progress, I started to question whether that was actually true in practice. I am, after all, a human. Although I never quite came to resemble John Nash in A Beautiful Mind, my personal life and relationships suffered as I slogged through. My mind became mush; I couldn't even work out a way to catalog all the relevant information, much less make heads or tails of how to proceed.

Fortunately, the next week, another brilliant coworker came to the rescue. He flew in to do some training at BitCo, and managed to save the day by swapping the workaround I had put in place for a different one. We theorized this second approach stood a better chance of succeeding, because it used other techniques we knew to be working before I originally came on site. And sure enough, after the switch was made, the pipeline stopped crashing.

As the acute crisis started to fade, I relished the prospect of returning to my exciting and interesting "day job" project, which had languished thanks to this detour. But there was something I had to do first. I went back and implemented a feature in our system that would allow BitCo to more efficiently transfer the data into our application from their starting system. Naturally, I did this with a little help from a third coworker, who quite literally wrote a book on that system. In the end, it was short and elegant; less than a hundred lines that - in some alternate universe - saved us all a week of failed workarounds and pure agony. At a personal level, it was catharsis through code.

Once my change passed internal review and was merged into our master codebase, I reached out to my main contact at BitCo, who had been gracious and as helpful as he could throughout the ordeal. I told him the feature was ready, and all that was left was for me to hop in my time machine and go back two weeks. His response was immediate. "Take me with you."

* By using the word "nothing", I am admittedly glossing over a few minor things. For starters: the incalculably large library of open source code that many have toiled to create over decades, and the tremendous collective body of knowledge that all programmers and tinkerers who have ever lived helped build.

† Kerberos is an authentication system. In a nutshell, it allows people and software applications within an organization to prove their identity to each other. This is necessary, for example, to prevent Bob from reading Alice's email. It is the gold standard for having an acceptable level of security at many companies. Kerberos was originally developed at MIT, and I suspect your best chance at really understanding it is to get an education there.

‡ For the purposes of this post, I'll grossly simplify Hadoop this way: it's a complex distributed system that is both horrible and the best way people have come up with to handle large amounts of data.

Amalgamated Content

Search This Blog

The Limits of Determinism

Comments

Popular posts from this blog

The early April 2025 market selloff was not a "healthy correction"

Don't Trust the Process

Reflections on working as an election judge