A Server Error Got Me a Free Meal
By Noam YadgarI was at the office, about to go out and grab lunch from a nearby restaurant. Like many companies, I work in one that uses a service to manage our budget for paid meals. Using a mobile app, I can generate a 5-digit code, which I give to the restaurant, and it automatically charges my account. Beautiful, simple, and effective.
I was standing in line, tapping my fancy code-generation button, and after a few seconds, I witnessed a (probably) rare event. The food app suddenly showed me that my order had been accepted and confirmed. Did someone hijack my code? At this point, I thought this was just a client bug. I regenerated a new code and went on with my order.
When I left the restaurant, I checked my order history and was surprised to see an order from a restaurant in a nearby city. I realized I’d probably experienced a race condition in its most glorious form. I believe I’ve regenerated another user’s code, and when the other user finished the order, the charge was on my account (probably made someone happy and confused).
The food app company did an excellent job detecting this issue. After about 3 minutes, I received a text telling me the order was canceled. Interestingly enough, I checked my account balance and saw that not only had they canceled this order, but they had also reverted my account balance to before any order was made that day (yay! Free meal!).
With a smirking face, I thought: “What a great excuse to write a system design article,” and ask what could’ve gone wrong. Could that be avoided? Is ACID guaranteed in distributed systems? If so, what’s the cost? Is it worth it? (I promise a juicy twist at the end).
What could’ve gone wrong?
Random code generator
Random codes are a common thing among MFA systems. MFA systems exist so that users can prove to a service that they aren’t hackers. If hackers steal your bank account credentials, they may have to steal your phone. MFA codes are simple because they are almost always linked to some unique identification (like a user ID, phone number, email address, and sometimes even MAC address). There’s nothing wrong with generating two identical codes.
The food app is more complicated. The code has a global scope and must be unique across the entire system. This is because, unlike MFA, there’s a third player in the food app.
sequenceDiagram Actor User participant Food App participant Restaurant User ->> Food App: Clicks on "Generate Code" button Food App -->> User: Returns code User ->> Restaurant: Shares code Restaurant ->> Food App: Requests transaction Food App -->> User: Notifies
Users are anonymous to the restaurant. For the restaurant, an active code is a way to open a channel to the food app company and request a transaction. They aren’t concerned with user identities. This property allows users to share a code with someone else.
Given that the food app provides 5-digit codes, there’s a maximum capacity of \(100,000\) ephemerally active codes at a given moment. We can generalize the chances of collisions with the simple expression \(P = 1 - \frac{N-n}{N}\). Where \(P\) is the probability of generating a duplicate code, \(N\) is the total number of codes in the system, and \(n\) is the number of currently active codes. The more active codes in such a system, the more likely we generate duplicates. But just because we generate duplicates doesn’t mean we must send them to users.
Making a simple code-generator
We can write a code-generator program that generates 5-digit pseudo-random codes. However, more is needed. We also want to give each generated code a life span of a few minutes. This way, we can look up this code and tell if it’s active (or taken). This feature introduces our program to a state. We may want to decouple our program from this state by a database to run multiple instances of this program and manage our generated codes at a shared location.
Even if our code generator can accept concurrent requests, we can still guarantee atomicity and consistency at the application level. In a single running instance, we can lock the critical part of the request and ensure codes are generated and stored synchronously. This will have a small impact on performance but guarantee consistency.
sequenceDiagram Actor User participant Code Generator participant Cache User ->> Code Generator: Clicks on "Generate Code" button loop exists == true Code Generator ->> Code Generator: Generates a random code Code Generator ->> Cache: exists = Cache.exists(code) end Code Generator -->> User: Returns code Code Generator ->> Cache: Cache.set(code, ttl)
Scaling out
At some point, our single instance code generator can no longer support the ever-growing traffic at peak hours, and we might want to deploy multiple instances of code generators. Without modifications, we can no longer guarantee consistency because our atomic process of a single instance is now happening concurrently across multiple instances.
sequenceDiagram Actor Alice Actor Bob participant Code Generator 1 participant Code Generator 2 participant Cache Alice ->> Code Generator 1: Clicks on "Generate Code" button Bob ->> Code Generator 2: Clicks on "Generate Code" button Code Generator 1 ->> Code Generator 1: "12345" Code Generator 1 ->> Cache: Cache.exists("12345") Code Generator 2 ->> Code Generator 2: "12345" Code Generator 2 ->> Cache: Cache.exists("12345") Cache -->> Code Generator 2: false Code Generator 2 -->> Bob: Returns "12345" Cache -->> Code Generator 1: false Code Generator 1 -->> Alice: Returns "12345" Code Generator 1 ->> Cache: Cache.set("12345") Code Generator 2 ->> Cache: Cache.set("12345") Note right of Cache: error: key already exists
We can preserve atomicity by moving the lock to the database level (traditionally in a transaction). We can send the code to the user only after successfully writing it into the database. It will, again, cost us in performance and increase network delays but evade collisions.
The effect of network delays and performance degradation can get even worse if the database is distributed across multiple nodes. (But that’s a topic for another article).
By now, you can already tell that there’s a trade-off between performance and consistency. This trade-off is common in digital services, and companies are continuously waging and prioritizing one trade at the expense of the other.
Let it fail and deal with it after
I promised a twist, so here it is. There’s a good chance that I’ve experienced this race condition because they let it happen by design. Most engineers will naturally look for the best bug-proof solution within the scope of the internal process, and this is an excellent quality to have as an engineer. Better engineers (some say architects) will also know how to wage trade-offs and translate business priorities into design decisions. I believe that the food app company prioritizes performance and predictability and is willing to tolerate a small percentage of system faults to fulfill these top priorities.
I’d claim that there’s a good chance that the code is being generated and sent to the user and only then verified against a database. This approach is much more predictable and responsive and relies on pure statistics. For example, if the food app company knows there are about \(1000\) active codes at a given moment (in peak hours), they can predict an error rate of only \(1\%\).
You can argue that they can dramatically lower this error rate by adding more numbers to the code. However, this code is for humans. Can you imagine customers reading 16-digit numbers to the person behind the counter? Maybe a 5-digit, base 64 number is better. In any case, they’re satisfied with 5-digit, base 10 numbers, and that’s okay.
Other suggestions might be:
- Use an atomic counter and not be random at all. The problem is that if users understand it, they can exploit it by generating a number and guessing another number near that number. It also means that we still have a lock in the database.
- Use the atomic counter from below circularly (
mod N
), and have the values represent an index to a pre-generated list of random numbers. This is an interesting idea. However, it adds a lot of complexity. Instances need to be notified (or constantly fetch) newly generated lists. - Have each instance generate codes with a unique “instance prefix.” That would be nice, but it would shorten the actual random part of the code.
About the error handling
This topic deserves an entire article, but let’s understand why they reverted my account as if I never made an order that day.
First, no matter what, the food app company will lose money in this error. While they can refund the charged user, they can’t charge the user who generated the duplicated code first because they can’t tell for sure (even if it’s very likely) that this user made the order.
I believe that the account balance of each user is being managed with transaction logs of some sort. Each transaction is appended as a log to the database, thus keeping a transactional history of the account. A good approach is to declare these logs immutable, meaning that we only append new ones and never delete or modify existing ones.
Maybe the engineer who wrote the error-handling logic for this case thought that the detection and handling of this case would happen in a fraction of a second, so it’s highly unlikely for later transactions to be added. Then, the process violates immutability and deletes logs up to a point before the wrong transaction.
It may not be the exact cause, but I believe some involvement in mistreating time-based snapshots and reverting to the wrong state occurred. You may be tempted to say that I’m right, and this is just poor error handling, and it seems pretty simple to append a new transaction with the reversed amount of the wrong transaction. But once again, this system’s complexity and business priorities might be such that the way it handled this issue has the expected outcome.
Summary
Our tendency to trust automated digital services was never higher than today. We trust servers and functions to execute our most sensitive money transfers, and we’re also on board with trusting AI assistants to manage our doctor appointments. And why wouldn’t we? Computers are better at accuracy, data, and speed when serving clients.
We also tend to forget that digital services can fail and make mistakes. Digital services are not pure (in a mathematical sense). They are inherently susceptible to unexpected side effects such as node crashes, network hangups, bugs, memory and CPU issues, synchronization issues, etc.
I’d be happy to end this article with a nice key takeaway: acknowledging the difference between a good service that is doing what it’s supposed to do and an excellent service that can also gracefully handle its faults in an uninterrupted or completely hidden way to users. To the food app company: your system is wonderful. Thank you for buying me lunch.