A Server Error Got Me a Free Meal

By Noam Yadgar at July 2, 2024

#system #design #engineering #analysis #system-design #architecture

I was at the office, about to go out and grab lunch from a nearby restaurant. Like many companies, I work in one that uses a service to manage our budget for paid meals. Using a mobile app, I can generate a 5-digit code, which I give to the restaurant, and it automatically charges my account. Beautiful, simple, and effective.

I was standing in line, tapping my fancy code-generation button, and after a few seconds, I witnessed a (probably) rare event. The food app suddenly showed me that my order had been accepted and confirmed. Did someone hijack my code? At this point, I thought this was just a client bug. I regenerated a new code and went on with my order.

When I left the restaurant, I checked my order history and was surprised to see an order from a restaurant in a nearby city. I realized I’d probably experienced a race condition in its most glorious form. I believe I’ve regenerated another user’s code, and when the other user finished the order, the charge was on my account (probably made someone happy and confused).

race_condition

The food app company did an excellent job detecting this issue. After about 3 minutes, I received a text telling me the order was canceled. Interestingly enough, I checked my account balance and saw that not only had they canceled this order, but they had also reverted my account balance to before any order was made that day (yay! Free meal!).

With a smirking face, I thought: “What a great excuse to write a system design article,” and ask what could’ve gone wrong. Could that be avoided? Is ACID guaranteed in distributed systems? If so, what’s the cost? Is it worth it? (I promise a juicy twist at the end).

What could’ve gone wrong?

Random code generator

Random codes are a common thing among MFA systems. MFA systems exist so that users can prove to a service that they aren’t hackers. If hackers steal your bank account credentials, they may have to steal your phone. MFA codes are simple because they are almost always linked to some unique identification (like a user ID, phone number, email address, and sometimes even MAC address). There’s nothing wrong with generating two identical codes.

The food app is more complicated. The code has a global scope and must be unique across the entire system. This is because, unlike MFA, there’s a third player in the food app.

sequenceDiagram
    Actor User
    participant Food App
    participant Restaurant

    User ->> Food App: Clicks on "Generate Code" button
    Food App -->> User: Returns code
    User ->> Restaurant: Shares code
    Restaurant ->> Food App: Requests transaction
    Food App -->> User: Notifies

Users are anonymous to the restaurant. For the restaurant, an active code is a way to open a channel to the food app company and request a transaction. They aren’t concerned with user identities. This property allows users to share a code with someone else.

Given that the food app provides 5-digit codes, there’s a maximum capacity of \(100,000\) ephemerally active codes at a given moment. We can generalize the chances of collisions with the simple expression \(P = 1 - \frac{N-n}{N}\). Where \(P\) is the probability of generating a duplicate code, \(N\) is the total number of codes in the system, and \(n\) is the number of currently active codes. The more active codes in such a system, the more likely we generate duplicates. But just because we generate duplicates doesn’t mean we must send them to users.

Making a simple code-generator

We can write a code-generator program that generates 5-digit pseudo-random codes. However, more is needed. We also want to give each generated code a life span of a few minutes. This way, we can look up this code and tell if it’s active (or taken). This feature introduces our program to a state. We may want to decouple our program from this state by a database to run multiple instances of this program and manage our generated codes at a shared location.

Even if our code generator can accept concurrent requests, we can still guarantee atomicity and consistency at the application level. In a single running instance, we can lock the critical part of the request and ensure codes are generated and stored synchronously. This will have a small impact on performance but guarantee consistency.

sequenceDiagram
    Actor User
    participant Code Generator
    participant Cache

    User ->> Code Generator: Clicks on "Generate Code" button
    loop exists == true
      Code Generator ->> Code Generator: Generates a random code
      Code Generator ->> Cache: exists = Cache.exists(code)
    end
    Code Generator -->> User: Returns code
    Code Generator ->> Cache: Cache.set(code, ttl)

Scaling out

At some point, our single instance code generator can no longer support the ever-growing traffic at peak hours, and we might want to deploy multiple instances of code generators. Without modifications, we can no longer guarantee consistency because our atomic process of a single instance is now happening concurrently across multiple instances.

sequenceDiagram
    Actor Alice
    Actor Bob
    participant Code Generator 1
    participant Code Generator 2
    participant Cache

    Alice ->> Code Generator 1: Clicks on "Generate Code" button
    Bob ->> Code Generator 2: Clicks on "Generate Code" button
    Code Generator 1 ->> Code Generator 1: "12345"
    Code Generator 1 ->> Cache: Cache.exists("12345")
    Code Generator 2 ->> Code Generator 2: "12345"
    Code Generator 2 ->> Cache: Cache.exists("12345")
    Cache -->> Code Generator 2: false
    Code Generator 2 -->> Bob: Returns "12345"
    Cache -->> Code Generator 1: false
    Code Generator 1 -->> Alice: Returns "12345"
    Code Generator 1 ->> Cache: Cache.set("12345")
    Code Generator 2 ->> Cache: Cache.set("12345")
    Note right of Cache: error: key already exists

We can preserve atomicity by moving the lock to the database level (traditionally in a transaction). We can send the code to the user only after successfully writing it into the database. It will, again, cost us in performance and increase network delays but evade collisions.

The effect of network delays and performance degradation can get even worse if the database is distributed across multiple nodes. (But that’s a topic for another article).

By now, you can already tell that there’s a trade-off between performance and consistency. This trade-off is common in digital services, and companies are continuously waging and prioritizing one trade at the expense of the other.

Let it fail and deal with it after

I promised a twist, so here it is. There’s a good chance that I’ve experienced this race condition because they let it happen by design. Most engineers will naturally look for the best bug-proof solution within the scope of the internal process, and this is an excellent quality to have as an engineer. Better engineers (some say architects) will also know how to wage trade-offs and translate business priorities into design decisions. I believe that the food app company prioritizes performance and predictability and is willing to tolerate a small percentage of system faults to fulfill these top priorities.

I’d claim that there’s a good chance that the code is being generated and sent to the user and only then verified against a database. This approach is much more predictable and responsive and relies on pure statistics. For example, if the food app company knows there are about \(1000\) active codes at a given moment (in peak hours), they can predict an error rate of only \(1\%\).

You can argue that they can dramatically lower this error rate by adding more numbers to the code. However, this code is for humans. Can you imagine customers reading 16-digit numbers to the person behind the counter? Maybe a 5-digit, base 64 number is better. In any case, they’re satisfied with 5-digit, base 10 numbers, and that’s okay.

Other suggestions might be:

Use an atomic counter and not be random at all. The problem is that if users understand it, they can exploit it by generating a number and guessing another number near that number. It also means that we still have a lock in the database.
Use the atomic counter from below circularly (mod N), and have the values represent an index to a pre-generated list of random numbers. This is an interesting idea. However, it adds a lot of complexity. Instances need to be notified (or constantly fetch) newly generated lists.
Have each instance generate codes with a unique “instance prefix.” That would be nice, but it would shorten the actual random part of the code.

About the error handling

This topic deserves an entire article, but let’s understand why they reverted my account as if I never made an order that day.

First, no matter what, the food app company will lose money in this error. While they can refund the charged user, they can’t charge the user who generated the duplicated code first because they can’t tell for sure (even if it’s very likely) that this user made the order.

I believe that the account balance of each user is being managed with transaction logs of some sort. Each transaction is appended as a log to the database, thus keeping a transactional history of the account. A good approach is to declare these logs immutable, meaning that we only append new ones and never delete or modify existing ones.

Maybe the engineer who wrote the error-handling logic for this case thought that the detection and handling of this case would happen in a fraction of a second, so it’s highly unlikely for later transactions to be added. Then, the process violates immutability and deletes logs up to a point before the wrong transaction.

It may not be the exact cause, but I believe some involvement in mistreating time-based snapshots and reverting to the wrong state occurred. You may be tempted to say that I’m right, and this is just poor error handling, and it seems pretty simple to append a new transaction with the reversed amount of the wrong transaction. But once again, this system’s complexity and business priorities might be such that the way it handled this issue has the expected outcome.

Summary

Our tendency to trust automated digital services was never higher than today. We trust servers and functions to execute our most sensitive money transfers, and we’re also on board with trusting AI assistants to manage our doctor appointments. And why wouldn’t we? Computers are better at accuracy, data, and speed when serving clients.

We also tend to forget that digital services can fail and make mistakes. Digital services are not pure (in a mathematical sense). They are inherently susceptible to unexpected side effects such as node crashes, network hangups, bugs, memory and CPU issues, synchronization issues, etc.

I’d be happy to end this article with a nice key takeaway: acknowledging the difference between a good service that is doing what it’s supposed to do and an excellent service that can also gracefully handle its faults in an uninterrupted or completely hidden way to users. To the food app company: your system is wonderful. Thank you for buying me lunch.

On this page (Top)