Antithesis of a One-in-a-Million Bug: Taming Demonic Nondeterminism

wwilson · on March 22, 2024

Antithesis co-founder here -- happy to answer any questions about the company, the technology, or this particular engagement!

eslaught · on March 22, 2024

I work on a distributed runtime system for heterogeneous supercomputers [1].

As an example of the sort of bug we regularly deal with, I am at this exact moment tracking down a freeze that occurs on 8,192 nodes of a supercomputer [2]. That means I'm using about 64,000 GPUs and about half a million CPU cores. The smallest node count I've seen my issue is 2,048 nodes and at that scale it only happens about 10% of the time.

We've been debating internally whether Antithesis could help us or not. On the one hand, the fuzzing to explore the state space, and deterministic reproduction, are exactly what we want. On the other hand, we believe our state space is much larger than what you see in a typical distributed database. (And not just because of the sheer scale of things, but even on a single node we have state machines with order hundreds to thousands of states in them.) Based on the post here and the "scenario" count explored in CouchDB, I'm not convinced you'd be able to handle us. :-)

I'd be curious what you think. Happy to discuss here, or contact info in profile.

[1]: https://legion.stanford.edu/

[2]: https://www.olcf.ornl.gov/frontier/

wwilson · on March 22, 2024

A drawback of our approach is absolutely that it is expensive to test extremely large volumes of data or compute this way. Even before you start running into physical limitations of our current platform, you will probably be complaining about your bills. :-)

Our advice on this is that there are actually a lot of things you can do to exercise behaviors that are usually only seen at massive scale. For example, if you run a distributed storage system, you can probably configure it to split and move shards at 1/1,000,000th of the production size. That might let us hit a tricky codepath much more cheaply. We have a lot more about this in our documentation, e.g. here: https://antithesis.com/docs/best_practices/optimizing.html#k... and here: https://antithesis.com/docs/best_practices/find_more_bugs.ht...

The other thing is just that the reason many bugs only happen at scale is that they're some kind of subtle distributed race, and you need a lot of nodes for one of the runners in the race to be slow enough that the other sometimes wins. But we can very easily and efficiently provoke these sorts of races by pausing individual threads or freezing nodes, etc. We actually pretty regularly hit issues with tiny deployments that our customers only see in their largest clusters (but no promises, this obviously depends on the details of the software).

abtinf · on March 22, 2024

I lead a platform engineering team. We have a lot of challenges that I think antithesis could help with.

I’d like to try it. But…

I did not know the history of foundation db until last week—specifically about the Apple acquisition and the resulting termination of client support and downloads for 5 years.

So my question is, how can I trust the same founding team to not do that again? It’s one thing to get acquired by a company that would likely continue to support business users (Microsoft, IBM, AWS, even Oracle). It’s another to sell to acquirers that will likely shut down the public service (Facebook, Apple, Google).

If I invest the time, money, and effort to adopt antithesis, I won’t even have the security of “well, at least we downloaded the packages before they went offline.”

Maybe this is an unfair question. I think it’s great that you had a successful liquidity event with foundation. Yet I must manage risk.

wwilson · on March 22, 2024

I will let Dave and Nick chime in here if they want to, since they saw the Apple acquisition closer up than I did, but here are my views:

(1) Your beliefs about what happened to FoundationDB's customers are incorrect. Everybody who had a FoundationDB license when we were acquired either got a free license to the software in perpetuity at their current levels (the free tier), or in the case of our paid customers, they were able to continue using it on an even larger scale in the future. None of our paying customers were screwed.

(2) We are not aiming for an acquisition or any other kind of early exit. Our previous successes enable us to be a little bit more risk-neutral this time.

(3) Even if we did vanish, what have you actually lost at this point? If your operational database gets yanked (which, I reemphasize, didn't actually happen to anybody), then you're screwed. If your exotic software testing technology gets yanked, then you're literally exactly where you are today. Doesn't seem like as big a risk to me.

Apologies if my answers here were not diplomatic or whatever, but I try to be very direct about this stuff.

EDIT:

Actually let me add a number (4) We have validated that people are willing to pay for Antithesis, and we have a growing business being built. To the extent we're an attractive acquisition target, it's likely to be due to the strength of our business and their desire to scale it out / make it more widely available, and not simply to improve the internal tech stack of an acquirer internally.

BTW, one example of an FDB customer that was still a small startup when we were acquired was Snowflake. They obviously had no problems continuing to grow and use FDB, as they still use FoundationDB as their core metadata storage today and have since they started working with us.

SloopJon · on March 22, 2024

> If your exotic software testing technology gets yanked, then you're literally exactly where you are today.

As someone testing concurrent/distributed software, Antithesis is potentially useful to me. It would be a substantial investment to build test infrastructure around it, with a very big opportunity cost if it wasn't successful, or disappeared after an acquisition. If this exotic technology is more than a toy, I wouldn't be so cavalier about its long term prospects.

geodel · on March 22, 2024

Well, thats the price of implementing very new technology. When risk appetite is less people can always go with established large vendors.

wwilson · on March 22, 2024

That's true. And it's also true that investing heavily would take time, and the more time goes on the less likely we are to be acquired or disappear (or for the acquisition to be by somebody who wants to resell this as I note above). In that sense, you have a lot of optionality/convexity on this bet.

In the meantime, you can use it right now to solve your problems without a ton of integration. We have customers at every level of the spectrum from "sending us unmodified output of their CI system" to "deeply integrating with our SDKs". If the former are seeing value, you can too, and your risk is genuinely minimal.

rockwotj · on March 23, 2024

Question for you, what is the level of ROI for doing simplified just send unmodified existing CI vs integrating with the SDKs?

Do you happen to know if hitting this big required SDK integration?

wwilson · on March 23, 2024

This bug definitely didn’t involve SDK integration, because our SDK didn’t exist yet when we found it! I believe CockroachDB was running our instrumentor over their binaries though. So call it an in-between level of integration.

abtinf · on March 22, 2024

Thank you for explaining point 1. My brief googling of the issue led me to find mostly loud, angry voices, none of which mentioned these facts. This completely changes my evaluation.

Regarding point 3, I agree that the software would be in a higher quality state even if the testing framework disappears. However, fully adopting this framework likely means integrating and explaining it in our processes and compliance documentation (my company provides services to financial institutions).

I appreciate your directness. Thanks.

costco · on March 22, 2024

Your software sounds amazing and will probably make you a fortune. I wanted to check my understanding of how Antithesis fuzzes software. The way I understand it from reading the documentation is that you create a "workload" which is sort analogous to a fuzzing harness that will typically make random API calls, and Antithesis will pursue sequences of events that are more interesting as defined by coverage and also inject faults. So something like this could probably be pretty easily be adapted to a workload: https://github.com/grpc/grpc/blob/86953f66948aaf49ecda56a0b9.... Do you use any interesting coverage metrics or just basic blocks?

wwilson · on March 22, 2024

Your understanding of our approach is pretty much correct. As for interesting coverage metrics... stay tuned! We're going to write about this a lot in the future!

mtyurt · on March 22, 2024

What is your target market? Is it database companies, cloud providers etc. that implements a proper distributed architecture? If it is not limited to that, what would be the value Antithesis can provide to a company that utilizes a small-scale service oriented architecture that has its own share of distributed complexity?

wwilson · on March 22, 2024

Database vendors are very natural customers, because they're distributed systems for whom correctness and uptime are paramount. But they're far from our only customers! We have tons of people who are doing exactly what you describe -- running a client-server architecture or a collection of microservices that need to be fault-tolerant.

One of our theses is that the vast majority of real world "distributed systems developers" would never describe themselves that way and don't read the same blogs that all the database people do. Nonetheless, these people are writing distributed systems, and share the pain that we've all experienced. One of the most important tasks ahead of us is precisely to reach this vast "distributed systems dark matter" and explain to them how it is that we can help their day-to-day jobs.

quadrature · on March 22, 2024

Are there plans to support the debugging of replays ?. It seems like a really hard problem, I'm assuming that instrumenting could change the outcome of the run.

Is there literature that is a good starting point on determinism/non-determinism in computing ? I'd like to understand the sources of non-determinism better.

wwilson · on March 22, 2024

Yes, we are working on integrated debugging technology (what our customers mostly do these days is just run a conventional debugger inside the simulation, but that doesn't quite use our full power).

You're correct that tiny things like modifying the binary under test or attaching a debugger can change determinism and result in the bug slipping away. That's why... it's a good thing we've already built autonomous bug-searching technology? Since a deterministic hypervisor is also a hypervisor, we can generally rewind and do the intrusive debugging action at the "last possible moment", when it's least likely to cause the bug to disappear. Then if it still does, we simply fuzz onwards from that point and re-find the bug. This usually goes pretty quickly because we're in a timeline that's "close" to it, and we can use clues from the original repro to guide the fuzzing.

farresito · on March 22, 2024

What does the tech stack look like?

Taikonerd · on March 22, 2024

Cockroach DB is a very logical place to use Antithesis -- its original use case was for FoundationDB!

wwilson · on March 22, 2024

To be clear — many of us worked at FoundationDB, and we were certainly inspired by the testing technology we had there. But Antithesis is built from scratch, and WAY more general and WAY more powerful than the FDB simulator.

WarOnPrivacy · on March 22, 2024

Ah it is coding then. I was trying to work out what Demonic Nondeterminism might be and got increasingly excited about the possibilities. At the least - it would have included all the demon psych/religious backstory that we know about. I'd finally get some answers.

notresidenter · on March 22, 2024

Antithesis are doing a great job at marketing their product and building hype. Their platform looks really interesting.

wwilson · on March 22, 2024

Thanks! But just you wait… the really insane stuff isn’t even public yet!

thimkerbell · on March 22, 2024

grumpycamel · on March 22, 2024

Appendix is not visible https://docs.google.com/document/d/1hA7bpfAMyyAYix0lelUZLI7W...

srosenberg · on March 22, 2024

Sorry about that; should be fixed now.

Scaevolus · on March 24, 2024

> Note that a logical clock can be effectively derived from the precise CPU hardware performance counter–retired conditional branches (RCB).

That's surprising, why is this preferred over RDTSC?

zitterbewegung · on March 22, 2024

As time goes on the bugs get weirder in a software project …

38 · on March 22, 2024

Personal and professional Go developer here. I will never use or recommend that my company use something called cockroach. Call me petty, but fix the fucking name.