Twitter thread on reactive and proactive solving of bugs in production

When bugs are real but the data isn’t.

Lessons from the trenches: Episode 2 - Replicating bugs in production is hard (without Snaplet)

It’s a familiar scenario to every developer: your application runs perfectly in development, but is a hot mess in production; a veritable bug-infested nightmare. Worse still, despite your best efforts, you can’t replicate the bugs locally. It's a situation so common it's practically a developer rite of passage.

A recent Twitter thread initiated by a developer named Rogerio Taques (@rogeriotaques) caught our collective eye because it neatly encapsulates the frustration, and some common approaches to solving this issue.

What we found interesting was that as the thread developed, responses to how to solve this challenge broadly fell into two camps - reactive measures such as debugging tools and logging, to better identify errors and bugs after they happen, and proactive measures, such as more closely aligning environments and better test data. Obviously, we’ve got a horse (or in our case, a cat) in this race, but it was interesting to see how real-world developers sharing how they solve this in their projects.

Reactive measures: debugging tools and logging.

Many of the responses could broadly be categorised as reactive measures - actions taken after a bug occurs in production that wasn’t picked up locally or in testing. Logging was a popular first point of departure for many, for instance:

Riemer van der Zee recommends logging via Sentry, DataDog or New Relic, all popular options for logging and gaining insights especially into technical bugs. However, they mention that for functional bugs (which are often caused by data issues) getting in a call with a user is a good solution. We agree - but it’s not always feasible or possible to hop on a call with a user to diagnose an issue.

Interestingly, a similar viewpoint echoed by Anthony Becket was working with customer service or customer success on a call to go through the information manually to try and diagnose the issue. They mention that this can take months! Some of the other reactive measures mentioned including attempting to replicate the issue while looking at logs, additional debugging information, or as a last resort, getting a dump of the environment which they claim is “very difficult”, for all the reasons production database dumps tend to be challenging (security, personal information, performance etc).

Logging is a critical part of any modern software stack, but when it comes to picking up bugs related to data or functional bugs, logging alone may not always be able to give a developer a view into the underlying problem. That’s why despite several developers advocating for logging specifically, they also mentioned the need for realistic test data, and environment parity.

Proactive measures: realistic test data and environment parity.

The best way to deal with bugs is to prevent them from occurring in the first place. And that's where preventive measures like using realistic test data come in, and working in development environments that closely resemble production, a concept known as environment parity.

Simon Hamp neatly sums up both concepts here - getting your local development environment to look like production, and to use production-like data in that development environment. They mention a snapshot of production data can often help diagnose these kinds of issues, and quite rightly point out that truly not being able to replicate a bug is rare, once you’re able to control all the variables involved.

Another developer, Kevin Vaghasiya, shares a similar approach: clone your entire production environment (data and all) and use that when attempting to reproduce the issue.

A development environment that is identical to production is the absolute gold standard here, as several developers pointed out, but the challenge is often around resources, access, security, and data privacy. Data, specifically, is a challenge. While containerization with Docker makes it much easier to create development environments that match production, data is still the missing piece of the puzzle:

Data is of course, the real problem much of the time. As you may have suspected, Rogerio’s bug ended up being data related in the end:

With environment parity between production and development, and access to production data, solving data-related bugs becomes much easier. However, with the advent of data privacy legislation such as GDPR and the real financial and reputational risk of exposing personally-identifiable data, most software engineering teams are rightfully circumspect about making production data available for testing, despite the benefits.

This is, of course, where Snaplet comes in. We were quite happy to see Snaplet getting recommended specifically as a solution for this problem, because that’s very the problem we’re trying to solve:

Snaplet: realistic, production-like data.

Snaplet bridges the gap between your development and production environments by allowing you to create snapshots of your actual database, which are then anonymized and encrypted for safe usage in development. This approach addresses the limitation of traditional methods like seed scripts, which can be time-consuming to create and may not fully replicate the complex relationships and edge cases present in production data.

An automated way to import complex, real-world data into your development environment.

Unlike seed scripts or database dumps, Snaplet's data is already anonymized and encrypted, ensuring compliance with data protection regulations.

The reality is, no amount of logging or real-time debugging can replace the value of catching a bug before it ever reaches your users. While debugging tools and logging strategies have their place as part of software development best practice, they are often reactive solutions to problems that could have been prevented with more robust and realistic testing in the development phase.

Realistic, anonymized snapshots of your actual data.

Snaplet enables you to be proactive rather than reactive, catching bugs long before they reach your production environment. In a world where even seasoned developers are not immune to the challenges of real-world data, Snaplet offers a powerful, preventive solution that can save both time and sanity.

So the next time you find yourself nodding along to a Twitter thread full of developers sharing their production woes, remember that there are tools designed to prevent those late-night debugging sessions. With Snaplet, you can ensure that the bugs you catch are the bugs you keep—safely quarantined in your development environment, far away from your end users.

Jian Reis
January 24, 2022