Twitter thread on reactive and proactive solving of bugs in production

When bugs are real but the data isn’t.

Lessons from the trenches: Episode 2 - Replicating bugs in production is hard (without Snaplet)

It’s a familiar scenario to every developer: your application runs perfectly in development, but is a hot mess in production; a veritable bug-infested nightmare. Worse still, despite your best efforts, you can’t replicate the bugs locally. It's a situation so common it's practically a developer rite of passage.

A recent Twitter thread initiated by a developer named Rogerio Taques (@rogeriotaques) caught our collective eye because it neatly encapsulates the frustration, and some common approaches to solving this issue.

‍

I hate when there are bugs in production that cannot be reproduced locally! 😣

How do you address such issues?#buildinpublic
— Rogerio Taques (@rogeriotaques) October 8, 2023

‍

What we found interesting was that as the thread developed, responses to how to solve this challenge broadly fell into two camps - reactive measures such as debugging tools and logging, to better identify errors and bugs after they happen, and proactive measures, such as more closely aligning environments and better test data. Obviously, we’ve got a horse (or in our case, a cat) in this race, but it was interesting to see how real-world developers sharing how they solve this in their projects.

Reactive measures: debugging tools and logging.

Many of the responses could broadly be categorised as reactive measures - actions taken after a bug occurs in production that wasn’t picked up locally or in testing. Logging was a popular first point of departure for many, for instance:

‍

For technical bugs you can get better insights with https://t.co/4xWq4naRIc datadog or new relic.

For functional bugs I always used teamviewer and get in a call with the user haha, but there are prob. better solutions for this.

Bug prevention is also another sport
— Riemer van der Zee (@RiemerZee) October 8, 2023

‍

Riemer van der Zee recommends logging via Sentry, DataDog or New Relic, all popular options for logging and gaining insights especially into technical bugs. However, they mention that for functional bugs (which are often caused by data issues) getting in a call with a user is a good solution. We agree - but it’s not always feasible or possible to hop on a call with a user to diagnose an issue.

Interestingly, a similar viewpoint echoed by Anthony Becket was working with customer service or customer success on a call to go through the information manually to try and diagnose the issue. They mention that this can take months! Some of the other reactive measures mentioned including attempting to replicate the issue while looking at logs, additional debugging information, or as a last resort, getting a dump of the environment which they claim is “very difficult”, for all the reasons production database dumps tend to be challenging (security, personal information, performance etc).

‍

We have no access to production at all. Customer services can access environments though so I usually just set up a call and try go through it all and get as much information as I need. Can take months sometimes but eventually I do get a replication.
— Anthony Beckett (@_AnthonyBeckett) October 9, 2023

Yeah there's a few steps we also do like requesting logs from the live env, or getting someone to replicate it before getting logs etc. Last steps if were really stumped is a logging branch to output extra debug info or try and request a dump of the env which is very difficult
— Anthony Beckett (@_AnthonyBeckett) October 9, 2023

‍

Logging is a critical part of any modern software stack, but when it comes to picking up bugs related to data or functional bugs, logging alone may not always be able to give a developer a view into the underlying problem. That’s why despite several developers advocating for logging specifically, they also mentioned the need for realistic test data, and environment parity.

Proactive measures: realistic test data and environment parity.

The best way to deal with bugs is to prevent them from occurring in the first place. And that's where preventive measures like using realistic test data come in, and working in development environments that closely resemble production, a concept known as environment parity.

‍

Is dev env identical/similar to prod (same OS, versions, config etc)? What can you do to make them more consistent?

Unexpected data is often the cause. Can you snapshot prod data and use locally?

Rare that a bug truly can't be reproduced - just need to get to the cause
— Simon (@simonhamp) October 9, 2023

‍

Simon Hamp neatly sums up both concepts here - getting your local development environment to look like production, and to use production-like data in that development environment. They mention a snapshot of production data can often help diagnose these kinds of issues, and quite rightly point out that truly not being able to replicate a bug is rare, once you’re able to control all the variables involved.

Another developer, Kevin Vaghasiya, shares a similar approach: clone your entire production environment (data and all) and use that when attempting to reproduce the issue.

‍

I clone production environment and try to reproduce on that.
— Kevin (@KevinVaghasiya5) October 9, 2023

‍

A development environment that is identical to production is the absolute gold standard here, as several developers pointed out, but the challenge is often around resources, access, security, and data privacy. Data, specifically, is a challenge. While containerization with Docker makes it much easier to create development environments that match production, data is still the missing piece of the puzzle:

‍

Having a local environment that closely matches your production environment is the best path forward, in the long run, imo.
Everything we build is dockerized, so that the only real difference between production and lower environments is the data.
— David (@Gritty_DW) October 9, 2023

‍

Data is of course, the real problem much of the time. As you may have suspected, Rogerio’s bug ended up being data related in the end:

‍

Bug was data related, only happened once. It was difficult to isolate and find the root cause tho!
— Rogerio Taques (@rogeriotaques) October 9, 2023

‍

With environment parity between production and development, and access to production data, solving data-related bugs becomes much easier. However, with the advent of data privacy legislation such as GDPR and the real financial and reputational risk of exposing personally-identifiable data, most software engineering teams are rightfully circumspect about making production data available for testing, despite the benefits.

This is, of course, where Snaplet comes in. We were quite happy to see Snaplet getting recommended specifically as a solution for this problem, because that’s very the problem we’re trying to solve:

‍

You should totally check out @_snaplet https://t.co/qkhOFCh44M
— Stephen Gheysens (@segheysens) October 9, 2023

‍

Snaplet: realistic, production-like data.

Snaplet bridges the gap between your development and production environments by allowing you to create snapshots of your actual database, which are then anonymized and encrypted for safe usage in development. This approach addresses the limitation of traditional methods like seed scripts, which can be time-consuming to create and may not fully replicate the complex relationships and edge cases present in production data.

‍

An automated way to import complex, real-world data into your development environment.

Unlike seed scripts or database dumps, Snaplet's data is already anonymized and encrypted, ensuring compliance with data protection regulations.

The reality is, no amount of logging or real-time debugging can replace the value of catching a bug before it ever reaches your users. While debugging tools and logging strategies have their place as part of software development best practice, they are often reactive solutions to problems that could have been prevented with more robust and realistic testing in the development phase.

‍

Realistic, anonymized snapshots of your actual data.

Snaplet enables you to be proactive rather than reactive, catching bugs long before they reach your production environment. In a world where even seasoned developers are not immune to the challenges of real-world data, Snaplet offers a powerful, preventive solution that can save both time and sanity.

So the next time you find yourself nodding along to a Twitter thread full of developers sharing their production woes, remember that there are tools designed to prevent those late-night debugging sessions. With Snaplet, you can ensure that the bugs you catch are the bugs you keep—safely quarantined in your development environment, far away from your end users.

‍

When bugs are real but the data isn’t.

Lessons from the trenches: Episode 2 - Replicating bugs in production is hard (without Snaplet)

Reactive measures: debugging tools and logging.

Proactive measures: realistic test data and environment parity.

Snaplet: realistic, production-like data.

An automated way to import complex, real-world data into your development environment.

Realistic, anonymized snapshots of your actual data.

Jian Reis

May 2, 2024

I

When bugs are real but the data isn’t.

Lessons from the trenches: Episode 2 - Replicating bugs in production is hard (without Snaplet)

Reactive measures: debugging tools and logging.

Proactive measures: realistic test data and environment parity.

Snaplet: realistic, production-like data.

An automated way to import complex, real-world data into your development environment.

Realistic, anonymized snapshots of your actual data.

Jian Reis

May 2, 2024

>