Snappy, the Snaplet cat mascot is worried about sensitive data and advises against copying production databases onto local development machines

Cautionary copying: The pitfalls of porting sensitive data from production

Why Snaplet's data anonymization tool is a better alternative to data dumps.

We've talked previously about why seed scripts suck. If you missed that, the TL;DR is that no one really wants to spend their day writing tedious SQL queries to get data to code against that's probably not particularly great anyway. They scale poorly, they're nearly impossible to keep up to date, and they can be hard to share among a team.

If you're not currently using Snaplet, you may be thinking: "Sure, seed scripts do suck, but that's why it's just easier to copy the data I need to code against from the production database. Duh!" If that's you, we have news for you: copying data from production sucks just about as much as using a seed script. Maybe more.

It's less painful than writing a seed script, maybe, but it's way more risky - for everyone involved. While the benefit of using production data in your development environment is that you have a perfect representation of what's actually happening for your users, it's not great for your database, it's not great for your users, and it's not great for your security and compliance.

Snappy, Snaplets cute cat mascot, struggles to choose between seed scripts and data dumps
Snappy needs mock data to use in their local development environment but doesn't want to do a data dump or write a seed script

Five reasons why you should avoid a data dump

Resource-intensive

First off, your database. Given that a typical development database may span into the millions or billions of rows across all the assorted tables, copying your production database is not a task that should be undertaken lightly. Beyond just the actual time taken, copying down a database is a resource-intensive exercise, potentially causing disk IO, CPU, and memory bottlenecks.

Database performance issues

Production databases are typically optimized to serve real-time requests from users, so copying the entire database can have a severe impact on the database's performance. Given the size of a typical, mature database, you could not only be waiting a long time, you could be subjecting your users to a degraded experience thanks to the strain you're putting on the database server. That's not great for them.

Copying unnecessary data

It's also largely unnecessary. While we're not disputing the benefits of having accurate and realistic data to use in your development environment, pulling an unwieldy copy of your entire database is probably overkill. The size of a production database can be many times larger than what's needed for a developer's development environment. Typically, developers only need a small sample (what others call a "subset") of data to work with, such as a few hundred or a few thousand records. Having to work with such a large amount of data actually just serves to slow you down, as it will almost certainly take longer to download and restore the data. (Spoiler alert: this is why we've included subsetting into Snaplet.)

Compliance issues

It's also risky. Very risky. Compliance regulations such as GDPR and HIPAA are no joke, and being in violation can be a big deal (to the tune of fines in the millions of dollars). Using the production data as is means storing sensitive and personally identifiable information in an unsecure environment. Considering the fines and reputation risk, it's better to avoid entirely, and is why access to production databases is typically tightly controlled and monitored.

Outdated data

Any gains achieved by using actual production data are largely nullified anyway by the fact that this data may change regularly and can become outdated quickly by its very nature. While using production data as a one-off to seed your development environment may be tempting, it'd quickly be out of date and its usefulness severely impacted. Scheduling regular database dumps only runs the risk of further compromising database performance while running production workloads, while doing nothing to mitigate the risk of personally identifiable information being routinely shared.

Snaplet copies a de-identified snapshot of your production database by anonymizing all personally identifiable data

This is where Snaplet elegantly enters the picture to solve for all of the above issues. Snaplet provides a solution that offers the best of using production data while also offering all the benefits of removing sensitive information.

Snaplet securely connects to your production database and creates a sample of that data, which we refer to as a database snapshot. This snapshot is then transformed to remove sensitive information and can be restored into your development database. You have complete control over the entire process, including defining the size of your database sample, and which data are and aren't captured and transformed, and how that data are transformed.

End result? You get high-quality, production-like data to work with, without the compliance and performance risks that come with copying the entire production database. Best of all, you can schedule your snapshot to be captured regularly, ensuring you're always working with up-to-date data. You can also easily share your snapshots with your teams, ensuring everyone on the team is working against the same data.

So, what are you waiting for? Stop copying your production database and let Snaplet show you how to seed your PostgreSQL database with up-to-date production-like data, so you can code, debug, and test with ease!

Jian Reis
January 24, 2022