Snappy, the Snaplet cat mascot wrestles with seed scripts and unmanageable data

Why seed scripts suck

Looking for a better alternative to seeding your local development database?

We often talk about why seed scripts suck here at Snaplet. Admittedly, we’re biased - we think Snaplet is a much better alternative to using seed scripts. But why? What is it about seed scripts that’s so bad, and why is Snaplet such a better option?

Let’s start at the beginning. Since time immemorial (or at least the emergence of relational databases in the 1970s, which is basically the same thing) seed scripts have been used by developers to populate databases with fake data for testing or development purposes. In a typical scenario, a developer will write a script that contains a series of SQL statements that insert data into tables. These statements will usually follow a specific order to ensure that the data is inserted correctly and that any foreign key constraints are satisfied.

Here is an example of a seed script that might be used to create fake data for a simple blog application:

INSERT INTO users (name, email, password) VALUES
('Jimmy', 'jimmy@example.com', 'password123'),
('Bob', 'bob@example.com', 'hunter2');

INSERT INTO posts (title, body, user_id) VALUES
('My First Post', 'Lorem ipsum dolor sit amet...', 1),
('My Second Post', 'Lorem ipsum dolor sit amet...', 2);

INSERT INTO comments (body, user_id, post_id) VALUES
('Great post, Jimmy!', 2, 1),
('I agree, Bob!', 1, 2);

This script would insert two users, two posts (one for each user), and two comments (one for each post). Note that the user IDs are hard-coded into the script, which means that if the structure of the database changes (eg if a new user is added), the script will need to be updated accordingly.

While this approach may work for very small databases and simple applications, as the size and complexity of the database grows, the constraints around seed scripts become more readily apparent. Traditionally, seeding larger and more complex databases typically use a sequence of seed scripts, executed in a specific order to ensure dependencies are met, and as the number of seed scripts increases, managing the order and dependencies can become maddeningly complex. This often leads to errors or conflicts when trying to execute the scripts, which can be time-consuming and frustrating to troubleshoot. It’s seed scripts all the way down!

Additionally, making changes to seed scripts can be a slow and cumbersome process, requiring the developer to make changes to the script, re-run the script, and then wait for the data to be updated in the database. Being able to quickly add, remove, or transform data isn’t really possible in a dynamic, real-time way using seed scripts. There’s only so many coffee breaks you can take while your seed script is running before you get caffeine jitters.

Seed scripts can also become difficult to maintain over time, especially if the database schema changes. If (or more likely, when) the schema changes, the seed script may no longer work as intended and will need to be updated to match the new schema. This can be a time-consuming and error-prone process, and as the database grows and changes over time, it can become increasingly difficult to ensure that the seed data is accurate and up-to-date. This is especially true where there’s a sequence of seed scripts that need to be updated.

As if that’s not bad enough, often the arcane knowledge around seed script maintenance is “tribal” in nature, and it’s not unheard of to have one wizened developer being responsible for maintaining a legacy of seed scripts that are largely inscrutable to anyone else. This can present quite a concentration risk for a team.

Finally, a big issue with seed scripts is that they’re always going to be an approximation of what a developer assumes the data may look like. Data-related bugs often occur because the seed script didn’t account for what the real world real-world data looks like. For example, names with special characters, or addresses entered in non-standard formats would probably not be anticipated in a seed script, but users routinely find ways to surprise developers with their data inputs!

All of these constraints can make seed scripts difficult to manage, especially as the size and complexity of the database grows, and the number of developers using the seed scripts grow. It’s entirely anecdotal, but in our shared experience it’s worth noting that writing and managing seed scripts isn’t particularly fun or satisfying for most developers. Generally speaking, developers enjoy writing actual production code and solving problems - wrangling seed scripts just to get data in your development environment is not a creative or engaging challenge, it’s just frustrating and tedious.

If you can relate to any of the points above, Snaplet offers the solution. Instead of having to write complex queries to seed your local database with data, you can simply connect to an existing source of data, be it your staging, QA or production environment. This secure, read-only connection allows Snaplet to introspect your schema, examine your data, and capture a ‘snapshot’ of that data.

A snapshot is basically a representation of that data you can use to seed a local development database, but importantly, you have the ability to manipulate the data to your needs. Specifically, you can capture a sample (what we call a subset) of the original data, making it possible to reduce a database with millions of rows into something much smaller, but still representative of the data.

Additionally, transformations configured during the snapshot capture process allow you to transform sensitive information like names, email addresses, credit card numbers and similar into safe, pseudonymized (Is this even a word? It is now!) versions. You’ll still get values that look like names, email addresses and credit card numbers, but they’ll be safe to work against and compliant with data protection policies. Added bonus: The values are deterministic, so a given input (“foo@bar.com”) will always produce the same output (“xas@sqw.com”), so you can write tests against your snapshots.

Snaplet also allows you to easily share this snapshot with your team. Using Snaplet Cloud, it’s possible to provide the exact same snapshot to other members on your team to code against. If your schema changes, or you add new tables to your database, capturing a new snapshot and sharing it with your team takes a few seconds. With Snaplet, everyone on the team can be safely working against the same data, all the time, without being blocked.

And lastly, Snaplet’s transformations allow you to manipulate the data you need much more easily than using a seed script. Using the configuration file, or the Snaplet Cloud web app, you can specify which data are transformed, how that data are transformed, and which data are captured as-is on a per-snapshot level. All of these configurations are done via JavaScript, so there’s no need to learn a new language. Modifying configurations can be done in real-time, and it’s even possible to seed new data into a snapshot when you’re developing a new feature.

Curious to see how Snaplet can make your life easier? Give it a try and don't let bad data block you ever again.

Jian Reis
January 24, 2022