Generate ai-driven production-like data in three easy steps with Snaplet seed

Snaplet seed's magic unveiled

A peek into the inner-workings of Snaplet’s AI-driven data generation tool.

Let's look at the tech behind Snaplet's seeding process. By breaking it down into clear steps, you'll understand how seeding works and how it can supercharge your development workflow.

Snaplet's seed process involves three key steps:

  1. Database introspection
  2. Client generation
  3. Data generation and connection

Let's explore these steps and see how they work together to streamline your development tasks.

An illustration that shows the three steps behind Snaplet's AI-automated data generation tool.
Three steps behind Snaplet `seed`'s magic

Database introspection

The journey begins with Snaplet carefully examining your database structure. This process results in the creation of a comprehensive dataModel.json file (located at node_modules/.snaplet/dataModel.json).

Snaplet carefully examines your database structure

This file is the foundation for Snaplet's data generation and connection capabilities, containing a detailed blueprint of your database – schemas, tables, columns, and constraints.

You have the power to customize this introspection process through the snaplet.config.ts file.

import { defineConfig } from "snaplet";
export default defineConfig({
   select: {
       hdb_catalog: false, // Exclude the 'hdb_catalog' schema
       public: {
           __prisma_migrations: false    // Exclude the database migrations table
       }
   }
});

This allows you to tailor the process to your project's needs, such as excluding certain schemas or adjusting naming conventions.

Client generation

Once the dataModel.json is created, Snaplet generates the client code and TypeScript types essential for data generation. They will help you spot, in advance, a lot of potential issues with the input data you’re trying to provide (e.g. wrong data type, missing required fields, etc).

It leverages an AI-powered prediction API to suggest appropriate data for your database columns, aiming to reduce your manual workload. This approach lets you focus on defining the most important data points while Snaplet handles realistic data for the rest.

This will all be generated under the modelDefault.js file which looks like this:

Snaplet seed client generation

The client code and types become readily available when you use the createSeedClient function, making integration into your workflow seamless.

Data generation and connection

In addition to its core data seeding features, the Snaplet client includes several utility functions that assist with script and database handling.

  • $resetDatabase: This function clears all data from the models defined in your dataModel.json file. In the future, parameters will be available to truncate only specific tables.
  • $transaction: This function isolates the seed state, providing a secure environment to run plans without affecting your main database. This is particularly useful when running multiple plans in sequence. (Read more on this in our docs)

Snaplet automatically generates data using various strategies. However, you can provide custom instructions when you need specific values, or if the default generation methods aren't ideal. (Read more on custom instructions)

You can also customize how Snaplet connects related data.

Process:

  1. Snaplet connects to the target database and updates necessary values (such as sequence values to maintain continuity).
  2. When you run a plan (e.g., await seedClient.<models>((x) => x(3))), Snaplet generates and connects data. (Read more)

Internal process overview:

Snaplet employs a recursive algorithm to orchestrate data generation.Imagine you want to seed 10 "profile" records, each linked to a corresponding "user".Snaplet first identifies the "root" of this plan ("user," in this case).

It then proceeds recursively: for each profile, it ensures the existence of a parent "user" record.Once the top-level parent is created, Snaplet generates scalar fields (like names, emails, etc.) for the "user".

Finally, it delves into generating related children – creating the "profile" records and linking them back to their respective "users".

  1. Snaplet prioritizes data creation, using either default methods or your custom instructions.
  2. Snaplet automatically connects data based on foreign keys and uniqueness constraints.
  3. Snaplet commits newly created rows to the global seedClient.$store and returns only the data generated within the current plan (excluding any pre-existing connected data).

An image depicting Snaplet seeds's data generation and connection functions

Data generation specifics

You may ask: “Why not just use Faker to generate fake data with a custom function?” Well, you could. Though Snaplet seems very similar, it leverages deeper insights, optimizations, and a touch of AI-powered analysis to generate realistic production-like data.

  1. Database-specific finesse: Snaplet understands the nuances of types of databases. Consider a PostgreSQL smallint column. Snaplet recognizes its limitation (values from -32768 to 32767) and generates data accordingly. A generic 'faker' library might produce values outside this range, causing database errors. Some other advantages include "sequence value continuation" which allows you to just generate "more data" on top of existing data without having to worry about the primary key values.
  2. Determinism for reliable testing: Snaplet's focus on deterministic data generation is crucial for testing. With Snaplet, you gain predictable test results. If your code remains the same, so will your seed data, allowing you to confidently isolate issues in your core application logic.
  3. AI-powered assistance: For unstructured data like JSON fields, Snaplet's AI capabilities analyze your database schema to suggest relevant data. Additionally, it understands fields like "birth_date," offering sensible date ranges rather than purely random dates.

Data connection specifics

Snaplet's data connection process is designed to ensure data integrity and consistency. It automatically connects related data based on foreign keys and uniqueness constraints. This approach minimizes manual effort and reduces the risk of data inconsistencies.

Once again this could be done manually, but Snaplet's automation saves you time and effort, especially when dealing with complex data relationships.

A good example is when you have what's called a "join table" in your database schema. This pattern is commonly observed in schema definitions, as illustrated below:

CREATE TABLE users (
   id SERIAL PRIMARY KEY,
   name TEXT
);
CREATE TABLE groups (
   id SERIAL PRIMARY KEY,
   name TEXT
);
CREATE TABLE group_member (
   user_id INT REFERENCES users(id),
   group_id INT REFERENCES groups(id),
   role TEXT,
   PRIMARY KEY (user_id, group_id)
);

Here, we can see that several constraints apply, one of them is that a user can only be present once in a group. This is enforced by the PRIMARY KEY constraint on the group_member table. Snaplet will automatically take care of this for you, and you can be sure that your data will be consistent and correct.

So if you have a pool of let say 10 users and 2 groups, and you want to connect them together, Snaplet will take care of that and only place each user maximum once in each group. If there are not enough users to satisfy the constraint, Snaplet will wayou that the connection you want cannot satisfy your database constraints.

If you have a pool of, let's say, 10 users and 2 groups, and you want to connect them, Snaplet will seamlessly handle this process. It will ensure that each user is placed in each group at most once. In cases where there aren't enough users to meet the constraint, Snaplet will warn you that the desired connection may not satisfy the database constraints.

Say for instance, you want a pool of 10 users and 2 groups and want to distribute users across these groups—allowing some to be members of both groups and some in none, while ensuring no user is a member of the same group more than once—you can simply use the following code:

// Setup our pools
const { users } = await seed.users(x => x(10));
const { groups } = await seed.groups(x => x(2));

// Create 15 group members and connect them to the users and groups pool
await seed.groupMembers(x => x(15), { connect: { users, groups } });

Snaplet will take care of all the constraints for you. Considering you have 10 users and 2 groups, you can generate up to 20 group members. If you want to create more than that, we’ll warn you, indicating that you must have a larger pool of users or groups to meet the specified constraints.

An additional advantage of using Snaplet is that, in the event of you adding a new foreign key to your database, you won't necessarly have to update your seed script. Snaplet is designed to make the best attempt at connecting and creating the required relational data, minimizing the need for manual script updates.

Conclusion: Snaplet's automation and flexibility

Snaplet's seeding process aims to strike a balance between automation and customization. During these three steps described above it analyzes your database structure to minimize manual effort while also providing the flexibility to tailor data generation to your exact testing and development needs. Understanding the mechanics behind this process empowers you to fully harness Snaplet's capabilities, ultimately saving you valuable time and streamlining your development cycles.

Related posts:

Andrew Valleteau
March 4, 2024