Our work at Snaplet involves taking personally identifiable information (PII) and replacing it with something that resembles that information (or its shape) but is in fact something entirely different. In other words, we automatically de-identify sensitive data by replacing it with alternative or 'mock' data. A good example of personally identifiable information is email addresses. So if the real email address was something like: email@example.com , the email address that we would want to replace it with would be something distinct, for example: firstname.lastname@example.org.
Most fake data libraries only generate sequences
The libraries out there for generating fake data (from what we could find) mostly generate sequences of fake data, so let’s take Faker for example. Below we can see how it changes:
The challenge here though, is that we need to make sure that we're always replacing the same input value with the same output value. So if the input value was: email@example.com and the output value: 'Elva67@hotmail.com' we would always want it to be replaced with 'Elva67@hotmail.com'. We don’t want it to sometimes be replaced with 'Lonny74@yahoo.com' and/or other times with ‘Clara16@hotmail.com’.
The need for determinism
It is better to have deterministic data because it means things are predictable. In other words, when you use it for tests, it means that your test is always getting the same input. This predictability helps you to set expectations or make assertions on what the outputs are going to be, and this is similar for working with data in development, aside from tests.
So let's imagine that this was the code of a program.
You can see that we've got the same code at the top as the bottom. But the results we got for Faker between these two runs of the program are completely different. We got a different sequence of mock data values generated. It turns out that Faker has a way to give you the same sequence of values. So if we were to use faker.seed():
We first get those three values at the top. When we run the the program again, you can see that we get those same values again at the bottom. So this is good because we now have determinism, or a deterministic sequence of values. This is what we were using up until now to ensure that for the same input values, we always got the same output values. So it is clear that with Faker it is possible to get a deterministic sequence of values. But for us, that was not good enough. What we were looking for is a mapping of values, instead a sequence.
So if we got 'Marcella76@yahoo.com' , we always want to map it to 'Marcella76@yahoo.com' and if we got 'Easter44@hotmail.com' we always want to map it to 'Easter44@hotmail.com' never to 'Marcella76@yahoo.com' or any other email address not originally associated with it.
The challenge comes in when your environment changes, all your parameters change. Let's say the shape or structure of the data changes. Perhaps a column is added to a table, or there is a reordering in the data, or it could even be that the program itself gets changed (like the code itself gets changed). In all these different cases, we ended up using the sequence differently. So the sequence didn’t change, but the way we used the sequence changed. In these cases it became very difficult to ensure that the same input values always got the same output values. We had determinism, but not in a way that was actually useful for us.
Copycat, Snaplet's own mock data library is born
Copycat works slightly different to Faker in the fact that it takes in an input, versus Faker, who’s API doesn't rely on an input value to generate the output. With Faker, it happens from ‘nothing’ to ‘something’. Whereas with Copycat, if you were to write an input: > copycat.email(’firstname.lastname@example.org’), it will generate an output: 'Rodger.Bernier540@gmail.com'. And if you were to provide a different input: > copycat.email(’email@example.com’), it will give you a different output: 'Imelda_Lakin707@yahoo.com'.
If you were to ask it for that same data input: > copycat.email(’firstname.lastname@example.org’) a second time, or a third time, it would give you the same data output as before: 'Imelda_Lakin707@yahoo.com'
And then if we asked for the earlier data input we used first: > copycat.email(’email@example.com’) , again, it provides that correlating data output: 'Rodger.Bernier540@gmail.com'.
Every time you ask for a specific input, it will give you it’s original correlating output. It's not reliant on maintaining anything in memory in order to make sure we have this guarantee and it doesn't rely on any state in order to generate data values.
So the distinction between Faker and Copycat is this: Faker generates a sequence of values, whereas Copycat maps values. For the same input values it will always give the same output values.
Watch a demo
See Snaplet developer Justin van der Merwe explain this in his video here! Alternatively Check out our docs for a deeper dive into the API. This is our first open-source release and we really love collaboration. Visit https://github.com/snaplet/copycat to add your own contribution!
More about Snaplet
Working with Snaplet is easy, fast and safe. Snaplet is a data generation tool that populates your Postgres database and automatically de-identifies sensitive data. This makes testing and debugging simpler and more robust–especially when working with databases that contain private personal information. If you need production-like data (that is de-identified, deterministic and mapped) to code against, Snaplet is here to help. Go ahead, and try it out for yourself.