As you know, our work at Snaplet involves taking personally identifiable information (Pii) and replacing it with something that resembles that information (or its shape) but is in fact something entirely different. A good example is email addresses. So if the real email address was something like: firstname.lastname@example.org , the email address that we would want to replace it with would be something distinct, for example: email@example.com.
The libraries out there for generating fake data (from what we could find) mostly generate sequences of fake data, so let’s take Faker for example. Below we can see how it changes:
The challenge here though, is that we need to make sure that we're always replacing the same input value with the same output value. So if the input value was: firstname.lastname@example.org and the output value: 'Elva67@hotmail.com' we would always want it to be replaced with 'Elva67@hotmail.com'. We don’t want it to sometimes be replaced with 'Lonny74@yahoo.com' and/or other times with ‘Clara16@hotmail.com’.
It is better to have deterministic data because it means things are predictable. In other words, when you use it for tests, it means that your test is always getting the same input. This predictability helps you to set expectations or make assertions on what the outputs are going to be, and this is similar for working with data in development, aside from tests.
So let's imagine that this was the code of a program.
You can see that we've got the same code at the top as the bottom. But the results we got for Faker between these two runs of the program are completely different. We got a different sequence of values generated. It turns out that Faker has a way to give you the same sequence of values. So if we were to use faker.seed():
We first get those three values at the top. When we run the the program again, you can see that we get those same values again at the bottom. So this is good because we now have determinism, or a deterministic sequence of values. This is what we were using up until now to ensure that for the same input values, we always got the same output values. So it is clear that with Faker it is possible to get a deterministic sequence of values. What we're actually looking for is a mapping of values, instead a sequence.
So if we got 'Marcella76@yahoo.com' , we always want to map it to 'Marcella76@yahoo.com' and if we got 'Easter44@hotmail.com' we always want to map it to 'Easter44@hotmail.com' never to 'Marcella76@yahoo.com' or any other email address not originally associated with it.
The challenge comes in when your environment changes, all your parameters change. Let's say the shape or structure of the data changes. Perhaps a column is added to a table, or there is a reordering in the data, or it could even be that the program itself gets changed (like the code itself gets changed). In all these different cases, we ended up using the sequence differently. So the sequence didn’t change, but the way we used the sequence changed. In these cases it became very difficult to ensure that the same input values always got the same output values. We had determinism, but not in a way that was actually useful for us.
This is why we decided to write Copycat. Copycat works slightly different to Faker in the fact that it takes in an input, versus Faker, who’s API doesn't rely on an input value to generate the output. With Faker, it happens from ‘nothing’ to ‘something’. Whereas with Copycat, if you were to write an input: > copycat.email(’email@example.com’), it will generate an output: 'Rodger.Bernier540@gmail.com'. And if you were to provide a different input: > copycat.email(’firstname.lastname@example.org’), it will give you a different output: 'Imelda_Lakin707@yahoo.com'
If you were to ask it for that same input: > copycat.email(’email@example.com’) a second time, or a third time, it would give you the same output as before: 'Imelda_Lakin707@yahoo.com'
And then if we asked for the earlier input we used first: > copycat.email(’firstname.lastname@example.org’) , again, it provides that correlating output: 'Rodger.Bernier540@gmail.com'.
Every time you ask for a specific input, it will give you it’s original correlating output. It's not reliant on maintaining anything in memory in order to make sure we have this guarantee and it doesn't rely on any state in order to generate its values.
So the distinction between Faker and Copycat is this: Faker generates a sequence of values, whereas Copycat maps values. For the same input values it will always give the same output values.
See Snaplet developer Justin van der Merwe explain tin his video here! Alternatively Check out our docs for a deeper dive into the API. This is our first open-source release and we really love collaboration. Visit https://github.com/snaplet/copycat to add your own contribution!
Working with Snaplet is easy, fast and safe. This makes testing and debugging simpler and more robust–especially when working with databases that contain sensitive personal information. To guide you through the process, Snappy will be there to give you hints and instructions. You can chat with us directly. We’re always happy to help.