I don't know exactly how this works, but I wanted to share my experience trying to anonymize data. Don't.
While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.
This is why browser fingerprinting is so effective. It's how Ad targeting works.
Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.
Thanks for the comment and hear you on the anonymization. What we see is that customers will go through and categorize what is PII and what is not and anonymize as needed. If not, they'll back fill with synthetic data. You can change the gender from male to something else, same with the city, etc.
It's really down to the use-case. If you're strictly doing development, then you'll probably want to use more synthetic data than anonymization. If you care about preserving the statistical characteristics of the data then you can use ML models like CTGAN to create net new data.
Definitely a balance between when do you anonymize vs. when do you create synthetic data.
Thanks for the reply, I don't mean to be discouraging! I totally believe people do this, I'm saying they shouldn't. There are other issues as well. Once production data is floating around different environments, it will be easy to lose track of. Then the first GDPR delete request comes in. Was this data synthetic? Was it real? I think Joe has a copy on his laptop, he's on vacation?
It gets messy. It also doesn't solve the main 'unsolvable' issue with production data: scale. It is difficult to test some changes locally because developers often don't have access to databases large enough that would show issues before getting to production. At a certain size, this is the #1 killer of deployments.
Yup - I worked on a data warehouse project that was subject to GDPR. The way we did it is we didn't do any synthetic data generation, we just blanked out any PII fields with "DELETED". Then it's still possible to action a delete request, because the PK's, emails are the same as they are in production.
It's definitely possible to practice this while adhering to GDPR, but you do need to plan carefully, and synthetic data should only be used for local dev/testing, not data warehousing.
Fuzzing random data is fine for development environments, but it won't give you the same scale or statistical significance as production data. Without that you can't really ensure that a change will work reliably in production, without actually deploying to production. Canary deployments can only give you this assurance to a certain degree, and by definition would only cover a subset of the data, so having a traditional staging environment is still valuable.
Not only that, but a staging environment with prod-like data is also useful for running load, performance and E2E tests without actually impacting production servers or touching production data. In all of these cases, anonymizing production data is important as you want to minimize the risk of data leaks, while also avoiding issues that can happen when testing against real production data (e.g. sending emails to actual customers).
I don't totally understand this comment. Random data can get you more scale than production data, in that it can just be made up. All the load and E2E testing can be done with test data, no problem.
This idea of data being statistically significant has come up, but that's also easy to replicate with random data once you know the distributions of the data. In practice, those distributions rarely change, especially around demographic data. However, I don't think I've seen a case where this has been a problem. I'd be interested to learn about one.
The ideal scenario is that you're able to augment your existing data with more data that looks just like it. The matter of statistical significance really depends on the use-case. For load testing, it's probably not as important as it is for something like feature testin/debugging/analytical queries.
Even if you know the distribution of the data (which imo can be fairly difficult) replicating that can also be tricky. If you know that a gender column is 30-70 male - female, how do you create 30% male names? How about the female names? Are they the same name or do you repeat names? Does it matter? In some cases it does and in others it doesn't.
What we've seen is that it's really use-case specific and there are some tools that can help but there isn't a complete tool set. That's what we're trying to build over time.
There are various reasons why you might want synthetic data. Anonymisation is one of them - but the issue is around which statistical relationships are preserved in the anonymizing process, while ensuring that fusion with other data sources is not going to unmask the real data hidden beneath.
So because you were a) too lazy to understand the concept of differential privacy b) the value of using anonymized data and c) come up with a trivial strawman, that makes the whole concept of anonymization unnecessary and something that should be replaced by oversimplified factories?
How ignorant and wrong-headed. I'd recommend learning more about the concepts of k-anonymity and differential privacy before you prematurely presume impossible a concept that others (including Google) have been able to use successfully.
Your professional laziness clearly has you in the wrong, and all you can muster is a snide, defensive "are you ok?"
I'd really advise you to work on that. There's a world of difference between "This is impossible" and "I can't be bothered to figure out how this useful thing others successfully do while managing billions of dollars of risk is possible." Otherwise, you risk professional stagnation.
It depends on the use case though. It's not just about developers testing locally.
One use case that many companies have is data warehousing. Here, you want to have real customer and order data, but anonymized to a degree where only the necessary data is exposed to business analysts and so on. I once worked on a project to do exactly that: clone production to the data warehouse, stripping out only things like contact details, but preserving things like emails, what the customer ordered, and other data like that.
Belated reply, sorry: This is the Correct Answer™.
Mid 2000s, I worked with electronic medical records. I eventually determined anon isn't worthwhile.
For starters, deanon will always beat anon. This statement is unambiguously true, per the research. Including the differential privacy stuff.
My efforts pivoted to adopting Translucent Database techniques. My current hill to die on is demanding that all PII be encrypted at rest, at the field level.
(It's basically applying paper password storing techniques to protecting PII. The book shows a handful of illuminating use cases. Super clever. No weird or new tech required.)
So, how does one create synthetic relational data? Do you just crank out a list of synthetic customers, assign IDs, create between 0 and 3 synthetic orders per person, and between 0 and 3 order line entries per order?
This is somewhat framework dependent, but factory_boy supports connecting factories together via SubFactory. There's a real-world example I'm building [0]. See where "author = SubFactory(UserFactory)". I'd imagine there are similar ways to do this for Rails and others too.
it's a combination of creating a random number of records for foreign keys i.e 1 customer - create between 2 and 5 transctions. Working on giving you control over that, and handling referential integrity with table constraints (foreign keys, unique constraints, etc.)
ML based approaches typically are not very good at this and struggle with handling things like referential integrity. So a more "procedural" or imperative way is slightly better. The ideal is a combination of both.
Well… if you know the data is real, then knowing everything about that someone can seriously limit the number of people it could actually be, which makes that someone identifiable.
Like if you only know the full address, and you see that only one person lives at that address. Or you know the exact birthdate and the school that someone went to. Or the year of birth and the small shop that the person works for. And so on…
While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.
This is why browser fingerprinting is so effective. It's how Ad targeting works.
Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.
Keep production data in production.
[0]: https://factoryboy.readthedocs.io/en/stable/orms.html