You’ve probably heard of deep fakes in the last few years – If not, on which side of the internet are you?
Sites like thispersondoesnotexist.com generate what we call synthetic data. In the case of this site, images of people who don’t exist. The images generated by the neural network are subjacent to the website. This neural network was trained on real images to create fake ones.
In sum, synthetic data is “any production data applicable to a given situation that is not obtained by direct measurement” according to the McGraw-Hill Dictionary of Scientific and Technical Terms.
Research into the synthesization of data dates back to the 1930s when the first research into the synthesization of audio and voice can be traced back. The rise of digitization in the 1970s gave way for software synthesizers to appear.
The first application of synthetic data generation for the privacy of the original data can be dated to 1993 to Donald Rubin, an emeritus professor of statistics at Harvard. He conceptualized the usage of algorithms to create a fully synthetic version of the Decennial Census, thus anonymizing the original data and keeping people’s privacy while keeping the statistics of the original dataset intact.
During the 90s and early 2000s the techniques to generate synthetic data diversified with the usage of algorithms such as Bayes Bootstrap, parametric posterior predictive distribution, or Sequential Regression Multivariate Imputation.
A significant jump in the usage and quality of synthetic data appeared in the 2010s with the increasing usage of neural networks in synthetic data generation and the diversification of such neural networks with Generative Adversarial Networks (presented in this paper in 2014) arising to popularity.
Generative Adversarial Networks (GANs) are an interesting concept. In a simplified way, we have two networks, a generator, and a discriminator, both learning from the original data. The generator acts as an art forger, it begins to attempt to generate random pieces of data which in the beginning do not resemble the original but with time and learning will become better. The discriminator is like the police, it’s function is to distinguish what is real data and what is data generated by the generator. With time our art forger (generator) and our police (discriminator) will become increasingly more effective and the pieces of data that could fool the discriminator are truly data almost indistinguishable from the original.
GANs, because of their straightforward implementation, are amongst the most used methodologies for synthetic data generation, from videos to images and even simple tabular data.
Nowadays the implementations and creation of synthetic data have diversified, with algorithms such as the usage of the aforementioned GAN’s and LSTM neural networks as the top players. From tabular data to images to sounds and videos, synthetic data has come to stay.
Synthetic data has 3 main usages :
Being able to mimic real-life data has its challenges and benefits. While it seems it can be limitless in generating scenarios for testing and development, it’s important for us to remember that any synthetic models deriving from data can only replicate specific properties of said data, meaning they will ultimately only be able to simulate general trends.
But that doesn’t leave synthetic data without its benefits. It allows us to:
Sounds like a perfect way to generate datasets right? There are also a few challenges to it. These are just a few:
As you can see synthetic data is on the rise and its importance is paramount. So let’s check a simple example of how you can learn to create a synthetic dataset and use this technique in your daily life as a data person. We’re gonna see an extremely simple example with tabular data as it is the most commonly used data in data science projects within companies.
We’re gonna go through a Conditional Tabular GAN (ctGAN) example and show you how you can create synthetic tabular data. We’re gonna use a default ctGAN structure for this example but you can definitely tune the parameters of this neural network.
For this example, we’re gonna use the Pima Indians Diabetes Dataset from Kaggle as an example. This dataset portrays a common case where data privacy is paramount. It joins the health information of a group of females of Pima Indian heritage and attempts to predict their diabetes risk.
Watch the video or carefully read the instructions below to create your own synthetic dataset. Good luck!
For this exercise, we’ll need to install the package CTGAN, the package SDV, and the package Table Evaluator. To make sure the table evaluator works properly a specific version of seaborn, 0.11.1 , should be installed using python -mpip install seaborn==0.11.1 .
First, let’s import all necessary packages.
Then we’ll need to import the data.
At this stage, it’s important to validate the dataset and check if it has missing data as CTGAN fails if the dataset is not clean.
Finally, we can start the CTGAN implementation. First, we’ll need to declare which are the columns with discrete variables.
And then configure and execute the synthesizer. You can control the number of epochs, batch_size, and the dimensions of the generator and discriminator. Verbose can be set to true if you want to accompany the training of the CTGAN.
After some time (CTGAN takes quite a bit of time to train), you can generate synthetic data as simple as below.
The final step is evaluating this synthetic dataset. Two tools, the table_evaluator, and sdv.evaluate are useful here. TableEvaluator is a library to evaluate how similar a synthesized dataset is to real data. In other words, it tries to give an indication of how real your fake data is.
As you can see in the PCA below, for example, the fake data has grasped the overall trends of the real data and the distributions are fairly similar.
In order to obtain an objective metric of comparison between synthetic data and fake data, we can use the SDV evaluate function to analyze the similarity between real and fake datasets. This function displays aggregated results of all of the similarity metrics from 0 to 1, where 0 begins as worst and 1 is ideal. As you can see below the results, while not being awful, still leave some improvement margin in our CTGAN that could be conquered by tuning the parameters of the synthesizer.
You can check the full code for this project, here.
I hope you enjoyed this mini intro to this very exciting area of Data Science!
Would you like to know more about the subject? Here’s an article about what is Data Science and its current challenge.
Till the next voyage in this galaxy of data!