Data Synthetization: enhanced GANs vs Copulas
Using case studies, I compare generative adversarial networks (GANs) with copulas to synthesize tabular data. I discuss back-end and front-end improvements to help GANs better replicate the correlation structure present in the real data. Likewise, I discuss methods to further improve copulas, including transforms, the use of separate copulas for each population segment, and parametric model-driven copulas compared to a data-driven parameter-free approach. I apply the techniques to real-life datasets, with full Python implementation. In the end, blending both methods leads to better results. Both methods eventually need an iterative gradient-descent technique to find an optimum in the parameter space. For GANs, I provide a detailed discussion of hyperparameters and fine-tuning options.
I show examples where GANs are superior to copulas, and the other way around. My GAN implementation also leads to fully replicable results — a feature usually absent in other GAN systems. This is particularly important given the high dependency on the initial configuration determined by a seed parameter: it also allows you to find the best synthetic data using multiple runs of GAN in a replicable setting. In the process, I introduce a new matrix correlation distance to evaluate the quality of the synthetic data, taking values between 0 and 1 where 0 is best, and leverage the TableEvaluator library. I also discuss feature clustering to improve the technique, to detect groups of features independent from each other, and apply a different model to each of them. In a medical data example to predict the risk of cancer, I use random forests to classify the real data, and compare the performance with results obtained on the synthetic data.
Read more and download the article, with full Python implementation, from here.