Generative AI Technology Break-through: Spectacular Performance of New Synthesizer
I introduce a new, NoGAN alternative to standard data synthetization methods. It is designed to run faster by several orders of magnitude, compared to training generative adversarial networks (GAN). In addition, the quality of the generated data is far superior to almost all other products available on the market.
Many evaluation metrics to measure faithfulness have critical flaws, sometimes ranking a replication as excellent, when it is actually a failure, due to using on low-dimensional indicators. I fix this problem with the full multivariate empirical distribution (ECDF). As an additional benefit, both for synthetization and evaluation, all types of features — categorical, ordinal, or continuous — are processed with a single formula, regardless of type, even in the presence of missing values.
In a real-life case study involving tabular data, the synthetization was generated in less than 5 seconds, versus 10 minutes with GAN. It produced much better results, verified via cross-validation. Thanks to the very fast implementation, it is possible to automatically and efficiently fine-tune the hyperparameters. I also discuss next steps to further improve the speed, the faithfulness of the generated data, and applications other than synthetization.
The superiority of NoGAN is substantial and unquestionable. After all, it allows for exact replication of the real data if bins are granular enough. This is easily achieved with barely any penalty in running time or memory requirements. No matter what, the final number of bins is no larger than the number of observations. GAN is not capable of such performance, making NoGAN a game changer. The loss function is the KS distance between the multivariate ECDFs computed on the real and synthetic data. However, there is no gradient descent algorithm involved, contributing to the speed and stability of the method, regardless of the type of features. In particular, the method is not subject to mode collapse or divergence. Likewise, there is no discriminator model involved, unlike GAN. The method also leads to fully replicable results and simple parallel implementation.
To read more and access the technical document (16 pages) with full Python implementation and case studies, follow this link.