GenAI-Evaluation: New Open Source Python Library Now Available

Vincent Granville
2 min readSep 22, 2023

--

Tested on multiple public data sets with my own NoGAN synthesizers (1000x faster and consistently better than solutions offered by synthetic data companies), this Python library implements the best evaluation metric to compare your synthetic data with the real data it is supposed to mimic.

A few highlights:

  • First implementation of the multivariate Kolmogorov-Smirnov distance in any dimension, for categorical or numerical features, or a mix of both.
  • Fast, returning results in a few seconds. The minimum value is zero (best fit), the maximum is one (worst fit). Thus, easy to interpret.
  • Outperforms all other evaluation metrics currently implemented by vendors. Will correctly identify poor synthetizations even on the very challenging “circle dataset” pictured below.
  • Adjusted for the number of features (the dimension). Produce a comparison scatterplot easy to interpret, regardless of dimension, see bottom picture below.
  • Also returns the multivariate ECDF (empirical distribution) attached to your datasets, synthetic and real. Generalizing the unidimensional ECDF function available in Python, to any dimension. Based on multivariate quantiles.
  • Free and easy to install with “pip install genai-evaluation”.

To learn more, see use cases and for documentation, visit the reference page.

First illustration: circle dataset

Circle dataset: poor evaluation metrics rate all synthetizations as excellent

Second illustration: evaluation scatterplot, Telecom dataset

High dimensional data: evaluation scatterplot with easy interpretation, three ways

--

--

Vincent Granville
Vincent Granville

Written by Vincent Granville

Founder, MLtechniques.com. Machine learning scientist. Co-founder of Data Science Central (acquired by Tech Target).

No responses yet