Synthetizing the Insurance Dataset Using Copulas: Towards Better Synthetization

Vincent Granville
2 min readDec 18, 2022

--

In the context of synthetic data generation, I’ve been asked a few times to provide a case study focusing on real-life tabular data used in the finance or health industry. Here we go: this article fills this gap. The purpose is to generate a synthetic copy of the real data set, preserving the correlation structure and all the statistical distributions attached to it. I went one step further and compared my results with those obtained with one of the most well-known vendors in this market: Mostly.ai.

I was able to reverse-engineer the technique that they use, and I share all the details in this article. It is actually a lot easier than most people think. Indeed, the core of the method relies on a few lines of Python code, calling four classic functions from the Numpy and Scipy libraries.

Comparing real data with two synthetic copies

Automatically detecting large homogeneous groups — called nodes in decision trees — and using a separate copula for each node is an ensemble technique not unlike boosted trees. In the insurance dataset, I manually picked up these groups. Either way (manual or automated), it leads to better performance.

Testing how close your synthetic data is to the real dataset using Hellinger or similar distances is not a good idea: the best synthetic dataset is the exact replica of your real data, leading to overfitting. Instead, you might want to favor synthetized observations with summary statistics (including the shape of the distribution in high dimensions) closely matching those in the real dataset, but with the worst (rather than best) Hellinger score. This allows you to create richer synthetic data, including atypical observations not found in your training set. Extrapolating empirical quantile functions (as opposed to interpolating only) or adding uncorrelated white noise to each feature (in the real or synthetic data) are two ways to generate observations outside the observed range when using copula-based methods, while keeping the structure present in the real data.

Read the full article with Python implementation, here

.

--

--

Vincent Granville
Vincent Granville

Written by Vincent Granville

Founder, MLtechniques.com. Machine learning scientist. Co-founder of Data Science Central (acquired by Tech Target).

No responses yet