Generated Data vs Monte-Carlo Simulations: What are the Differences?

Vincent Granville
2 min readAug 1, 2023

--

I sometimes get asked this question: could you use simulations instead of synthetizations? Below is my answer, also focusing on some particular aspects of data synthetizations, that differentiate them from other techniques.

Simulations do not simulate joint distributions

Sure, if all your features behave like a mixture of multivariate normal distributions, you can use GMMs (Gaussian mixture models) for synthetization. This is akin to Monte-Carlo simulation. The parameters of the mixture — number of clusters, covariance matrix attached to each Gaussian distribution (one per cluster), and the mixture proportions — can be estimated using the EM algorithm. It is subject to model identifiability issues, but it will work.

If the interdependence structure among the features is essentially linear, in other words well captured by the correlation matrix, you can decorrelate the features using a linear transform such as PCA to remove cross-correlations, then sample each feature separately using standard simulation techniques, and finally apply the inverse transform to add the correlations back. This is similar to what the copula method accomplishes. Each decorrelated feature can be modeled using a parametric metalog distribution to fit with various shapes, akin to Monte-Carlo simulations.

Read the full article here, in including my answer to the following questions:

  • Dealing with a mix of categorical, ordinal, and continuous features
  • Do Gaussian copulas work on non-Gaussian observations?
  • My simulations do as well as synthetizations, how so?
  • Sensitivity to changes in the real data

--

--

Vincent Granville
Vincent Granville

Written by Vincent Granville

Founder, MLtechniques.com. Machine learning scientist. Co-founder of Data Science Central (acquired by Tech Target).

No responses yet