Generating Synthetic Data using Synthetic Data Vault (SDV)

Anuj Agarwal
3 min readMar 7, 2023

--

The Synthetic Data Vault (SDV) project is an open-source library that provides a set of tools for generating synthetic data. Here are the general steps to generate synthetic data using SDV:

  1. Install SDV: Start by installing the SDV package. You can do this using pip, the Python package manager. Simply run the command “pip install sdv” in your command line or terminal.
  2. Load your dataset: Once you have SDV installed, you’ll need to load your dataset into the library. SDV supports a wide range of data formats, including CSV, Excel, and SQL databases.
  3. Choose a model: SDV provides several models for generating synthetic data. The most commonly used models are the Gaussian Copula Model and the CTGAN model. Choose the model that best fits your data.
  4. Train the model: Once you’ve chosen your model, you’ll need to train it on your dataset. SDV provides a simple API for training models. Simply call the “fit” method on your model, passing in your dataset.
  5. Generate synthetic data: After training your model, you can generate synthetic data using the “sample” method. This method generates a specified number of samples that are similar to your original dataset.
  6. Validate the synthetic data: Once you’ve generated your synthetic data, you’ll need to validate it to ensure that it’s accurate and representative of your original data. SDV provides several tools for validating synthetic data, including goodness-of-fit tests and visualizations.
  7. Save the synthetic data: Finally, you’ll need to save your synthetic data in a format that you can use. SDV supports a variety of data formats, including CSV, Excel, and SQL databases.

These are the general steps for generating synthetic data using the Synthetic Data Vault project. Remember to choose the appropriate model for your dataset, validate the synthetic data, and save it in a format that you can use.

Synthetic Data Vault (SDV) maintains data relationships by using generative models that learn the underlying probability distributions of the data. These models can capture complex relationships between features, such as correlations between variables or conditional dependencies.

For example, if you have a dataset with two columns, “age” and “income”, the Gaussian Copula Model used by SDV can learn the joint distribution of these variables, including any correlations between them. When generating synthetic data, SDV can then use this learned distribution to generate new age and income values that maintain the correlation between the two variables.

In terms of user control, SDV provides several ways for users to customize the generation of synthetic data. For example, users can specify which columns or features to generate synthetic data for, and which columns to leave unchanged. Users can also set constraints on the data generation process, such as specifying minimum and maximum values for certain features.

SDV also provides a feature selection method that allows users to control which features are used to generate synthetic data. This can be useful when dealing with high-dimensional datasets where not all features are relevant to the analysis.

SDV uses generative models to maintain data relationships and provides users with several ways to customize the generation of synthetic data, including feature selection and setting constraints on the data generation process.

Sample code using Synthetic Data Vault (SDV) to generate synthetic data for an e-commerce company with user details, order details, and payment details:

import pandas as pd
from sdv import SDV

# Load the e-commerce dataset
ecommerce_data = pd.read_csv('ecommerce_data.csv')

# Create a SDV model
sdv_model = SDV()

# Fit the model to the data
sdv_model.fit(ecommerce_data)

# Generate 1000 new rows of data
synthetic_data = sdv_model.sample(num_rows=1000)

# Save the synthetic data to a new CSV file
synthetic_data.to_csv('synthetic_ecommerce_data.csv', index=False)

In this code, we first load the e-commerce dataset from a CSV file using Pandas. We then create an instance of the SDV class and fit the model to the data using the fit() method. We then generate 1000 new rows of data using the sample() method and save the resulting synthetic data to a new CSV file using the to_csv() method.

Here is an example format of an ecommerce_data.csv file for an e-commerce company with user details, order details, and payment details:

In this example, the file includes user details, order details, and payment details, with each row representing a single order transaction. The user_id, order_id, and payment_id columns are unique identifiers for each user, order, and payment respectively.

--

--

Anuj Agarwal
Anuj Agarwal

Written by Anuj Agarwal

Director - Technology at Natwest. Product Manager and Technologist who loves to solve problems with innovative technological solutions.

No responses yet