Synthetic data: Challenges and Sample code in Python

4 min readMar 7, 2023

Synthetic data is artificially generated data that is designed to resemble real-world data while protecting the privacy of individuals. Synthetic data can be generated using various methods, such as generative models, rule-based algorithms, or a combination of both.

Synthetic data is important for several reasons:

Privacy: Synthetic data can help protect the privacy of individuals by replacing their personal information with fake but realistic data. This allows organizations to share or publish data without compromising the privacy of individuals.
Cost and accessibility: Generating synthetic data is often less expensive and more accessible than collecting real data. Synthetic data can be generated on demand, allowing organizations to test and validate algorithms, models, and applications without the need for real data.
Diversity and coverage: Synthetic data can provide greater diversity and coverage of the data space than real data. Synthetic data can simulate scenarios that are difficult or impossible to observe in real-world data, allowing organizations to explore new use cases and applications.
Accuracy: Synthetic data can be used to improve the accuracy and quality of real-world data. Synthetic data can be used to augment real data by filling gaps or correcting errors, providing a more complete and accurate view of the data.

Synthetic data is an important tool for organizations that need access to realistic data but cannot use real data for privacy or other reasons. By generating synthetic data, organizations can develop and test algorithms and models, conduct simulations and experiments, and gain insights into complex data spaces.

How to generate :

Generating synthetic data involves creating artificial data points that mimic the statistical properties of real-world data. Here are some common methods for generating synthetic data:

Simulation: Use a mathematical model or algorithm to simulate data based on the known properties of the real data.
Sampling: Randomly sample from the existing data set to create a new data set that has similar characteristics.
Transformation: Transform the existing data set by applying mathematical operations such as scaling, rotation, or skewness to generate a new data set.
Generative Adversarial Networks (GANs): Use a deep learning technique that involves two neural networks — one generates synthetic data while the other evaluates whether the generated data is real or fake. The generator network learns from the feedback provided by the evaluator network, and gradually produces more realistic synthetic data.
Variational Autoencoders (VAEs): Similar to GANs, VAEs also use deep learning techniques to generate synthetic data. In VAEs, the network is trained to encode real data into a lower-dimensional space and then decode it back to the original form. The generated synthetic data can be created by sampling from the lower-dimensional space.

When generating synthetic data, it is important to ensure that the statistical properties of the synthetic data match those of the real data as closely as possible, so that the synthetic data can be used effectively for tasks such as testing machine learning algorithms or preserving privacy of sensitive data.

Sample code using transformation using real data:

import pandas as pd
import numpy as np
from faker import Faker

# Set up Faker for generating fake data
fake = Faker()

# Generate customer data
num_customers = 1000
customers = pd.DataFrame({
    'customer_id': np.arange(num_customers),
    'name': [fake.name() for i in range(num_customers)],
    'email': [fake.email() for i in range(num_customers)],
    'address': [fake.address() for i in range(num_customers)]
})

# Generate orders data
num_orders = 5000
order_dates = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D')
orders = pd.DataFrame({
    'order_id': np.arange(num_orders),
    'customer_id': np.random.choice(np.arange(num_customers), size=num_orders),
    'order_date': np.random.choice(order_dates, size=num_orders),
    'order_total': np.random.normal(loc=100, scale=20, size=num_orders)
})

# Generate payment data
payments = pd.DataFrame(columns=['payment_id', 'order_id', 'payment_date', 'payment_amount', 'payment_method'])
for index, row in orders.iterrows():
    payment_date = row['order_date'] + pd.DateOffset(days=np.random.randint(1, 30))
    payment_amount = row['order_total'] * np.random.uniform(0.8, 1.2)
    payment_method = fake.credit_card_provider()
    payments = payments.append({'payment_id': index, 'order_id': row['order_id'], 'payment_date': payment_date,
                                'payment_amount': payment_amount, 'payment_method': payment_method}, ignore_index=True)

# Join customer, order, and payment data
customer_order = customers.merge(orders, on='customer_id')
customer_order_payment = customer_order.merge(payments, on='order_id')

This modified code generates synthetic data that preserves the relationships between customer, order, and payment data

Another example of Generating Synthetic data for trade and order.


import pandas as pd
import numpy as np
from faker import Faker

# Load real customer data
customer_data = pd.read_csv('customer_data.csv')

# Set up Faker for generating fake data
fake = Faker()

# Generate user data
num_users = len(customer_data)
users = pd.DataFrame({
    'user_id': np.arange(num_users),
    'username': [fake.user_name() for i in range(num_users)],
    'password': [fake.password() for i in range(num_users)],
    'email': customer_data['email'],
    'name': customer_data['name'],
    'address': customer_data['address']
})

# Generate order data
num_orders = 10000
order_dates = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D')
orders = pd.DataFrame({
    'order_id': np.arange(num_orders),
    'user_id': np.random.choice(np.arange(num_users), size=num_orders),
    'order_date': np.random.choice(order_dates, size=num_orders),
    'order_total': np.random.normal(loc=100, scale=20, size=num_orders)
})

# Generate payment data
payments = pd.DataFrame(columns=['payment_id', 'order_id', 'payment_date', 'payment_amount', 'payment_method'])
for index, row in orders.iterrows():
    payment_date = row['order_date'] + pd.DateOffset(days=np.random.randint(1, 30))
    payment_amount = row['order_total'] * np.random.uniform(0.8, 1.2)
    payment_method = fake.credit_card_provider()
    payments = payments.append({'payment_id': index, 'order_id': row['order_id'], 'payment_date': payment_date,
                                'payment_amount': payment_amount, 'payment_method': payment_method}, ignore_index=True)

# Join user, order, and payment data
user_order = users.merge(orders, on='user_id')
user_order_payment = user_order.merge(payments, on='order_id')

In this code, we first load the real customer data from a CSV file. We then generate synthetic user data using Faker, where we use the email, name, and address from the real customer data and generate fake usernames and passwords.

Next, we generate synthetic order data by randomly assigning user IDs to each order and randomly selecting order dates and order totals. We then generate synthetic payment data by iterating over each order and generating a payment record that corresponds to that order. We use a random offset between 1 and 30 days to simulate the delay between the order and the payment.

Finally, we join the user, order, and payment data to create a data frame that contains all three entities. This ensures that the relationships between these entities are preserved in the synthetic data.

Written by Anuj Agarwal

No responses yet