Lecture 2

Presenter

Name: Rajshri Jain
Topic: Synthetic Data Generation and Model Evaluation
Description: Rajshri Jain explores the full lifecycle of synthetic data in modern AI systems, from generation techniques to rigorous evaluation frameworks. The session covers why synthetic data is increasingly critical for privacy-preserving learning, data augmentation, and testing rare or edge-case scenarios, as well as practical methods for creating high-quality synthetic datasets using generative models, rule-based approaches, and hybrid techniques. Beyond generation, Rajshri focuses on how to evaluate synthetic data effectively—measuring fidelity, diversity, bias, and downstream model utility—to ensure that synthetic datasets improve, rather than degrade, real-world model performance. Attendees will learn how to integrate synthetic data and evaluation pipelines into production ML workflows to accelerate iteration while maintaining trust and reliability.

When and why to use synthetic data for privacy, data scarcity, and coverage of rare or critical edge cases.
Practical approaches for generating synthetic datasets using generative, rule-based, and hybrid methods.
How to evaluate synthetic data using metrics that capture fidelity, diversity, bias, and impact on downstream model performance.