Ramblings - Dataset Distillation

What is Dataset Distillation

Dataset Distillation, also called Dataset Condensation in much of the later literature, studies the following problem:

Can we replace a large real training set with a much smaller synthetic set such that training on the small set gives nearly the same downstream performance?

The synthetic dataset is not restricted to being a subset of the original data. The inputs, labels, and sometimes even training schedules can be optimized directly. This makes dataset distillation more general than coreset selection or subset selection.

At a high level, the goal is to compress the task-relevant information in a dataset rather than merely compressing pixels or storing a few representative examples.

Paper map last expanded: 2026-04-26.

Core entry points:

Broad Formal Problem Setup

Let the real dataset be

D={(xi,yi)}i=1N

and let the synthetic distilled dataset be

S={(x~j,y~j)}j=1M,MN.

Let a learning algorithm train model parameters θ from initialization θ0 using some optimizer, augmentation policy, and training schedule. The broad dataset distillation objective is:

minSEθ0p(θ0)[Ltest(θ(S;θ0))]

where θ(S;θ0) is the model obtained after training on the synthetic set.

Papers that are especially useful for formalizing the objective:

Why should we care?

Dataset distillation matters for at least four reasons.

First, it improves efficiency. If a tiny synthetic dataset can stand in for a large real dataset, then training, ablations, architecture search, and hyperparameter sweeps become much cheaper.

Second, it is useful in communication-constrained, storage-constrained, privacy-sensitive, or continual-learning settings where replay buffers or dataset transmission matter.

Third, it is an intriguing scientific problem. A successful distilled dataset is evidence that the original dataset contains a large amount of redundancy. Distillation therefore gives a way to ask: what information does a model really need in order to learn a task? It is about understanding which aspects of learning dynamics are worth preserving

Motivation and use-case papers:

The Evolution of the Field

Origin - The Original Bilevel Optimization Approach

Introduced in [@wangDatasetDistillation2020a].

Key papers:

Gradient Matching

Gradient matching makes the synthetic set induce gradients close to those induced by real data. It is the bridge between the original bilevel formulation and more scalable matching objectives.

Distribution Matching

Distribution matching avoids differentiating through long training trajectories by matching statistics or distributions in learned feature spaces. This family is usually cheaper and often more scalable, but its quality depends heavily on the embedding space and diversity constraints.

Kernel and other theoretically grounded methods

Kernel methods are useful because they give a more analyzable version of DD: the model-training map is replaced by kernel ridge regression or neural tangent kernels.

Trajectory Matching

Trajectory matching asks the synthetic data to reproduce where real-data training moves the model parameters, usually using expert trajectories precomputed on the full data.

Synthetic Data Parameterization - Latent Space, Generative Priors

This line asks what the synthetic object should be. Instead of raw pixels, DD can optimize labels, latent codes, bases/factors, neural fields, generators, diffusion priors, or quantized representations.

Pixel/label/factor parameterizations:

Generative priors:

Moving beyond Image Classification Task - Different Tasks and Modalities

The original benchmark culture was small image classification. The field now includes language, graphs, video, time series, speech, medical imaging, federated/continual learning, RL, point clouds, object detection, retrieval, recommender systems, and multimodal settings.

Self-supervised, pretrained, and multimodal:

Language and text:

Graphs:

Video, time series, speech, and other structured data:

Applications beyond standard supervised classification:

Continual, federated, medical, and data-sharing contexts:

DD in Different Distributions - Long Tailed, etc.

This cluster covers non-IID data, class imbalance, long-tailed labels, bias/fairness, privacy, robustness, noisy labels, domain shift, and adversarial/backdoor settings. These matter because a distilled set can amplify dataset/model biases even when average accuracy is high.

Long-tailed and imbalanced data:

Bias, fairness, calibration, and robustness:

Privacy, backdoors, and data leakage:

Noisy labels, domain shift, and non-IID data:

Understanding the behavior of Dataset Distillation

Does the Loss Landscape differ for models trained on Real Data vs Distilled Data?

Direct "loss landscape" papers are still sparse, but several lines provide indirect evidence: trajectory matching, curvature matching, flat/stable trajectory objectives, calibration, and architecture-transfer work.

What does Dataset Distillation Learn and Encode?

This is the interpretability/science question: does DD encode prototypes, discriminative shortcuts, training dynamics, class boundaries, model-specific priors, or teacher knowledge?

How far is DD explained theoretically?

Partially. Kernel and convexified settings are much better understood than deep, finite-width, augmentation-heavy, generative, or foundation-model settings. Existing theory explains fragments of DD: approximation, compression, kernels, quantization, and some offline-RL/supervised guarantees.

Open Problems, Research Questions and Future Direction

Open problem clusters and useful starting papers: