Ramblings - Dataset Distillation

What is Dataset Distillation

Dataset Distillation, also called Dataset Condensation in much of the later literature, studies the following problem:

Can we replace a large real training set with a much smaller synthetic set such that training on the small set gives nearly the same downstream performance?

The synthetic dataset is not restricted to being a subset of the original data. The inputs, labels, and sometimes even training schedules can be optimized directly. This makes dataset distillation more general than coreset selection or subset selection.

At a high level, the goal is to compress the task-relevant information in a dataset rather than merely compressing pixels or storing a few representative examples.

Paper map last expanded: 2026-04-26.

Core entry points:

Dataset Distillation - Wang, Zhu, Torralba, and Efros; introduced the task and optimized synthetic examples through unrolled training.
Dataset Condensation with Gradient Matching - Zhao, Mopuri, and Bilen; reframed the problem around matching gradients from real and synthetic batches.
Dataset Condensation with Distribution Matching - Zhao and Bilen; made condensation cheaper by matching feature distributions rather than full optimization dynamics.
Dataset Distillation by Matching Training Trajectories - Cazenavette et al.; popularized matching long-range parameter trajectories from expert networks.
DC-BENCH: Dataset Condensation Benchmark - Cui, Wang, Si, and Hsieh; standardized evaluation and exposed sensitivity to architecture, augmentation, and protocol.
A Comprehensive Survey of Dataset Distillation - Lei and Tao; broad survey covering frameworks, algorithms, factorization, applications, and limitations.
Dataset Distillation: A Comprehensive Review - Yu, Liu, and Wang; taxonomy, algorithmic framework, theoretical connections, and challenges.
The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions - Liu et al.; recent survey emphasizing scalability, generalization, and foundation-model-era directions.

Broad Formal Problem Setup

Let the real dataset be

D = {(x_{i}, y_{i})}_{i = 1}^{N}

and let the synthetic distilled dataset be

S = {({\tilde{x}}_{j}, {\tilde{y}}_{j})}_{j = 1}^{M}, M ≪ N .

Let a learning algorithm train model parameters $θ$ from initialization $θ_{0}$ using some optimizer, augmentation policy, and training schedule. The broad dataset distillation objective is:

min_{S} E_{θ_{0} \sim p (θ_{0})} [L_{t e s t} (θ^{*} (S; θ_{0}))]

where $θ^{*} (S; θ_{0})$ is the model obtained after training on the synthetic set.

Papers that are especially useful for formalizing the objective:

Dataset Distillation - canonical bilevel/unrolled optimization setup.
Flexible Dataset Distillation: Learn Labels Instead of Images - label distillation and more flexible meta-learning formulation.
Soft-Label Dataset Distillation and Text Dataset Distillation - soft labels and early extension beyond image-only hard-label distillation.
Dataset Meta-Learning from Kernel Ridge-Regression - KIP/kernel inducing point view.
Dataset Distillation with Infinitely Wide Convolutional Networks - infinite-width convolutional kernel formulation.
Efficient Dataset Distillation Using Random Feature Approximation - approximate-kernel view for making kernel distillation cheaper.
Dataset Distillation with Convexified Implicit Gradients - implicit-gradient and convexified approximation to the bilevel problem.
Provable and Efficient Dataset Distillation for Kernel Ridge Regression - theoretical guarantees for KRR distillation.
Dataset Distillation as Pushforward Optimal Quantization - distributional/optimal-quantization framing.
Dataset Distillation as Data Compression: A Rate-Utility Perspective - rate-utility view of what compression should preserve.

Why should we care?

Dataset distillation matters for at least four reasons.

First, it improves efficiency. If a tiny synthetic dataset can stand in for a large real dataset, then training, ablations, architecture search, and hyperparameter sweeps become much cheaper.

Second, it is useful in communication-constrained, storage-constrained, privacy-sensitive, or continual-learning settings where replay buffers or dataset transmission matter.

Third, it is an intriguing scientific problem. A successful distilled dataset is evidence that the original dataset contains a large amount of redundancy. Distillation therefore gives a way to ask: what information does a model really need in order to learn a task? It is about understanding which aspects of learning dynamics are worth preserving

Motivation and use-case papers:

Dataset Condensation with Differentiable Siamese Augmentation - augmentation-aware condensation, important because evaluation can hinge on train/test augmentation alignment.
DC-BENCH: Dataset Condensation Benchmark - benchmarking and protocol sensitivity.
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory - constant-memory trajectory matching at ImageNet scale.
Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective - decoupled large-scale pipeline using recovery/relabeling.
Dataset Distillation via Curriculum Data Synthesis in Large Data Era - curriculum data synthesis for large-scale condensation.
Calibrated Dataset Condensation for Faster Hyperparameter Search - hyperparameter search as a concrete efficiency application.
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm - realism/diversity matters for practical use and transfer.
Self-Supervised Dataset Distillation for Transfer Learning - transfer-oriented DD rather than only training-from-scratch classification.
Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks - DD as a route to efficient self-supervised pre-training.
Dataset Distillation for Pre-Trained Self-Supervised Vision Models - distillation for linear probes over modern pretrained SSL vision backbones.

The Evolution of the Field

Origin - The Original Bilevel Optimization Approach

Introduced in [@wangDatasetDistillation2020a].

Key papers:

Dataset Distillation - original unrolled bilevel optimization of synthetic images, labels, and learning rates.
Soft-Label Dataset Distillation and Text Dataset Distillation - early soft-label formulation and text extension.
Flexible Dataset Distillation: Learn Labels Instead of Images - shows that optimizing labels on real images can outperform optimizing synthetic pixels in some settings.
Medical Dataset Distillation / Soft-Label Anonymous Gastric X-ray Image Distillation - early real-world/medical-data-sharing direction.
Optimizing Millions of Hyperparameters by Implicit Differentiation - not DD-specific, but important background for implicit differentiation in bilevel data optimization.
Dataset Distillation with Convexified Implicit Gradients - revisits bilevel DD with implicit gradients and convexified approximations.
Dataset Distillation with Stochastic Neural Networks - studies stochasticity/dropout-style randomness inside distillation.

Gradient Matching

Gradient matching makes the synthetic set induce gradients close to those induced by real data. It is the bridge between the original bilevel formulation and more scalable matching objectives.

Dataset Condensation with Gradient Matching - first major gradient-matching formulation.
Dataset Condensation with Differentiable Siamese Augmentation - differentiable augmentation during condensation and evaluation.
Dataset Condensation with Contrastive Signals - adds contrastive structure to the synthetic data learning signal.
Loss-Curvature Matching for Dataset Selection and Condensation - matches second-order/loss-curvature information rather than only gradients.
Delving into Effective Gradient Matching for Dataset Condensation - analyzes and improves practical gradient matching.
DREAM: Efficient Dataset Distillation by Representative Matching - representative matching as an efficient alternative within the matching family.
DREAM+: Efficient Dataset Distillation by Bidirectional Representative Matching - strengthens representative matching bidirectionally.
Accelerating Dataset Distillation via Model Augmentation - model augmentation to improve generalization of gradient-based distilled sets.
Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching - Taylor-approximated matching for large-scale settings.
Calibrated Dataset Condensation for Faster Hyperparameter Search - calibrates condensed sets for downstream hyperparameter search.
Towards Adversarially Robust Dataset Distillation by Curvature Regularization - curvature regularization for robustness.
Synthetic Text Generation for Training Large Language Models via Gradient Matching - gradient matching applied to text/LLM training data.
Dataset Distillation for Pre-Trained Self-Supervised Vision Models - linear-gradient matching over frozen SSL representations.

Distribution Matching

Distribution matching avoids differentiating through long training trajectories by matching statistics or distributions in learned feature spaces. This family is usually cheaper and often more scalable, but its quality depends heavily on the embedding space and diversity constraints.

CAFE: Learning to Condense Dataset by Aligning Features - feature alignment with discriminative constraints.
Dataset Condensation with Distribution Matching - class-wise feature distribution matching over sampled networks.
Improved Distribution Matching for Dataset Condensation - fixes feature imbalance and unvalidated embeddings in naive DM.
DataDAM: Efficient Dataset Distillation with Attention Matching - matches attention/discriminative regions.
M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy - explicit MMD-based distribution matching.
Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation - matches relational structure across samples and features.
Dataset Condensation with Latent Quantile Matching - quantile-based feature distribution matching.
DANCE: Dual-View Distribution Alignment for Dataset Condensation - aligns distributions from complementary views.
Diversified Semantic Distribution Matching for Dataset Distillation - improves semantic diversity in distribution matching.
Decomposed Distribution Matching in Dataset Condensation - separates content/style distribution issues.
Dataset Distillation with Neural Characteristic Function: A Minmax Perspective - characteristic-function/minmax view of distribution matching.
OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation - optimal transport for contribution allocation.
Dataset Distillation via the Wasserstein Metric - Wasserstein/OT framing.
Diversity-Enhanced Distribution Alignment for Dataset Distillation - explicitly handles diversity in distribution alignment.
Hyperbolic Dataset Distillation - studies DD in hyperbolic representation space.
TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution - combines trajectory guidance with distribution balancing.

Kernel and other theoretically grounded methods

Kernel methods are useful because they give a more analyzable version of DD: the model-training map is replaced by kernel ridge regression or neural tangent kernels.

Dataset Meta-Learning from Kernel Ridge-Regression - Kernel Inducing Points (KIP).
Dataset Distillation with Infinitely Wide Convolutional Networks - distributed kernel meta-learning with infinite-width ConvNets.
Dataset Distillation using Neural Feature Regression - neural feature regression as a cheaper surrogate.
Efficient Dataset Distillation Using Random Feature Approximation - random features to approximate expensive kernel computations.
Dataset Distillation with Convexified Implicit Gradients - finite-width NTK-inspired convexification.
Provable and Efficient Dataset Distillation for Kernel Ridge Regression - guarantees for KRR distillation.
A Theoretical Study of Dataset Distillation - theoretical analysis of what distilled sets can approximate.
On the Size and Approximation Error of Distilled Sets - size/error tradeoffs for distilled sets.
Dataset Distillation as Pushforward Optimal Quantization - DD as optimal quantization after pushing data through the learning map.
Dataset Distillation as Data Compression: A Rate-Utility Perspective - compression-theoretic formulation.
Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets - guarantees beyond standard supervised classification.

Trajectory Matching

Trajectory matching asks the synthetic data to reproduce where real-data training moves the model parameters, usually using expert trajectories precomputed on the full data.

Dataset Distillation by Matching Training Trajectories - MTT; matches long-range expert parameter trajectories.
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - accumulated trajectory error instead of endpoint-only error.
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory - TESLA-style constant-memory scaling.
Sequential Subset Matching for Dataset Distillation - sequentially matches subsets of the expert trajectory.
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching - aligns matching difficulty along the trajectory.
SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching - selection-based initialization and partial update strategy.
Dataset Distillation by Automatic Training Trajectories - automatic trajectory generation/selection.
Neural Spectral Decomposition for Dataset Distillation - spectral analysis/parameterization connected to trajectory behavior.
Prioritize Alignment in Dataset Distillation - alignment prioritization for better generalization.
Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory - more stable/storage-efficient trajectory matching.
Robust Dataset Distillation by Matching Adversarial Trajectories - adversarial trajectory matching for robustness.
Contrastive Learning-Enhanced Trajectory Matching for Dataset Distillation - contrastive learning signal on top of trajectory matching.

Synthetic Data Parameterization - Latent Space, Generative Priors

This line asks what the synthetic object should be. Instead of raw pixels, DD can optimize labels, latent codes, bases/factors, neural fields, generators, diffusion priors, or quantized representations.

Pixel/label/factor parameterizations:

Flexible Dataset Distillation: Learn Labels Instead of Images - label distillation.
Soft-Label Dataset Distillation and Text Dataset Distillation - soft-label DD.
Dataset Condensation via Efficient Synthetic-Data Parameterization - IDC; parameterizes synthetic data more efficiently.
Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks - addressable memory parameterization.
Dataset Distillation via Factorization - HaBa-style factorized synthetic data.
Dataset Condensation with Latent Space Knowledge Factorization and Sharing - latent factorization/sharing.
Slimmable Dataset Condensation - one condensed set for multiple compression ratios.
Few-Shot Dataset Distillation via Translative Pre-Training - translative pretraining for few-shot DD.
Sparse Parameterization for Epitomic Dataset Distillation - sparse/epitomic representation.
Frequency Domain-based Dataset Distillation - frequency-domain parameterization.
Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation - hierarchical sharing across synthetic data.
FYI: Flip Your Images for Dataset Distillation - simple transformation/parameterization improvement.
Color-Oriented Redundancy Reduction in Dataset Distillation - color redundancy as a compression target.
Distilling Dataset into Neural Field - neural field representation.
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation - sparse Gaussian representation.
Dataset Condensation with Color Compensation - color compensation for improved condensed images.
Heavy Labels Out! Dataset Distillation with Label Space Lightening - label-space compression.

Generative priors:

Synthesizing Informative Training Samples with GAN - GANs as sample generators.
Generalizing Dataset Distillation via Deep Generative Prior - GLaD; latent optimization through a pretrained generator.
DiM: Distilling Dataset into Generative Model - distills into a generative model.
Dataset Condensation via Generative Model - generator-based condensation.
Efficient Dataset Distillation via Minimax Diffusion - diffusion-based minimax distillation.
D4M: Dataset Distillation via Disentangled Diffusion Model - disentangled diffusion latent DD.
Generative Dataset Distillation Based on Diffusion Model - diffusion-based generative DD.
Influence-Guided Diffusion for Dataset Distillation - influence-guided diffusion sampling.
Taming Diffusion for Dataset Distillation with High Representativeness - diffusion representativeness constraints.
MGD3: Mode-Guided Dataset Distillation using Diffusion Models - mode-guided diffusion DD.
CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation - fixes diffusion-DD inconsistency.
Dataset Distillation via Vision-Language Category Prototype - uses vision-language category prototypes.
Unlocking Dataset Distillation with Diffusion Models - diffusion models as dataset distillation priors.
Diffusion Models as Dataset Distillation Priors - ICLR 2026 diffusion-prior framing.
CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation - training-free T2I diffusion DD.
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation - training-free manifold guidance.
IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation - addresses aggregation artifacts in diffusion DD.
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation - vision-language fusion for generative DD.
Learnability-Guided Diffusion for Dataset Distillation - learnability-guided diffusion sampling.
HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation - autoregressive generative DD.
Path-Guided Flow Matching for Dataset Distillation - flow-matching generative DD.

Moving beyond Image Classification Task - Different Tasks and Modalities

The original benchmark culture was small image classification. The field now includes language, graphs, video, time series, speech, medical imaging, federated/continual learning, RL, point clouds, object detection, retrieval, recommender systems, and multimodal settings.

Self-supervised, pretrained, and multimodal:

Language and text:

Graphs:

Video, time series, speech, and other structured data:

Applications beyond standard supervised classification:

Continual, federated, medical, and data-sharing contexts:

DD in Different Distributions - Long Tailed, etc.

This cluster covers non-IID data, class imbalance, long-tailed labels, bias/fairness, privacy, robustness, noisy labels, domain shift, and adversarial/backdoor settings. These matter because a distilled set can amplify dataset/model biases even when average accuracy is high.

Long-tailed and imbalanced data:

Bias, fairness, calibration, and robustness:

Privacy, backdoors, and data leakage:

Noisy labels, domain shift, and non-IID data:

Understanding the behavior of Dataset Distillation

Does the Loss Landscape differ for models trained on Real Data vs Distilled Data?

Direct "loss landscape" papers are still sparse, but several lines provide indirect evidence: trajectory matching, curvature matching, flat/stable trajectory objectives, calibration, and architecture-transfer work.

Dataset Distillation by Matching Training Trajectories - models trained on distilled data are explicitly pushed toward real-data parameter states.
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - asks whether matching the whole training path matters more than final parameters.
Loss-Curvature Matching for Dataset Selection and Condensation - connects condensed data to local curvature of the training objective.
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching - suggests that which trajectory segments are matched changes optimization behavior.
Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory - stability-oriented trajectory matching.
What is Dataset Distillation Learning? - studies what patterns DD learns and how distilled images differ from natural images.
Towards Mitigating Architecture Overfitting in Dataset Distillation - architecture-specific optimization landscapes can be encoded in synthetic data.
Improve Cross-Architecture Generalization on Dataset Distillation - uses model pools to reduce architecture-specific bias.
Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study - empirical analysis of architecture-transfer failure.
Rethinking Data Distillation: Do Not Overlook Calibration - calibration changes how distilled-data-trained models behave beyond accuracy.

What does Dataset Distillation Learn and Encode?

This is the interpretability/science question: does DD encode prototypes, discriminative shortcuts, training dynamics, class boundaries, model-specific priors, or teacher knowledge?

Dataset Distillation with Infinitely Wide Convolutional Networks - includes early analysis of how distilled sets differ from natural data.
Generalizing Dataset Distillation via Deep Generative Prior - suggests synthetic data quality improves when constrained by natural-image generative priors.
On the Diversity and Realism of Distilled Dataset - studies realism/diversity as properties of useful distilled sets.
What is Dataset Distillation Learning? - direct analysis of the information and patterns encoded by DD.
Neural Spectral Decomposition for Dataset Distillation - spectral view of learned synthetic data.
Frequency Domain-based Dataset Distillation - frequency content as a lens on what DD stores.
Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks - evidence for low-dimensional representation encoding.
Understanding Dataset Distillation via Spectral Filtering - analyzes DD through spectral filtering.
Dataset Distillation for Pre-Trained Self-Supervised Vision Models - distilled images expose model-specific representation biases and spurious correlations.
Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge - soft labels may encode memorized/teacher-held information.
Rethinking Dataset Distillation: Hard Truths about Soft Labels - recent critique of soft-label behavior and evaluation assumptions.

How far is DD explained theoretically?

Partially. Kernel and convexified settings are much better understood than deep, finite-width, augmentation-heavy, generative, or foundation-model settings. Existing theory explains fragments of DD: approximation, compression, kernels, quantization, and some offline-RL/supervised guarantees.

Dataset Meta-Learning from Kernel Ridge-Regression - strong theoretical starting point via KIP.
Dataset Distillation with Infinitely Wide Convolutional Networks - NTK/infinite-width analysis.
Efficient Dataset Distillation Using Random Feature Approximation - approximate-kernel theory/practice.
Dataset Distillation with Convexified Implicit Gradients - convexified approximation for implicit gradients.
On the Size and Approximation Error of Distilled Sets - distilled set size and approximation error.
A Theoretical Study of Dataset Distillation - theoretical characterization of DD.
Provable and Efficient Dataset Distillation for Kernel Ridge Regression - provable KRR distillation.
M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy - MMD framing gives a statistical-distance lens.
A Discrepancy-Based Perspective on Dataset Condensation - discrepancy-theoretic view.
Dataset Distillation as Pushforward Optimal Quantization - pushforward quantization perspective.
Dataset Distillation as Data Compression: A Rate-Utility Perspective - compression theory and utility tradeoffs.
Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets - guarantees for supervised and offline-RL dataset distillation.

Open Problems, Research Questions and Future Direction

Open problem clusters and useful starting papers:

Cross-architecture generalization - GLaD, Towards Mitigating Architecture Overfitting in Dataset Distillation, Improve Cross-Architecture Generalization on Dataset Distillation, Boosting Cross-Architecture Generalization, Prioritize Alignment in Dataset Distillation, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, and PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors.
Scaling laws, compression limits, and evaluation protocols - DC-BENCH, TESLA, SRe2L, Curriculum Data Synthesis, Elucidating the Design Space of Dataset Condensation, DD-Ranking: Rethinking the Evaluation of Dataset Distillation, Rectified Decoupled Dataset Distillation, and Dataset Distillation as Data Compression.
What should be matched? - gradients, features, distributions, trajectories, curvature, attention, quantiles, optimal transport, spectra, and teacher outputs are all active candidates. Compare Gradient Matching, Distribution Matching, MTT, DataDAM, M3D, Neural Characteristic Function, OPTICAL, and Linear Gradient Matching.
Foundation-model-era DD - how to distill for pretrained SSL, CLIP/VLMs, LLMs, and multimodal models rather than training small ConvNets from scratch. Start with Vision-Language Dataset Distillation, Self-Supervised Dataset Distillation for Transfer Learning, Dataset Distillation via Knowledge Distillation, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, Synthetic Text Generation for Training Large Language Models via Gradient Matching, and Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis.
Generative priors vs direct optimization - diffusion, GAN, flow, and autoregressive priors improve realism and scalability, but can introduce generator bias. Start with GLaD, D4M, Taming Diffusion, Unlocking Dataset Distillation with Diffusion Models, CoDA, Learnability-Guided Diffusion, and Path-Guided Flow Matching.
Distribution shift, bias, fairness, privacy, and robustness - important because DD can look good on average accuracy while leaking, biasing, backdooring, or failing under shift. Start with Mitigating Bias, FairDD, Distilling Long-tailed Datasets, Privacy for Free, No Free Lunch in Privacy for Free, Differentially Private Dataset Condensation, BEARD, and DD-RobustBench.
Beyond static image classification - graph, text, video, RL, time series, object detection, point cloud, medical, and federated DD each changes the definition of "small synthetic dataset." Start with Graph Condensation, DiLM, Video Set Distillation, Dataset Distillation for Offline Reinforcement Learning, Dataset Condensation for Time Series Classification, Fetch and Forge, Point Cloud Dataset Distillation, Image Distillation for Safe Data Sharing in Histopathology, and FedDM.
Interpretability of distilled data - DD can be used as a microscope for the training algorithm, but it is unclear whether it reveals dataset semantics, architecture priors, shortcut features, or teacher leakage. Start with What is Dataset Distillation Learning?, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, Understanding Dataset Distillation via Spectral Filtering, Dataset Distillation Efficiently Encodes Low-Dimensional Representations, Dataset Distillation for Memorized Data, and Rethinking Dataset Distillation: Hard Truths about Soft Labels.