Ramblings - Dataset Distillation
What is Dataset Distillation
Dataset Distillation, also called Dataset Condensation in much of the later literature, studies the following problem:
Can we replace a large real training set with a much smaller synthetic set such that training on the small set gives nearly the same downstream performance?
The synthetic dataset is not restricted to being a subset of the original data. The inputs, labels, and sometimes even training schedules can be optimized directly. This makes dataset distillation more general than coreset selection or subset selection.
At a high level, the goal is to compress the task-relevant information in a dataset rather than merely compressing pixels or storing a few representative examples.
Paper map last expanded: 2026-04-26.
Core entry points:
- Dataset Distillation - Wang, Zhu, Torralba, and Efros; introduced the task and optimized synthetic examples through unrolled training.
- Dataset Condensation with Gradient Matching - Zhao, Mopuri, and Bilen; reframed the problem around matching gradients from real and synthetic batches.
- Dataset Condensation with Distribution Matching - Zhao and Bilen; made condensation cheaper by matching feature distributions rather than full optimization dynamics.
- Dataset Distillation by Matching Training Trajectories - Cazenavette et al.; popularized matching long-range parameter trajectories from expert networks.
- DC-BENCH: Dataset Condensation Benchmark - Cui, Wang, Si, and Hsieh; standardized evaluation and exposed sensitivity to architecture, augmentation, and protocol.
- A Comprehensive Survey of Dataset Distillation - Lei and Tao; broad survey covering frameworks, algorithms, factorization, applications, and limitations.
- Dataset Distillation: A Comprehensive Review - Yu, Liu, and Wang; taxonomy, algorithmic framework, theoretical connections, and challenges.
- The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions - Liu et al.; recent survey emphasizing scalability, generalization, and foundation-model-era directions.
Broad Formal Problem Setup
Let the real dataset be
and let the synthetic distilled dataset be
Let a learning algorithm train model parameters
where
Papers that are especially useful for formalizing the objective:
- Dataset Distillation - canonical bilevel/unrolled optimization setup.
- Flexible Dataset Distillation: Learn Labels Instead of Images - label distillation and more flexible meta-learning formulation.
- Soft-Label Dataset Distillation and Text Dataset Distillation - soft labels and early extension beyond image-only hard-label distillation.
- Dataset Meta-Learning from Kernel Ridge-Regression - KIP/kernel inducing point view.
- Dataset Distillation with Infinitely Wide Convolutional Networks - infinite-width convolutional kernel formulation.
- Efficient Dataset Distillation Using Random Feature Approximation - approximate-kernel view for making kernel distillation cheaper.
- Dataset Distillation with Convexified Implicit Gradients - implicit-gradient and convexified approximation to the bilevel problem.
- Provable and Efficient Dataset Distillation for Kernel Ridge Regression - theoretical guarantees for KRR distillation.
- Dataset Distillation as Pushforward Optimal Quantization - distributional/optimal-quantization framing.
- Dataset Distillation as Data Compression: A Rate-Utility Perspective - rate-utility view of what compression should preserve.
Why should we care?
Dataset distillation matters for at least four reasons.
First, it improves efficiency. If a tiny synthetic dataset can stand in for a large real dataset, then training, ablations, architecture search, and hyperparameter sweeps become much cheaper.
Second, it is useful in communication-constrained, storage-constrained, privacy-sensitive, or continual-learning settings where replay buffers or dataset transmission matter.
Third, it is an intriguing scientific problem. A successful distilled dataset is evidence that the original dataset contains a large amount of redundancy. Distillation therefore gives a way to ask: what information does a model really need in order to learn a task? It is about understanding which aspects of learning dynamics are worth preserving
Motivation and use-case papers:
- Dataset Condensation with Differentiable Siamese Augmentation - augmentation-aware condensation, important because evaluation can hinge on train/test augmentation alignment.
- DC-BENCH: Dataset Condensation Benchmark - benchmarking and protocol sensitivity.
- Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory - constant-memory trajectory matching at ImageNet scale.
- Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective - decoupled large-scale pipeline using recovery/relabeling.
- Dataset Distillation via Curriculum Data Synthesis in Large Data Era - curriculum data synthesis for large-scale condensation.
- Calibrated Dataset Condensation for Faster Hyperparameter Search - hyperparameter search as a concrete efficiency application.
- On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm - realism/diversity matters for practical use and transfer.
- Self-Supervised Dataset Distillation for Transfer Learning - transfer-oriented DD rather than only training-from-scratch classification.
- Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks - DD as a route to efficient self-supervised pre-training.
- Dataset Distillation for Pre-Trained Self-Supervised Vision Models - distillation for linear probes over modern pretrained SSL vision backbones.
The Evolution of the Field
Origin - The Original Bilevel Optimization Approach
Introduced in [@wangDatasetDistillation2020a].
Key papers:
- Dataset Distillation - original unrolled bilevel optimization of synthetic images, labels, and learning rates.
- Soft-Label Dataset Distillation and Text Dataset Distillation - early soft-label formulation and text extension.
- Flexible Dataset Distillation: Learn Labels Instead of Images - shows that optimizing labels on real images can outperform optimizing synthetic pixels in some settings.
- Medical Dataset Distillation / Soft-Label Anonymous Gastric X-ray Image Distillation - early real-world/medical-data-sharing direction.
- Optimizing Millions of Hyperparameters by Implicit Differentiation - not DD-specific, but important background for implicit differentiation in bilevel data optimization.
- Dataset Distillation with Convexified Implicit Gradients - revisits bilevel DD with implicit gradients and convexified approximations.
- Dataset Distillation with Stochastic Neural Networks - studies stochasticity/dropout-style randomness inside distillation.
Gradient Matching
Gradient matching makes the synthetic set induce gradients close to those induced by real data. It is the bridge between the original bilevel formulation and more scalable matching objectives.
- Dataset Condensation with Gradient Matching - first major gradient-matching formulation.
- Dataset Condensation with Differentiable Siamese Augmentation - differentiable augmentation during condensation and evaluation.
- Dataset Condensation with Contrastive Signals - adds contrastive structure to the synthetic data learning signal.
- Loss-Curvature Matching for Dataset Selection and Condensation - matches second-order/loss-curvature information rather than only gradients.
- Delving into Effective Gradient Matching for Dataset Condensation - analyzes and improves practical gradient matching.
- DREAM: Efficient Dataset Distillation by Representative Matching - representative matching as an efficient alternative within the matching family.
- DREAM+: Efficient Dataset Distillation by Bidirectional Representative Matching - strengthens representative matching bidirectionally.
- Accelerating Dataset Distillation via Model Augmentation - model augmentation to improve generalization of gradient-based distilled sets.
- Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching - Taylor-approximated matching for large-scale settings.
- Calibrated Dataset Condensation for Faster Hyperparameter Search - calibrates condensed sets for downstream hyperparameter search.
- Towards Adversarially Robust Dataset Distillation by Curvature Regularization - curvature regularization for robustness.
- Synthetic Text Generation for Training Large Language Models via Gradient Matching - gradient matching applied to text/LLM training data.
- Dataset Distillation for Pre-Trained Self-Supervised Vision Models - linear-gradient matching over frozen SSL representations.
Distribution Matching
Distribution matching avoids differentiating through long training trajectories by matching statistics or distributions in learned feature spaces. This family is usually cheaper and often more scalable, but its quality depends heavily on the embedding space and diversity constraints.
- CAFE: Learning to Condense Dataset by Aligning Features - feature alignment with discriminative constraints.
- Dataset Condensation with Distribution Matching - class-wise feature distribution matching over sampled networks.
- Improved Distribution Matching for Dataset Condensation - fixes feature imbalance and unvalidated embeddings in naive DM.
- DataDAM: Efficient Dataset Distillation with Attention Matching - matches attention/discriminative regions.
- M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy - explicit MMD-based distribution matching.
- Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation - matches relational structure across samples and features.
- Dataset Condensation with Latent Quantile Matching - quantile-based feature distribution matching.
- DANCE: Dual-View Distribution Alignment for Dataset Condensation - aligns distributions from complementary views.
- Diversified Semantic Distribution Matching for Dataset Distillation - improves semantic diversity in distribution matching.
- Decomposed Distribution Matching in Dataset Condensation - separates content/style distribution issues.
- Dataset Distillation with Neural Characteristic Function: A Minmax Perspective - characteristic-function/minmax view of distribution matching.
- OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation - optimal transport for contribution allocation.
- Dataset Distillation via the Wasserstein Metric - Wasserstein/OT framing.
- Diversity-Enhanced Distribution Alignment for Dataset Distillation - explicitly handles diversity in distribution alignment.
- Hyperbolic Dataset Distillation - studies DD in hyperbolic representation space.
- TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution - combines trajectory guidance with distribution balancing.
Kernel and other theoretically grounded methods
Kernel methods are useful because they give a more analyzable version of DD: the model-training map is replaced by kernel ridge regression or neural tangent kernels.
- Dataset Meta-Learning from Kernel Ridge-Regression - Kernel Inducing Points (KIP).
- Dataset Distillation with Infinitely Wide Convolutional Networks - distributed kernel meta-learning with infinite-width ConvNets.
- Dataset Distillation using Neural Feature Regression - neural feature regression as a cheaper surrogate.
- Efficient Dataset Distillation Using Random Feature Approximation - random features to approximate expensive kernel computations.
- Dataset Distillation with Convexified Implicit Gradients - finite-width NTK-inspired convexification.
- Provable and Efficient Dataset Distillation for Kernel Ridge Regression - guarantees for KRR distillation.
- A Theoretical Study of Dataset Distillation - theoretical analysis of what distilled sets can approximate.
- On the Size and Approximation Error of Distilled Sets - size/error tradeoffs for distilled sets.
- Dataset Distillation as Pushforward Optimal Quantization - DD as optimal quantization after pushing data through the learning map.
- Dataset Distillation as Data Compression: A Rate-Utility Perspective - compression-theoretic formulation.
- Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets - guarantees beyond standard supervised classification.
Trajectory Matching
Trajectory matching asks the synthetic data to reproduce where real-data training moves the model parameters, usually using expert trajectories precomputed on the full data.
- Dataset Distillation by Matching Training Trajectories - MTT; matches long-range expert parameter trajectories.
- Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - accumulated trajectory error instead of endpoint-only error.
- Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory - TESLA-style constant-memory scaling.
- Sequential Subset Matching for Dataset Distillation - sequentially matches subsets of the expert trajectory.
- Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching - aligns matching difficulty along the trajectory.
- SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching - selection-based initialization and partial update strategy.
- Dataset Distillation by Automatic Training Trajectories - automatic trajectory generation/selection.
- Neural Spectral Decomposition for Dataset Distillation - spectral analysis/parameterization connected to trajectory behavior.
- Prioritize Alignment in Dataset Distillation - alignment prioritization for better generalization.
- Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory - more stable/storage-efficient trajectory matching.
- Robust Dataset Distillation by Matching Adversarial Trajectories - adversarial trajectory matching for robustness.
- Contrastive Learning-Enhanced Trajectory Matching for Dataset Distillation - contrastive learning signal on top of trajectory matching.
Synthetic Data Parameterization - Latent Space, Generative Priors
This line asks what the synthetic object should be. Instead of raw pixels, DD can optimize labels, latent codes, bases/factors, neural fields, generators, diffusion priors, or quantized representations.
Pixel/label/factor parameterizations:
- Flexible Dataset Distillation: Learn Labels Instead of Images - label distillation.
- Soft-Label Dataset Distillation and Text Dataset Distillation - soft-label DD.
- Dataset Condensation via Efficient Synthetic-Data Parameterization - IDC; parameterizes synthetic data more efficiently.
- Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks - addressable memory parameterization.
- Dataset Distillation via Factorization - HaBa-style factorized synthetic data.
- Dataset Condensation with Latent Space Knowledge Factorization and Sharing - latent factorization/sharing.
- Slimmable Dataset Condensation - one condensed set for multiple compression ratios.
- Few-Shot Dataset Distillation via Translative Pre-Training - translative pretraining for few-shot DD.
- Sparse Parameterization for Epitomic Dataset Distillation - sparse/epitomic representation.
- Frequency Domain-based Dataset Distillation - frequency-domain parameterization.
- Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation - hierarchical sharing across synthetic data.
- FYI: Flip Your Images for Dataset Distillation - simple transformation/parameterization improvement.
- Color-Oriented Redundancy Reduction in Dataset Distillation - color redundancy as a compression target.
- Distilling Dataset into Neural Field - neural field representation.
- Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation - sparse Gaussian representation.
- Dataset Condensation with Color Compensation - color compensation for improved condensed images.
- Heavy Labels Out! Dataset Distillation with Label Space Lightening - label-space compression.
Generative priors:
- Synthesizing Informative Training Samples with GAN - GANs as sample generators.
- Generalizing Dataset Distillation via Deep Generative Prior - GLaD; latent optimization through a pretrained generator.
- DiM: Distilling Dataset into Generative Model - distills into a generative model.
- Dataset Condensation via Generative Model - generator-based condensation.
- Efficient Dataset Distillation via Minimax Diffusion - diffusion-based minimax distillation.
- D4M: Dataset Distillation via Disentangled Diffusion Model - disentangled diffusion latent DD.
- Generative Dataset Distillation Based on Diffusion Model - diffusion-based generative DD.
- Influence-Guided Diffusion for Dataset Distillation - influence-guided diffusion sampling.
- Taming Diffusion for Dataset Distillation with High Representativeness - diffusion representativeness constraints.
- MGD3: Mode-Guided Dataset Distillation using Diffusion Models - mode-guided diffusion DD.
- CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation - fixes diffusion-DD inconsistency.
- Dataset Distillation via Vision-Language Category Prototype - uses vision-language category prototypes.
- Unlocking Dataset Distillation with Diffusion Models - diffusion models as dataset distillation priors.
- Diffusion Models as Dataset Distillation Priors - ICLR 2026 diffusion-prior framing.
- CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation - training-free T2I diffusion DD.
- ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation - training-free manifold guidance.
- IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation - addresses aggregation artifacts in diffusion DD.
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation - vision-language fusion for generative DD.
- Learnability-Guided Diffusion for Dataset Distillation - learnability-guided diffusion sampling.
- HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation - autoregressive generative DD.
- Path-Guided Flow Matching for Dataset Distillation - flow-matching generative DD.
Moving beyond Image Classification Task - Different Tasks and Modalities
The original benchmark culture was small image classification. The field now includes language, graphs, video, time series, speech, medical imaging, federated/continual learning, RL, point clouds, object detection, retrieval, recommender systems, and multimodal settings.
Self-supervised, pretrained, and multimodal:
- Self-Supervised Dataset Distillation for Transfer Learning
- Efficiency for Free: Ideal Data Are Transportable Representations
- Self-supervised Dataset Distillation: A Good Compression Is All You Need
- Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks
- Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation
- Dataset Distillation for Pre-Trained Self-Supervised Vision Models
- Vision-Language Dataset Distillation
- Low-Rank Similarity Mining for Multimodal Dataset Distillation
- Audio-Visual Dataset Distillation
- Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
- Efficient Multimodal Dataset Distillation via Generative Models
- CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
- ImageBindDC: Compressing Multi-modal Data with ImageBind-based Condensation
- Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
- Multimodal Dataset Distillation via Phased Teacher Models
Language and text:
- Soft-Label Dataset Distillation and Text Dataset Distillation
- Data Distillation for Text Classification
- Dataset Distillation with Attention Labels for Fine-tuning BERT
- DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation
- Textual Dataset Distillation via Language Model Embedding
- UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation
- Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training
- Synthetic Text Generation for Training Large Language Models via Gradient Matching
- CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching
Graphs:
- Graph Condensation for Graph Neural Networks
- Condensing Graphs via One-Step Gradient Matching
- Graph Condensation via Receptive Field Distribution Matching
- Kernel Ridge Regression-Based Graph Dataset Distillation
- Structure-free Graph Condensation: From Large-scale Graphs to Condensed Graph-free Data
- Does Graph Distillation See Like Vision Dataset Counterpart?
- Mirage: Model-Agnostic Graph Distillation for Graph Classification
- Graph Distillation with Eigenbasis Matching
- Navigating Complexity: Toward Lossless Graph Condensation via Expanding Window Matching
- Graph Data Condensation via Self-expressive Graph Structure Reconstruction
- A Survey on Graph Condensation
Video, time series, speech, and other structured data:
- Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- Video Set Distillation: Information Diversification and Temporal Densification
- A Large-Scale Study on Video Action Dataset Condensation
- Condensing Action Segmentation Datasets via Generative Network Inversion
- Latent Video Dataset Distillation
- Distill Video Datasets into Images
- PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- New Properties of the Data Distillation Method When Working With Tabular Data
- Dataset Condensation for Time Series Classification via Dual Domain Matching
- CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting
- Less is More: Efficient Time Series Dataset Condensation via Two-fold Modal Matching
- DDTime: Dataset Distillation with Spectral Alignment and Information Bottleneck for Time-Series Forecasting
- ShapeCond: Fast Shapelet-Guided Dataset Condensation for Time Series Classification
- Dataset-Distillation Generative Model for Speech Emotion Recognition
Applications beyond standard supervised classification:
- Dataset Distillation for Offline Reinforcement Learning
- Offline Behavior Distillation
- Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets
- Fetch and Forge: Efficient Dataset Condensation for Object Detection
- OD3: Optimization-free Dataset Distillation for Object Detection
- Point Cloud Dataset Distillation
- Dataset Distillation of 3D Point Clouds via Distribution Matching
- Toward Dataset Distillation for Regression Problems
- Towards Efficient Deep Hashing Retrieval: Condensing Your Data via Feature-Embedding Matching
- GSDD: Generative Space Dataset Distillation for Image Super-resolution
- Distilled Datamodel with Reverse Gradient Matching
- Dataset Condensation Driven Machine Unlearning
- EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training
- ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models
- Towards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion
Continual, federated, medical, and data-sharing contexts:
- Reducing Catastrophic Forgetting with Learning on Synthetic Data
- Condensed Composite Memory Continual Learning
- Distilled Replay: Overcoming Forgetting through Synthetic Samples
- Sample Condensation in Online Continual Learning
- An Efficient Dataset Condensation Plugin and Its Application to Continual Learning
- Summarizing Stream Data for Memory-Restricted Online Continual Learning
- Federated Learning via Synthetic Data
- Distilled One-Shot Federated Learning
- Meta Knowledge Condensation for Federated Learning
- FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
- Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments
- Soft-Label Anonymous Gastric X-ray Image Distillation
- Image Distillation for Safe Data Sharing in Histopathology
- Dataset Distillation in Medical Imaging: A Feasibility Study
- Dataset Distillation for Histopathology Image Classification
- Progressive Trajectory Matching for Medical Dataset Distillation
- High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation
DD in Different Distributions - Long Tailed, etc.
This cluster covers non-IID data, class imbalance, long-tailed labels, bias/fairness, privacy, robustness, noisy labels, domain shift, and adversarial/backdoor settings. These matter because a distilled set can amplify dataset/model biases even when average accuracy is high.
Long-tailed and imbalanced data:
- Distilling Long-tailed Datasets
- Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
- Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling
- TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution
Bias, fairness, calibration, and robustness:
- Mitigating Bias in Dataset Distillation
- FairDD: Fair Dataset Distillation
- Rethinking Data Distillation: Do Not Overlook Calibration
- Towards Trustworthy Dataset Distillation
- Can We Achieve Robustness from Data Alone?
- Towards Robust Dataset Learning
- Towards Adversarially Robust Dataset Distillation by Curvature Regularization
- Group Distributionally Robust Dataset Distillation with Risk Minimization
- ROME is Forged in Adversity: Robust Distilled Datasets via Information Bottleneck
- Robust Dataset Distillation by Matching Adversarial Trajectories
- BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation
- DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation
Privacy, backdoors, and data leakage:
- Privacy for Free: How does Dataset Condensation Help Privacy?
- Private Set Generation with Discriminative Information
- No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy"
- Backdoor Attacks Against Dataset Distillation
- Differentially Private Kernel Inducing Points for Privacy-preserving Data Distillation
- Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation
- Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective
- Differentially Private Dataset Condensation
- Improving Noise Efficiency in Privacy-preserving Dataset Distillation
- Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
- SNEAKDOOR: Stealthy Backdoor Attacks against Distribution Matching-based Dataset Condensation
- Poisoned Distillation: Injecting Backdoors into Distilled Datasets Without Raw Data Access
Noisy labels, domain shift, and non-IID data:
- Dataset Distillers Are Good Label Denoisers In the Wild
- Robust Dataset Condensation using Supervised Contrastive Learning
- Multi-Source Domain Adaptation Meets Dataset Distillation through Dataset Dictionary Learning
- Large Scale Dataset Distillation with Domain Shift
- Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments
- DCFL: Non-IID Awareness Dataset Condensation Aided Federated Learning
Understanding the behavior of Dataset Distillation
Does the Loss Landscape differ for models trained on Real Data vs Distilled Data?
Direct "loss landscape" papers are still sparse, but several lines provide indirect evidence: trajectory matching, curvature matching, flat/stable trajectory objectives, calibration, and architecture-transfer work.
- Dataset Distillation by Matching Training Trajectories - models trained on distilled data are explicitly pushed toward real-data parameter states.
- Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - asks whether matching the whole training path matters more than final parameters.
- Loss-Curvature Matching for Dataset Selection and Condensation - connects condensed data to local curvature of the training objective.
- Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching - suggests that which trajectory segments are matched changes optimization behavior.
- Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory - stability-oriented trajectory matching.
- What is Dataset Distillation Learning? - studies what patterns DD learns and how distilled images differ from natural images.
- Towards Mitigating Architecture Overfitting in Dataset Distillation - architecture-specific optimization landscapes can be encoded in synthetic data.
- Improve Cross-Architecture Generalization on Dataset Distillation - uses model pools to reduce architecture-specific bias.
- Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study - empirical analysis of architecture-transfer failure.
- Rethinking Data Distillation: Do Not Overlook Calibration - calibration changes how distilled-data-trained models behave beyond accuracy.
What does Dataset Distillation Learn and Encode?
This is the interpretability/science question: does DD encode prototypes, discriminative shortcuts, training dynamics, class boundaries, model-specific priors, or teacher knowledge?
- Dataset Distillation with Infinitely Wide Convolutional Networks - includes early analysis of how distilled sets differ from natural data.
- Generalizing Dataset Distillation via Deep Generative Prior - suggests synthetic data quality improves when constrained by natural-image generative priors.
- On the Diversity and Realism of Distilled Dataset - studies realism/diversity as properties of useful distilled sets.
- What is Dataset Distillation Learning? - direct analysis of the information and patterns encoded by DD.
- Neural Spectral Decomposition for Dataset Distillation - spectral view of learned synthetic data.
- Frequency Domain-based Dataset Distillation - frequency content as a lens on what DD stores.
- Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks - evidence for low-dimensional representation encoding.
- Understanding Dataset Distillation via Spectral Filtering - analyzes DD through spectral filtering.
- Dataset Distillation for Pre-Trained Self-Supervised Vision Models - distilled images expose model-specific representation biases and spurious correlations.
- Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge - soft labels may encode memorized/teacher-held information.
- Rethinking Dataset Distillation: Hard Truths about Soft Labels - recent critique of soft-label behavior and evaluation assumptions.
How far is DD explained theoretically?
Partially. Kernel and convexified settings are much better understood than deep, finite-width, augmentation-heavy, generative, or foundation-model settings. Existing theory explains fragments of DD: approximation, compression, kernels, quantization, and some offline-RL/supervised guarantees.
- Dataset Meta-Learning from Kernel Ridge-Regression - strong theoretical starting point via KIP.
- Dataset Distillation with Infinitely Wide Convolutional Networks - NTK/infinite-width analysis.
- Efficient Dataset Distillation Using Random Feature Approximation - approximate-kernel theory/practice.
- Dataset Distillation with Convexified Implicit Gradients - convexified approximation for implicit gradients.
- On the Size and Approximation Error of Distilled Sets - distilled set size and approximation error.
- A Theoretical Study of Dataset Distillation - theoretical characterization of DD.
- Provable and Efficient Dataset Distillation for Kernel Ridge Regression - provable KRR distillation.
- M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy - MMD framing gives a statistical-distance lens.
- A Discrepancy-Based Perspective on Dataset Condensation - discrepancy-theoretic view.
- Dataset Distillation as Pushforward Optimal Quantization - pushforward quantization perspective.
- Dataset Distillation as Data Compression: A Rate-Utility Perspective - compression theory and utility tradeoffs.
- Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets - guarantees for supervised and offline-RL dataset distillation.
Open Problems, Research Questions and Future Direction
Open problem clusters and useful starting papers:
- Cross-architecture generalization - GLaD, Towards Mitigating Architecture Overfitting in Dataset Distillation, Improve Cross-Architecture Generalization on Dataset Distillation, Boosting Cross-Architecture Generalization, Prioritize Alignment in Dataset Distillation, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, and PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors.
- Scaling laws, compression limits, and evaluation protocols - DC-BENCH, TESLA, SRe2L, Curriculum Data Synthesis, Elucidating the Design Space of Dataset Condensation, DD-Ranking: Rethinking the Evaluation of Dataset Distillation, Rectified Decoupled Dataset Distillation, and Dataset Distillation as Data Compression.
- What should be matched? - gradients, features, distributions, trajectories, curvature, attention, quantiles, optimal transport, spectra, and teacher outputs are all active candidates. Compare Gradient Matching, Distribution Matching, MTT, DataDAM, M3D, Neural Characteristic Function, OPTICAL, and Linear Gradient Matching.
- Foundation-model-era DD - how to distill for pretrained SSL, CLIP/VLMs, LLMs, and multimodal models rather than training small ConvNets from scratch. Start with Vision-Language Dataset Distillation, Self-Supervised Dataset Distillation for Transfer Learning, Dataset Distillation via Knowledge Distillation, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, Synthetic Text Generation for Training Large Language Models via Gradient Matching, and Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis.
- Generative priors vs direct optimization - diffusion, GAN, flow, and autoregressive priors improve realism and scalability, but can introduce generator bias. Start with GLaD, D4M, Taming Diffusion, Unlocking Dataset Distillation with Diffusion Models, CoDA, Learnability-Guided Diffusion, and Path-Guided Flow Matching.
- Distribution shift, bias, fairness, privacy, and robustness - important because DD can look good on average accuracy while leaking, biasing, backdooring, or failing under shift. Start with Mitigating Bias, FairDD, Distilling Long-tailed Datasets, Privacy for Free, No Free Lunch in Privacy for Free, Differentially Private Dataset Condensation, BEARD, and DD-RobustBench.
- Beyond static image classification - graph, text, video, RL, time series, object detection, point cloud, medical, and federated DD each changes the definition of "small synthetic dataset." Start with Graph Condensation, DiLM, Video Set Distillation, Dataset Distillation for Offline Reinforcement Learning, Dataset Condensation for Time Series Classification, Fetch and Forge, Point Cloud Dataset Distillation, Image Distillation for Safe Data Sharing in Histopathology, and FedDM.
- Interpretability of distilled data - DD can be used as a microscope for the training algorithm, but it is unclear whether it reveals dataset semantics, architecture priors, shortcut features, or teacher leakage. Start with What is Dataset Distillation Learning?, Dataset Distillation for Pre-Trained Self-Supervised Vision Models, Understanding Dataset Distillation via Spectral Filtering, Dataset Distillation Efficiently Encodes Low-Dimensional Representations, Dataset Distillation for Memorized Data, and Rethinking Dataset Distillation: Hard Truths about Soft Labels.