ICML23 - Synthetic Data for Model Selection

前言

如果你对这篇文章感兴趣，可以点击「【访客必读 - 指引页】一文囊括主页内所有高质量博客」，查看完整博客分类与对应链接。

本文关注的问题为：是否可以使用合成数据（Synthetic Data）用于模型选择？即不再划分验证集，而是将所有标记数据作为训练集，使用训练集生成的合成数据来挑选模型。

本文中关注的「模型选择」，是指根据训练集训练得到的多个模型（不同网络架构，不同超参等）的选择。

本文的整体行文逻辑为：

首先给出包含 insight 的理论；
随后用大量的实验说明：使用合成数据挑选模型是有效的。

Synthetic Data for Model Selection

本文首先定义了一个统计量 $\Delta \epsilon$ ，其具体定义如下：

Lemma 3.1. Let $\Delta \epsilon$ denote the risk difference between two hypotheses, $h_1, h_2 \in \mathcal{H}$ , measured over a probability distribution $\mathcal{D}=\langle\Omega, \mu\rangle$ , i.e., $\Delta \epsilon=$ $\epsilon\left(h_2\right)-\epsilon\left(h_1\right)$ . Let $f$ denote the labeling function. Let $\Omega_1=\left\{\mathbf{x} \in \Omega \mid h_1(\mathbf{x}) \neq f(\mathbf{x}) \wedge h_2(\mathbf{x})=f(\mathbf{x})\right\}$ and $\Omega_2=$ $\left\{\mathbf{x} \in \Omega \mid h_2(\mathbf{x}) \neq f(\mathbf{x}) \wedge h_1(\mathbf{x})=f(\mathbf{x})\right\}$ . Then,
$\Delta \epsilon=\int_{\Omega_2} \mu(\mathbf{x}) d \mathbf{x}-\int_{\Omega_1} \mu(\mathbf{x}) d \mathbf{x} .$

简单讲，现在有两个模型 $h_1$ 和 $h_2$ ，任务分布 $\mathcal{D}$ 上的 Ground truth 为 $f$ ，则 $\Delta \epsilon$ 刻画了「 $h_2$ 在 $\mathcal{D}$ 上的准确率 - $h_1$ 在 $\mathcal{D}$ 上的准确率」。因此如果 $\Delta \epsilon\geq 0$ ，则应选择模型 $h_2$ 。

根据上述统计量，可推出下述定理：

Theorem 3.2. Let $\Delta \epsilon_r$ and $\Delta \epsilon_s$ denote the risk difference between two hypotheses, $h_1, h_2 \in \mathcal{H}$ , measured over the real and the synthetic probability distributions $\mathcal{D}_r=\left(\Omega, \mu_r\right)$ and $\mathcal{D}_s=\left(\Omega, \mu_s\right)$ , respectively, i.e., $\Delta \epsilon_r=$ $\epsilon_r\left(h_2\right)-\epsilon_r\left(h_1\right)$ and $\Delta \epsilon_s=\epsilon_s\left(h_2\right)-\epsilon_s\left(h_1\right)$ . Let $f$ denote the labeling function. Then, for any $h_1, h_2 \in \mathcal{H}$ :
$\Delta \epsilon_s-\Delta \epsilon_r \leq \delta_{h_1 \oplus h_2}(\mu_r, \mu_s),$ where $\delta_{h_1 \oplus h_2}$ is the total variation computed over the subset of the domain $\Omega$ , where the hypotheses $h_1$ and $h_2$ do not agree.

具体证明如下：
$\begin{aligned} \Delta \epsilon_s- \Delta \epsilon_r & = \int_{\Omega_2} \mu_s(\mathbf{x}) d \mathbf{x}-\int_{\Omega_1} \mu_s(\mathbf{x}) d \mathbf{x} -\int_{\Omega_2} \mu_r(\mathbf{x}) d \mathbf{x}+\int_{\Omega_1} \mu_r(\mathbf{x}) d \mathbf{x} \\ &= \int_{\Omega_2} \mu_s(\mathbf{x})-\mu_r(\mathbf{x}) d \mathbf{x}-\int_{\Omega_1} \mu_s(\mathbf{x})-\mu_r(\mathbf{x}) d \mathbf{x} \\ & \leq \int_{\Omega_2}\left|\mu_s(\mathbf{x})-\mu_r(\mathbf{x})\right| d \mathbf{x}+\int_{\Omega_1}\left|\mu_s(\mathbf{x})-\mu_r(\mathbf{x})\right| d \mathbf{x} \\ &= \int_{\Omega_1 \cup \Omega_2}\left|\mu_s(\mathbf{x})-\mu_r(\mathbf{x})\right| d \mathbf{x} \\ & \leq \delta_{h_1 \oplus h_2}(\mu_r, \mu_s) \end{aligned}$

上述定理想刻画 $\Delta \epsilon_r$ （真实数据分布上模型 $h_1$ 和 $h_2$ 的性能排序）和 $\Delta \epsilon_s$ （合成数据分布上模型性能排序）之间的关系，并说明：

使用合成数据对模型进行排名的能力仅取决于在模型分歧区域内合成数据分布和真实数据分布之间的概率密度差距 $\delta_{h_1 \oplus h_2}(\mu_r, \mu_s)$ .
原文：The ability to use synthetic data for ranking models depends only on the probability density gap between the synthetic and real distribution in the area of disagreement, $\delta_{h_1 \oplus h_2}(\mu_r, \mu_s)$ .

根据上述定理可以得到下述推论：

当 $\Delta \epsilon_s\geq \delta(\mu_r,\mu_s)$ 时，可以得到 $\Delta \epsilon_r\geq 0$ ，其中 $\delta(\mu_r,\mu_s)$ 为真实分布和合成分布之间的全变差 (Total variation)。

换句话说，只要 $\Delta \epsilon_s\geq \delta(\mu_r,\mu_s)$ ，则真实分布和合成分布上的模型排序，是一致的。即：

如果模型 $h_1$ 和 $h_2$ 在合成分布上的准确率差距「大于」合成分布与真实分布之间的差距，则使用合成分布进行模型选择是有效的。

Synthetic Dataset Calibration

为了使合成分布和真实分布更为接近，本文在实验部分提出了一种「合成数据集校正的方法」，即选出一组模型，首先得到这组模型在训练数据中各类别上的经验损失 $\hat{\epsilon}_r^c$ （假设为类别 $c$ ），随后再得到模型对合成数据上各数据的预测损失 $\mathbf{Q}_c$ （0 为正确，1 为错误）。

随后对合成数据中各数据点进行加权，并求解下式得到样本权重：
$\mathbf{w}_c=\underset{\mathbf{w}}{\operatorname{argmin}}\left\{\left\|\hat{\epsilon}_r^c-\mathbf{Q}_c{ }^T \mathbf{w}\right\|_2^2+\lambda\|\mathbf{w}\|_2^2\right\}.$

更通用的做法一般是对数据进行加权，然后优化加权后的合成数据和训练数据之间的分布差距，不知道和上述这种做法对比，差距如何。

Experiments

这篇文章主要还是以实验为主，感兴趣的话可以直接去原论文看，此处列举一些主要的实验结果。

「使用合成数据挑模型」vs「使用验证集挑」，前者效果更好：
训练集比较小的时候，「合成数据上的误差」和「测试集上的误差」相关性更强：
- 文中的分析：训练集小导致 $\Delta \epsilon_s$ 更大， $\Delta \epsilon_s\geq \delta(\mu_r,\mu_s)$ 更易满足。