


Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding.Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).






The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pretrained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].



Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22], with each method outperforming the others on different tasks.1 Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture [43, 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.





In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.



For our model architecture, we use the Transformer [62], which has been shown to perform strongly on various tasks such as machine translation [62], document generation [34], and syntactic parsing [29].This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.





We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40], 5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.



Related Work

Semi-supervised learning for NLP

Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics.

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks [28, 32, 1, 36, 22, 12, 56, 31].




Unsupervised pre-training

Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification [20, 49, 63] and regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48].

The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longerrange linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches [43, 44, 38] use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.




Auxiliary training objectives

Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.




Unsupervised pre-training

Given an unsupervised corpus of tokens U = {u1, ...,un}, we use a standard language modeling objective to maximize the following likelihood:

where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent [51].

In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

where U = (u−k; ... ; u−1) is the context vector of tokens, n is the number of layers, We is the token embedding matrix, and Wp is the position embedding matrix.


给定一个无监督的token语料库U = {u1, ...,un},我们使用标准的语言建模目标来最大化以下可能性:



其中U = (u−k; ... ; u−1)为token的上下文向量,n为层数,We为词嵌入矩阵,Wp为位置嵌入矩阵。







Supervised fine-tuning

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens, x1,...,xm, along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation hl^m, which is then fed into an added linear output layer with parameters Wy to predict y:

This gives us the following objective to maximize:

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):

                                                L3(C) = L2(C) + λ ∗ L1(C)

Overall, the only extra parameters we require during fine-tuning are Wy, and embeddings for delimiter tokens (described below in Section 3.3).


在使用Eq. 1中的目标训练模型后,我们将参数调整为监督目标任务。我们假设一个有标签的数据集C,其中每个实例由一系列输入令牌x1,…,xm,以及标签y。输入通过我们的预训练模型获得最终变压器块的激活hl^m,然后将其馈送到一个附加的线性输出层,参数为Wy,以预测y:



L3(C) = L2 (C) + λ * L1(C)





Task-specific input transformations







Model specifications

Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.




 Zero-shot Behaviors

We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs. We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning. We visualize the effectiveness of these heuristic solutions over the course of generative pre-training in Fig 2(right). We observe the performance of these heuristics is stable and steadily increases over training suggesting that generative pretraining supports the learning of a wide variety of task relevant functionality. We also observe the LSTM exhibits higher variance in its zero-shot performance suggesting that the inductive bias of the Transformer architecture assists in transfer.




We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets we study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Our work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach. We hope that this will help enable new research into unsupervised learning, for both natural language understanding and other domains, further improving our understanding of how and when unsupervised learning works.






RabbitMq:什么是RabbitMq? ①

一、RabbitMq定位 RabbitMq是一个基于消息订阅发布的一款消息中间件。 二、技术原理 核心概念 server:又称broker,接受客户端连接,实现AMQP实体服务。缓存代理,Kafka集群中的一台或多台服务器统称broker.connection:…


介绍完了stack和queue的介绍以及模拟的相关内容后:C初阶:容器适配器介绍、stack和queue常用接口详解及模拟实现 接下来进行priority_queue的介绍以及模拟: 文章目录 1.priority_queue的介绍和使用1.1priority_queue的初步介绍1.2priority_que…

模型 3C(顾客、公司、竞争)战略

系列文章 分享 模型,了解更多👉 模型_总纲目录。洞悉自身,把握顾客,超越竞争。 1 3C(顾客、公司、竞争)战略模型的应用 1.1 3C战略模型在麦当劳公司中的应用 麦当劳在扩张国际市场时采用3C战略模型,具体如下&#xf…

Covalent Network(CQT)发展新里程碑:SOC 2 数据安全认证通过,进一步加强了其人工智能支持

Covalent Network(CQT)现已完成并通过了严格的 Service Organization Control(SOC) 2 Type II 的合规性审计,通过由备受行业认可的机构执行,进一步证明了 Covalent Network(CQT)团队坚定不移地致…

什么是nginx 、安装nginx、nginx调优

一、 什么是nginx 1.1 nginx的概念 一款高新能、轻量级Web服务软件系统资源消耗低对HTTP并发连接的处理能力高单台物理服务器可支持30 000~50 000个并发请求。 1.2 nginx模块与作用 核心模块:是 Nginx 服务器正常运行必不可少的模块,提供错…

数字电路 第二章—第一节(门电路—概述)

一、门电路的概念 实现基本和常用逻辑运算的电子电路称为逻辑门电路,简称门电路。例如,实现与运算的称为与门,实现或运算的称为或门,实现非运算的称为非门,也称为反相器;类似地,实现与非、或非、…

vue+nodejs+uniapp婚纱定制婚庆摄影系统 微信小程序 springboot+python


knife4j springboot3使用

简介 在日常开发中,写接口文档是我们必不可少的,而Knife4j就是一个接口文档工具,可以看作是Swagger的升级版,但是界面比Swagger更好看,功能更丰富 使用 我使用的是springboot3.2.3 knife4j 4.3.0,knife4j 4.4版本有…


类和对象 面向对象编程–说白就是让对象干活 创建类:class 类名: 创建类对象 对象名 类名() 构造方法 1、构造方法的名称是__init__ 2、构造方法的作用? 构建类对象的时候会自动运行 构建类对象的传参会传递给构造…


1.环境 我这里准备了三台centos7 1.用于部署gitlab 运行内存:6G 名字:Jenkins-GitLab 2.用于部署jenkins 运行内存:2G 名字:Jenkins-server 3.用于打包测试…


一、代码下载以、修改以及使用 下载: 链接:yangjiaolong/Go-ICP: Implementation of the Go-ICP algorithm for globally optimal 3D pointset registration (github.com) 解压之后 : 首先visual studio项目,配置好PCL环境&…


给你两个字符串 s 和 t ,统计并返回在 s 的 子序列 中 t 出现的个数,结果需要对 109 7 取模。 示例 1: 输入:s "rabbbit", t "rabbit" 输出:3 解释: 如下所示, 有 3 种可以从 s 中…


明度 Luminosity直方图显示了图像中各个亮度级别的像素分布情况。 与 RGB 直方图不同,“明度”直方图专注于图像的亮度信息,而不是单独的颜色信息。 在“直方图”面板的通道中选择“明度”。 “明度”直方图提供了一种量化的方式来理解图像的整体明暗结构…


介绍 vue-countup-v3 插件是一个基于 Vue3 的数字动画插件,用于在网站或应用程序中创建带有数字动画效果的计数器。通过该插件,我们可以轻松地实现数字的递增或递减动画,并自定义其样式和动画效果。该插件可以用于许多场景,例如展…


目录 一、下载 二、解压 三、配置 1. 添加环境变量 2. 初始化MySQL 3. 注册MySQL服务 4. 启动MySQL服务 5. 修改默认账户密码 四、登录MySQL 五、卸载MySQL 一、下载 点开下面的链接:MySQL :: Download MySQL Community Server 点击Download 就可以下载对…


人脸检测是计算机视觉中的一个重要方向,也是一个和人们生活息息相关的研究方向,因为人脸是人最重要的外貌特征。人脸检测技术的重要性主要体现在以下几个方面: 人脸识别与安全:人脸检测是人脸识别系统的一个关键部分,是…

人工智能 — 特征选择、特征提取、PCA

目录 一、特征选择1、定义2、原因3、做法4、生成过程5、停止条件 二、特征提取三、PCA 算法1、零均值化(中心化)2、方差3、协方差4、协方差矩阵5、对协方差矩阵求特征值、特征矩阵6、对特征值进行排序7、评价模型8、代码实现9、sklearn 库10、鸢尾花实例…

Flink join详解(含两类API及coGroup、connect详解)

Flink SQL支持对动态表进行复杂而灵活的连接操作。 为了处理不同的场景,需要多种查询语义,因此有几种不同类型的 Join。 默认情况下,joins 的顺序是没有优化的。表的 join 顺序是在 FROM 从句指定的。可以通过把更新频率最低的表放在第一个、…

Python 实现 BRAR 指标计算(情绪指标):股票技术分析的利器系列(11)

Python 实现 BRAR 指标计算(情绪指标):股票技术分析的利器系列(11) 介绍算法公式 代码rolling函数介绍核心代码计算BR计算AR 完整代码 介绍 BRAR 是一种情绪指标,用于衡量特定金融市场中的买卖情绪。它代表…


高考志愿辅助填报系统 获取源码——》公主号:计算机专业毕设大全