HuggingFace Modes

作者：qiqiqi 阅读次数：

Introduction

If you are unfamiliar with the original Transformer model, please refer to the paper: [1706.03762] Attention Is All You Need (arxiv.org)

Additionally, for a code-annotated guide to the original Transformer model, you can consult The Annotated Transformer (harvard.edu)

The configuration details for these models are available in the documentation: Pretrained models — transformers 4.0.0 documentation (huggingface.co)

Here is a brief introduction to the typical examples of each type of model.

Autoregressive models（elf-regressive models）

Original GPT：

The first autoregressive model based on the transformer architecture, pre-trained on the Book Corpus dataset.
Paper：language_understanding_paper.pdf (openai.com)
Model Source Code Explanation: https://huggingface.co/transformers/model_doc/gpt.html

GPT-2：

CTRL：

Transformer-XL：

Similar to the standard GPT model, but introduces a recurrent mechanism for two consecutive segments (similar to a conventional RNN with two consecutive inputs). Compared to the original Transformer, Transformer-XL avoids fragmenting long sequences and can maintain context over longer passages.
Paper：Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (arxiv.org)
Model Source Code Explanation: Transformer XL - transformers 4.0.0 documentation

======================================================

Autoencoding models（autoencoding models）

BERT：

Utilizes random masking to disrupt the input and pre-trains by predicting whether two sequences are contextually related.
Paper：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)
Model source code analysis: BERT - transformers 4.0.0 documentation

RoBERTa：

Similar to BERT, but with improved pre-training techniques:
# Dynamic masking: Tokens are masked differently in each epoch.
# Omits the use of NSP, avoids sequence truncation, and employs full-length sequences.
# Trained on a larger dataset.
# Employs BPE and treats bytes as sub-units rather than characters (due to the nature of Unicode characters).
Paper：RoBERTa: A Robustly Optimized BERT Pretraining Approach (arxiv.org)
模型源码解析：RoBERTa - transformers 4.0.0 documentation

DistilBERT：

======================================================

Sequence-to-sequence models（(sequence-to-sequence models）

BART：

Corrupted tokens are fed into the encoder, while the original, correct tokens are fed into the decoder. The decoder, similar to a standard Transformer decoder, employs a mask to prevent attending to future tokens during training.
During pre-training, the encoder undergoes a combination of the following input transformations:
# Random token masking (similar to BERT)
# Random token deletion
# Masking a span of k contiguous tokens with a single mask token
# Sentence shuffling
# Document rotation to begin from a specific token
Paper：BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (arxiv.org)
Model source code analysis: BART - transformers 4.0.0 documentation

T5：

T5 (Text-To-Text Transfer Transformer) unifies all natural language processing (NLP) tasks into a single sequence-to-sequence framework. Instead of designing different input formats for various tasks, T5 treats every task as a text transformation problem. For instance, in a Chinese-to-English translation task, before T5, inputs might have been formatted differently depending on the system. With T5, the input is explicitly structured as an instruction combined with the source text, like “translate Chinese to English: 我喜欢跑步”, and the model outputs the translated text, “I like running”.
Paper：Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (arxiv.org)
Model source code analysis: T5 - transformers 4.0.0 documentation

======================================================

Multimodal models（multimodal models）

MMBT：

A transformer-based model designed for multimodal settings, which integrates textual and visual information to make predictions.
Paper：Supervised Multimodal Bitransformers for Classifying Images and Text (arxiv.org)
Model source code analysis: None

返回列表