If you are unfamiliar with the original Transformer model, please refer to the paper: [1706.03762] Attention Is All You Need (arxiv.org)
Additionally, for a code-annotated guide to the original Transformer model, you can consult The Annotated Transformer (harvard.edu)
All models available on Hugging Face can be found here: Hugging Face – On a mission to solve NLP, one commit at a time.
The configuration details for these models are available in the documentation: Pretrained models — transformers 4.0.0 documentation (huggingface.co)
All models on Hugging Face fall into one of the following categories:
An overview:
Autoregressive models (self-regressive models) are pre-trained on standard language modeling tasks: predicting the next token based on all previously read tokens. In simpler terms, the sequence is read from left to right. They correspond to the decoder of the original transformer model. While these models can be fine-tuned to achieve excellent results on many tasks, their primary application is text generation, as both their training and text generation proceed from left to right. A typical example of this model is GPT.
Autoencoding models (self-encoding models) are trained by disrupting input tokens in some way and attempting to reconstruct the original sequence during pre-training. In a sense, they correspond to the encoder of the original transformer model because they can view the entire sequence at input. Although they can be fine-tuned and achieve excellent results on many tasks (such as text generation), their best application is sequence classification or token classification. A typical example of this model is BERT.
Sequence-to-sequence models (sequence-to-sequence models) aim to frame all NLP tasks as sequence-to-sequence problems. They can be fine-tuned for many tasks, but their best application is translation, summarization, and reading comprehension. The original transformer model is an example of this type of model (specifically for translation). A typical example of this model is T5.
Multimodal models (multi-modal models) combine text input with other types of input, such as images, and are often tailored to specific tasks.
Retrieval-based models are not yet understood by the author.
Here is a brief introduction to the typical examples of each type of model.
Autoregressive models(elf-regressive models)
Original GPT:
The first autoregressive model based on the transformer architecture, pre-trained on the Book Corpus dataset.
Model Source Code Explanation: https://huggingface.co/transformers/model_doc/gpt.html
GPT-2:
A larger and improved version of GPT, pre-trained on WebText.
Paper:Language Models are Unsupervised Multitask Learners (d4mucfpksywv.cloudfront.net)
Model Source Code Explanation: OpenAI GPT2 — transformers 4.0.0 documentation (huggingface.co)
CTRL:
The author is not yet familiar with this model.
Paper:[1909.05858] CTRL: A Conditional Transformer Language Model for Controllable Generation (arxiv.org)
Model Source Code Explanation: CTRL — transformers 4.0.0 documentation (huggingface.co)
Transformer-XL:
Similar to the standard GPT model, but introduces a recurrent mechanism for two consecutive segments (similar to a conventional RNN with two consecutive inputs). Compared to the original Transformer, Transformer-XL avoids fragmenting long sequences and can maintain context over longer passages.
Paper:Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (arxiv.org)
Model Source Code Explanation: Transformer XL - transformers 4.0.0 documentation
======================================================
Autoencoding models(autoencoding models)
BERT:
Utilizes random masking to disrupt the input and pre-trains by predicting whether two sequences are contextually related.
Paper:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)
Model source code analysis: BERT - transformers 4.0.0 documentation
RoBERTa:
Similar to BERT, but with improved pre-training techniques:
# Dynamic masking: Tokens are masked differently in each epoch.
# Omits the use of NSP, avoids sequence truncation, and employs full-length sequences.
# Trained on a larger dataset.
# Employs BPE and treats bytes as sub-units rather than characters (due to the nature of Unicode characters).
Paper:RoBERTa: A Robustly Optimized BERT Pretraining Approach (arxiv.org)
DistilBERT:
BERT的蒸馏版本
Paper:DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (arxiv.org)
Model source code analysis: https://huggingface.co/transformers/master/model_doc/distilbert.html
======================================================
Sequence-to-sequence models((sequence-to-sequence models)
BART:
Corrupted tokens are fed into the encoder, while the original, correct tokens are fed into the decoder. The decoder, similar to a standard Transformer decoder, employs a mask to prevent attending to future tokens during training.
During pre-training, the encoder undergoes a combination of the following input transformations:
# Random token masking (similar to BERT)
# Random token deletion
# Masking a span of k contiguous tokens with a single mask token
# Sentence shuffling
# Document rotation to begin from a specific token
Model source code analysis: BART - transformers 4.0.0 documentation
T5:
T5 (Text-To-Text Transfer Transformer) unifies all natural language processing (NLP) tasks into a single sequence-to-sequence framework. Instead of designing different input formats for various tasks, T5 treats every task as a text transformation problem. For instance, in a Chinese-to-English translation task, before T5, inputs might have been formatted differently depending on the system. With T5, the input is explicitly structured as an instruction combined with the source text, like “translate Chinese to English: 我喜欢跑步”, and the model outputs the translated text, “I like running”.
Paper:Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (arxiv.org)
Model source code analysis: T5 - transformers 4.0.0 documentation
======================================================
Multimodal models(multimodal models)
MMBT:
A transformer-based model designed for multimodal settings, which integrates textual and visual information to make predictions.
Paper:Supervised Multimodal Bitransformers for Classifying Images and Text (arxiv.org)
Model source code analysis: None
扫码关注不迷路!!!
郑州升龙商业广场B座25层
service@iqiqiqi.cn
联系电话:187-0363-0315
联系电话:199-3777-5101