transformers 库的使用

transformers 是由 Hugging Face 开发的 Python 库，用于在自然语言处理（NLP）任务中使用和训练预训练的 Transformer 模型。它提供了许多强大的工具和功能，使得处理文本数据和构建 NLP 模型变得更加容易。该库广泛应用于各种 NLP 任务，如文本分类、命名实体识别、问答、文本生成等。

1. transformers 中的 pipeline

pipeline 提供了便捷的方式，将分词器、模型处理、后处理器等组合在一起，方便用户使用。

1
2
3
4
5
6
from transformers import pipeline

model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe = pipeline("sentiment-analysis", model=model_id, tokenizer=model_id)
pipe("You're a dumbass")

pipeline 可选的任务类型有：

audio-classification，音频分类
automatic-speech-recognition，自动语音识别
conversational，对话
depth-estimation，深度估计
document-question-answering，文档问答
feature-extraction，特征提取
…

详细列表可以参数 https://huggingface.co/docs/transformers/main_classes/pipelines

上面的例子中，显示指定了 tokenizer，其实也可以缺省。模型与分词器是紧密耦合的，缺省情况下 pipeline 会自动选择合适的 tokenizer。

1
2
3
4
5
6
from transformers import pipeline

model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe = pipeline("sentiment-analysis", model=model_id)
pipe("You're a dumbass")

使用 pipeline 可以让我们更加专注于任务本身，而不用关心模型、分词器等的细节。

2. transformers 中的模型类

2.1 关于 Auto Classes 类

在 transformers 中实现了大量的算法模型类，有 Bert 模型的 BertModel 类，有 BART 模型的 BartModel 类，有 GPT 模型的 GPT2Model 类等。

为了减轻用户使用对应模型时，必须找到对应模型类的负担，AutoModel 类会根据 model 的类型自动选择合适的模型类。

相同设计思路的还有，AutoConfig、AutoTokenizer 等，称之为 Auto Classes，具体可以参考 https://huggingface.co/docs/transformers/model_doc/auto 。

2.2 使用 AutoModel 加载模型

1
2
3
4
from transformers import AutoModel

model_name = "LinkSoul/Chinese-Llama-2-7b"
model = AutoModel.from_pretrained(model_name)

但 AutoModel 只能加载模型，不能调用 generate() 等方法用于生成文本。

2.3 使用 AutoModelFor 类的使用

AutoModelFor 类是 AutoModel 类的子类，它会自动选择合适的模型类，并且会自动加载对应的配置文件。包括:

AutoModelForCausalLM, 用于自回归语言模型
AutoModelForMaskedLM, 用于掩码语言模型
AutoModelForSeq2SeqLM, 用于序列到序列的任务模型
AutoModelForQuestionAnswering, 用于问答任务模型
AutoModelForTokenClassification, 用于标记分类任务模型
AutoModelForSequenceClassification, 用于序列分类任务模型
AutoModelForMultipleChoice, 用于多选任务模型
…

AutoModel 与其子类 AutoModelForXXX 对比:

AutoModel 提供的一些基础能力，AutoModelForXXX 根据任务类型提供了一些额外的能力
AutoModel 只包含 Encoder，AutoModelForXXX 包含 Encoder 和 Decoder
AutoModel 用于文本编码、特征提取，AutoModelForXXX 用于训练模型、生成文本

1
2
3
4
from transformers import AutoModelForCausalLM

model_name = "LinkSoul/Chinese-Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(model_name)

2.4 保存模型与分词

模型

1
2
save_directory = "/Volumes/Data/HuggingFace/"
model.save_pretrained(save_directory + "model")

1
2
3
ls /Volumes/Data/HuggingFace/model

config.json       pytorch_model.bin

分词

1
2
save_directory = "/Volumes/Data/HuggingFace/"
tokenizer.save_pretrained(save_directory + "tokenizer")

1
2
3
4
ls /Volumes/Data/HuggingFace/tokenizer

merges.txt              tokenizer.json          vocab.json
special_tokens_map.json tokenizer_config.json

3. transformers 中的分词器

分词器的作用就行实现输入与模型可以理解的输入格式之间的转换。因此有两个方向的转换：

将输入的文本转换成模型可以理解的输入格式
将模型输出的结果转换成人类可以理解的格式

AutoTokenizer 会根据 model 的类型自动选择合适的分词器。需要注意的是，预训练模型与分词器是配套使用的。如果使用了 cardiffnlp/twitter-roberta-base-sentiment-latest 模型，就应该使用 cardiffnlp/twitter-roberta-base-sentiment-latest 分词器，否则效果会很差。

1
2
3
4
from transformers import AutoTokenizer

model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(model_id)

3.1 单个句子

1
2
text = "我爱北京天安门"
tokenizer(text)

1
{'input_ids': [0, 47876, 3602, 36714, 23133, 15389, 48418, 6800, 46499, 11582, 49429, 47089, 23171, 49117, 11423, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

3.2 多个句子

1
{'input_ids': [[0, 47876, 3602, 36714, 23133, 15389, 48418, 6800, 46499, 11582, 49429, 47089, 23171, 49117, 11423, 2], [0, 49429, 47089, 23171, 49117, 11423, 48827, 47983, 10278, 41907, 711, 15264, 47658, 6382, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

3.3 分词器参数

1
2
3
4
5
text = "我爱北京天安门"
tokenizer(text,
          padding=True,
          truncation=True,
          max_length=512)

1
{'input_ids': [0, 47876, 3602, 36714, 23133, 15389, 48418, 6800, 46499, 11582, 49429, 47089, 23171, 49117, 11423, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

padding：是否填充，如果为 True，会将所有句子填充到相同长度
truncation：是否截断，如果为 True，会将所有句子截断到相同长度
max_length：填充或截断后的句子长度

输出中 input_ids 是分词后的结果，attention_mask 是注意力掩码，用于指示哪些是真实的输入，哪些是填充的。

4. transformers 中的模型配置类

模型配置是模型的超参数，比如 Bert 模型的隐藏层大小、注意力头的数量等。

1
2
3
4
from transformers import AutoConfig

model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"
config = AutoConfig.from_pretrained(model_id)

修改模型配置

下面这个模型的注意力头数量是 12，这里将其修改为 11。

1
2
3
4
from transformers import AutoConfig

model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"
my_config = AutoConfig.from_pretrained(model_id, num_attention_heads=11)

根据模型配置创建模型

1
2
3
from transformers import AutoModel

my_model = AutoModel.from_config(my_config)

可以通过这种方式，修改模型的参数，调试模型的效果。

5. 总结

本篇主要是介绍了 transformers 中的 pipeline、模型类、分词器、模型配置类等。pipeline 提供了便捷的方式，将分词器、模型处理、后处理器等组合在一起，方便用户使用。AutoModel 类会根据 model 的类型自动选择合适的模型类。AutoTokenizer 会根据 model 的类型自动选择合适的分词器。AutoConfig 会根据 model 的类型自动选择合适的模型配置类。