HuggingFace 的模型和数据操作

HuggingFace 通过提供共享模型 model、数据集 dataset、在线托管 space 等服务，为 AI 研究人员和开发者提供了一个完整的生态。本篇文章将介绍如何使用 HuggingFace 的模型和数据集。

1. 模型操作与使用

1.1 自定义存储目录

1
export HF_HOME=/Volumes/Data/HuggingFace

否则默认在 ~/.cache/huggingface 目录下。

1.2 模型的下载

第一种方法，页面上点击下载到本地

https://huggingface.co/LinkSoul/Chinese-Llama-2-7b/tree/main 点击文件列表中的下载 Icon 。

第二种方法，使用 Git LFS 下载

在安装 git-lfs 之后，执行命令:

1
git lfs install

下载模型到本地:

1
git clone https://huggingface.co/LinkSoul/Chinese-Llama-2-7b

第三种方法，使用 huggingface-hub 下载

1
pip install huggingface_hub

1
2
from huggingface_hub import snapshot_download
snapshot_download(repo_id="LinkSoul/Chinese-Llama-2-7b")

第四种方法，使用 transformers 使用时，在线下载

1
pip install transformers

1
2
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")

1.3 模型的操作

加载模型

1
2
3
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")

保存模型

1
2
3
4
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")
model.save_pretrained("/Volumes/Data/HuggingFace/Chinese-Llama-2-7b-v2")

1.4 模型的使用

安装依赖

1
pip install transformers torch

使用模型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from transformers import EncoderDecoderModel, AutoTokenizer

model_id = "raynardj/wenyanwen-chinese-translate-to-ancient"
model = EncoderDecoderModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def chat(text):
  input_ids = tokenizer.encode(text, return_tensors='pt')
  output = model.generate(input_ids, max_length=40)
  return tokenizer.decode(output[0], skip_special_tokens=True)

chat("你好")

1
汝 好

2. Dataset 操作与使用

2.1 数据集的下载

安装 datasets

1
pip install datasets

下载数据集

进入 Ipython

1
ipython

1
2
In [1]: import datasets
In [2]: remote_datasets = datasets.load_dataset("fka/awesome-chatgpt-prompts")

此时，数据集将被下载到 $HF_HOME/datasets 目录下。类似模型的下载，数据集也可以在页面上下载、Git LFS 下载，在此不再赘述。

查看数据集合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
tree -L 2 $HF_HOME/datasets

/Volumes/Data/HuggingFace/datasets
├── _Volumes_Data_HuggingFace_datasets_fka___awesome-chatgpt-prompts_default-18237255be23cc62_0.0.0_eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d.lock
├── downloads
│   ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a
│   ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a.json
│   ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a.lock
│   ├── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b
│   ├── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b.json
│   └── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b.lock
└── fka___awesome-chatgpt-prompts
    └── default-18237255be23cc62

可以看到存储的目录并不是 fka/awesome-chatgpt-prompts 。不能使用 datasets.load_from_disk("fka/awesome-chatgpt-prompts") 加载数据集，load_from_disk 适合直接下载、Git LFS 等方式下载的数据集。

2.2 数据集的操作

查看数据集

1
2
3
4
5
6
7
8
In [3]: remote_datasets

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 153
    })
})

可以看到一共有 153 条数据，数据放在两个字段中，act 和 prompt。

查看数据

1
2
3
In [4]: remote_datasets["train"][0]
Out[4]: {'act': 'Linux Terminal',
 'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}

随机选取数据

1
2
3
4
5
6
In [5]: remote_datasets["train"].select(range(10))
Out[5]:
Dataset({
    features: ['act', 'prompt'],
    num_rows: 10
})

更新列名

1
2
3
4
5
6
7
8
9
In [5]: new_datasets = remote_datasets.rename_column("act", "actor")
In [6]: new_datasets
Out[6]:
DatasetDict({
    train: Dataset({
        features: ['actor', 'prompt'],
        num_rows: 153
    })
})

filter 过滤数据

1
2
3
4
5
6
7
8
In [7]: new_datasets.filter(lambda x: "Linux" in x["actor"])
Out[7]:
DatasetDict({
    train: Dataset({
        features: ['actor', 'prompt'],
        num_rows: 1
    })
})

map 处理数据

1
2
3
4
In [8]: new_datasets.map(lambda x: {"actor": x["actor"].upper(), "prompt": x["prompt"]})["train"][0]
Out[8]:
{'actor': 'LINUX TERMINAL',
 'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}

sort 排序

1
2
3
4
In [9]: new_datasets.sort("actor")["train"][0]
Out[9]:
{'actor': 'AI Assisted Doctor',
 'prompt': 'I want you to act as an AI assisted doctor. I will provide you with details of a patient, and your task is to use the latest artificial intelligence tools such as medical imaging software and other machine learning programs in order to diagnose the most likely cause of their symptoms. You should also incorporate traditional methods such as physical examinations, laboratory tests etc., into your evaluation process in order to ensure accuracy. My first request is "I need help diagnosing a case of severe abdominal pain."'}

shuffle 乱序

1
2
3
4
In [10]: new_datasets.shuffle(seed=42)["train"][0]
Out[10]:
{'actor': 'Tech Reviewer:',
 'prompt': 'I want you to act as a tech reviewer. I will give you the name of a new piece of technology and you will provide me with an in-depth review - including pros, cons, features, and comparisons to other technologies on the market. My first suggestion request is "I am reviewing iPhone 11 Pro Max".'}

随机选取数据

1
2
3
4
5
6
In [11]: new_datasets.shuffle(seed=42)["train"].select(range(10))
Out[11]:
Dataset({
    features: ['actor', 'prompt'],
    num_rows: 10
})

导出数据集

1
new_datasets.save_to_disk("fka_awesome-chatgpt-prompts_2")

数据将保存至 Ipython 工作目录下的 fka_awesome-chatgpt-prompts_2 中。