LLM大模型单机运行实践

伴随ChatGPT的火爆及出圈，LLM瞬间成为投资圈和学术圈的热门方向。大模型训练和使用的成本是巨大的，甚至连科研机构都无法承担，但这不影响大家的热情，各个学术机构和大厂也相继发布了大模型的研究计划和成果，能够小成本训练甚至可以单机运行的模型逐渐出现。

这里记录可以单机运行的LLM的实践过程，包括LLaMA、ChatGLM-6B等。

LLaMA-7B

Facebook（Meta）也发布了我认为还不够成熟的LLaMA模型，站稳了开源大模型的第一名。因此，相应的研究如雨后春笋般出现，甚至出现了单机运行的LLaMA。这里记录体验过程，供大家参考。

LLaMA模型下载

网络分享的下载地址，大小200GB+，大家各取所需，各显神通。

For the 7B model...
aria2c --select-file 21-23,25,26 'magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA'
https://huggingface.co/nyanko7/LLaMA-7B/tree/main

For the 13B model...

aria2c --select-file 1-4,25,26 'magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA'
For the 30B model...

aria2c --select-file 5-10,25,26 'magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA'
For the 65B model...

aria2c --select-file 11-20,25,26 'magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA'

And for everything...

aria2c 'magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA'

pytorch版模型地址：

 7B  https://www.123pan.com/s/sKd9-bBJc.html
13B  https://www.123pan.com/s/sKd9-yJJc.html
30B  https://www.123pan.com/s/sKd9-CMJc.html
65B  https://www.123pan.com/s/sKd9-8JJc.html
Smalllint  https://www.123pan.com/s/sKd9-sIJc.html

llama.cpp

https://github.com/ggerganov/llama.cpp 是C/C++实现的LLaMA，可以单机运行，这里以Mac M1 16为例，体验运行效果。

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

在clang 12.0.0版本下，编译出错，升级版本到14.0.0即可：

sudo rm -rf /Library/Developer/CommandLineTools
sudo xcode-select --install

12.0.0的错误如下：

% make -j
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 12.0.5 (clang-1205.0.22.9)
I CXX:      Apple clang version 12.0.5 (clang-1205.0.22.9)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
ggml.c:1364:25: error: implicit declaration of function 'vdotq_s32' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        int32x4_t p_0 = vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0ls);
                        ^
ggml.c:1364:19: error: initializing 'int32x4_t' (vector of 4 'int32_t' values) with an expression of incompatible type 'int'
        int32x4_t p_0 = vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0ls);
                  ^     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1365:19: error: initializing 'int32x4_t' (vector of 4 'int32_t' values) with an expression of incompatible type 'int'
        int32x4_t p_1 = vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1ls);
                  ^     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1367:13: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        p_0 = vdotq_s32(p_0, v0_0hs, v1_0hs);
            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1368:13: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        p_1 = vdotq_s32(p_1, v0_1hs, v1_1hs);
            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5 errors generated.
make: *** [ggml.o] Error 1
make: *** Waiting for unfinished jobs....

模型预处理

把7B模型放在models文件夹下，models

pip install torch numpy sentencepiece
cp -r ~/Download/LLaMA/7B ./models/
cp ~/Download/LLaMA/tokenizer.model ./models/
python convert-pth-to-ggml.py models/7B/ 1
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

运行模型

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 -p 'The age of sun'

使用 -i 参数可以进入交互模式， -r 参数可以用于prompt string。

模型应该是使用英文语料训练的，中文的回答质量一般。

编外——Whisper.cpp

大神也实现了Whisper的C/C++版本，顺带测试一下。经过测试，base模型的效果较差，medium模型的效果就明显好多了。

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh base
make
./main -m models/ggml-base.bin -l auto -f test.wav

模型对跨语言的支持不太好，也仅支持输入16kHz的wav格式，对其他格式可以使用ffmpeg转换：

ffmpeg -i test.m4a -ar 16k test.wav

Whisper.cpp也可以下载openai的模型：

# clone HF fine-tuned model (this is just an example)
git clone https://huggingface.co/openai/whisper-base.en

# convert the model to ggml
python3 ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-medium/ ./whisper .

直接麦克风语音测试：

make stream
./stream -m models/ggml-base.bin -t 8 --step 500 --length 5000 -l zh

命令监听测试：

make command
./command -m models/ggml-base.bin -t 8 -ac 768 -l zh

ChatGLM-6B

注意：ChatGLM-6B不支持Darwin平台，以下实验在Linux x86_64平台测试通过。

pip install protobuf==3.20.0 transformers==4.26.1 icetk cpm_kernels

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
print(response)

使用huggingface的代号"THUDM/chatglm-6b"来载入模型，会立刻下载模型到本地，有更新时也会自动更新。为了避免自动更新，可以直接使用模型的路径来载入：

tokenizer = AutoTokenizer.from_pretrained("/home/work/.cache/huggingface/models-chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("/home/work/.cache/huggingface/models-chatglm-6b", trust_remote_code=True).bfloat16()

参考

https://openai.wiki/llama-model-download.html

https://www.bilibili.com/read/cv22383652