如何用 Python 取得 GPT-4o 模型最新的 Tokenizer 詞彙表 (o200k_base)

昨晚 OpenAI 的 2024 春季更新發佈會，宣布了最新的 GPT-4o 模型，其中最讓我好奇的地方是，他將 Tokenizer 的詞彙表擴充了兩倍之多，理論上整體的文字處理速度會提升，且呼叫 API 的成本也會大幅降低。我就好奇到底他們增加了哪些詞彙？這篇文章將介紹如何用 Python 取得這個詞彙表的內容。

python-gpt-4o-tokenizer

首先，OpenAI 自己有開源了一套 tiktoken 套件，昨天也推送了新的版本，原本的 GPT-4 都採用 cl100k_base 的詞彙表，新的 GPT-4o 則採用了全新的 o200k_base 詞彙表，其詞彙量增加了整整一倍。我們可以透過 Python 去調用 tiktoken 這套套件，以取得這個詞彙表的內容，這樣我們就可以知道他們增加了哪些詞彙。然而我最好奇的地方則是，他們的詞彙表到底新增了多少「中文」詞彙？有「繁體中文」的詞彙嗎？

開發小程式

以下就是我設計這個 Python 專案的完整過程：

建立專案目錄

mkdir gpt4o-tokenizer && cd gpt4o-tokenizer

建立 Python 虛擬環境

echo .venv > .gitignore

python -m venv .venv
.\.venv\Scripts\Activate.ps1

安裝相關套件
```
pip install -U tiktoken langdetect
```

撰寫主要原始碼 (main.py)

我主要使用 langdetect 來判斷詞彙的語言，並將詞彙分類到不同的檔案中。

"""
This script tokenizes a given text using the 'o200k_base' tokenizer from the 'tiktoken' library.
It iterates over the range of tokens in the tokenizer and decodes each token into a string.
Then, it uses the 'langdetect' library to detect the language of the decoded token.
The detected language and the token itself are printed to the console.
Additionally, the token is appended to a file named based on the detected language.
The output files are stored in the 'output' directory.
"""
import os
import sys
import codecs
import tiktoken
import langdetect

sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

tokenizer = tiktoken.get_encoding("o200k_base")

output_dir = "output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

for i in range(tokenizer.eot_token - 1):
    term: str = tokenizer.decode([i])
    lang: str = "unknown"

    try:
        lang = langdetect.detect(term)
    except langdetect.lang_detect_exception.LangDetectException:
        pass

    print(f"{i:06d} {lang} {term}")

    with open(f"{output_dir}/o200k_base_{lang}.txt", "a", encoding="utf-8") as file:
        file.write(f"{i:06d} {term}\n")

若使用 Visual Studio Code 開發 Python 應用程式，可以考慮安裝我的 Python Extension Pack for AI Developers 擴充套件，裡面包含了大部分 Python 所需的擴充套件。

執行程式並得到完整的輸出
```
python main.py
```

等待程式執行完畢，然後就可以在 output 目錄下找到所有的詞彙表檔案！

完整的執行結果我都放到 https://github.com/doggy8088/gpt4o-tokenizer Repo 中，大家可以去挖寶！😊

分析結果

有趣的資訊來了，到底「中文」詞彙有多少呢？還是少的可憐啊！😅

因為 langdetect 的語言偵測能力並不是很好，有許多應該為「英文」的詞彙，被判斷為其他語言。然而中文也是這樣，有許多應為「繁體中文」或「簡體中文」的詞彙，被歸類到「日文」或「韓文」中，所以這邊的結果可能有誤差。

繁體中文 (zh-tw) 僅 55 個詞彙 (這個數字其實不太準確)
簡體中文 (zh-cn) 僅 3,750 個詞彙 (當中還包含了許多誇張的詞彙內容)
韓文 (ko) 僅 5,992 個詞彙 (裡面其實包含了許多中文字)
日文 (ja) 僅 883 個詞彙
英文 (en) 有 13,881 個詞彙

所以，你不要再幻想 GPT-4o 的「中文」能力有多好了，不可能的，大家現在學好英文比較重要！😅

從詞彙表中也可以發現，GPT-4o 對「英文」的理解能力應該是遠大於其他語言的，所以我覺得大家在下 Prompt 的時候，有機會還是多用英文吧！😊

重新改用 hanzidentifier 分析中文詞彙數

由於 langdetect 的語言偵測能力不夠好，所以我改用了 hanzidentifier 這個套件來判斷中文詞彙的數量，但是他對「台灣」與「大陸」的常用詞彙還是無法精準判斷，所以有些「簡體詞彙」被歸類到「繁體詞彙」了。

新版的程式碼如下：

"""
This script tokenizes a given text using the 'o200k_base' tokenizer from the 'tiktoken' library.
It iterates over the range of tokens in the tokenizer and decodes each token into a string.
Then, it uses the 'langdetect' library to detect the language of the decoded token.
The detected language and the token itself are printed to the console.
Additionally, the token is appended to a file named based on the detected language.
The output files are stored in the 'output' directory.
"""
import os
import sys
import codecs
import tiktoken
import hanzidentifier

sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

ENCODING_NAME = "o200k_base"
DETECT_METHOD = "hanzidentifier"

tokenizer = tiktoken.get_encoding(ENCODING_NAME)

if not os.path.exists(f"{ENCODING_NAME}-{DETECT_METHOD}"):
    os.makedirs(f"{ENCODING_NAME}-{DETECT_METHOD}")

for i in range(tokenizer.eot_token - 1):
    term: str = tokenizer.decode([i])
    lang: str = "others"

    if hanzidentifier.has_chinese(term):
        lang = "zh"

        if hanzidentifier.is_simplified(term):
            lang = "zh-cn"

        if hanzidentifier.is_traditional(term):
            lang = "zh-tw"

    print(f"{i:06d} {lang} {term}")

    with open(f"{ENCODING_NAME}-{DETECT_METHOD}/{lang}.txt", "a", encoding="utf-8") as file:
        file.write(f"{i:06d} {term}\n")

新的分析數據如下：

繁體中文 (zh-tw) 僅 4,413 個詞彙 (在 cl100k_base 只有 537 個繁體中文詞彙) (當中包含部分中國大陸常用詞彙)
簡體中文 (zh-cn) 僅 3,191 個詞彙 (在 cl100k_base 只有 340 個簡體中文詞彙)
其他 (others) 有 192,391 個詞彙 (包含了所有的非中文語言與特殊符號)

總之，GPT-4o 的中文詞彙數量還是很少，但確實還是比 GPT-4 所使用的 cl100k_base 詞彙表多出很多了！👍

The Will Will Web

記載著 Will 在網路世界的學習心得與技術分享