On this tutorial, we’ll learn to create a customized tokenizer utilizing the tiktoken library. The method entails loading a pre-trained tokenizer mannequin, defining each base and particular tokens, initializing the tokenizer with a selected common expression for token splitting, and testing its performance by encoding and decoding some pattern textual content. This setup is important for NLP duties requiring exact management over textual content tokenization.
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import json
Right here, we import a number of libraries important for textual content processing and machine studying. It makes use of Path from pathlib for simple file path administration, whereas tiktoken and load_tiktoken_bpe facilitate loading and dealing with a Byte Pair Encoding tokenizer.
num_reserved_special_tokens = 256
mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
num_base_tokens = len(mergeable_ranks)
special_tokens = [
“<|begin_of_text|>”,
“<|end_of_text|>”,
“<|reserved_special_token_0|>”,
“<|reserved_special_token_1|>”,
“<|finetune_right_pad_id|>”,
“<|step_id|>”,
“<|start_header_id|>”,
“<|end_header_id|>”,
“<|eom_id|>”,
“<|eot_id|>”,
“<|python_tag|>”,
]
Right here, we set the trail to the tokenizer mannequin, specifying 256 reserved particular tokens. It then masses the mergeable ranks, which kind the bottom vocabulary, calculates the variety of base tokens, and defines an inventory of particular tokens for marking textual content boundaries and different reserved functions.
f”<|reserved_special_token_{2 + i}|>”
for i in range(num_reserved_special_tokens – len(special_tokens))
]
special_tokens = special_tokens + reserved_tokens
tokenizer = tiktoken.Encoding(
title=Path(tokenizer_path).title,
pat_str=r”(?i:’s|’t|’re|’ve|’m|’ll|’d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+”,
mergeable_ranks=mergeable_ranks,
special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)
Now, we dynamically create further reserved tokens to achieve 256, then append them to the predefined particular tokens record. It initializes the tokenizer utilizing tiktoken. Encoding with a specified common expression for splitting textual content, the loaded mergeable ranks as the bottom vocabulary, and mapping particular tokens to distinctive token IDs.
# Check the tokenizer with a pattern textual content
#————————————————————————-
sample_text = “Hiya, it is a check of the up to date tokenizer!”
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)
print(“Pattern Textual content:”, sample_text)
print(“Encoded Tokens:”, encoded)
print(“Decoded Textual content:”, decoded)
We check the tokenizer by encoding a pattern textual content into token IDs after which decoding these IDs again into textual content. It prints the unique textual content, the encoded tokens, and the decoded textual content to substantiate that the tokenizer works accurately.
Right here, we encode the string “Hey” into its corresponding token IDs utilizing the tokenizer’s encoding methodology.
In conclusion, following this tutorial will educate you the right way to arrange a customized BPE tokenizer utilizing the TikToken library. You noticed the right way to load a pre-trained tokenizer mannequin, outline each base and particular tokens, and initialize the tokenizer with a selected common expression for token splitting. Lastly, you verified the tokenizer’s performance by encoding and decoding pattern textual content. This setup is a basic step for any NLP venture that requires personalized textual content processing and tokenization.
Right here is the Colab Pocket book for the above venture. Additionally, don’t overlook to comply with us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 75k+ ML SubReddit.
🚨 Beneficial Open-Supply AI Platform: ‘IntellAgent is a An Open-Supply Multi-Agent Framework to Consider Complicated Conversational AI System’ (Promoted)
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.