Tokenization

Splitting text into the smallest meaningful units — usually words or characters.

Tokenization is the process of dividing text into tokens (typically words). Different tokenizers apply different rules to hyphens, contractions, URLs, and emoji. Word counts differ between tools largely because of tokenizer choices.