Uses Regex instead of fancy-regex - 6x speedup #331

Majdoddin · 2024-08-05T09:03:15Z

This PR realizes the wish expressed in current code to use the faster Regex.

The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output.
This makes it possible to use linear-time Regex instead of fancy-regex, as Regex does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.

Although fancy_regex delegates to Regex, when the pattern has no special features, it is still some 10% slower in test, thus we directly use Regex.
This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done with fancy_regex.

Tests
For encoding o200k_base (used by model GPT-4o)

Text	Number of tokens	Current Runtime	PR Runtime
wikitext-103 (100 MB)	22138325	18.94s	4.94s
Linux code (100 MB)	36119543	30.28s	4.59s

…a 6x speedup. To make the regex patterns compatible with Regex, drops part of thpatterns for whitespaces, and handles the whitespaces with scripting instead of regex. Still with exact same output. _encode_native calls _encode_ordinary_native_impl directly (_encode_ordinary_native is a wrapper of _encode_ordinary_native now).

Bigheem · 2024-08-31T18:29:24Z

@bigheemseafood

Based on openai#331 Uses Regex in _encode_ordinary_native instead of fancy-regex, to get a 6x speedup. To make the regex patterns compatible with Regex, drops part of thpatterns for whitespaces, and handles the whitespaces with scripting instead of regex. Still with exact same output. _encode_native calls _encode_ordinary_native_impl directly (_encode_ordinary_native is a wrapper of _encode_ordinary_native now).

tmm1 · 2024-11-09T20:21:49Z

Thanks for your work on this!

I noticed this code block which sounds like it would need to change along with these regexes?

tiktoken/src/lib.rs

Lines 405 to 409 in 6352764

    
           // For example, with gpt2, the use of \s+(?!\S) means that "\n\n" could 
        
           // develop a split, e.g. "\n\n0" splits into "\n"+"\n"+"0", making "\n" a possible token. 
        
           // Here is a quick and dirty fix: 
        
           // This isn't right if we ever remove \s+(?!\S) 
        
           if unstable_bytes.len() > 1 {

hendrikvanantwerpen mentioned this pull request Oct 10, 2024

Reorganize benchmark to include fairer comparisons github/rust-gems#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uses Regex instead of fancy-regex - 6x speedup #331

Uses Regex instead of fancy-regex - 6x speedup #331

Majdoddin commented Aug 5, 2024 •

edited

Loading

Bigheem commented Aug 31, 2024

tmm1 commented Nov 9, 2024

Uses Regex instead of fancy-regex - 6x speedup #331

Are you sure you want to change the base?

Uses Regex instead of fancy-regex - 6x speedup #331

Conversation

Majdoddin commented Aug 5, 2024 • edited Loading

Bigheem commented Aug 31, 2024

tmm1 commented Nov 9, 2024

Majdoddin commented Aug 5, 2024 •

edited

Loading