You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Tokenizers provides ultra-fast implementations of most current tokenizers:>>>fromtokenizersimport (ByteLevelBPETokenizer,
BPETokenizer,
SentencePieceBPETokenizer,
BertWordPieceTokenizer)
# Ultra-fast => they can encode 1GB of text in ~20sec on a standard server's CPU# Tokenizers can be easily instantiated from standard files>>>tokenizer=BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
Tokenizer(vocabulary_size=30522, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK],
sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True,
strip_accents=True, lowercase=True, wordpieces_prefix=##)# Tokenizers provide exhaustive outputs: tokens, mapping to original string, attention/special token masks.# They also handle model's max input lengths as well as padding (to directly encode in padded batches)>>>output=tokenizer.encode("Hello, y'all! How are you 😁 ?")
Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
>>>print(output.ids, output.tokens, output.offsets)
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
[(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27),
(28, 29), (0, 0)]
# Here is an example using the offsets mapping to retrieve the string coresponding to the 10th token:>>>output.original_str[output.offsets[10]]
'😁'
The text was updated successfully, but these errors were encountered:
Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
jim-schwoebel
changed the title
add new tokenizers for text features
add new Hugging Face tokenizers for text features
Aug 2, 2020
https://github.com/huggingface/tokenizers
The text was updated successfully, but these errors were encountered: