prepare_flyvec_data()
Our tokenization will work with "lines" -- that is, a sequence of text that can contain multiple sentences, paragraphs, and newlines. For cohesiveness, we want to split these to the sentence and word level.
line = """
Various prior work has demonstrated 100 weaknesses in these models — even highly accurate ones — including reliance on non-salient regions
or on background information only. Explanation methods help identify these pitfalls by providing explanations for model predictions, enabling humans to identify the features on which a model decision is based. However, these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations."""
We first need to check that the line contains actual content and is not a binary string acting as an identifier in most files.
def is_good_line(line):
"""Check if the line is valid"""
return (len(line) > 1) and ("\x00" not in line)
is_good_line(line)
assert is_good_line(line)
assert not is_good_line("\x00\x0033-thegreatdivide.txt\x00")
assert not is_good_line("")
Split a text by sentence according to the following regex pattern
sentences = line2sentences(line); sentences
Once we have a sentence, we want to strip all punctuation and unicode
assert isascii("Hello!")
assert not isascii("Ĉ")
assert isascii("")
proc_sentences = [strip_punc_unicode(s) for s in sentences]; proc_sentences
And remove all instances where there are multiple spaces
proc_sentences = [remove_multiple_spaces(s) for s in proc_sentences]; proc_sentences
Before we have our tokens, we will define the concept of 'number' as any ASCII token that contains a digit
Compiling all these steps into a single function
tokens = process_line(line); print(tokens[0])
def process_tok(x, num_tok="xxNUMxx", stop_tok="xxSTOPxx", stopwords=[]):
"""Process a token by replacing numbers and stop tokens with the desired special tokens"""
if isnum(x):
return num_tok
elif x in stopwords:
return stop_tok
return x.strip()
test_eq(process_tok(" "), "")
test_eq(process_tok("abc88"), "xxNUMxx")
test_eq(process_tok("993"), "xxNUMxx")
test_eq(process_tok("the", stopwords=["the", "a", "but"]), "xxSTOPxx")
test_eq(process_tok(" lotsofspace "), "lotsofspace")
[process_tok(t, stopwords=["the", "in", "on", "or", "has"]) for t in tokens[0]]
And now we can convert an entire file to tokens (naively loading everything into memory)
The GensimTokenizer
is a simple wrapper around gensim's Dictionary
and Phraser
classes that aligns them with our simple tokenization rules. You can use the model for converting between tokens and ids as follows:
vocab = get_model_dir() / "tokenizer/gensim1_patched.dict"
tok = GensimTokenizer.from_file(vocab)
tokens = ["apple", "pie", "is", "delicious"]
ids = tok.tokens2ids(tokens); ids
tok.ids2tokens(ids)
There are several different views into the vocabulary of the model.
tok.token_vocab[:5]
tok.vocab[:5]
d = tok.dictionary;