prepare_flyvec_data()

Preprocessing functions

We want tokens to deal with simple concepts, so we will enforce lowercase ASCII and predominantly split on spaces.

Our tokenization will work with "lines" -- that is, a sequence of text that can contain multiple sentences, paragraphs, and newlines. For cohesiveness, we want to split these to the sentence and word level.

line = """
Various prior work has demonstrated 100 weaknesses in these models — even highly accurate ones — including reliance on non-salient regions 
 or on background information only. Explanation methods help identify these pitfalls by providing explanations for model predictions, enabling humans to identify the features on which a model decision is based. However, these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations."""

We first need to check that the line contains actual content and is not a binary string acting as an identifier in most files.

def is_good_line(line):
    """Check if the line is valid"""
    return (len(line) > 1) and ("\x00" not in line)

is_good_line(line)
assert is_good_line(line)
assert not is_good_line("\x00\x0033-thegreatdivide.txt\x00")
assert not is_good_line("")

Split a text by sentence according to the following regex pattern

sentences = line2sentences(line); sentences

['various prior work has demonstrated 100 weaknesses in these models — even highly accurate ones — including reliance on non-salient regions   or on background information only.',
 'explanation methods help identify these pitfalls by providing explanations for model predictions, enabling humans to identify the features on which a model decision is based.',
 'however, these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations.']

Once we have a sentence, we want to strip all punctuation and unicode

assert isascii("Hello!")
assert not isascii("Ĉ")
assert isascii("")

proc_sentences = [strip_punc_unicode(s) for s in sentences]; proc_sentences

['various prior work has demonstrated 100 weaknesses in these models  even highly accurate ones  including reliance on nonsalient regions   or on background information only',
 'explanation methods help identify these pitfalls by providing explanations for model predictions enabling humans to identify the features on which a model decision is based',
 'however these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations']

And remove all instances where there are multiple spaces

proc_sentences = [remove_multiple_spaces(s) for s in proc_sentences]; proc_sentences

['various prior work has demonstrated 100 weaknesses in these models even highly accurate ones including reliance on nonsalient regions or on background information only',
 'explanation methods help identify these pitfalls by providing explanations for model predictions enabling humans to identify the features on which a model decision is based',
 'however these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations']

Before we have our tokens, we will define the concept of 'number' as any ASCII token that contains a digit

Compiling all these steps into a single function

tokens = process_line(line); print(tokens[0])

['various', 'prior', 'work', 'has', 'demonstrated', '<NUM>', 'weaknesses', 'in', 'these', 'models', 'even', 'highly', 'accurate', 'ones', 'including', 'reliance', 'on', 'nonsalient', 'regions', 'or', 'on', 'background', 'information', 'only']

def process_tok(x, num_tok="xxNUMxx", stop_tok="xxSTOPxx", stopwords=[]):
    """Process a token by replacing numbers and stop tokens with the desired special tokens"""
    if isnum(x):
        return num_tok
    elif x in stopwords:
        return stop_tok
    return x.strip()

test_eq(process_tok(" "), "")
test_eq(process_tok("abc88"), "xxNUMxx")
test_eq(process_tok("993"), "xxNUMxx")
test_eq(process_tok("the", stopwords=["the", "a", "but"]), "xxSTOPxx")
test_eq(process_tok("   lotsofspace "), "lotsofspace")

[process_tok(t, stopwords=["the", "in", "on", "or", "has"]) for t in tokens[0]]

['various',
 'prior',
 'work',
 'xxSTOPxx',
 'demonstrated',
 '<NUM>',
 'weaknesses',
 'xxSTOPxx',
 'these',
 'models',
 'even',
 'highly',
 'accurate',
 'ones',
 'including',
 'reliance',
 'xxSTOPxx',
 'nonsalient',
 'regions',
 'xxSTOPxx',
 'xxSTOPxx',
 'background',
 'information',
 'only']

And now we can convert an entire file to tokens (naively loading everything into memory)

The Tokenizer

Collecting all the helper functions underneath a single class

The GensimTokenizer is a simple wrapper around gensim's Dictionary and Phraser classes that aligns them with our simple tokenization rules. You can use the model for converting between tokens and ids as follows:

vocab = get_model_dir() / "tokenizer/gensim1_patched.dict"
tok = GensimTokenizer.from_file(vocab)
tokens = ["apple", "pie", "is", "delicious"]
ids = tok.tokens2ids(tokens); ids

[2563, 17862, 17, 8073]

tok.ids2tokens(ids)

['apple', 'pie', 'is', 'delicious']

There are several different views into the vocabulary of the model.

tok.token_vocab[:5]

['properties', 'a', 'among', 'and', 'any']

tok.vocab[:5]

[2, 3, 4, 5, 6]

d = tok.dictionary;

Tokenizer

Preprocessing functions

`line2sentences`[source]

`isascii`[source]

`strip_punc_unicode`[source]

`remove_multiple_spaces`[source]

`isnum`[source]

`process_line`[source]

`file2tokens`[source]

The Tokenizer

`class` `GensimTokenizer`[source]

Tokenizer

Preprocessing functions

line2sentences[source]

isascii[source]

strip_punc_unicode[source]

remove_multiple_spaces[source]

isnum[source]

process_line[source]

file2tokens[source]

The Tokenizer

class GensimTokenizer[source]

`line2sentences`[source]

`isascii`[source]

`strip_punc_unicode`[source]

`remove_multiple_spaces`[source]

`isnum`[source]

`process_line`[source]

`file2tokens`[source]

`class` `GensimTokenizer`[source]