Functions that use a pretrained FlyVec model to create sparse binary representations

Helpers

softmax[source]

softmax(x:array, beta=1.0)

Take the softmax of 1-D vector x according to inverse temperature beta. Returns a vector of the same length as x

normalize_synapses[source]

normalize_synapses(syn:array, prec=1e-32, p=2)

Normalize the synapses

Args: syn: The matrix of learned synapses prec: Noise to prevent division by 0 p: Of the p-norm

Returns: Normalized array of the given synapses

class FlyVec[source]

FlyVec(synapse_file:Union[Path, str], tokenizer_file:Union[Path, str], stopword_file:Union[Path, str, NoneType]=None, phrases_file:Union[Path, str, NoneType]=None, normalize_synapses:bool=True)

A class wrapper around a tokenizer, stop words, and synapse weights for hashing words

Simply run FlyVec.load()) to download the existing model and use as desired

model = FlyVec.load()
hsh = model.get_sparse_embedding("hello"); hsh
{'token': 'hello',
 'id': 5483,
 'embedding': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 1], dtype=int8)}

hsh['embedding'] is non-zero for the top hash_length most activated neurons in our model

If you provide multiple words in the input string, FlyVec will provided the word vector for the first word

hsh2 = model.get_sparse_embedding("hello world");
assert np.all(hsh2['embedding'] == hsh['embedding'])
_f = lambda x: model.get_sparse_embedding(x)
test_eq(_f("hello")['embedding'], model.get_sparse_embedding("hello", 50)['embedding'])
test_eq(_f("hello")['token'], "hello")
assert np.all(_f("BOXNAFS")['embedding'] == 0), "Expected unknown embedding to be all zero"
test_eq(_f("HELLO")['embedding'], _f("hello")['embedding'])
test_eq(_f("not a single token")['embedding'], _f("not")['embedding'])
test_fail(lambda: _f(""), contains="empty string")
test_eq(_f("NotARealWord")['embedding'], _f("<UNK>")['embedding'])