how to get embeddings
1. tokenizer
we need a bunch of tokens in order to process via vectordb
for example for a sentence like
I like playing counter strike
first we need to get some tokenizer(In this case BertTokenizer)
first we generate a dictionary using WordPieces
the dictionary contains
[I, like, playing, counter, strike]
for every word which is inside the dictionary, we directly use it, otherwise we split it into even smallar part
for
unlike
we generate something like
["un##", "like"]
we also have some special tokens
[[CLS], [SEP], [PAD]]
Then we generate a map from word to ID
also combine a mask which have 0/1 indicates whether its a PAD or a word