qdrant

2025/2/2 vector db

how to get embeddings

we need a bunch of tokens in order to process via vectordb

for example for a sentence like

I like playing counter strike

first we need to get some tokenizer(In this case BertTokenizer)

first we generate a dictionary using WordPieces

the dictionary contains

[I, like, playing, counter, strike]

for every word which is inside the dictionary, we directly use it, otherwise we split it into even smallar part

for

unlike

we generate something like

["un##", "like"]

we also have some special tokens

[[CLS], [SEP], [PAD]]

Then we generate a map from word to ID

also combine a mask which have 0/1 indicates whether its a PAD or a word