Vocab vectors using complete pretrained-embedding? · pytorch/text#446

(6 comments) (0 reactions) (0 assignees)Python (822 forks)batch import

enhancementhelp wanted

Repository metrics

Stars: (3,396 stars)
PR merge metrics: (No merged PRs in 30d)

Description

I am new to pytorch and nlp. I have a question when I tried to build a model.

Since my training dataset is not so big, the size of its vocab is relatively small (around 5000). However, I want to deal with any other user input which could be out of this vocabulary.

The problem is, in the model I trained, the embedding layer's weight is based on the vectors of the field, not the whole word2vec pretrained embeddings. So I cannot modified it after the training is done.

I wondered is there any better approach to do it? Thanks in advance!

Contributor guide

Research direction: Investigate how PyTorch's torchtext handles vocabulary and embedding layers. Look into existing discussion on this issue (e.g., comments on GitHub issue #446) and evaluate if a utility function for loading full pretrained embeddings is feasible. Consider adding a method to allow users to specify a pretrained embedding source that maps tokens not in the training vocab to their vectors from a larger set.
Tech stack: python
Domain: machine learning
Issue type: Feature
Difficulty: 3
Estimated time: Half day
Activity status: Stale
Clarity: Clear
Prerequisites: PyTorch basicsword embeddings
Newbie friendliness: 50

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.