Valueerror: Cannot Reshape Array Of Size 3800 Into Shape (1,200)
Solution 1:
It looks like the intent of your word_vector()
method is to take a list of words, and then with respect to a given Word2Vec
model, return the average of all those words' vectors (when present).
To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size
, because that's forced by what the model already provides. You could use utility methods from numpy
to simplify the code a lot. For example, the gensim
n_similarity()
method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:
So, while I haven't tested this code, I think your word_vector()
method could be essentially replaced with:
import numpy as np
defaverage_words_vectors(tokens, wv_model):
vectors = [wv_model[word] for word in tokens
if word in wv_model] # avoiding KeyErrorreturn np.array(vectors).mean(axis=0)
(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim
code via applying gensim.matutils.unitvec()
to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)
Separate observations about your Word2Vec
training code:
typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is
min_count=5
. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.
Post a Comment for "Valueerror: Cannot Reshape Array Of Size 3800 Into Shape (1,200)"