Skip to content Skip to sidebar Skip to footer

Valueerror: Cannot Reshape Array Of Size 3800 Into Shape (1,200)

I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow: def

Solution 1:

It looks like the intent of your word_vector() method is to take a list of words, and then with respect to a given Word2Vec model, return the average of all those words' vectors (when present).

To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size, because that's forced by what the model already provides. You could use utility methods from numpy to simplify the code a lot. For example, the gensimn_similarity() method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:

https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996

So, while I haven't tested this code, I think your word_vector() method could be essentially replaced with:

import numpy as np

defaverage_words_vectors(tokens, wv_model):
    vectors = [wv_model[word] for word in tokens 
               if word in wv_model]  # avoiding KeyErrorreturn np.array(vectors).mean(axis=0)

(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim code via applying gensim.matutils.unitvec() to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)

Separate observations about your Word2Vec training code:

  • typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.

  • the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.

Post a Comment for "Valueerror: Cannot Reshape Array Of Size 3800 Into Shape (1,200)"