How To Load Sentences Into Python Gensim?
I am trying to use the word2vec module from gensim natural language processing library in Python. The docs say to initialize the model: from gensim.models import word2vec model = W
Solution 1:
A list of utf-8
sentences. You can also stream the data from the disk.
Make sure it's utf-8
, and split it:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
Solution 2:
Like alKid
pointed out, make it utf-8
.
Talking about two additional things you might have to worry about.
- Input is too large and you're loading it from a file.
- Removing stop words from the sentences.
Instead of loading a big list into the memory, you can do something like:
import nltk, gensim
classFileToSent(object):
def__init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def__iter__(self):
for line inopen(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i notin self.stop]
yield ll
And then,
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
Post a Comment for "How To Load Sentences Into Python Gensim?"