Skip to content Skip to sidebar Skip to footer

Frequency Of Ngrams (strings) In Tokenized Text

I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is t

Solution 1:

You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n) one you have now. Remember, shorter code is not necessarily better or more efficient code:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

To use this function properly, you would have to write a def function instead of a lambda:

defcount_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

Solution 2:

Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams name as the function, and use something else as variable.

import time
from functools import partial
from itertools import chain
from collections importCounterimport wikipedia

import pandas as pd

from nltk import word_tokenize
from nltk.utilimport ngrams

Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:

# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')

And then:

# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')

Then:

# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')

Finally, to the last line

# Instead of a set, we use a Counter here because # we can use an intersection between Counter objects later.# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the # zeros in the vectors (aka dense vector), # you could actually get the non-zero sparse vectors # as a dict as suchdf['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)

Post a Comment for "Frequency Of Ngrams (strings) In Tokenized Text"