Prefixspan Sequence Extraction Misunderstanding

May 26, 2024 Post a Comment

I have a set of tuples of size three in a list that represent windowed sequences. What I need is using pyspask to be able to get (given the two first parts of the tuple) the third

Solution 1:

First let's look at your input:

rdd.count()

As you can see you created a dataset with only one sequence. It can be described as:

<(abc)(bcd)(cde)(def)(efg)(fgh)(abc)(def)(abc)(bcd)(fgh)(def)(bcd)>

So patterns you get are indeed correct given the input. For example

FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)

corresponds to:

...(abc)(def)(abc)...

If each element of the dataset represents individual sequence data could have the following shape:

Baca Juga

rdd = sc.parallelize([
    [['a'], ['b'], ['c']], [['b'], ['c'], ['d']], [['c'], ['d'], ['e']],
    [['d'], ['e'], ['f']], [['e'], ['f'], ['g']], [['f'], ['g'], ['h']],
    [['a'], ['b'], ['c']], [['d'], ['e'], ['f']], [['a'], ['b'], ['c']],
    [['b'], ['c'], ['d']], [['f'], ['g'], ['h']], [['d'], ['e'], ['f']],
    [['b'], ['c'], ['d']]
])

rdd.count()

rdd.first()

[['a'], ['b'], ['c']]

where:

Each element is a list of lists.
Each internal list represents possible alternatives at the given position.

With data structured like this:

model = PrefixSpan.train(rdd, 0.2, 3)
model.freqSequences().top(5, key=lambda x: len(x.sequence))

[FreqSequence(sequence=[['d'], ['e'], ['f']], freq=3),
 FreqSequence(sequence=[['b'], ['c'], ['d']], freq=3),
 FreqSequence(sequence=[['a'], ['b'], ['c']], freq=3),
 FreqSequence(sequence=[['f'], ['g']], freq=3),
 FreqSequence(sequence=[['d'], ['f']], freq=3)]

model.freqSequences().top(5, key=lambda x: x.freq)

[FreqSequence(sequence=[['d']], freq=7),
 FreqSequence(sequence=[['c']], freq=7),
 FreqSequence(sequence=[['f']], freq=6),
 FreqSequence(sequence=[['b']], freq=6),
 FreqSequence(sequence=[['b'], ['c']], freq=6)]

Introduction to Python Course

Prefixspan Sequence Extraction Misunderstanding

Solution 1:

Post a Comment for "Prefixspan Sequence Extraction Misunderstanding"