Skip to content Skip to sidebar Skip to footer

Text Extraction Using Python Lxml Looping Issue

Here is a part of my xml file.. - - - &l

Solution 1:

You are making the sentence for every a:rPr found in the loop.

Here's an example of what you should do instead:

test.xml:

<bodyxmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"><a:p>
        -
        <a:pPrlvl="2">
            -
            <a:spcBef><a:spcPtsval="200"/></a:spcBef></a:pPr>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"/><a:t>The</a:t></a:r>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"/><a:t>world</a:t></a:r>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"/><a:t>is small</a:t></a:r></a:p><a:p>
        -
        <a:pPrlvl="2">
            -
            <a:spcBef><a:spcPtsval="200"/></a:spcBef></a:pPr>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"b="0"/><a:t>The</a:t></a:r>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"b="0"/><a:t>world</a:t></a:r>
        -
        <a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"b="0"/><a:t>is too big</a:t></a:r></a:p></body>

test.py:

from lxml import etree


tree = etree.parse('test.xml')
NAMESPACES = {'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
              'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}

path = tree.xpath('/body/a:p', namespaces=NAMESPACES)

for outer_item in path:
    parts = []
    for item in outer_item.xpath('./a:r/a:rPr', namespaces=NAMESPACES):
        parts.append(item.getparent().xpath('./a:t/text()', namespaces=NAMESPACES)[0])

    print" ".join(parts)

output:

The world is small

The world is too big

So, just looping over a:p items and extracting the text into parts, then print it after processing of each a:p. I've removed if statement for clarity.

Hope that helps.

Post a Comment for "Text Extraction Using Python Lxml Looping Issue"