Text Extraction Using Python Lxml Looping Issue
Here is a part of my xml file.. - - - &l
Solution 1:
You are making the sentence for every a:rPr
found in the loop.
Here's an example of what you should do instead:
test.xml
:
<bodyxmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"><a:p>
-
<a:pPrlvl="2">
-
<a:spcBef><a:spcPtsval="200"/></a:spcBef></a:pPr>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"/><a:t>The</a:t></a:r>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"/><a:t>world</a:t></a:r>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"/><a:t>is small</a:t></a:r></a:p><a:p>
-
<a:pPrlvl="2">
-
<a:spcBef><a:spcPtsval="200"/></a:spcBef></a:pPr>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"b="0"/><a:t>The</a:t></a:r>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"b="0"/><a:t>world</a:t></a:r>
-
<a:r><a:rPrlang="en-US"sz="1400"dirty="0"smtClean="0"b="0"/><a:t>is too big</a:t></a:r></a:p></body>
test.py
:
from lxml import etree
tree = etree.parse('test.xml')
NAMESPACES = {'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}
path = tree.xpath('/body/a:p', namespaces=NAMESPACES)
for outer_item in path:
parts = []
for item in outer_item.xpath('./a:r/a:rPr', namespaces=NAMESPACES):
parts.append(item.getparent().xpath('./a:t/text()', namespaces=NAMESPACES)[0])
print" ".join(parts)
output:
The world is small
The world is too big
So, just looping over a:p
items and extracting the text into parts
, then print it after processing of each a:p
. I've removed if statement for clarity.
Hope that helps.
Post a Comment for "Text Extraction Using Python Lxml Looping Issue"