Python 3.3.2 - Finding Image Sources In HTML
I need to locate and extract image sources from an html file. For example, it might contain:
or

Examples
Live Regex Demo
Live Python Demo
Sample Text
Note the rather difficult edge cases in the first line
<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
Python Code
#!/usr/bin/python
import re
string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";
regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";
intCount = 0
for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
print " "
print "[", intCount, "][ 0 ] : ", matchObj.group(0)
print "[", intCount, "][ 1 ] : ", matchObj.group(1)
print "[", intCount, "][ 2 ] : ", matchObj.group(2)
intCount+=1
Capture Groups
Group 0 gets the entire image or img tag
Group 1 gets the quote which surrounded src attribute, if it exists
Group 2 gets the src attribute value
[ 0 ][ 0 ] : <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] : "
[ 0 ][ 2 ] : http://another.example/picture.png
[ 1 ][ 0 ] : <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] : "
[ 1 ][ 2 ] : http://example.site/logo.jpg
[ 2 ][ 0 ] : <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] : "
[ 2 ][ 2 ] : http://another.example/DoubleQuoted.png
[ 3 ][ 0 ] : <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] : '
[ 3 ][ 2 ] : http://another.example/SingleQuoted.png
[ 4 ][ 0 ] : <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :
[ 4 ][ 2 ] : http://another.example/NotQuoted.png
Solution 2:
Try BeautifulSoup, just write
from bs4 import BeautifulSoup
soup = BeautifulSoup(theHTMLtext)
imagesElements = soup.find_all('img')
Solution 3:
And an altered version
<ima?ge? # using conditional letters, we match both tags in one expression
\s+ # require at least one space, also includes newlines which are valid
# prevents <imgbutnotreally> tags
[^>]*? # similar to the above, but tell it not to be greedy (performance)
\bsrc="([^"]+) # match a space and find all characters in the src tag
rubular
<ima?ge?\s+[^>]*?\src="([^"]+)
Solution 4:
To find some image in a html using soup
from bs4 import BeautifulSoup
url = <img src="http://another.example/picture.png">
a = BeautifulSoup(html, 'html.parser')
b = a.findAll('img')
url_picture = list()
for i in range(0, len(b)):
image = b[i].attrs['src']
url_picture.append(image)
Post a Comment for "Python 3.3.2 - Finding Image Sources In HTML"