How to crawl web data using urllib2 and BeautifulSoup?

jackvo · September 1, 2016, 9:55am

Hi guys,

I’m using urllib2 and BeautifulSoup to crawl web data. I can get all tags in a webpage but if tag inside is not class='some value'. It use something like <a rel="nofollow" href="http://google.com.vn">Google</a>
So how to solve this issue ?
I want to extract the href value in this tag <a rel="nofollow" href="http://google.com.vn">.

If you know some links similar to this requirement, please send me this.

Thank you so much.

htl · September 1, 2016, 10:46am

BeautifulSoup class has find() and find_all() methods. In your situation, it woud be:

link = soup.find('a', {'rel':'nofollow'})
print(link.attrs['href'])

to find first link’s href.
find_all() returns a list of found links with attributes you pass in
PS: I recommend using requests instead of urllib

jackvo · September 2, 2016, 1:45am

Yes, I know. But my PC behind a proxy, so I just found some tutorial relate to urllib2 and BeautifulSoup. Wait, request also support proxy. Thank for your mention.
Problem solve still use urllib2 and bs4. I redirect the result to *.txt file so next step will create a def to call an download manager program and automatically run it.

Python version : 2.7.10
OS Platform : Windows 7

# coding:utf-8
import urllib2, re
from bs4 import BeautifulSoup

# declare proxy configuration
proxy = urllib2.ProxyHandler({'http': 'http://192.168.1.1:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

# specify the url
quote_page = "http://web_you_wanna_crawl"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
#create an empty list link
link = []
for i in soup.find_all('a'):
    j = i.get('href')
    # type(j) is unicode so convert it to string
    if re.findall('.rar',str(j)):
        print j
        link.append(j)

with open('linkdownload.txt', 'wb') as file:
    for item in link:
        file.write("%s\n" % item)

Your solution got an error. Could you check it ?

print(link.attrs['href'])
AttributeError: 'ResultSet' object has no attribute 'attrs'

htl · September 3, 2016, 9:39am

find_all() returns a list so you have to loop through it to get items. Each item has its own attrs

For sending links to a download manager, please refer to my script: https://github.com/htlcnn/scripts/blob/master/send_link_to_IDM.py