Search This Blog

Wednesday, December 17, 2014

Scrapping Web Sites using Python

I was trying to scrap some content from the web and there are so many videos on the web on how to use Python to do this. I thought I should write something about this. Here you go!

1) Install Sublime to get the cool hacker look for writing python code.
2) Install the Python (3.4 seems to be the latest at the time of writing this blog.)
3) Create a python file and import requests library by:
        import requests
4) Decide what do you want to parse the response with. Eg.
         (i) For HTML/XML content you could use BeautifulSoup4 libraries.
                  from bs4 import BeautifulSoup
        (ii) For JSON, you can use the JSON library
                  import json
       (iii) For other kinds of content, you could use simple regular expressions
                  import re

   Check out each of these libraries' usage to better understand what can be done with them. I really struggled with the BeautifulSoup library and ended up writing regular expressions instead.


5) For preparing your request, all you need to build is a URL and you could add the headers like this.
    headers = {'Content-Type':'text/plain;charset=utf-8','Accept': '*/*'}
    url = r"http://www.xyz.com/blahblah.aspx?parameters"

6) Send the request by the following way and capture the response into an object.
     resp = requests.get(url,headers=headers)

Looks simple right ?
Setting up python is a bit of a pain initially though. I simply followed instructions present online used pip install  to install the libraries. Also, python has some problems printing different encoding. Plenty of posts are there regarding encoding errors seen with the print command.


Cheers! and Happy Scrapping !

No comments:

Post a Comment