Image 14282+ users

Web Scraping in Python

Web Scraping in Python

15/05/18   15 minutes read     836 Naren Allam

Web Scraping is the process of data extraction from various websites.

Multiple Techniques For Scraping

1. Requests — Http Library In Python

To scrape the page or a website, we first need the content of the HTML page in an HTTP response object. The requests library from python is pretty handy and very easy to use. It uses urllib inside. I like requests as it’s easy and the code becomes readable too.

PYTHON  Copy
                    
                      #Example showing how to use the requests library

import requests
res = requests.get("https://www.rossum.io/") #Fetch HTML Page





                    
                  

2. Beautifulsoup

BeautifulSoup is a very powerful Python library which helps you to extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you to extract the data. We use the requests library to fetch an HTML page and then use the Beautiful Soup to parse that page. In this example, we can easily fetch the page title and all links in the page.

PYTHON  Copy
                    
                      from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.rossum.io/") #Fetch HTML Page
soup = BeautifulSoup(res.text, "html.parser") #Parse HTML Page
print( "Webpage Title:" + soup.title.string)
print ("Fetch All Links:", soup.find_all('a'))
                    
                  

3. Python Scrapy Framework

Scrapy is faster than BeautifulSoup. Scrapy is a web crawling framework for developers to write code to create spiders, which define how a certain site (or a group of sites) will be scrapped.
The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great.
Moreover, it is a framework to write scrapers as opposed BeautifulSoup which is just a library to parse HTML pages.

Here is a simple example of how to use scrapy. Install scrapy via pip. Scrapy gives a shell after parsing a website.

PYTHON  Copy
                    
                      $ pip install scrapy   #Install Scrapy"
$ scrapy shell https://www.rossum.io/

In [1]: response.xpath("//a").extract() #Fetch all a hrefs
                    
                  

Now let's write a custom spider to parse a website.

PYTHON  Copy
                    
                      import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://www.rossum.io/']
    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}
EOF
 scrapy runspider myspider.py
                    
                  

That’s it. Your first custom spider is created. Now let’s understand the code.

  • name: Name of the spider, in this case, it is “blogspider”.

  • start_urls: A list of URLs where the spider will begin to crawl from.

  • parse(self, response): This function is called whenever the crawler successfully crawls a URL. Remember the                     response object from earlier in the scrapy shell? This is the same response object that is passed to the parse(..).
  • When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternately, you can write your extraction logic in the parse method or create a separate class for extraction and call its object from the parse method.

    Conclusion:

  • We’ve seen the basics of scraping, frameworks, how to crawl and do’s and don’ts of scraping.

  • Follow target URLs rules while scraping. Don’t make them block your spider.

  • Maintenance of data and spiders at scale is difficult. Use Docker/Kubernetes and public cloud provider like AWS           to easily scale your web-scraping backend.

  • Always respect the rules of the websites you plan to crawl and if APIs are available, always use them first.