15/05/18 15 minutes read 836 Naren Allam
Web Scraping is the process of data extraction from various websites.
To scrape the page or a website, we first need the content of the HTML page in an HTTP response object. The requests library from python is pretty handy and very easy to use. It uses urllib inside. I like requests as it’s easy and the code becomes readable too.
BeautifulSoup is a very powerful Python library which helps you to extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you to extract the data. We use the requests library to fetch an HTML page and then use the Beautiful Soup to parse that page. In this example, we can easily fetch the page title and all links in the page.
Scrapy is faster than BeautifulSoup.
Scrapy is a web crawling framework for developers to write code to create spiders, which define how a certain site (or a group of sites) will be scrapped.
The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great.
Moreover, it is a framework to write scrapers as opposed BeautifulSoup which is just a library to parse HTML pages.
Here is a simple example of how to use scrapy. Install scrapy via pip. Scrapy gives a shell after parsing a website.
Now let's write a custom spider to parse a website.
That’s it. Your first custom spider is created. Now let’s understand the code.
When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternately, you can write your extraction logic in the parse method or create a separate class for extraction and call its object from the parse method.