Step-by-step guide

Web Scraping With No Effort. Python: BeautifulSoup, Grequests.

How to build a web scraper with BeautifulSoup and asynchronous HTTP requests (Grequests)

Galina Blokh
Geek Culture
Published in
4 min readJan 13, 2021

--

Photo by Artem Sapegin on Unsplash

Introduction.

It is my first tutorial about web scraping. I will explain (with full code examples) how to create a web scraper using BeautifulSoup and Grequests Python libraries.

Assuming you have an NLP task — collect text data from the recipe website and make a binary classification: ingredients/instructions. Let’s scrape the data from a recipe site https://www.loveandlemons.com/. For this purpose, we will use the most popular, beginner-friendly libraries: BeautifulSoup and Grequests.

Definitions.

BeautifulSoup is open-source and completely free to use the library, makes it easy to scrape information from web pages. It sits at the top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. BeautifulSoup does not have the capability of sending web requests. We will use the Grequests module. Aside from sending web requests, BeautifulSoup also does not have a document parser. We have to choose from options such as ‘html. parser’, HTML5lib, XML Parser, and a few others.

Grequests is also an open-source, free library, which allows you to use Requests with Gevent to make asynchronous HTTP requests. The majority of tutorials use the Request module, implement asynchronous requests in hand way. Grequests already have built-in asynchronous methods in the box, which is very convenient.

Package installation.

It is not hard to do:

  1. Find in IDE Interpreter settings (I use Pycharm, but new package installation is always the same in any IDE)
  2. Press plus to add new package/library
  3. Enter the name of the package you want to install
  4. Press Install Package.
Pic.1 How to install package Grequests in Pycharm using Anaconda virtual environments

When we have all packages installed, we can think about the steps forward data collection from a website:

  1. Collect all links with recipes into a list
  2. Collect the data from each page
  3. Save the data into a file

Step 1. Collect all links.

It is easy to do. First, construct a request list and use “grequests.map()” to send in parallel. Then get a list of responses.

How to use grequest

Since we got a response list, we able to iterate over it with bs4 and html.parser. Then collect all recipe links from the bs4 object using pattern. In this way, we get a list of links.

Parsing the results of all the requests by iterating through the res_list with bs4 and link pattern

We finished with step 1. Only four lines of code plus imports — easy!

Step 2. Collect the data from each page.

First, define a default dictionary to store the data we are collecting. Then repeat two lines with constructing a request list and getting responses from all recipe pages from step 1. Pay attention: we should make a set of links to reduce duplicates:

Get all recipe responses at one time

Then start iterate over responses, applying the bs4 search function for each response element of class “ingredient” and class “instructions”. Surround the parser with the try-except statement since we can catch an AttributeError exception (when we try to parse None type object):

Data search in each response with bs4

And here, on the picture below, the target class search in web page HTML. It shows the place where you take keys for target search:

When you catch all text elements (in the same for-loop), you have to transform all results into a list-of-text, put it into a default dictionary. That’s all. Step 2 is over:

Using list comprehension extract text data from a parsed response

Step 3. Save the data into file.

It is the quickest and easiest step. Pickle dumps data much faster than CSV does and takes less space. Here we take a pickle format, but you free to use any you like.

Save data into pickle file

Try to open this file with pandas — you” ll see a set with two columns: ‘Recipe’ and ’INSTRUCTIONS’. Each row from 1009 rows in DataFrame is data from 1009 unique pages with recipes. Each item in every cell is a list of paragraphs.

Pic.2 Data set loaded from pickle file

Conclusions.

Well, we’ve done a big job in a short time. The next steps for the NLP goal should be: preprocess the data, split it into paragraphs, give labels for each text paragraph (for text classification), transform the text into sequences and build the model. It might be the following subject for an article.

In this tutorial, you learned how quick and easy web scraping does with asynchronous HTTP requests. It takes not much time and intuitive. BeautifulSoup and Grequests are very convenient and powerful tools for data science purposes. I hope this knowledge will save you time with googling and help to avoid stuck in errors.

The full project on Github

--

--

Galina Blokh
Geek Culture

Passionate about technologies, love challenges, talented NLP Data Scientist in EPAM. https://www.linkedin.com/in/galina-blokh/