Getting Started with Scraping
This exercise uses the Python programming language and the BeautifulSoup Python library for pulling data out of HTML pages. We’ll write code in a Python file in a text/html editor, and then we’ll run the script using the Terminal.
It is helpful to have knowledge of html and css concepts before you learn to scrape a web page. These are basic concepts that all students who have taken the first few modules of a web design class should know.
- table – html code for introducing a table on an html page
- tr – html code for table row
- td – html code for table data
- div – a section of a Web page. It can be identified further with classes or ids.
- Basic understanding of how html is styled with attributes and inline and external css.
You will need to be able to look at html code and identify selectors (ids and classes) that will help the script find the data.
A few things about Python:
- The program is space sensitive, so tabbing matters. When you paste the saved code, make sure it is indented in an identical manner.
- Do not end the lines with a semicolon or other punctuation.
- If you want to comment any lines of the code, precede each line with a # sign. Commenting allows you to add helpful instructions or descriptions to code that does not affect its ability to run. And, it can allow you to temporarily remove lines of code for testing.
- If you get any error messages in the Terminal, read them. They will help you troubleshoot any problems.
These syntactical rules may not make sense right now, but will become clearer as we work through the exercises.
Python Code
Check your python version. Make sure you have python installed and are using at least version 2.7.6. You can find the Terminal on a Mac under Applications, Utilities. In the Terminal, type in the following code. The $ indicates the Terminal prompt, so don’t type that.
$ python –version
Also, make sure you have the pip program installed. This allows you to install Python libraries that you will use during this lesson. Type this code into the Terminal
$ pip –version
If you don’t have pip, run:
$ sudo easy_install pip
The command “sudo” means “super-user do.” This requires the user to sign in with username and password for the computer. Install the following now with pip:
$ sudo pip install BeautifulSoup
$ sudo pip install Requests
With your Terminal open and a new document ready in an html editor, use the Terminal to cd to a folder on your computer. Let’s just mkdir from your root directory to make a folder called “scraper”. Then cd into “scraper”.
$ mkdir scraper
$ cd scraper
Now in your html editor, save a page in that folder named scrape.py. You will now be in the appropriate directory to run the file in the Terminal.
Hint: if you need to change your Finder to your “Home” folder, use Cmd-Shift-H.
You will use the following command to run the program:
$ python scrape.py
Include this code in the scrape.py file you created. It will scrape the Academy-Award-Winning Films from Wikipedia. You will run this in the Terminal by going to the appropriate directory, using the Terminal commands to change directory (cd). You run it with $python scrape.py.
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'class': 'mw-content-ltr'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text
list_of_cells.append(text.encode('utf-8'))
list_of_rows.append(list_of_cells)
outfile = open("./film.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
Let’s take a look at what is happening:
- First, we import several libraries that we will need. “import csv” allows us to create a csv at the end of the script. “import requests” allows us to handle url requests and BeautifulSoup lets us select html.
- Next, we establish the url of the site we are scraping. We put it in a variable named “url,” so we can use it in the script. The next two lines create variables for the response and html. We will also use html in the script.
- The next two lines use BeautifulSoup to access the html (DOM) and make a selection.
- The “find” line is the key to this exercise. You have to go into the code and find a CSS selector that identifies where in the code you want to start scraping. The data is usually stored in a table (although it could be in a list). Find a containing element for the table or list and identify either a class or id. In this case, there was a parent div that had a class “mw-content-ltr”. The code includes that information as an attr (attribute).
- Next we create an empty array named list_of_rows. This will contain all our rows of data until we are ready to write it to a file.
- Next there are nested for loops that go through each row of the table, extract the text from each cell and append it to the list_of_cells array. That is then appended to the list_of_rows array. That happens for every row of the table.
- The last few lines simply write the data from the final list_of_rows array to a file, in this case, we named it film.csv. After you run the script, you will find it in the same folder as your scripts.
From the terminal, run:
$ python scrape.py
After a few seconds, the prompt should reappear. Check your folder for the csv and open it in Excel. You have created your first scraper!
Exercise: : Modify the scraper above to scrape the contents of this page, a list of Top 100 Newspapers. http://247wallst.com/media/2017/01/24/americas-100-largest-newspapers/
Hint: You have to find a containing selector for the table you wish to scrape and modify the “find” statement. And, you have to change the url.
Also, you can rerun a command in Terminal by using the up arrow/down arrow keys. This allows you to scroll through the last several statements.
Resources
- You can find code for a multi-page scraper on the Using Python for Scraping page on CodeActually.com.
- The Ten Best Data Scraping and Web Scraping Tools for 2019
- Chrome Scraper Extension
- Try Google Sheets IMPORTHTML function