Lab: BeautifulSoup

The goal of this lab is to learn how to scrape web pages using BeautifulSoup.

Introduction

BeautifulSoup is a Python library that parses HTML files and allows you to extract information from them. HTML files are the files that are used to represent web pages.

If you are interested in some question, whether it be related to finance, meteorology, sports, etc., there are almost certainly web sites that provide access to data that can be used to explore your question and gain insight.

Sometimes, you are lucky and find a website that both has the data you want and that helpfully provides the data in a machine-readable format. For instance, it may have a mechanism that allows you to submit queries and download CSV or JSON files back. A website that offers this service is said to “provide an API.”

Sadly, most websites are not so helpful. To gather data from them, you will instead need to go rogue and “scrape” them – submit requests as if they were coming from a web browser being operated by a human, but save the raw HTML output and interpret it programmatically in a Python script instead. When this is necessary, you will need to be able to parse the HTML code for the web page. And when you need to parse HTML, BeautifulSoup is the library to use.

Unfortunately, web pages are designed to present information to humans; web designers rely on the ability of humans to make sense of the data, no matter how it may be formatted. BeautifulSoup cannot analyze an entire web page and understand where the data is; it can only help you extract the data once you know where to find it. Therefore, applying BeautifulSoup to a scraping task involves:

  1. inspecting the source code of the web page in a text editor to infer its structure
  2. using information about the structure to write code that pulls the data out, employing BeautifulSoup

Installation

If you are completing this lab on a CSIL machine, the required software is likely already present. But, if you are working on your VM, perform the following step to upgrade your modules to working versions (the versions pre-installed on the VM appear to be buggy):

sudo pip3 install --upgrade beautifulsoup4 html5lib

If you experience import issues on a CSIL machine, try:

pip3 install --user --upgrade beautifulsoup4 html5lib

References

Here are links to reference material on HTML and BeautifulSoup that you may find useful during the course of this lab:

Example 1: Aviation Weather Observations

If you are interested in working with weather data, you may find METAR data to be useful.

Pilots rely on accurate weather information to operate aircraft safely. For instance, wind speeds and directions affect landing technique, because the pilot must compensate for the fact that the wind may be blowing the aircraft off course. And, the altitude of clouds and presence or absence of fog determines whether a pilot can safely approach an airport and visually identify the location of the runway without flying dangerously close to the ground.

Therefore, airports around the country publish hourly weather observations, called METARs. These reports are formatted as plain text in a highly abbreviated syntax that any trained pilot knows how to decode. Because they follow a standard formatting, they are suitable for use in programs, as well. Unfortunately, the web page that provides METARs places them as a single line of text amidst other elements, such as a request form and various weather-related links. We want to be able to extract just the weather observation from this cluttered page.

Step 1: Examining the Page

Start by going to the web site that provides this data.

On the right side of the page, there is a form field labelled “IDs” and a button named “Get METAR data”. In the IDs field, you need to fill in the ICAO code of a US airport. The ICAO code is simply the three-letter code you are probably used to (known as an IATA code), preceded by a K. For instance, the ICAO code for Midway is KMDW, and the ICAO code for O’Hare is KORD. Please enter in a single ICAO code for an airport of your choosing, leave the other options at their defaults, and press the button to get the current result.

If you entered in a valid airport code, you should get back a page that contains a line like:

KMDW 082353Z 17008KT 10SM SCT110 BKN180 BKN250 M09/M18 A3056
RMK AO2 SLP372 60000 T10891178 11067 21100 58016

Interpreting a METAR is obviously outside the scope of the course, but, for what it’s worth: This line gives the ICAO code, then the date and time of the observation (in the UTC time zone), then the wind direction and speed,, followed by the visibility (in miles), information on the current cloud cover, the temperature and dewpoint in Celsius, then the barometric pressure, followed by other remarks.

You could simply visit this web site by hand, and then copy and paste the pertinent line into a file. But, if you wanted to automatically collect these observations every hour, or for hundreds or thousands of airports, that does not sound tenable.

Let’s now explore the technical details of this web page.

First, take a look at the URL of the web page. It should look something like:

https://www.aviationweather.gov/metar/data?ids=kmdw&format=raw&date=0&hours=0

(Your URL may be different, depending upon what airport you used and whether you typed the ICAO code in lower- or upper-case letters.)

We have already learned something useful: for this web site, the URL encodes the specific airport being queried. So, if we wanted to write a Python function that generates a URL for a given airport, we do not need to somehow fill in and submit the form on the web site and see which page it sends us to; all we need to do is concatenate a couple of strings together.

Now, use your web browser’s File menu to save a copy of the web page to disk. Open this file with a text editor to view the raw HTML formatting of the web page.

You can start by scanning through the file to get a sense of what it looks like in general. However, because it is fairly large and most of the content is irrelevant, you will quickly want to hone in on the portion of interest.

Referring back to your web browser window, determine the first few characters of the information we want to extract from the web page. For instance, for the above example, KMDW 08. Use your text editor’s Find function to locate this text within the web page source.

Now, examine the HTML tags in the vicinity of this text. Look for distinctive landmarks that would help you programmatically identify the location of the weather observation without searching for the actual contents of the observation itself.

There are different ways to approach this task. Here is what I noticed, however. Before the observation, there is a paragraph tag with the attribute clear="both". I searched for any other paragraph tags with the same attribute in the page, but didn’t find any. Within this paragraph is a human-readable date, but then the paragraph is closed with a </p> tag.

Following the paragraph is a newline, then a comment that the Data starts here. Next is a newline, then the actual observation, wrapped in a <code> tag.

Step 2: Parsing the HTML

I could then locate the weather observation using this approach:

  1. Find all paragraph (p) tags with attribute clear="both". (There is only one.)
  2. This gives me a list of length one. Select the first entry in this list.
  3. I could then navigate the tree of HTML, starting with this paragraph tag, to get to the next sibling. Critically, this skips over everything nested within the paragraph tag, including the bold (strong) tag and the human-readable date.
  4. It turns out that the next sibling is the newline character after the closing of the paragraph tag. Not what I want. I’ll move on to the next sibling of that.
  5. That entry turns out to be the comment. (Data starts here) I’ll move on to the next sibling another time.
  6. Another newline. Let’s go one sibling further again.
  7. I get a <code> tag at this point. I can use .text on this tag to pull out its contents.

Using a Python interpreter, try writing lines of code to go through this process. Here are fragments of code that you will find useful:

import bs4
html = open(filename).read()
soup = bs4.BeautifulSoup(html)
tag_list = soup.find_all("p", clear="both")
...
tag = tag.next_sibling
...
data = tag.text
...

Note that when you try the line that initializes the soup object, you will get a warning message, along with advice on how to improve your code by adding an extra parameter to the function call. Go ahead and follow this advice.

Combine these snippets, the step-by-step process for getting the weather observations listed above, and your knowledge of Python to write a series of lines of code that successfully retrieve the weather observation string.

If you looked over the HTML file and made the observation that you could just search for the <code> tag and go directly to where you want to be, instead of performing the additional navigation described above, you are correct. At some point in the recent past, the web designer added this tag; before, you did have to follow the more indirect process stated above. I’ve left the more indirect version in this lab, however, because it demonstrates more of the techniques you need to use in typical situations. When you have a very direct approach available, though, feel free to use it.

Step 3: Automating Queries for Any Airport

Now, here is a code snippet that loads HTML from a URL:

import urllib3
import certifi

pm = urllib3.PoolManager(
       cert_reqs='CERT_REQUIRED',
       ca_certs=certifi.where())
...
html = pm.urlopen(url=myurl, method="GET").data

Take these snippets, and combine them with your BeautifulSoup-based parsing code and some additional code to build a URL through string concatenation, and create a single function current_weather that takes in a parameter – an airport code – and returns back a string which is the latest weather observation from that airport.

Once you have this working, take a moment to understand the power of the function you just wrote: although you only inspected the format of the web page for one airport weather observation at one time, you found characteristic landmarks in the formatting of this page that will allow your function to work for any airport at any time, and to be used millions of times, if desired.

Example 2: Climate Data

This data is no longer available.

Example 3: Chicago ‘L’ Lines

The CTA maintains web pages for each of the ‘L’ lines in the city. For instance, here is the one for the Red Line.

Although these pages are valuable, they were even more valuable in the past – during a recent redesign of the site, they reduced the amount of information that can be seen at a glance. Fortunately, it is often (though not always) possible to view older versions of web pages using the Internet Archive Wayback Machine. Try entering the same URL into that system and exploring back in time. You should be able to find a version of the site that has more detailed information in its chart. In particular, this version is much better than the current one.

Scroll down to the section titled “Route Diagram and Guide” and have a look. This section seems to be chock full of detail about this transit line: a list of stations, information about wheelchair accessibility and parking, and rail and bus transfers. (This part of the web site is the one that has seen the most dramatic decrease in information density recently, if you compare the two versions.)

If you were doing a project on public transportation in Chicago, this page (and the corresponding ones for the other lines) seems like it would be a treasure trove of data. Unfortunately, much of it appears to be graphical. For instance, you can transfer to the Yellow and Purple lines at Howard, but short of writing code to interpret an image (a very complicated task), determining this in your code rather than by visual inspection seems out of reach. Similarly, while there are wheelchair and parking icons for some stations, they are icons, not text.

The HTML tag for images, however, allows a web site designer to provide an alternative textual representation of an image. This is accomplished by adding an alt attribute to an img tag. For various reasons, it is considered good practice to do this, and a well-designed web site will follow this protocol.

Save this page (the archived version) and open it in your text editor. Search for this table and take a look at the image tags. We’re in luck!

When we are navigating the tree of elements that results from parsing a web page with BeautifulSoup, we can retrieve the value of a tag’s attribute with the syntax tag["atr_name"]. For instance, if t is an img tag, we could use t["alt"] to retrieve its alt text.

This table seems to be a little harder to distinctively identify. But there is a nearby a tag that might serve as a landmark based on one of its attributes.

To take advantage of this, here’s what I tried:

soup.find_all("a", name="map")

Please try it yourself.

This didn’t work; Python gave me and, likely, you, an error message. What happened here?

It turns out that the name of the first argument, the one that stores the name of the tag itself (in this case a), is already named name. This conflicts with our attempt to specify the value of an attribute named name.

This is a similar, but different, issue as the fact that you have to write class_ instead of class when matching against an attribute named class, because this keyword is a Python reserved word.

There is a workaround, however: you can make a dictionary of attributes to look for, and just have a single entry:

soup.find_all("a", attrs={"name":"map"})

Once you have found this a tag, you can use .parent and .next_sibling to get to the table. Examine the nesting of tags carefully to understand what the tree of nested tags looks like in this vicinity.

Write code to scrape the table and turn it into a useful in-memory representation. Here is one reasonable choice:

  • There is a list of stations in order down the line, in the order presented on the page. (For instance, for the Red Line, this is north to south, from Howard to 95th/Dan Ryan.)
  • Each entry in this list is a dictionary. It has a key for the name, a key for the “amenities” (accessible, parking), a key for the URL for the station page, a key for the ‘L’ transfers (colored lines), and a key for the other connections (bus and Metra rail lines). The values associated with each of these keys could simply be text strings or lists of strings retrieved from the table, with processing to remove HTML coding. Here is an example for the Howard station, for instance:
{'L-transfers': ['transfer to yellow line', 'transfer to purple line'],
 'amenities': ['accessible station', 'automobile parking available'],
 'connections': '\n            CTA Buses #22, #97, #147, #151, #201, #205, #206\n
  Pace Buses #215, #290\n            ',
 'URL': '/web/20170211053109/http://www.transitchicago.com/travel_information/station.aspx?StopId=71', 'name': 'Howard'}

Conclusion

For any web page that holds data you want, you should start by seeing if the web site offers JSON or CSV data. But if it doesn’t, you have the option of retrieving HTML and using BeautifulSoup.

Doing so involves examining the HTML source code for the page, finding landmarks that allow you to programmatically locate the data of interest within the page, and then extracting it. Working through the page involves using navigation facilities like find_all, parent, and next_sibling; extracting data involves unwrapping the tags around it to get to the text, while respecting and maintaining the structure of the data.

This process can be painstaking, but in the end, working with HTML is manageable, and made easier thanks to BeautifulSoup.

The experience you have gained in this lab is very likely to help you with the upcoming programming assignment, and in the course project.