Page Scraping: the Good, the Bad, and the Ugly

6 min read

You're either wondering what page scraping means or why on earth you would do such a thing instead of using an API like a sane person. Both will be covered in today's post as we explore page scraping with Python's delightful Beautiful Soup package.

Backstory

My dad's big on investing as a personal hobby. Back when I was a kid, I remember my dad would usually be watching Bloomberg or researching investment options on his computer. Naturally when I grew up he wanted to teach me how to make smart investments, especially if I had the money saved up to do so. Not long after getting my first real job as a DevOps Engineer, I was able to save up more than enough to offset the costs of single-living as a 20-something and having enough emergency funds.

What's This Got to do With Page Scraping?

During an hour-long sit-down with my dad when I was visiting home, he showed me the sites he refers to when researching stocks and ETFs he's interested in investing in. It was kind of time-consuming to look up each of the symbols and cross-reference them across the sites he used, not to mention there are an enormous amount of ETFs and stocks out there. I figured there had to be a way to programmatically get the data to cut out the time of having to interact with the frontend manually. Unfortunately, there were no APIs that I could use for any of the three sites my dad referred me to (not any free ones anyway, and I'm cheap). Hence the need for page scraping.

Wait What's an API?

An API or Application Programming Interface is generally speaking, is a set of methods for communicating between various components of software. Functions provided by APIs can be used as "building blocks" by a programmer to make software. For example, companies like Google give developers who want it access to various APIs so they can integrate components into their apps or websites like maps or text-to-speech support. The way to obtain the data a developer wants is clearly defined and they can expect it to not suddenly change without notice. If there is a breaking change, it's announced well in advance so developers can plan to upgrade their code to use the new API version without breaking their site or app.

Then What's Page Scraping?

Simply put, page scraping is a way to extract data from a page's HTML (Hypertext Markup Language) source code. In case you weren't already aware, all internet browsers support viewing a site or page's HTML code. It's also possible to obtain the HTML for a page via curl (e.g. curl https://chowner.com). With page scraping, this HTML is then parsed for specific data encased in HTML tags.

Why Page Scraping is Bad

Page scraping is less feature rich than an API as it depends on what data is actually in the HTML code. It's also very brittle since oftentimes the data you want is in a specific HTML element with a specific name or id. However, the HTML of a page is subject to change at any given time. How often have you been to a site you frequent only to see that the layout changed, there were new features added, or things were harder to navigate to get to the information you wanted? Although it might not happen often, when it does occur it can cause unexpected breakages in code that uses page scraping as a means to gather the data it needs to work.

Back to the Story

Putting it all together, I knew I'd have to rely on page scraping to get the data I wanted. I knew the three sites I'd need to use in order to research ETFs: Morningstar, Zacks, and ETFdb. Morningstar had an API but it required you to purchase Morningstar Premium. At a cost of $24 a month or $199 for a year it was a hard no for me. As for Zacks, it seems the have an API nowadays but you need to contact them for information, likely meaning it's not free. Lastly, ETFdb didn't and still doesn't seem to even have an official API. Thankfully, Python has a nice package to make page scraping a little less painful.

BeautifulSoup to the Rescue

I heard good things about how BeautifulSoup makes page scraping easy. You can install it with pip as follows:

pip install beautifulsoup4

To use the package, just add the following imports line in your Python script:

from bs4 import BeautifulSoup

As you can see from the Python snippet below, as long as you know what HTML element you're looking for in the page, it's pretty straightforward to get the data especially when also using the Python Requests package:

page = requests.get("https://www.zacks.com/funds/etf/" + symbol +
                    "/profile")

if page.status_code != 200:
    LOGGER.debug("Zacks is returning status code %s", page.status_code)
    return None

soup = BeautifulSoup(page.text, 'html.parser')
rating_text = soup.find(class_='zr_rankbox').getText()

What Could Go Wrong?

While I was able to make a not that bad CLI tool and PyPi package that I was able to successfully use when I was first deciding what ETFs to invest in, there inevitably was breakage less than a year later. I decided to use the tool again recently to see if my ETF picks were still highly rated. However, I noticed running the tool gave an error and upon checking the logs, saw that ETFdb had changed not only the way the data is presented, but how it's obtained with the data URL as well.

Before

category = "technology"
page = requests.get((
        "http://etfdb.com/data_set/?tm=1725&"
        "cond={%22by_category%22:" + category + "}&"
        "no_null_sort=true&"
        "count_by_id=&"
        "sort=ytd_percent_return&"
        "order=desc&"
        "limit=" + str(limit) +
        "&offset=0"))

After

category = '["Etfdb::EtfType",1996,null,false,false]'
page = requests.get((
        "http://etfdb.com/data_set/?tm=78580&"
        "cond={by_type:" + category + "}&"
        "no_null_sort=true&"
        "count_by_id=&"
        "sort=ytd_percent_return&"
        "order=desc&"
        "limit=" + str(limit) +
        "&offset=0"))

So instead of using a human-readable name to refer to the category of ETF, at some point they changed it to be a datastructure. This would require updating every imaginable ETF category (which there are literally dozens of) to be mapped to the equivalent unintuitive data structure. Not only is this gross but it's also time-consuming to set up that mapping.

Conclusion

BeautifulSoup combined with Requests can make page scraping pretty straightforward. However, if you have a free API at your disposal, I highly suggest you use that to save yourself breakage and a headache later on.

Previous Post Next Post