Start Web Scraping Coronavirus Data Using Node.js

Jonathan Lao
6 min readJun 23, 2020

Inspired by this post on how to begin web scraping coronavirus data using Python and Selenium, I thought it would be fun to explore that same exercise using Node.js, Cheerio, and Puppeteer.

Getting Started

So let’s say you want to track some relevant statistics about Covid19 cases in your home country. One comprehensive source of coronavirus data can be found at worldometers.info which has a nifty table containing every country:

Source: https://www.worldometers.info/coronavirus/

Being the developer that you are, you want to grab the data and use it for some cool project you are working on. Unfortunately, according to their FAQ, they do no provide an API. Alas, this is where some web scraping basics can come to the rescue!

Grab a web page using Puppeteer

Puppeteer is a Node.js library that allows you to simulate just about everything that you can do manually in a browser — just by using code! While you would normally fetch a website’s HTML page and display it by typing a URL into a browser’s address bar, we will be doing that by using a few lines of JavaScript instead.

After adding the puppeteer library we:

  1. Create a browser (Puppeteer uses Chromium by default).
  2. Create a page for that browser (a new page is like an empty tab).
  3. On that new page, navigate to any web page we want.
  4. Fetch the page’s HTML using the content() method.
  5. Print the HTML. Don’t worry, we’ll be doing more interesting stuff in a moment.
  6. Don’t forget to close() your browser when you’re finished, lest you stay in an infinite loop.

Puppeteer often uses the async/await syntax because loading web pages, as you’re probably familiar with, usually takes a few seconds. We typically don’t want that to block us from doing other stuff if we can. However note that this example is equivalent to doing everything synchronously.

Parse HTML using Cheerio

Now that we have the HTML file, we need to be able to extract the specific row of data we are interested in. Cheerio is a Node.js library designed to parse markup on the server. It implements a subset of core jQuery, so the syntax will be familiar to many.

Let’s say you’re interested in Taiwan. If you’re particularly skilled at reading markup, you can output the HTML to a file, then try finding the data you’re interested in by simply using Ctrl+F for “Taiwan.” For the rest of us, you can use Chrome DevTools to locate the source of the data row:

Predictably, the data you’re interested in is found in a <table> and you will need to locate a particular <tr> in this table.

  1. We narrow down the HTML page to just the table, using its id selector named #main_table_countries_today .
  2. You may be tempted to just grab the Taiwan row using its row id, but the order is not stable. By default, the table is ordered by total cases which may change from day to day. Instead, we want to specifically look for the word “Taiwan.” So first we search for every <a> tag.
  3. Then we use filter() to find the one whose text matches “Taiwan.” This will result in the <td> cell that contains the “Taiwan” link.
  4. We want all the cells relating to Taiwan, which will be contained by the parent of the <td> we found. We use closest("tr") to grab the entire row.
  5. Lastly, we iterate through every <td> in that row, and grab each of the values.

Look at the numbers!

After a little data massaging, the end result looks like:

As of June 23rd, 2020

Awesome! Now we can use start using that sweet sweet data for our own lucrative purposes. But let’s take a step back for a moment. Worldometer is a data aggregation site, so they must be getting their data from somewhere. The closer we can get to the source of truth, the more reliable our data can be. We won’t be flying to Taiwan, taking surveys and administering tests anytime soon. So let’s take this one step at a time. Sure enough, if you click into Taiwan, you’ll see that Worldometer lists Taiwan’s CDC website as their source. And sure enough, here is what’s on display:

Source (As of June 22, 2020): https://sites.google.com/cdc.gov.tw/2019-ncov/taiwan

If we compare that to the statistics we just scraped, the total confirmed, recovered, and deaths are exactly the same. But the total number tested is higher on the CDC website, presumably since it is more up-to-date. It also includes some other fun numbers like the number excluded (which I have determined to mean the number of tests that have returned as negative), as well as the new from yesterday test numbers.

Can we do better?

It’s a good thing we just learned how to web scrape; we can apply what we just learned to the CDC website. But if you take a look at the full HTML that Puppeteer outputs, you will see this for the Tested section:

The div is empty! Where’s the number? After some digging, you can see that it’s populated dynamically using some JavaScript:

It looks like that data is actually populated by a request to https://covid19dashboard.cdc.gov.tw/dash3. If we paste that into our web browser (a real one, though you could do it with Puppeteer if you’re feeling adventurous), you’ll see this:

Now, I bet you weren’t expecting to need to learn Mandarin for this tutorial. I suppose this is what we get for choosing Taiwan.

Looking past the unfamiliar characters, we can see that the numbers match what we see on the CDC website! So using your favorite request sending library, and a healthy dose of copy-and-paste on those Chinese characters, let’s call that API and get the numbers using code:

Final Results

Hey! I thought this was a tutorial about web scraping. Isn’t using an API cheating?

Yes, if we had lead with the API call, this would be a pretty lousy tutorial on Node.js web scraping. But there is a bigger lesson about web scraping in general. Developers are lazy; why would we spend the time to web scrape data that we can easily retrieve from an API call? Moreover the proper API call is closer to the source of truth and is therefore more likely to be up-to-date. Web scraping is useful when that data is not nicely packaged for you in a single API call, which will happen frequently. But when an API is available, it will save you lots of time and headache.

But enough lecturing, here’s the final numbers from both Worldometer and Taiwan’s CDC:

As of June 22, 2020

Future Improvements

Ha! We’ve got the most accurate testing numbers in Taiwan without the hassle of creating a web scraper. Mission accomplished!

What we’ve got is probably sufficient. It’s simple and gets us pretty accurate numbers from the front page of the Taiwan CDC, surely a reliable source. But what if we wanted to flex our dev brains just a bit longer? The front page also links the most recent press releases. Most are related to coronavirus (some are general health updates). These releases can sometimes provide even more up-to-the-minute information about the ongoing situation.

From a dev perspective, this is clearly more complex. We would need to visit the front page of the CDC, navigate to the most recent press releases, determine which are updates about the coronavirus, parse through the freeform text to get the data we want, and compare it with the info we are already receiving from the API. It’s definitely very doable using some more features provided by Puppeteer and some basic natural language processing.

We could build a more sophisticated web scraper to do this. But maybe we’re being too short sighted. What if the CDC gets its data from 3rd party testing agencies and various hospital databases? Maybe we can hire a translator, make some phone calls, and get to the bottom of where this data really comes from.

But that doesn’t seem worth our time right now. Let’s leave that for another day.

Final source code can be found here.

--

--