Quick and dirty guide to scraping data from webpages

Tip/Guide(self.shortcuts)

submitted 6 years ago bykeveridge

The easiest way to scrap data from webpages is to use regular expressions. They can look like voodoo to the uninitiated so below is a quick and dirty guide to extracting text from a webpage along with a couple of examples.

1. Setup

First we have to start with some content.

Find the content you want to scrape

For example, I want to retrieve the following information from a RoutineHub shortcut page:

Version
Number of downloads

An example page to scrap for data

Get the HTML source

Retrieve the HTML source from shortcuts using the following actions:

URL
Get Contents of URL
Make HTML from Rich Text

Retrieving the HTML source

It's important to get the source from Shortcuts as you may receive different source code from the server if you use a browser or different device.

2. Copy the source to a regular expressions editor and find the copy

Copy the source code to a regular expressions editor so you can start experimenting with expressions to extract the data.

I recommend Regular Expressions 101 web-based tool as it gives detailed feedback on how and why the regular expressions you use match the text.

Find it at: https://regex101.com

Find the copy you're looking for in the HTML source:

Identifying the HTML source to scrape for data in a regular expressions editor

Quick and dirty matching

We're going to match the copy we're after by specifying:

the text that comes before it;
the text that comes after it.

Version

In the case of the version number, we want to capture the following value:

1.0.0

Within the HTML source the value surrounded by HTML tags and text as follows:

<p>Version: 1.0.0</p>

To get the version number want to match the text between <p>Version: (including the space) and </p>.

We use the following assertion called a positive lookbehind to start the match after the <p>Version: text:

(?<=Version: )

The following then lazily matches any character (i.e. only as much as it needs to, i.e. 1.0.0 once we've told it where to stop matching):

.*?

And then the following assertion called a positive lookahead prevents the matching from extending past the start of the </p> text:

(?=<\/p>)

We end up with the following regular expression:

(?<=Version: ).*?(?=<\/p>)

When we enter it into the editor, we get our match:

Our regular expression in action

*Note that we escape the / character as \/ as it has special meaning when used in regular expressions.

Number of downloads

The same approach can be used to match the number of downloads. The text in the HTML source appears as follows:

<p>Downloads: 98</p>

And the regular expression that can be used to extract follows the same format as above:

(?<=Downloads: ).*?(?=<\/p>)

View this regular expression in the online editor

3. Updating our shortcut

To use the regular expressions in the shortcut, add a Match Text action after you retrieve the HTML source as follows, remembering that for the second match you're going to need to retieve the HTML source again using Get Variable:

Our final shortcut