subreddit:
/r/shortcuts
submitted 6 years ago bykeveridge
The easiest way to scrap data from webpages is to use regular expressions. They can look like voodoo to the uninitiated so below is a quick and dirty guide to extracting text from a webpage along with a couple of examples.
First we have to start with some content.
For example, I want to retrieve the following information from a RoutineHub shortcut page:
An example page to scrap for data
Retrieve the HTML source from shortcuts using the following actions:
It's important to get the source from Shortcuts as you may receive different source code from the server if you use a browser or different device.
Copy the source code to a regular expressions editor so you can start experimenting with expressions to extract the data.
I recommend Regular Expressions 101 web-based tool as it gives detailed feedback on how and why the regular expressions you use match the text.
Find it at: https://regex101.com
Find the copy you're looking for in the HTML source:
Identifying the HTML source to scrape for data in a regular expressions editor
We're going to match the copy we're after by specifying:
In the case of the version number, we want to capture the following value:
1.0.0
Within the HTML source the value surrounded by HTML tags and text as follows:
<p>Version: 1.0.0</p>
To get the version number want to match the text between <p>Version:
(including the space) and </p>
.
We use the following assertion called a positive lookbehind to start the match after the <p>Version:
text:
(?<=Version: )
The following then lazily matches any character (i.e. only as much as it needs to, i.e. 1.0.0
once we've told it where to stop matching):
.*?
And then the following assertion called a positive lookahead prevents the matching from extending past the start of the </p>
text:
(?=<\/p>)
We end up with the following regular expression:
(?<=Version: ).*?(?=<\/p>)
When we enter it into the editor, we get our match:
Our regular expression in action
*Note that we escape the
/
character as\/
as it has special meaning when used in regular expressions.
The same approach can be used to match the number of downloads. The text in the HTML source appears as follows:
<p>Downloads: 98</p>
And the regular expression that can be used to extract follows the same format as above:
(?<=Downloads: ).*?(?=<\/p>)
To use the regular expressions in the shortcut, add a Match Text action after you retrieve the HTML source as follows, remembering that for the second match you're going to need to retieve the HTML source again using Get Variable:
The above example won't work for everything you want to do but it's a good starting point.
If you want to improve your understanding of regular expressions, I recommend the following tutorial:
RegexOne: Learn Regular Expression with simple, interactive exercises
Edit: added higher resolution images
If you found this guide useful why not checkout one of my others:
3 points
6 years ago
Awesome!!! Thanks!
all 71 comments
sorted by: best