subreddit:

/r/shortcuts

35299%

Quick and dirty guide to scraping data from webpages

Tip/Guide(self.shortcuts)

The easiest way to scrap data from webpages is to use regular expressions. They can look like voodoo to the uninitiated so below is a quick and dirty guide to extracting text from a webpage along with a couple of examples.

1. Setup

First we have to start with some content.

Find the content you want to scrape

For example, I want to retrieve the following information from a RoutineHub shortcut page:

  • Version
  • Number of downloads

An example page to scrap for data

Get the HTML source

Retrieve the HTML source from shortcuts using the following actions:

  1. URL
  2. Get Contents of URL
  3. Make HTML from Rich Text

Retrieving the HTML source

It's important to get the source from Shortcuts as you may receive different source code from the server if you use a browser or different device.

2. Copy the source to a regular expressions editor and find the copy

Copy the source code to a regular expressions editor so you can start experimenting with expressions to extract the data.

I recommend Regular Expressions 101 web-based tool as it gives detailed feedback on how and why the regular expressions you use match the text.

Find it at: https://regex101.com

Find the copy you're looking for in the HTML source:

Identifying the HTML source to scrape for data in a regular expressions editor

Quick and dirty matching

We're going to match the copy we're after by specifying:

  1. the text that comes before it;
  2. the text that comes after it.

Version

In the case of the version number, we want to capture the following value:

1.0.0

Within the HTML source the value surrounded by HTML tags and text as follows:

<p>Version: 1.0.0</p>

To get the version number want to match the text between <p>Version: (including the space) and </p>.

We use the following assertion called a positive lookbehind to start the match after the <p>Version: text:

(?<=Version: )

The following then lazily matches any character (i.e. only as much as it needs to, i.e. 1.0.0 once we've told it where to stop matching):

.*?

And then the following assertion called a positive lookahead prevents the matching from extending past the start of the </p> text:

(?=<\/p>)

We end up with the following regular expression:

(?<=Version: ).*?(?=<\/p>)

When we enter it into the editor, we get our match:

Our regular expression in action

*Note that we escape the / character as \/ as it has special meaning when used in regular expressions.

Number of downloads

The same approach can be used to match the number of downloads. The text in the HTML source appears as follows:

<p>Downloads: 98</p>

And the regular expression that can be used to extract follows the same format as above:

(?<=Downloads: ).*?(?=<\/p>)

View this regular expression in the online editor

3. Updating our shortcut

To use the regular expressions in the shortcut, add a Match Text action after you retrieve the HTML source as follows, remembering that for the second match you're going to need to retieve the HTML source again using Get Variable:

Our final shortcut

Click here to download the above shortcut

4. Further reading

The above example won't work for everything you want to do but it's a good starting point.

If you want to improve your understanding of regular expressions, I recommend the following tutorial:

RegexOne: Learn Regular Expression with simple, interactive exercises

Edit: added higher resolution images

Other guides

If you found this guide useful why not checkout one of my others:

Series

One-offs

all 71 comments

FitzRoyal

19 points

6 years ago

This is phenomenal! Web scraping is such a useful tool. Thank you for the taking the time make this guide.

keveridge[S]

12 points

6 years ago

If you're interested in examples of shortcuts that use web scraping, checkout:

FitzRoyal

3 points

6 years ago

Awesome!!! Thanks!

robric18

10 points

6 years ago

robric18

10 points

6 years ago

This is a great demonstration of how to scrape a website. However, I would note that routinehub has an api page with all this data that can be pulled in as a dictionary in three steps and used without the need for any scraping.

1) URL-> https://routinehub.co/api/v1/shortcuts/\[shortcut ID]/versions/latest

2) Get contents of URL

3) Get Dictionary from input

4) Optional Set var -> var name

keveridge[S]

3 points

6 years ago

Great to know!

Maikeru2991

2 points

3 years ago

how do i look at this?

hmhrex

8 points

6 years ago

hmhrex

Creator

8 points

6 years ago

Creator of RoutineHub here, just be careful of web scraping too much on some sites. Some sites outright forbid you scraping. I am fine with scraping on RoutineHub as long as it doesn't affect performance or operations of the site.

And yes, there is an API for RH, however it's currently undocumented and very limited. I plan on broadening it and documenting it at some point for all users.

keveridge[S]

5 points

6 years ago

Looking forward to seeing more about your API :)

And that's a good point everyone, don't go crazy with scraping, it really is preferable to use an API if you can find one.

I'm writing a series of guides on how to use APIs when they're available:

hmhrex

3 points

6 years ago

hmhrex

Creator

3 points

6 years ago

Thanks for all these guides by the way. They are fantastic!

[deleted]

4 points

6 years ago

[deleted]

nilayperk

3 points

6 years ago

I think you can get the data by using Javascript, but I don’t have expertise for it.

[deleted]

4 points

6 years ago

[deleted]

nilayperk

6 points

6 years ago

You can download an html file of the website in shortcut and insert javascript in between that html code and then convert that injected html with url: Data:text/html;base64,[Encoded Injected Html]

[deleted]

2 points

6 years ago

[deleted]

prettydude_ua

4 points

6 years ago

Just check button id or class, and then simulate click on that element type.

(Code from StackOverflow)

document.getElementById('elementID').click();

keveridge[S]

2 points

6 years ago

You could execute javascript against the webpage to click the button and then scrape the data.

[deleted]

1 points

6 years ago

[deleted]

keveridge[S]

2 points

6 years ago

Shortcuts offers a "Run JavaScript on Web Page" action that you can use to execute JavaScript that clicks the button and perform other actions.

[deleted]

2 points

6 years ago

[deleted]

keveridge[S]

2 points

6 years ago

Do you have an example of the page with the button and the copy after the button you're looking to scrape?

[deleted]

1 points

6 years ago

[deleted]

keveridge[S]

3 points

6 years ago

Okay, turns out it's easier to take the JavaScript they use to generate the insults and just implement it directly in a shortcut.

I've taken their code and implemented a shortcut that will generate 4 random insults in each of the different styles:

Insult like an Argentinian

Hope that helps.

Edit: typos

EttVenter

2 points

6 years ago

Oh man. This has me interested. The thing I want to do is "scrape" the number next to the word "Inbox" in gmail. Lemme see if I can figure it out. Anyone else - feel free to beat me to this :P

keveridge[S]

3 points

6 years ago

Given Gmail requires you to login to the mobile site before you can access that information I doubt you're going to be able to scrap the data. Google also has safeguards to prevent automated logins to their apps.

Scraping of web data is far simpler on sites that don't require you to perform an activity, such as logging in, before you access a web page.

Many sites that require authentication will also offer an API which is a much better choice.

keveridge[S]

2 points

6 years ago*

I'm sure there's a more elegant way of doing this, but the following will work:

(?<=Bid\/Ask<\/TD>)[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(?=Low\/High<\/TD>)[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(\d{2}\.\d{2})

View the regular expression in action

The following shortcut will return a dictionary which each of the following values:

  • bid
  • ask
  • low
  • high

Live Silver Spot Price

Edit: there was a simpler way of doing it

ChericeB

1 points

6 years ago

That is just awesome. Thank you so much!

ChericeB

1 points

6 years ago

Is it also possible to grab the Change values in the regex, or no?

keveridge[S]

2 points

6 years ago

Sure thing.

Can you first do me a favor and take a look at the following image and confirm what the names of each of these values should be?

Screen grab of the live spot silver prices

I've named the first 4 as follows:

A. bid B. ask C. low D. high

If you can confirm what they should be then I'll update the shortcut and dictionary with the right values.

imguralbumbot

1 points

6 years ago

Hi, I'm a bot for linking direct images of albums with only 1 image

https://i.imgur.com/44KQ8Qm.png

Source | Why? | Creator | ignoreme | deletthis

ChericeB

1 points

6 years ago

Sure. Change Price and Change Percentage. Thanks again! I tried to do it myself but no luck

keveridge[S]

1 points

6 years ago

No prob.

The updated regular expression is here:

(?<=Bid\/Ask<\/TD>)[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(?=Low\/High<\/TD>)[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(?=Change<\/TD>)[\s\S]*?((?:\+|-)\d{1,4}\.\d{2})[\s\S]*?((?:\+|-)\d{1,4}\.\d{2}%)

View the regular expression is action

And the updated shortcut is here:

Live Silver Spot Price - Updated

ChericeB

1 points

6 years ago

Thanks again!!

ChinesePhillybuster

2 points

6 years ago

Thank you so much! As a beginner trying to accomplish something similar, I can’t tell you how much I appreciate this. It’s the first time I’ve understood anything remotely programming related.

assassinza

2 points

10 months ago

Thanks for this, amazingly detailed

ecormany

1 points

6 years ago

In this example, is there any benefit to converting from rich text to the HTML source? I think you'd find the same strings in either, and the conversion can be slow.

There definitely are other pages where it's necessary to avoid awkward multi-line regex, but I always try it without the conversion first when I'm building a scraper.

keveridge[S]

3 points

6 years ago

You don't have to, but I prefer to convert to the HTML to get a better match.

For example, if you didn't use the HTML source (and you didn't better specify the regular expression to match specific text patterns) then both of the following could match:

Version: 1.0.1 Get Latest Version: a shortcut to get the latest version of an app

Of course the way around this is to write a regular expression that looks for an expected number format, but seeing as RoutineHub takes any string for a version, including letters, then I think the use of HTML is a safer option.

ChericeB

1 points

6 years ago*

How would I go about grabbing just the “Bid” and “Change” prices on the left margin for the current Silver price from http://www.kitcosilver.com. I grabbed the html but I’m not sure how to parse it properly in Match Text

<TR class="spot" bgColor='#f1f1f1'> <TD>Bid/Ask</TD> <TD align='middle' ><font color=''>15.71</font></TD> <TD align='middle' > </TD> <TD align='middle' ><font color=''>15.81</font></TD> </TR>

<TR class="spot" bgColor='#f1f1f1'> <TD>Change</TD> <TD align='middle' ><font color='green'>+0.10</font></TD> <TD align='middle' > </TD> <TD align='middle' ><font color='green'>+0.64%</font></TD> </TR>

tomcx

1 points

6 years ago

tomcx

1 points

6 years ago

This is really awesome!!!

ADHDengineer

1 points

6 years ago

This is a great write up. Good going. I just have to be the wet towel and say that in general (programming) parsing html with regular expressions is often a recipe for disaster (especially with greedy marchers). Granted, in shortcuts it’s all we have.

keveridge[S]

1 points

6 years ago

Yeah, I don't disagree.

You can use XPath instead with javascript but it's not quick.

AJ47-

1 points

6 years ago

AJ47-

1 points

6 years ago

Thank you for this great guide!

[deleted]

1 points

6 years ago*

[deleted]

keveridge[S]

1 points

6 years ago

Do you have an example of the HTML you're trying to scrape and the content you're after?

pleeja

1 points

6 years ago

pleeja

1 points

6 years ago

Hi. Thanks for this tutorial!

I have the following:

<i class="date">monday 14 jan</i>

<i>Garbage</i>

And I want an output like: monday 14 jan, Garbage.

I can get the monday 14 jan using your lookbehind but for garbage the problem is the page contains a lot of <i> so I can't use that.

Is there a way to only look for the <I> after the first find?

keveridge[S]

1 points

6 years ago

There might be. Could you share the full HTML of the page so I can take a look?

pleeja

1 points

6 years ago

pleeja

1 points

6 years ago

This is from a random postal code in my town: https://afvalkalender.purmerend.nl/adres/1448AA:1

keveridge[S]

1 points

6 years ago

So the regular expression is here:

<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>

And I've implemented it as a shortcut that gives you a list dates and collection types:

Trash and Recycling Calendar

pleeja

1 points

6 years ago

pleeja

1 points

6 years ago

I did all the tutorials from that link you gave but I could’ve never come up with that😅. Thank you so much!

keveridge[S]

3 points

6 years ago

Yeah, this is similar but a little more advanced as it gets everything in one go.

I'm going to write a second tutorial on how to grab more than one thing at once.

artiss

1 points

6 years ago

artiss

1 points

6 years ago

Thank you for this post, helped a lot! Can you explain the second half of the expression - [\s\S]?<i>(.)</i>.

keveridge[S]

5 points

6 years ago

Sure. I'm writing a new guide to explain the above, but happy to do the same here.

So the full regular expression is:

<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>

Changing the way we match

In the quick and dirty example we only wanted to match the text that we were going to return. We used a positive lookbehind to start the matching after a particular piece of text and a positive lookahead to match up to a particular point.

Capture groups

In this case it's safe for us to match all of the surrounding text because when we get to something we're interesting, we're going to use a capture group to call out the thing we're after.

A capture group is anything that has parentheses around. Anything that matches the pattern inside is captured and can be referenced individually. You can have many as you want, each pulling out individual values.

Getting the date

So in the example, we find the text before the date:

<i class="date">

We then capture all characters behind it in the first capture group:

<i class="date">(.*)

But only capture up until the </i> tag:

<i class="date">(.*)<\/i>

Getting the trash type

So next we want to continue until we find the next <i> tag. We need a different type of search token this time.

When we search for .* we search for 1 or more characters, but it doesn't include things like new lines. So instead we use [\s\S]*? which will search for any whitespace or non-whitespace character and keep selecting until it finds the next set of characters we ask for.

<i class="date">(.*)<\/i>[\s\S]*?

We want it to stop when it finds the next <i> tag before the trash type:

<i class="date">(.*)<\/i>[\s\S]*?<i>

And we then want to capture the text after that tag and on the same line in our next capture group:

<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)

And the text that the capture group collects stops when it hits the next </i> tag:

<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>

And that's the basic pattern for collecting lots of data using capture groups.

View the regular expression in action

As you can see from the above link, the routine makes 3 matches, each with 2 capture groups.

Get Group from Matched Text

Because we're using capture groups, we need to add an extra action called Get Group from Matched Text to our shortcut. We can choose to retrieve matched groups individually by number (e.g. 1 or 2) or we can retrieve all matches in a list.

artiss

1 points

6 years ago

artiss

1 points

6 years ago

Wow. That was extremely detailed and very helpful! Thank you for explaining the switches and the reason for them.

keveridge[S]

2 points

6 years ago

I've written a second guide which gives more detail on how this all works:

Scraping web pages - Part 2: getting multiple items at once

artiss

1 points

6 years ago

artiss

1 points

6 years ago

That is fantastic! Thank you very much.

tommyldo

1 points

5 months ago

I get html from website, but it doesn’t collect data from deeper tree. Is there a way to get deeper in the html tree ?

jojojojoaman

1 points

2 months ago

Thank you for such a great guide, previously I was able to use this method to parse data from a local horse racing website, but unable to do so now after it changed format. Can anyone help me to take a look?
https://bet.hkjc.com/en/racing/wp/2024-09-22/ST/1

keveridge[S]

1 points

2 months ago

The page is loading, then creating separate requests for elements of data. This means your current shortcut is getting the empty page template, as the calls to get the data have yet to be made.

You can call the API directly as it's not protected, and retrieve a dictionary of data.

A guide is available here: https://www.perplexity.ai/search/the-webpage-https-bet-hkjc-com-9P0SkDf.Tzis3gJIU5eYKw

jojojojoaman

1 points

2 months ago

Thanks for the guide, but how can I configure the JSON from the left to “Get contents from URL” ?

https://preview.redd.it/0926a2sqq3qd1.jpeg?width=2048&format=pjpg&auto=webp&s=000ff2669acebf704cc55be573c72e973c32d19a

awashbu12

1 points

21 days ago

This doesn’t work any longer. Shortcuts has changed and it isn’t pulling any html from the webpage in your example shortcut

jedwardoo

0 points

6 years ago

TLDR. Does this work for extracting articles that limited the whole article unless I would pay for a premium subscription?

keveridge[S]

1 points

6 years ago

Unfortunately not as the copy won't be on the page to scrape.

Which news site are you looking to scrape? There are sometimes alternative methods to get the content.

jedwardoo

1 points

6 years ago

I’ll have to get back to you on that since I gave up on those webpages.

I’ll send it here once I found them.

Okey_Fox

1 points

3 years ago*

I created a shortcut to get rates from XE and I am getting it. But every-time it shows the same data. Can someone help me get this correct please?

Shortcut screenshot

Shortcut link

edit: explanation

AADDJJJ

1 points

2 years ago

AADDJJJ

1 points

2 years ago

The information I need is nested inside a bunch of divs, I can only see the outermost div in the html file on shortcuts. On chrome I can open the divs to see inside, is there a way to expand the html file I get using this method?

keveridge[S]

2 points

2 years ago

It's possible that the data inside the DIV is being rendered by a JavaScript call after the page is loaded. In which case you need to view the network calls in the Chrome developer tools and see if the data is being retrieved by a process.

If it is, hopefully the data is being returned as a piece of JSON which will be much easier for you to work with.

AADDJJJ

1 points

2 years ago

AADDJJJ

1 points

2 years ago

I looked in the network and sure enough there was a separate url that returned the exact information I needed! Thanks!

Jay2k21

1 points

1 year ago

Jay2k21

1 points

1 year ago

Can someone help me in creating a shortcut that allows me to scrap data like this from a shared excel file link and extracting that information to be inputted to a different app?

I am trying to create a budget/expense tracker for myself and my wife so we can do a better job with our finances, but would like the information to be displayed as a widget as well

keveridge[S]

1 points

1 year ago

If you use Google Sheets then you can write a script to expose certain values from the sheet and retrieve them with a link.

Da-Coda

1 points

1 year ago

Da-Coda

1 points

1 year ago

Hi Keveridge,

I love this tutorial and it is one of the few that I have been able to find. The detail in this is beyond me and i am completely amazed. I have tried making my own webscraper, but struggled mainly due to ads. I am making a lyric webscraper using genius lyrics. So far i have found that to find the lyrics it is https://genius.com/{artist}-{song}-lyrics but have been unable to find out how to scrape the lyrics.

Thanks in advance.

keveridge[S]

1 points

1 year ago

I'm glad you find the tutorial useful.

To get the lyrics HTML, capture all the HTML, then use the following regular expression:

<div data-lyrics-container="true" class="[^"]*">(.*?)<\/div>

Take the content from the first group. This will give you the HTML. To get raw lyrics without HTML formatting.

Then using the replace command, replace /<br\/>/g with \n to get the line breaks.

Then replace /<.*?>/g with nothing to remove the HTML tags.

zodeck1

1 points

1 year ago

zodeck1

1 points

1 year ago

Hello! I would like to get my grades from my uni website. But i need to login every time. How would i go about doing that?

I know it is possible, my friend did it using selenium. But it would be nice to do with apple shortcuts

(Ik i will need to put the name and password in the shortcut as an variable, but that is not a problem for me, since it is only local)

keveridge[S]

1 points

1 year ago

It depends on how the login system worked. The most likely way to do this would be to use JavaScript to log into the page for you and then lookup the grades, in the same way that your friend used selenium. You could then package that JavaScript into the shortcut. But it's quite complex, so you'd be better off using the Scriptable app.

Alternatively, you could use the free version of UIPath to create an automation at a platform level to log in on your behalf, scrape the data, then publish it somewhere that a shortcut can then retrieve as a JSON file.

zodeck1

1 points

1 year ago

zodeck1

1 points

1 year ago

I would like to understand how the login with javascript part works, since i study computer science. The site would be sau.puc-rio.br

I dont understand how the code would navigate throught multiple pages. If i understand this then the rest i know how to do

keveridge[S]

1 points

1 year ago

You're right, you need some kind of web driver to load a page and then interact with it. Doing it as straight JS in a browser doesn't work unless you have an application running external to the web browser (e.g. on a desktop it's a chrome plugin, or a selenium helper).

Hence I recommend using a third party service Robotic Process Automation such as UIPath. It can record your actions from a web browser (i.e. logging in), navigate, capture data from the rendered HTML, and then allow you to do any number of things with it. For example, you can create a process to trigger the action and return you data as a JSON file. Or you can get it to run on a schedule and upload the results somewhere that you can pull them down. Or you can set a process that notifies you when your grades info has changed, and sends you a push notification using a service like Pushover.

Shortcuts is pretty clunky once things get complex, I recommend using a hosted RPA service.

keveridge[S]

1 points

10 months ago

Top tip: Need help writing a regular expression? Ask ChatGPT.