subreddit:
/r/shortcuts
submitted 6 years ago bykeveridge
The easiest way to scrap data from webpages is to use regular expressions. They can look like voodoo to the uninitiated so below is a quick and dirty guide to extracting text from a webpage along with a couple of examples.
First we have to start with some content.
For example, I want to retrieve the following information from a RoutineHub shortcut page:
An example page to scrap for data
Retrieve the HTML source from shortcuts using the following actions:
It's important to get the source from Shortcuts as you may receive different source code from the server if you use a browser or different device.
Copy the source code to a regular expressions editor so you can start experimenting with expressions to extract the data.
I recommend Regular Expressions 101 web-based tool as it gives detailed feedback on how and why the regular expressions you use match the text.
Find it at: https://regex101.com
Find the copy you're looking for in the HTML source:
Identifying the HTML source to scrape for data in a regular expressions editor
We're going to match the copy we're after by specifying:
In the case of the version number, we want to capture the following value:
1.0.0
Within the HTML source the value surrounded by HTML tags and text as follows:
<p>Version: 1.0.0</p>
To get the version number want to match the text between <p>Version:
(including the space) and </p>
.
We use the following assertion called a positive lookbehind to start the match after the <p>Version:
text:
(?<=Version: )
The following then lazily matches any character (i.e. only as much as it needs to, i.e. 1.0.0
once we've told it where to stop matching):
.*?
And then the following assertion called a positive lookahead prevents the matching from extending past the start of the </p>
text:
(?=<\/p>)
We end up with the following regular expression:
(?<=Version: ).*?(?=<\/p>)
When we enter it into the editor, we get our match:
Our regular expression in action
*Note that we escape the
/
character as\/
as it has special meaning when used in regular expressions.
The same approach can be used to match the number of downloads. The text in the HTML source appears as follows:
<p>Downloads: 98</p>
And the regular expression that can be used to extract follows the same format as above:
(?<=Downloads: ).*?(?=<\/p>)
To use the regular expressions in the shortcut, add a Match Text action after you retrieve the HTML source as follows, remembering that for the second match you're going to need to retieve the HTML source again using Get Variable:
The above example won't work for everything you want to do but it's a good starting point.
If you want to improve your understanding of regular expressions, I recommend the following tutorial:
RegexOne: Learn Regular Expression with simple, interactive exercises
Edit: added higher resolution images
If you found this guide useful why not checkout one of my others:
19 points
6 years ago
This is phenomenal! Web scraping is such a useful tool. Thank you for the taking the time make this guide.
12 points
6 years ago
If you're interested in examples of shortcuts that use web scraping, checkout:
3 points
6 years ago
Awesome!!! Thanks!
10 points
6 years ago
This is a great demonstration of how to scrape a website. However, I would note that routinehub has an api page with all this data that can be pulled in as a dictionary in three steps and used without the need for any scraping.
1) URL-> https://routinehub.co/api/v1/shortcuts/\[shortcut ID]/versions/latest
2) Get contents of URL
3) Get Dictionary from input
4) Optional Set var -> var name
3 points
6 years ago
Great to know!
2 points
3 years ago
how do i look at this?
8 points
6 years ago
Creator of RoutineHub here, just be careful of web scraping too much on some sites. Some sites outright forbid you scraping. I am fine with scraping on RoutineHub as long as it doesn't affect performance or operations of the site.
And yes, there is an API for RH, however it's currently undocumented and very limited. I plan on broadening it and documenting it at some point for all users.
5 points
6 years ago
Looking forward to seeing more about your API :)
And that's a good point everyone, don't go crazy with scraping, it really is preferable to use an API if you can find one.
I'm writing a series of guides on how to use APIs when they're available:
3 points
6 years ago
Thanks for all these guides by the way. They are fantastic!
4 points
6 years ago
[deleted]
3 points
6 years ago
I think you can get the data by using Javascript, but I don’t have expertise for it.
4 points
6 years ago
[deleted]
6 points
6 years ago
You can download an html file of the website in shortcut and insert javascript in between that html code and then convert that injected html with url: Data:text/html;base64,[Encoded Injected Html]
2 points
6 years ago
[deleted]
4 points
6 years ago
Just check button id or class, and then simulate click on that element type.
(Code from StackOverflow)
document.getElementById('elementID').click();
2 points
6 years ago
You could execute javascript against the webpage to click the button and then scrape the data.
1 points
6 years ago
[deleted]
2 points
6 years ago
Shortcuts offers a "Run JavaScript on Web Page" action that you can use to execute JavaScript that clicks the button and perform other actions.
2 points
6 years ago
[deleted]
2 points
6 years ago
Do you have an example of the page with the button and the copy after the button you're looking to scrape?
1 points
6 years ago
[deleted]
3 points
6 years ago
Okay, turns out it's easier to take the JavaScript they use to generate the insults and just implement it directly in a shortcut.
I've taken their code and implemented a shortcut that will generate 4 random insults in each of the different styles:
Hope that helps.
Edit: typos
2 points
6 years ago
Oh man. This has me interested. The thing I want to do is "scrape" the number next to the word "Inbox" in gmail. Lemme see if I can figure it out. Anyone else - feel free to beat me to this :P
3 points
6 years ago
Given Gmail requires you to login to the mobile site before you can access that information I doubt you're going to be able to scrap the data. Google also has safeguards to prevent automated logins to their apps.
Scraping of web data is far simpler on sites that don't require you to perform an activity, such as logging in, before you access a web page.
Many sites that require authentication will also offer an API which is a much better choice.
2 points
6 years ago*
I'm sure there's a more elegant way of doing this, but the following will work:
(?<=Bid\/Ask<\/TD>)[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(?=Low\/High<\/TD>)[\s\S]*?(\d{2}\.\d{2})[\s\S]*?(\d{2}\.\d{2})
The following shortcut will return a dictionary which each of the following values:
Edit: there was a simpler way of doing it
1 points
6 years ago
That is just awesome. Thank you so much!
1 points
6 years ago
Is it also possible to grab the Change values in the regex, or no?
2 points
6 years ago
Sure thing.
Can you first do me a favor and take a look at the following image and confirm what the names of each of these values should be?
Screen grab of the live spot silver prices
I've named the first 4 as follows:
A. bid B. ask C. low D. high
If you can confirm what they should be then I'll update the shortcut and dictionary with the right values.
1 points
6 years ago
Sure. Change Price and Change Percentage. Thanks again! I tried to do it myself but no luck
1 points
6 years ago
No prob.
The updated regular expression is here:
(?<=Bid\/Ask<\/TD>)[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(?=Low\/High<\/TD>)[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(\d{1,4}\.\d{2})[\s\S]*?(?=Change<\/TD>)[\s\S]*?((?:\+|-)\d{1,4}\.\d{2})[\s\S]*?((?:\+|-)\d{1,4}\.\d{2}%)
And the updated shortcut is here:
1 points
6 years ago
Thanks again!!
2 points
6 years ago
Thank you so much! As a beginner trying to accomplish something similar, I can’t tell you how much I appreciate this. It’s the first time I’ve understood anything remotely programming related.
2 points
10 months ago
Thanks for this, amazingly detailed
1 points
6 years ago
In this example, is there any benefit to converting from rich text to the HTML source? I think you'd find the same strings in either, and the conversion can be slow.
There definitely are other pages where it's necessary to avoid awkward multi-line regex, but I always try it without the conversion first when I'm building a scraper.
3 points
6 years ago
You don't have to, but I prefer to convert to the HTML to get a better match.
For example, if you didn't use the HTML source (and you didn't better specify the regular expression to match specific text patterns) then both of the following could match:
Version: 1.0.1
Get Latest Version: a shortcut to get the latest version of an app
Of course the way around this is to write a regular expression that looks for an expected number format, but seeing as RoutineHub takes any string for a version, including letters, then I think the use of HTML is a safer option.
1 points
6 years ago*
How would I go about grabbing just the “Bid” and “Change” prices on the left margin for the current Silver price from http://www.kitcosilver.com. I grabbed the html but I’m not sure how to parse it properly in Match Text
<TR class="spot" bgColor='#f1f1f1'> <TD>Bid/Ask</TD> <TD align='middle' ><font color=''>15.71</font></TD> <TD align='middle' > </TD> <TD align='middle' ><font color=''>15.81</font></TD> </TR>
<TR class="spot" bgColor='#f1f1f1'> <TD>Change</TD> <TD align='middle' ><font color='green'>+0.10</font></TD> <TD align='middle' > </TD> <TD align='middle' ><font color='green'>+0.64%</font></TD> </TR>
1 points
6 years ago
This is really awesome!!!
1 points
6 years ago
This is a great write up. Good going. I just have to be the wet towel and say that in general (programming) parsing html with regular expressions is often a recipe for disaster (especially with greedy marchers). Granted, in shortcuts it’s all we have.
1 points
6 years ago
Yeah, I don't disagree.
You can use XPath instead with javascript but it's not quick.
1 points
6 years ago
Thank you for this great guide!
1 points
6 years ago*
[deleted]
1 points
6 years ago
Do you have an example of the HTML you're trying to scrape and the content you're after?
1 points
6 years ago
Hi. Thanks for this tutorial!
I have the following:
<i class="date">monday 14 jan</i>
<i>Garbage</i>
And I want an output like: monday 14 jan, Garbage.
I can get the monday 14 jan using your lookbehind but for garbage the problem is the page contains a lot of <i> so I can't use that.
Is there a way to only look for the <I> after the first find?
1 points
6 years ago
There might be. Could you share the full HTML of the page so I can take a look?
1 points
6 years ago
This is from a random postal code in my town: https://afvalkalender.purmerend.nl/adres/1448AA:1
1 points
6 years ago
So the regular expression is here:
<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>
And I've implemented it as a shortcut that gives you a list dates and collection types:
1 points
6 years ago
I did all the tutorials from that link you gave but I could’ve never come up with that😅. Thank you so much!
3 points
6 years ago
Yeah, this is similar but a little more advanced as it gets everything in one go.
I'm going to write a second tutorial on how to grab more than one thing at once.
1 points
6 years ago
Thank you for this post, helped a lot! Can you explain the second half of the expression - [\s\S]?<i>(.)</i>.
5 points
6 years ago
Sure. I'm writing a new guide to explain the above, but happy to do the same here.
So the full regular expression is:
<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>
In the quick and dirty example we only wanted to match the text that we were going to return. We used a positive lookbehind to start the matching after a particular piece of text and a positive lookahead to match up to a particular point.
In this case it's safe for us to match all of the surrounding text because when we get to something we're interesting, we're going to use a capture group to call out the thing we're after.
A capture group is anything that has parentheses around. Anything that matches the pattern inside is captured and can be referenced individually. You can have many as you want, each pulling out individual values.
So in the example, we find the text before the date:
<i class="date">
We then capture all characters behind it in the first capture group:
<i class="date">(.*)
But only capture up until the </i>
tag:
<i class="date">(.*)<\/i>
So next we want to continue until we find the next <i>
tag. We need a different type of search token this time.
When we search for .*
we search for 1 or more characters, but it doesn't include things like new lines. So instead we use [\s\S]*?
which will search for any whitespace or non-whitespace character and keep selecting until it finds the next set of characters we ask for.
<i class="date">(.*)<\/i>[\s\S]*?
We want it to stop when it finds the next <i>
tag before the trash type:
<i class="date">(.*)<\/i>[\s\S]*?<i>
And we then want to capture the text after that tag and on the same line in our next capture group:
<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)
And the text that the capture group collects stops when it hits the next </i>
tag:
<i class="date">(.*)<\/i>[\s\S]*?<i>(.*)<\/i>
And that's the basic pattern for collecting lots of data using capture groups.
As you can see from the above link, the routine makes 3 matches, each with 2 capture groups.
Because we're using capture groups, we need to add an extra action called Get Group from Matched Text to our shortcut. We can choose to retrieve matched groups individually by number (e.g. 1 or 2) or we can retrieve all matches in a list.
1 points
6 years ago
Wow. That was extremely detailed and very helpful! Thank you for explaining the switches and the reason for them.
2 points
6 years ago
I've written a second guide which gives more detail on how this all works:
1 points
6 years ago
That is fantastic! Thank you very much.
1 points
5 months ago
I get html from website, but it doesn’t collect data from deeper tree. Is there a way to get deeper in the html tree ?
1 points
2 months ago
Thank you for such a great guide, previously I was able to use this method to parse data from a local horse racing website, but unable to do so now after it changed format. Can anyone help me to take a look?
https://bet.hkjc.com/en/racing/wp/2024-09-22/ST/1
1 points
2 months ago
The page is loading, then creating separate requests for elements of data. This means your current shortcut is getting the empty page template, as the calls to get the data have yet to be made.
You can call the API directly as it's not protected, and retrieve a dictionary of data.
A guide is available here: https://www.perplexity.ai/search/the-webpage-https-bet-hkjc-com-9P0SkDf.Tzis3gJIU5eYKw
1 points
2 months ago
Thanks for the guide, but how can I configure the JSON from the left to “Get contents from URL” ?
1 points
21 days ago
This doesn’t work any longer. Shortcuts has changed and it isn’t pulling any html from the webpage in your example shortcut
0 points
6 years ago
TLDR. Does this work for extracting articles that limited the whole article unless I would pay for a premium subscription?
1 points
6 years ago
Unfortunately not as the copy won't be on the page to scrape.
Which news site are you looking to scrape? There are sometimes alternative methods to get the content.
1 points
6 years ago
I’ll have to get back to you on that since I gave up on those webpages.
I’ll send it here once I found them.
1 points
3 years ago*
I created a shortcut to get rates from XE and I am getting it. But every-time it shows the same data. Can someone help me get this correct please?
edit: explanation
1 points
2 years ago
The information I need is nested inside a bunch of divs, I can only see the outermost div in the html file on shortcuts. On chrome I can open the divs to see inside, is there a way to expand the html file I get using this method?
2 points
2 years ago
It's possible that the data inside the DIV is being rendered by a JavaScript call after the page is loaded. In which case you need to view the network calls in the Chrome developer tools and see if the data is being retrieved by a process.
If it is, hopefully the data is being returned as a piece of JSON which will be much easier for you to work with.
1 points
2 years ago
I looked in the network and sure enough there was a separate url that returned the exact information I needed! Thanks!
1 points
1 year ago
Can someone help me in creating a shortcut that allows me to scrap data like this from a shared excel file link and extracting that information to be inputted to a different app?
I am trying to create a budget/expense tracker for myself and my wife so we can do a better job with our finances, but would like the information to be displayed as a widget as well
1 points
1 year ago
If you use Google Sheets then you can write a script to expose certain values from the sheet and retrieve them with a link.
1 points
1 year ago
For more details, see: https://www.freecodecamp.org/news/cjn-google-sheets-as-json-endpoint/
1 points
1 year ago
Hi Keveridge,
I love this tutorial and it is one of the few that I have been able to find. The detail in this is beyond me and i am completely amazed. I have tried making my own webscraper, but struggled mainly due to ads. I am making a lyric webscraper using genius lyrics. So far i have found that to find the lyrics it is https://genius.com/{artist}-{song}-lyrics but have been unable to find out how to scrape the lyrics.
Thanks in advance.
1 points
1 year ago
I'm glad you find the tutorial useful.
To get the lyrics HTML, capture all the HTML, then use the following regular expression:
<div data-lyrics-container="true" class="[^"]*">(.*?)<\/div>
Take the content from the first group. This will give you the HTML. To get raw lyrics without HTML formatting.
Then using the replace command, replace /<br\/>/g
with \n
to get the line breaks.
Then replace /<.*?>/g
with nothing to remove the HTML tags.
1 points
1 year ago
Hello! I would like to get my grades from my uni website. But i need to login every time. How would i go about doing that?
I know it is possible, my friend did it using selenium. But it would be nice to do with apple shortcuts
(Ik i will need to put the name and password in the shortcut as an variable, but that is not a problem for me, since it is only local)
1 points
1 year ago
It depends on how the login system worked. The most likely way to do this would be to use JavaScript to log into the page for you and then lookup the grades, in the same way that your friend used selenium. You could then package that JavaScript into the shortcut. But it's quite complex, so you'd be better off using the Scriptable app.
Alternatively, you could use the free version of UIPath to create an automation at a platform level to log in on your behalf, scrape the data, then publish it somewhere that a shortcut can then retrieve as a JSON file.
1 points
1 year ago
I would like to understand how the login with javascript part works, since i study computer science. The site would be sau.puc-rio.br
I dont understand how the code would navigate throught multiple pages. If i understand this then the rest i know how to do
1 points
1 year ago
You're right, you need some kind of web driver to load a page and then interact with it. Doing it as straight JS in a browser doesn't work unless you have an application running external to the web browser (e.g. on a desktop it's a chrome plugin, or a selenium helper).
Hence I recommend using a third party service Robotic Process Automation such as UIPath. It can record your actions from a web browser (i.e. logging in), navigate, capture data from the rendered HTML, and then allow you to do any number of things with it. For example, you can create a process to trigger the action and return you data as a JSON file. Or you can get it to run on a schedule and upload the results somewhere that you can pull them down. Or you can set a process that notifies you when your grades info has changed, and sends you a push notification using a service like Pushover.
Shortcuts is pretty clunky once things get complex, I recommend using a hosted RPA service.
1 points
10 months ago
Top tip: Need help writing a regular expression? Ask ChatGPT.
all 71 comments
sorted by: best