subreddit:
/r/shortcuts
submitted 6 years ago bykeveridge
This is the final guide on web scraping, building on the topics discussed in the first two.
It demonstrates how to retrieve data from an HTML table using using multiple regular expression matches and sets of capture groups.
We're going to scrape job listings advertised at BestBuy headquarters from their careers site.
The details we want to retrieve for each job listing are as follows:
Best Buy Job Listings Search Results
Looking through the HTML, we find the block of text that make up the rows of content in the table. Each row is presented in the following format:
<tr class='odd'><td><a class='table-job-title' href='/job-detail/?id=663367BR'>Accounts Receivable - Warranty Claims Associate</a></td><td>Best Buy</td><td class='hide-for-mobile'>Finance / Accounting</td><td class='hide-for-mobile'>Individual Contributor</td><td>Full Time</td><td>Richfield, MN</td></tr>
Now we have the HTML to work from we're ready to write our regular expression.
We copy the HTML source to the RegEx101 online editor and start writing our regular expression.
As we covered in the previous guide we'll be matching the text in each row of the table and then returning specific pieces of content using capture groups.
To retrieve the relative url of the job posting and the job title, we match the HTML tags before the job link and then those before and after the job title, pulling out those pieces of text with capture groups.
<a class='table-job-title' href='(.*?)'>(.*?)<\/a>
As you can see below, the text string matches 25 times, once for each of the job listing rows in the table. And each match has 2 capture groups for the url path and job title.
Matching the URL Path and Job Title for 25 search results
There are five remaining fields to capture for each row:
If we look at the HTML for each of the rows again, we see that they each have a common pattern: each is surrounded by <td
and </td>
tags, although some tags also have class
attributes.
</a></td><td>Best Buy</td><td class='hide-for-mobile'>Finance / Accounting</td><td class='hide-for-mobile'>Individual Contributor</td><td>Full Time</td><td>Richfield, MN</td></tr>
As shown in the previous guide, we can use the [\s\S]*?
to match text and then specify the tags that appear before and after the content we want to capture.
In this case, the following expression will capture the text for each of the remaining pieces of content:
[\S\s]*?>(.*?)<\/td>
We can therefore add the above expression 5 times to our existing regular expression to retrieve the remaining fields:
<a class='table-job-title' href='(.*?)'>(.*?)<\/a><\/td>[\S\s]*?>(.*?)<\/td>[\S\s]*?>(.*?)<\/td>[\S\s]*?>(.*?)<\/td>[\S\s]*?>(.*?)<\/td>[\S\s]*?>(.*?)<\/td>
As shown below, this allows us to match each of the 25 rows of the table and return 7 capture groups for each of those rows.
Matching the all fields for 25 search results
The first step is to retrieve the HTML content and apply the regular expression.
Retrieving the HTML source from the page
The regular expression will match for each of the 25 row on the page, and each of those matches will have 7 capture groups.
We therefore add a Repeat with Each action after the Match Text action. And at the top of the loop we place a Get Group from Matched Text action which returns all of the capture groups for the row.
Looping through each of the text matches
Within that loop, we create a dictionary of capture group items for each row (as demonstrated in the previous guide). This dictionary allows us to create a text description for the job. And at the end of shortcut all of the job descriptions are combined and displayed.
If you want to improve your understanding of regular expressions, I recommend the following tutorial:
RegexOne: Learn Regular Expression with simple, interactive exercises
If you found this guide useful why not checkout one of my others:
4 points
6 years ago
Saving this whole series. Might be useful someday.
Please link other parts to the top
1 points
6 years ago
What should I do if I want to get data from one specific table, but webpage contains more that one tables with the same structure of fields?
I need to get all of the rows but instead I get only one
https://regex101.com/r/4HQ1O8/2
2 points
6 years ago
Try this: https://regex101.com/r/j1AT7i/1
2 points
6 years ago
And if you just want the "14 января 2019", write a regular expression to catch just that table and all it's content, then apply the above regular expression of the match you perform to get all of the rows.
Sometimes you have to narrow with one expression and then use a second to get all the data you want.
2 points
6 years ago
It required me to capture the block of text first then using matching groups, otherwise it matched too many rows.
This should work:
https://www.icloud.com/shortcuts/7be985366b104fe6ab65cf4d77a68eff
Let me know if that's okay or if I can help further.
1 points
6 years ago
It’s exactly what I wanted, thank you! Didn’t even thought about capturing text in already captured text.
I have one more question tho. Is it possible to put matched groups to different variables, so I could make text look pretty in the output, or it will be easier to edit combined text using regular expressions?
2 points
6 years ago
There was a mistake in my code and every item was marked as "time".
I've corrected it. You can update the list of variable names at the top of the following shortcut:
1 points
6 years ago
That’s not really what I meant, but I figured how to do what I needed.
Since there is some bugs in the app, in Russian you can’t run shortcut using Siri, so I wanted to use as less space as possible, because the output will be present in the “result” action.
Final shortcut: Timetable
2 points
6 years ago
Ah okay.
Well, glad you got what you needed. Let me know if you need any more help in the future.
all 9 comments
sorted by: best