| get-library-hours | ||
| README.md | ||
get-library-hours
This is a script I use as a Bash/ZSH function to retreive the hours of my local branch library on the command line.
It uses Perl's powerful regex capabilities, along with curl to retreive the web page and html2text to convert the html to text suitable for viewing on the command line.
Technical Notes
curl is used to retreive HTML content from the web:
# use the silent switch -s to hide progress/error output
curl -s "https://sfpl.org/locations"
In this instance we are slurping the entire file. That is, we aren't just matching individual lines in a file. For slurping the whole file -0777 is required.
curl -s "https://sfpl.org/locations" | perl -0777
In addition, the -n and -e switches of Perl are required. -n is to automatically loop over each line of input and -e is to execute directly from the command line
To print a match of a string:
curl -s <URL> | perl -0777 -ne 'print "$&\n" if /<STRING TO MATCH>/s'
To match this string: <div class="field field--name-field-short-name field__item">Parkside</div>
do:
# note that / must be escaped: \/
curl -s "https://sfpl.org/locations" | perl -0777 -ne 'print "$&\n" if /<div class="field field--name-field-short-name field__item">Parkside<\/div>/s'
This will print the match exactly:
<div class="field field--name-field-short-name field__item">Parkside</div>
When doing complex matching, it's good to know that the string matched, that your regex expression is correct.
But now we want to extract specific text from the page. We need something that will occur between match expressions. We are going to need wildcards and to capture text.
In the case of the library page, if you look at the raw HTML you will see that it contains multiple branches. Thus I wanted to match the Parkside branch specifically. But if you look at the page, the branch hours occur later. The occur right after this div:
<div class="office-hours">
And then they end before <div class="location__visit-branch">
So first we need to match everything from <div class="field field--name-field-short-name field__item">Parkside</div> until <div class="office-hours">, then we want to capture everything after that until (but not including) <div class="location__visit-branch">
To make this simple and be able to verify that our script it matching correctly, let's do the first match from <div class="field field--name-field-short-name fie ld__item">Parkside</div> until <div class="office-hours">:
In order to tell Perl to include everthing between these two strings we use what is called non-greedy matching. You may notice that <div class="office-hours"> occurs multiple times in the HTML, once for each branch. If we don't do non-greedy matching then all occurences of this string will be matched. But we only want the first occurence, the one that occurs right after our initial string.
A normal, greedy match would simply be .*. The . matches any character, and * means 0 or more occurences.
To convert this to a non-greedy match, supply the ? operator: .*?
curl -s "https://sfpl.org/locations" | perl -0777 -ne 'print "$&\n" if /<div class="field field--name-field-short-name field__item">Parkside<\/div>.*?<div class="office-hours">/s'
Next we need to capture everything after this text up to but not including div class="location__visit-branch"> Text is caputured using parentheses () Inside the parenthesis we do another non-greedy match.
Finally, we need to tell Perl to only print the captured string, so we change print "$&\n" to print "$1\n":
curl -s "https://sfpl.org/locations" | perl -0777 -ne 'print "$1\n" if /<div class="field field--name-field-short-name field__item">Parkside<\/div>.*?<div class="office-hours">(.*?)<div class="location__visit-branch">/s'
Excellent! We have the exact HTML from the page we want, so, finally, we can use html2text to render it as plain text:
curl -s "https://sfpl.org/locations" | perl -0777 -ne 'print "$1\n" if /<div class="field field--name-field-short-name field__item">Parkside<\/div>.*?<div class="office-hours">(.*?)<div class="location__visit-branch">/s' | html2text
If we ever want to retreive the hours for a different branch, we can simply change the branch name, e.g. from Parkside to Portola