The goal of polite
is to promote responsible web
etiquette.
“bow and scrape” (verb):
To make a deep bow with the right leg drawn back (thus scraping the floor), left hand pressed across the abdomen, right arm held aside.
(idiomatic, by extension) To behave in a servile, obsequious, or excessively polite manner. [1]
Source: Wiktionary, The free dictionary
The package’s two main functions bow
and
scrape
define and realize a web harvesting session.
bow
is used to introduce the client to the host and ask for
permission to scrape (by inquiring against the host’s
robots.txt
file), while scrape
is the main
function for retrieving data from the remote server. Once the connection
is established, there’s no need to bow
again. Rather, in
order to adjust a scraping URL the user can simply nod
to
the new path, which updates the session’s URL, making sure that the new
location can be negotiated against robots.txt
.
The three pillars of a polite session
are
seeking permission, taking slowly and never asking
twice.
The package builds on awesome toolkits for defining and managing http
sessions (httr
and rvest
), declaring the user
agent string and investigating site policies (robotstxt
),
and utilizing rate-limiting and response caching
(ratelimitr
and memoise
).
You can install polite
from CRAN with:
install.packages("polite")
Development version of the package can be installed from Github with:
install.packages("remotes")
::install_github("dmi3kno/polite") remotes
This is a basic example which shows how to retrieve the list of
semi-soft cheeses from www.cheese.com. Here, we authenticate a session
and then scrape the page with specified parameters. Behind the scenes
polite
retrieves robots.txt
, checks the URL
and user agent string against it, caches the call to
robots.txt
and to the web page and enforces rate
limiting.
library(polite)
library(rvest)
<- bow("https://www.cheese.com/by_type", force = TRUE)
session <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
result html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
head(result)
#> [1] "3-Cheese Italian Blend" "Abbaye de Citeaux"
#> [3] "Abbaye du Mont des Cats" "Adelost"
#> [5] "ADL Brick Cheese" "Ailsa Craig"
You can build your own functions that incorporate bow
,
scrape
(and, if required, nod
). Here we will
extend our inquiry into cheeses and will download all cheese names and
URLs to their information pages. Let’s retrieve the number of pages per
letter in the alphabetical list, keeping the number of results per page
to 100 to minimize number of web requests.
library(polite)
library(rvest)
library(purrr)
library(dplyr)
<- bow("https://www.cheese.com/alphabetical")
session
# this is only to illustrate the example.
<- letters[1:3] # delete this line to scrape all letters
letters
<- map(letters, ~scrape(session, query = list(per_page=100,i=.x)) )
responses <- map(responses, ~html_nodes(.x, "#id_page li") %>%
results html_text(trim = TRUE) %>%
as.numeric() %>%
tail(1) ) %>%
map(~pluck(.x, 1, .default=1))
<- tibble(letter = rep.int(letters, times=unlist(results)),
pages_df pages = unlist(map(results, ~seq.int(from=1, to=.x))))
pages_df#> # A tibble: 6 × 2
#> letter pages
#> <chr> <int>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 c 1
#> 5 c 2
#> 6 c 3
Now that we know how many pages to retrieve from each letter page,
let’s rotate over letter pages and retrieve cheese names and underlying
links to cheese details. We will need to write a helper function. Our
session is still valid and we don’t need to nod
again,
because we will not be modifying a page URL, only its parameters (note
that the field url
is missing from scrape
function).
<- function(letter, pages){
get_cheese_page <- scrape(session, query=list(per_page=100,i=letter,page=pages)) %>%
lnks html_nodes("h3 a")
tibble(name=lnks %>% html_text(),
link=lnks %>% html_attr("href"))
}
<- pages_df %>% pmap_df(get_cheese_page)
df
df#> # A tibble: 518 × 2
#> name link
#> <chr> <chr>
#> 1 Abbaye de Belloc /abbaye-de-belloc/
#> 2 Abbaye de Belval /abbaye-de-belval/
#> 3 Abbaye de Citeaux /abbaye-de-citeaux/
#> 4 Abbaye de Tamié /tamie/
#> 5 Abbaye de Timadeuc /abbaye-de-timadeuc/
#> 6 Abbaye du Mont des Cats /abbaye-du-mont-des-cats/
#> 7 Abbot’s Gold /abbots-gold/
#> 8 Abertam /abertam/
#> 9 Abondance /abondance/
#> 10 Acapella /acapella/
#> # … with 508 more rows
Bob Rudis is one the vocal proponents of an online etiquette in the R
community. If you have never seen his robots.txt file, you should
definitely check it out! Lets
look at his blog. We don’t know how many
pages will the gallery return, so we keep going until there’s no more
“Older posts” button. Note that I first bow
to the host and
then simply nod
to the current scraping page inside the
while
loop.
library(polite)
library(rvest)
<- data.frame()
hrbrmstr_posts <- "https://rud.is/b/"
url <- bow(url)
session
while(!is.na(url)){
# make it verbose
message("Scraping ", url)
# nod and scrape
<- nod(session, url) %>%
current_page scrape(verbose=TRUE)
# extract post titles
<- current_page %>%
hrbrmstr_posts html_nodes(".entry-title a") %>%
::html_attrs_dfr() %>%
politerbind(hrbrmstr_posts)
# see if there's "Older posts" button
<- current_page %>%
url html_node(".nav-previous a") %>%
html_attr("href")
# end while loop
}
::as_tibble(hrbrmstr_posts)
tibble#> # A tibble: 578 x3
We organize the data into the tidy format and append it to our empty data frame. At the end we will discover that Bob has written over 570 blog articles, which I very much recommend anyone to check out.
If you are developing a package which accesses the web,
polite
can be used either as a template, or as a
backend for your polite web session.
Just before its ascension to CRAN, the package acquired new
functionality for helping package developers get started on creating
polite web tools for the users. Any modern package developer is probably
familiar with excellent usethis
package
by Rstudio team. usethis
is a collection of scripts for
automating package development workflow. Many usethis
functions automating repetitive tasks start with prefix
use_
indicating that what followed will be adopted and
“used” by the package user developes. For details about
use_
family of functions, see package
documentation.
{polite}
has one usethis-like function called
polite::use_manners()
.
::use_manners() polite
When called within the analysis (or package) directory, it creates a
new file called R/polite-scrape.R
(creating R
directory if necessary) and populates it with template functions for
creating polite web-scraping session. The functions provided by
polite::use_manners()
are drop-in replacements for two of
the most popular tools in web-accessing R ecosystem:
read_html()
and download.file()
. The only
difference is that these functions have polite_
prefix. In
all other respects they should have look and feel of the original,
i.e. in most cases you should be able to simply replace calls to
read_html()
with polite_read_html()
and
download.file
with polite_download_file()
and
your code should work (provided you scrape from a url
,
which it the first required argument in both functions).
Recent addition to polite package is a purrr
-like
adverb politely()
which can make any web-accessing function
“polite” by wrapping it with a code which delivers on four pillars of
polite session:
Introduce Yourself, Seek Permission, Take Slowly and Never Ask Twice.
Adverbs can be useful, when a user (package developer) wants to
“delegate” polite session handling to external package, without
modifying the existing code. The only thing user needs to do is wrap
existing verb with politely()
and use the new function
instead of the original.
Let’s say you wanted to use httr::GET
for accessing
certain API, such as musicbrainz
and extract certain data
from a deeply nested list, returned by the server. Your originally
developed code looks like this:
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
library(httr)
library(xml2)
library(purrr)
<- GET("https://musicbrainz.org/ws/2/artist/",
beatles_res query=list(query="Beatles", limit=10),
::accept("application/json"))
httrif(!is.null(beatles_res)) beatles_lst <- httr::content(beatles_res, type = "application/json")
str(beatles_lst, max.level = 2)
#> List of 4
#> $ created: chr "2022-08-03T01:25:54.433Z"
#> $ count : int 169
#> $ offset : int 0
#> $ artists:List of 10
#> ..$ :List of 12
#> ..$ :List of 12
#> ..$ :List of 10
#> ..$ :List of 7
#> ..$ :List of 8
#> ..$ :List of 6
#> ..$ :List of 9
#> ..$ :List of 5
#> ..$ :List of 6
#> ..$ :List of 6
This code does not comply with polite
principles. It
does not provide human-readable user-agent string, it does not consult
robots.txt
about permissions. It is possible to run this
code in the loop and (accidentally) overwhelm the server with requests.
It does not cache the results, so if this code is re-run again, data
will be re-queried.
You could write your own infastructure for handling useragent,
robots.txt, rate limiting and memoisation, or you could simply use an
adverb politely()
which does all of these things for
you.
Here’s an example from using colormind.io API. We will need a couple of service functions to convert colors between HEX and RGB and to prepare a json required by the service.
<- function(r,g,b,a) {grDevices::rgb(r, g, b, a, maxColorValue = 255)}
rgba2hex
<- function(x, alpha=TRUE){t(grDevices::col2rgb(x, alpha = alpha))}
hex2rgba
<- function(x, model){
prepare_colormind_query <- list(model=model)
lst
if(!is.null(x)){
<- utils::head(c(x, rep(NA_character_, times=4)), 5) # pad it with NAs
x <- hex2rgba(x)
x_mat <- lapply(seq_len(nrow(x_mat)), function(i) if(x_mat[i,4]==0) "N" else x_mat[i,1:3])
x_lst <- c(list(input=x_lst), lst)
lst
}::toJSON(lst, auto_unbox = TRUE)
jsonlite }
Now all we have to do is to “wrap” existing function in the
politely
adverb. Then call the new function insted of
original. You dont need to change anything other than a function
name.
<- politely(httr::GET, verbose=TRUE)
polite_GET
#res <- httr::GET("http://colormind.io/list") # was
<- polite_GET("http://colormind.io/list") # now
res #> Fetching robots.txt
#> rt_robotstxt_http_getter: normal http get
#> Warning in request_handler_handler(request = request, handler = on_not_found, :
#> Event: on_not_found
#> Warning in request_handler_handler(request = request, handler =
#> on_file_type_mismatch, : Event: on_file_type_mismatch
#> Warning in request_handler_handler(request = request, handler =
#> on_suspect_content, : Event: on_suspect_content
#>
#> New copy robots.txt was fetched from http://colormind.io/robots.txt
#> Total of 0 crawl delay rule(s) defined for this host.
#> Your rate will be set to 1 request every 5 second(s).
#> Pausing...
#> Scraping: http://colormind.io/list
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
::fromJSON(httr::content(res, as = "text"))$result
jsonlite#> [1] "ui" "default" "the_wind_rises"
#> [4] "lego_movie" "stellar_photography" "game_of_thrones"
The backend functionality of polite
can be used for
any function as long as it has url
argument (or
the first argument is a url). Here’s an example of polite POST created
with adverb politely
.
<- politely(POST, verbose=TRUE)
polite_POST
<-c(NA, "lightseagreen", NA, "coral", NA)
clue_colors
<- prepare_colormind_query(clue_colors, "default")
req
#res <- httr::POST(url='http://colormind.io/api/', body = req) #was
<- polite_POST(url='http://colormind.io/api/', body = req) #now
res #> Fetching robots.txt
#> rt_robotstxt_http_getter: cached http get
#> Warning in request_handler_handler(request = request, handler = on_not_found, :
#> Event: on_not_found
#> Warning in request_handler_handler(request = request, handler =
#> on_file_type_mismatch, : Event: on_file_type_mismatch
#> Warning in request_handler_handler(request = request, handler =
#> on_suspect_content, : Event: on_suspect_content
#>
#> Found the cached version of robots.txt for http://colormind.io/robots.txt
#> Total of 0 crawl delay rule(s) defined for this host.
#> Your rate will be set to 1 request every 5 second(s).
#> Pausing...
#> Scraping: http://colormind.io/api/
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
<- httr::content(res, as = "text")
res_json <- jsonlite::fromJSON(res_json)$result
res_mcol <- rgba2hex(res_mcol)
colrs ::show_col(colrs, ncol = 5) scales
Musicbrainz
API allows querying data on artists, releases, labels and all things
music. API endpoint, unfortunately, is Disallowed in
robots.txt
, but it is completely legal to access for small
size requests. Mass querying is easier using a datadump, with
musicbrainz published periodically. We can create polite GET and turn
off robots.txt
validation.
library(polite)
<- politely(GET, verbose=TRUE, robots = FALSE) # turn off robotstxt checking
polite_GET_nrt
<- polite_GET_nrt("https://musicbrainz.org/ws/2/artist/",
beatles_lst query=list(query="Beatles", limit=10),
::accept("application/json")) %>%
httr::content(type = "application/json")
httr#> Pausing...
#> Scraping: https://musicbrainz.org/ws/2/artist/
#> Setting useragent: polite R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu) bot
str(beatles_lst, max.level = 2)
#> List of 4
#> $ created: chr "2022-08-03T01:25:54.433Z"
#> $ count : int 169
#> $ offset : int 0
#> $ artists:List of 10
#> ..$ :List of 12
#> ..$ :List of 12
#> ..$ :List of 10
#> ..$ :List of 7
#> ..$ :List of 8
#> ..$ :List of 6
#> ..$ :List of 9
#> ..$ :List of 5
#> ..$ :List of 6
#> ..$ :List of 6
Lets parse the response
options(knitr.kable.NA = '')
%>%
beatles_lst extract2("artists") %>%
::tibble(id=map_chr(.,"id", .default=NA_character_),
{tibblematch_pct=map_int(.,"score", .default=NA_character_),
type=map_chr(.,"type", .default=NA_character_),
name=map_chr(., "name", .default=NA_character_),
country=map_chr(., "country", .default=NA_character_),
lifespan_begin=map_chr(., c("life-span", "begin"),.default=NA_character_),
lifespan_end=map_chr(., c("life-span", "end"),.default=NA_character_)
)%>% knitr::kable(col.names = c(id="Musicbrainz ID", match_pct="Match, %",
} type="Type", name="Name of artist",
country="Country", lifespan_begin="Career begun",
lifespan_end="Career ended"))
Musicbrainz ID | Match, % | Type | Name of artist | Country | Career begun | Career ended |
---|---|---|---|---|---|---|
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d | 100 | Group | The Beatles | 1957-03 | 1970-04-10 | |
5e685f9e-83bb-423c-acfa-487e34f15ffd | 78 | Group | The Tape-beatles | US | 1986-12 | |
1019b551-eba7-4e7c-bc7d-eb427ef54df2 | 75 | Group | Blues Beatles | BR | ||
5a45e8c5-e8e5-4f05-9429-6dd00f0ab50b | 75 | Group | Instrumental Beatles | |||
e897e5fc-2707-49c8-8605-be82b4664dc5 | 74 | Group | Sex Beatles | |||
74e70126-def2-4b76-a001-ed3b96080e24 | 74 | Powdered Beatles | ||||
bc569a61-dd62-4758-86c6-e99dcb1fdda6 | 74 | Tokyo Beatles | JP | |||
3133aeb8-9982-4e11-a8ff-5477996a80bf | 74 | Beatles Chillout | ||||
5d25dbfb-7558-45dc-83dd-6d1176090974 | 74 | Daft Beatles | ||||
bdf09e36-2b82-44ef-8402-35c1250d81e0 | 74 | Zyklon Beatles |
Package logo uses elements of a free image by pngtree.com
[1] Wiktionary (2018), The free dictionary, retrieved from https://en.wiktionary.org/wiki/bow_and_scrape