Skip to content Skip to sidebar Skip to footer

Scraping Javascript-Rendered Content In R From A Webpage Without Unique URL

I want to scrape historical results of South African LOTTO draws (especially Total Pool Size, Total Sales, etc.) from the South African National Lottery website. By default one see

Solution 1:

You are right - the contents on the page are updated by javascript via an ajax request. The server returns a json string in response to an http POST request. With POST requests, the server's response is determined not only by the url you request, but by the body of the message you send to the server. In this case, your body is a simple form with 3 fields: gameName, which is always LOTTO, isAjax which is always true, and drawNumber, which is the field you want to vary.

If you are using httr, you specify these fields as a named list in the body parameter of the POST function.

Once you have the response for each draw, you will want to parse the json into an R-friendly format such as a list or data frame using a library such as jsonlite. From looking at the structure of this particular json, it makes most sense to extract the component $data$drawDetailsand make that a one-row dataframe. This will allow you to bind several draws together into a single data frame.

Here is a function that does all that for you:

lotto_details <- function(draw_numbers)
{
 do.call("rbind", lapply(draw_numbers, function(x)
 {
   res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
                            "?task=results.redirectPageURL&amp;",
                            "Itemid=265&amp;option=com_weaver&amp;",
                            "controller=lotto-history"),
                     body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
   as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
 }))
}

Which you use like this:

lotto_details(2009:2012)
#>   drawNumber   drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1       2009 2020/04/01   2020/04/04    51    15     7    32    42    45
#> 2       2010 2020/04/04   2020/04/08    43     4    21    24    10     3
#> 3       2011 2020/04/08   2020/04/11    42    43     8    18     2    29
#> 4       2012 2020/04/11   2020/04/15    48     6    43    41    25    45
#>   bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1         1           0          0           0          0          21
#> 2        22           0          0           0          0          31
#> 3        34           0          0           0          0          21
#> 4        38           1 10546013.8           0          0          28
#>   div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1     8455.3          60     2348.7        1252        189        1786
#> 2     6004.3          71     2080.6        1808      137.3        2352
#> 3     8584.5          60     2384.6        1405      171.1        2079
#> 4     7676.4          62     2751.4        1389      206.3        1872
#>   div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1      115.2       24664         50       19711         20     3809758.17
#> 2       91.7       35790         50       25981         20     5966533.86
#> 3      100.5       27674         50       21895         20     8055430.87
#> 4        133       28003         50       20651         20              0
#>   rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1              2     6198036.67    9879655          6000000
#> 2              3     9073426.56   11696905          8000000
#> 3              4    10649716.37   10406895         10000000
#> 4              0     13280236.5   11610950          2000000
#>   guaranteedJackpot drawMachine ballSet    status winners millionairs
#> 1                 0        RNG2     RNG published   47494           0
#> 2                 0        RNG2     RNG published   66033           0
#> 3                 0        RNG2     RNG published   53134           0
#> 4                 0        RNG2     RNG published   52006           1
#>   gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1     47494         0         0         0         0         0         0
#> 2     66033         0         0         0         0         0         0
#> 3     53134         0         0         0         0         0         0
#> 4     52006         0         0         0         0         0         0
#>   kznwinners nwwinners
#> 1          0         0
#> 2          0         0
#> 3          0         0
#> 4          0         0


Solution 2:

The question already has a satisfactory answer (see above) that I've accepted. I simultaneously arrived at a nearly identical solution; I add it here only because it explicitly covers the full range of available draw numbers and will automatically detect the most recent draw number so that the code can be run 'as is' in the future, provided the National Lottery website design remains the same.

theurl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&amp;Itemid=265&amp;option=com_weaver&amp;controller=lotto-history"
x <- rvest::html_text(xml2::read_html(theurl))
preceding_string <- "LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW "
drawnums <- as.integer(vapply(gregexpr(preceding_string, x)[[1]] + nchar(preceding_string), 
              function(k) substr(x, start = k, stop = k + 3), NA_character_))
drawnumrange <- 1506:max(drawnums)
response <- lapply(drawnumrange, function(d) httr::POST(url = theurl, 
                body = list(gameName = "LOTTO", drawNumber = as.character(d), isAjax = 
                "true"), encode = "form"))
jsondat <- lapply(response, function(r) jsonlite::parse_json(r)$data$drawDetails)
lottotable <- as.data.frame(do.call(rbind, jsondat))
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
xlsx::write.xlsx2(lottotable[1:37], "lottotable.xlsx", row.names = FALSE)

Post a Comment for "Scraping Javascript-Rendered Content In R From A Webpage Without Unique URL"