Scraping and Manipulating Tricky Web Tables

The Purpose

With this tutorial, I show how HTML code can be manipulated into a table based on location of text and formatting. This is important when pulling data on a regular basis. With the right approach, the web scraping and manipulation process can be automated through a loop.

The Problem

Not all webpages can be conveniently scraped. For example: the statcast leaderboards on BaseballSavant. As shown below, the simple method that can be boiled down to a few lines of code does not seem to work.

suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(xml2))
suppressMessages(library(rvest))
suppressMessages(library(RCurl))
suppressMessages(library(stringr))
suppressMessages(library(tidyr))

### Make link variable
url1 <- read.csv(url("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id")) 

file1 <- read_html(url1)

## Error in UseMethod("read_xml"): no applicable method for 'read_xml' applied to an object of class "data.frame"

The error code shows that the error occurs after trying to read the file as html. Fortunately, we still have options.

One option would be to take the link that has been loaded into the variable url1 and extract the rows that contain the data. As a quick example, here’s a few rows of what might be extracted:

url1[5814:5820,]

## [1] {name:Kendrick, Howie attempts:124          max_hit_speed:109.6  
## [4] min_hit_speed:42.3    avg_hit_speed:90.2    fbld:95.2            
## [7] gb:86.6              
## 7579 Levels:  - -)} ... z={tablet:k(d+[Tablet|iPad

From here, we would begin the data manipulation process to turn this into a table.

Before I get too far into this method, I would like to note that manual extraction like this would likely require updating regularly, as webpages change over time. This eliminates the convenience of simply running an automated script, since copy + pasting the full table would arguably take a similar amount of time. For this reason, we look to another method.

The Solution

Upon further inspection of the webpage’s HTML, we see that the entire table is on a row that starts with “var leaderboard_data” (CTRL+F, search for a player’s name). This makes data extraction much easier, as this will likely be the label of the table despite how the rest of the webpage changes over time. Now we can use this to our advantage.

### Retrieve webpage
webpage <- getURL("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id")
### Read webpage to variable and close the connection
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

### Grab line number that contains table based on label
linenum <- grep("var leaderboard_data", webpage)
# Remove "L" from number and convert to numeric
linenum <- as.numeric(gsub("L", "", linenum))

### Extract that line into a variable
sub <- webpage[linenum]

### Replace the square brackets and remove space at the end
sub2 <- sub(".+?\\[", "", sub)
sub2 <- trimws(sub2, which = "both")
names(sub2) <- "Col1"

This is where things start to get tricky. We have most of the data cleaned, but now we’re stuck with the variables in a crazy format. The split function can help by splitting the data into rows. Looking at the data, we can see that each player is preceded by the text “name”. For the sake of brevity, I will just show a small portion and follow up by splitting the data.

glimpse(sub2)

##  Named chr "{\"name\":\"Stanton, Giancarlo\",\"attempts\":\"341\",\"max_hit_speed\":\"121.7\",\"min_hit_speed\":\"35.0\",\""| __truncated__
##  - attr(*, "names")= chr "Col1"

### Split into new rows based on name
splitdf <- str_split(sub2, "name") 
splitdf <- str_split(unlist(splitdf), ",")
### Remove first row containing bracket
splitdf <- splitdf[-1]

head(splitdf[1][[1]])

## [1] "\":\"Stanton"                " Giancarlo\""               
## [3] "\"attempts\":\"341\""        "\"max_hit_speed\":\"121.7\""
## [5] "\"min_hit_speed\":\"35.0\""  "\"avg_hit_speed\":\"94.5\""

The Loop

So now that we have each player in its own section, we can finish cleaning the data and convert it into a data frame. Since we have a lot of variables within each list, we should make a loop that goes over each player and puts it into a data frame.

### Extract one player then convert to data frame
sub1 <- splitdf[1]
sub2 <- as.data.frame(sub1)
### Label column
colnames(sub2)[1] <- "Col1"

### Separate into multiple columns so the variable names correspond to their respective values
sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
sub3$varName <- gsub("\"", "", sub3$varName)
sub3$varNum <- gsub("\"", "", sub3$varNum)

### Remove rows that don't contain relevant data (we don't need the name, just the MLBid)
sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
### Take column names
my_colnames <- c(sub4$varName)

### Transpose the data so it is now in long format and convert back to data frame
transp <- sub4 %>%
    select(varNum) %>%
    t() %>%
    as.data.frame()

### Add column names and remove row names
colnames(transp) <- my_colnames
rownames(transp) <- NULL

Now that we have code that works for a single player, we can convert it into a loop format that goes over every row of the large list and adds each player’s information to a data frame. Don’t worry about the player names for now. I will show an online resource that you can call directly into R and merge with just a couple lines of simple code.

# counter
ct <- 1
df <- data.frame()

# loop
for(i in 1:length(splitdf)){
  sub1 <- splitdf[[i]]
  sub2 <- as.data.frame(sub1)
  names(sub2) <- "Col1"
  
  sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
  sub3$varName <- gsub("\"", "", sub3$varName)
  sub3$varNum <- gsub("\"", "", sub3$varNum)
  
  sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
  my_colnames <- c(sub4$varName)
  
  transp <- sub4 %>%
    select(varNum) %>%
    t() %>%
    as.data.frame()
  
  colnames(transp) <- my_colnames
  rownames(transp) <- NULL
  
  df <- rbind(transp, df)
  
  # keep track of where the loop is at
  print(ct)
  ct <- ct + 1
}

head(df)

##   attempts max_hit_speed min_hit_speed avg_hit_speed fbld   gb
## 1        1          39.1          39.1          39.1 null 39.1
## 2        2          41.0          41.0          41.0 null 41.0
## 3        1          41.0          41.0          41.0 null 41.0
## 4        1          41.0          41.0          41.0 null 41.0
## 5        1          41.0          41.0          41.0 null 41.0
## 6        1          41.0          41.0          41.0 null 41.0
##   max_distance avg_distance avg_hr_distance player_id    player_type
## 1            2            2            null    607968 resp_batter_id
## 2         null         null            null    592644 resp_batter_id
## 3         null         null            null    518397 resp_batter_id
## 4         null         null            null    571893 resp_batter_id
## 5         null         null            null    656546 resp_batter_id
## 6         null         null            null    596001 resp_batter_id
##   season resp_batter_id barrels brl_percent brl_pa ev95plus ev95percent
## 1   2018         607968       0         0.0    0.0        0         0.0
## 2   2018         592644       0         0.0    0.0        0         0.0
## 3   2018         518397       0         0.0    0.0        0         0.0
## 4   2018         571893       0         0.0    0.0        0         0.0
## 5   2018         656546       0         0.0    0.0        0         0.0
## 6   2018         596001       0         0.0    0.0        0         0.0

Before the loop, we create an empty data frame and counter to see where we’re at incase the loop takes a while. Essentially what this code does is loop over every player’s element (represented by i) until it reaches the end.

The code should look nearly identical to the code above with a couple exceptions: first, we index the ith element of splitdf. So for the first time the loop occurs, i will be 1. It will be 2 on the second time, and so forth. The second difference is that we added an rbind command. This binds the rows of the transposed data frame to the empty data frame and the once-empty data frame will be added to until the loop is complete.

Our last step is to merge this data frame with the player names. This following file contains names and various id’s of all MLB players from this season. The wide range of id’s and names to choose from allows us to merge and scrape data from the respective websites with less hassle.

### Read file from website
playerNames <- read.csv("http://crunchtimebaseball.com/master.csv") %>%
  select(mlb_id, mlb_name)

### Merge player names with MLB ids
finished <- merge(playerNames, df, by.x = "mlb_id", by.y = "player_id")

head(finished, 4)

##   mlb_id        mlb_name attempts max_hit_speed min_hit_speed
## 1 112526   Bartolo Colon        1          76.9          76.9
## 2 134181   Adrian Beltre      263         107.6          48.1
## 3 400085   Ichiro Suzuki       37         100.1          41.0
## 4 400121 Victor Martinez      369         105.3          48.1
##   avg_hit_speed fbld   gb max_distance avg_distance avg_hr_distance
## 1          76.9 null 76.9            4            4            null
## 2          88.0 91.1 85.3          420          173             399
## 3          80.2 85.4 77.6          351          126            null
## 4          87.2 90.2 85.2          401          180             389
##      player_type season resp_batter_id barrels brl_percent brl_pa ev95plus
## 1 resp_batter_id   2018         112526       0         0.0    0.0        0
## 2 resp_batter_id   2018         134181      11         4.2    2.9       94
## 3 resp_batter_id   2018         400085       0         0.0    0.0        1
## 4 resp_batter_id   2018         400121      14         3.8    3.2      106
##   ev95percent
## 1         0.0
## 2        35.7
## 3         2.7
## 4        28.7

And now you have a clean data table! The full code below will allow you to retrieve updated stats with a single click!

suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(xml2))
suppressMessages(library(rvest))
suppressMessages(library(RCurl))
suppressMessages(library(stringr))
suppressMessages(library(tidyr))

webpage <- getURL("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id")
### Read webpage to variable and close the connection
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

### Grab line number that contains table based on label
linenum <- grep("var leaderboard_data", webpage)
# Remove "L" from number and convert to numeric
linenum <- as.numeric(gsub("L", "", linenum))

### Extract that line into a variable
sub <- webpage[linenum]

### Replace the square brackets and remove space at the end
sub2 <- sub(".+?\\[", "", sub)
sub2 <- trimws(sub2, which = "both")
names(sub2) <- "Col1"

splitdf <- str_split(sub2, "name") 
splitdf <- str_split(unlist(splitdf), ",")
### Remove first row containing bracket
splitdf <- splitdf[-1]

# counter
ct <- 1
df <- data.frame()

# loop
for(i in 1:length(splitdf)){
  sub1 <- splitdf[[i]]
  sub2 <- as.data.frame(sub1)
  names(sub2) <- "Col1"
  
  sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
  sub3$varName <- gsub("\"", "", sub3$varName)
  sub3$varNum <- gsub("\"", "", sub3$varNum)
  
  sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
  my_colnames <- c(sub4$varName)
  
  transp <- sub4 %>%
    select(varNum) %>%
    t() %>%
    as.data.frame()
  
  colnames(transp) <- my_colnames
  rownames(transp) <- NULL
  
  df <- rbind(transp, df)
  
  # keep track of where the loop is at
  print(ct)
  ct <- ct + 1
}

### Read player names file from website
playerNames <- read.csv("http://crunchtimebaseball.com/master.csv") %>%
  select(mlb_id, mlb_name)

### Merge player names with MLB ids
finished <- merge(playerNames, df, by.x = "mlb_id", by.y = "player_id")

Scraping and Manipulating Tricky Web Tables

Andrew Tamalunas

August 15, 2018

The Purpose

The Problem

The Solution

The Loop