With this tutorial, I show how HTML code can be manipulated into a table based on location of text and formatting. This is important when pulling data on a regular basis. With the right approach, the web scraping and manipulation process can be automated through a loop.
Not all webpages can be conveniently scraped. For example: the statcast leaderboards on BaseballSavant. As shown below, the simple method that can be boiled down to a few lines of code does not seem to work.
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(xml2))
suppressMessages(library(rvest))
suppressMessages(library(RCurl))
suppressMessages(library(stringr))
suppressMessages(library(tidyr))
### Make link variable
url1 <- read.csv(url("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id"))
file1 <- read_html(url1)
## Error in UseMethod("read_xml"): no applicable method for 'read_xml' applied to an object of class "data.frame"
The error code shows that the error occurs after trying to read the file as html. Fortunately, we still have options.
One option would be to take the link that has been loaded into the variable url1 and extract the rows that contain the data. As a quick example, here’s a few rows of what might be extracted:
url1[5814:5820,]
## [1] {name:Kendrick, Howie attempts:124 max_hit_speed:109.6
## [4] min_hit_speed:42.3 avg_hit_speed:90.2 fbld:95.2
## [7] gb:86.6
## 7579 Levels: - -)} ... z={tablet:k(d+[Tablet|iPad
From here, we would begin the data manipulation process to turn this into a table.
Before I get too far into this method, I would like to note that manual extraction like this would likely require updating regularly, as webpages change over time. This eliminates the convenience of simply running an automated script, since copy + pasting the full table would arguably take a similar amount of time. For this reason, we look to another method.
Upon further inspection of the webpage’s HTML, we see that the entire table is on a row that starts with “var leaderboard_data” (CTRL+F, search for a player’s name). This makes data extraction much easier, as this will likely be the label of the table despite how the rest of the webpage changes over time. Now we can use this to our advantage.
### Retrieve webpage
webpage <- getURL("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id")
### Read webpage to variable and close the connection
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
### Grab line number that contains table based on label
linenum <- grep("var leaderboard_data", webpage)
# Remove "L" from number and convert to numeric
linenum <- as.numeric(gsub("L", "", linenum))
### Extract that line into a variable
sub <- webpage[linenum]
### Replace the square brackets and remove space at the end
sub2 <- sub(".+?\\[", "", sub)
sub2 <- trimws(sub2, which = "both")
names(sub2) <- "Col1"
This is where things start to get tricky. We have most of the data cleaned, but now we’re stuck with the variables in a crazy format. The split function can help by splitting the data into rows. Looking at the data, we can see that each player is preceded by the text “name”. For the sake of brevity, I will just show a small portion and follow up by splitting the data.
glimpse(sub2)
## Named chr "{\"name\":\"Stanton, Giancarlo\",\"attempts\":\"341\",\"max_hit_speed\":\"121.7\",\"min_hit_speed\":\"35.0\",\""| __truncated__
## - attr(*, "names")= chr "Col1"
### Split into new rows based on name
splitdf <- str_split(sub2, "name")
splitdf <- str_split(unlist(splitdf), ",")
### Remove first row containing bracket
splitdf <- splitdf[-1]
head(splitdf[1][[1]])
## [1] "\":\"Stanton" " Giancarlo\""
## [3] "\"attempts\":\"341\"" "\"max_hit_speed\":\"121.7\""
## [5] "\"min_hit_speed\":\"35.0\"" "\"avg_hit_speed\":\"94.5\""
So now that we have each player in its own section, we can finish cleaning the data and convert it into a data frame. Since we have a lot of variables within each list, we should make a loop that goes over each player and puts it into a data frame.
### Extract one player then convert to data frame
sub1 <- splitdf[1]
sub2 <- as.data.frame(sub1)
### Label column
colnames(sub2)[1] <- "Col1"
### Separate into multiple columns so the variable names correspond to their respective values
sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
sub3$varName <- gsub("\"", "", sub3$varName)
sub3$varNum <- gsub("\"", "", sub3$varNum)
### Remove rows that don't contain relevant data (we don't need the name, just the MLBid)
sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
### Take column names
my_colnames <- c(sub4$varName)
### Transpose the data so it is now in long format and convert back to data frame
transp <- sub4 %>%
select(varNum) %>%
t() %>%
as.data.frame()
### Add column names and remove row names
colnames(transp) <- my_colnames
rownames(transp) <- NULL
Now that we have code that works for a single player, we can convert it into a loop format that goes over every row of the large list and adds each player’s information to a data frame. Don’t worry about the player names for now. I will show an online resource that you can call directly into R and merge with just a couple lines of simple code.
# counter
ct <- 1
df <- data.frame()
# loop
for(i in 1:length(splitdf)){
sub1 <- splitdf[[i]]
sub2 <- as.data.frame(sub1)
names(sub2) <- "Col1"
sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
sub3$varName <- gsub("\"", "", sub3$varName)
sub3$varNum <- gsub("\"", "", sub3$varNum)
sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
my_colnames <- c(sub4$varName)
transp <- sub4 %>%
select(varNum) %>%
t() %>%
as.data.frame()
colnames(transp) <- my_colnames
rownames(transp) <- NULL
df <- rbind(transp, df)
# keep track of where the loop is at
print(ct)
ct <- ct + 1
}
head(df)
## attempts max_hit_speed min_hit_speed avg_hit_speed fbld gb
## 1 1 39.1 39.1 39.1 null 39.1
## 2 2 41.0 41.0 41.0 null 41.0
## 3 1 41.0 41.0 41.0 null 41.0
## 4 1 41.0 41.0 41.0 null 41.0
## 5 1 41.0 41.0 41.0 null 41.0
## 6 1 41.0 41.0 41.0 null 41.0
## max_distance avg_distance avg_hr_distance player_id player_type
## 1 2 2 null 607968 resp_batter_id
## 2 null null null 592644 resp_batter_id
## 3 null null null 518397 resp_batter_id
## 4 null null null 571893 resp_batter_id
## 5 null null null 656546 resp_batter_id
## 6 null null null 596001 resp_batter_id
## season resp_batter_id barrels brl_percent brl_pa ev95plus ev95percent
## 1 2018 607968 0 0.0 0.0 0 0.0
## 2 2018 592644 0 0.0 0.0 0 0.0
## 3 2018 518397 0 0.0 0.0 0 0.0
## 4 2018 571893 0 0.0 0.0 0 0.0
## 5 2018 656546 0 0.0 0.0 0 0.0
## 6 2018 596001 0 0.0 0.0 0 0.0
Before the loop, we create an empty data frame and counter to see where we’re at incase the loop takes a while. Essentially what this code does is loop over every player’s element (represented by i) until it reaches the end.
The code should look nearly identical to the code above with a couple exceptions: first, we index the ith element of splitdf. So for the first time the loop occurs, i will be 1. It will be 2 on the second time, and so forth. The second difference is that we added an rbind command. This binds the rows of the transposed data frame to the empty data frame and the once-empty data frame will be added to until the loop is complete.
Our last step is to merge this data frame with the player names. This following file contains names and various id’s of all MLB players from this season. The wide range of id’s and names to choose from allows us to merge and scrape data from the respective websites with less hassle.
### Read file from website
playerNames <- read.csv("http://crunchtimebaseball.com/master.csv") %>%
select(mlb_id, mlb_name)
### Merge player names with MLB ids
finished <- merge(playerNames, df, by.x = "mlb_id", by.y = "player_id")
head(finished, 4)
## mlb_id mlb_name attempts max_hit_speed min_hit_speed
## 1 112526 Bartolo Colon 1 76.9 76.9
## 2 134181 Adrian Beltre 263 107.6 48.1
## 3 400085 Ichiro Suzuki 37 100.1 41.0
## 4 400121 Victor Martinez 369 105.3 48.1
## avg_hit_speed fbld gb max_distance avg_distance avg_hr_distance
## 1 76.9 null 76.9 4 4 null
## 2 88.0 91.1 85.3 420 173 399
## 3 80.2 85.4 77.6 351 126 null
## 4 87.2 90.2 85.2 401 180 389
## player_type season resp_batter_id barrels brl_percent brl_pa ev95plus
## 1 resp_batter_id 2018 112526 0 0.0 0.0 0
## 2 resp_batter_id 2018 134181 11 4.2 2.9 94
## 3 resp_batter_id 2018 400085 0 0.0 0.0 1
## 4 resp_batter_id 2018 400121 14 3.8 3.2 106
## ev95percent
## 1 0.0
## 2 35.7
## 3 2.7
## 4 28.7
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(xml2))
suppressMessages(library(rvest))
suppressMessages(library(RCurl))
suppressMessages(library(stringr))
suppressMessages(library(tidyr))
webpage <- getURL("https://baseballsavant.mlb.com/statcast_leaderboard?year=2018&abs=0&player_type=resp_batter_id")
### Read webpage to variable and close the connection
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
### Grab line number that contains table based on label
linenum <- grep("var leaderboard_data", webpage)
# Remove "L" from number and convert to numeric
linenum <- as.numeric(gsub("L", "", linenum))
### Extract that line into a variable
sub <- webpage[linenum]
### Replace the square brackets and remove space at the end
sub2 <- sub(".+?\\[", "", sub)
sub2 <- trimws(sub2, which = "both")
names(sub2) <- "Col1"
splitdf <- str_split(sub2, "name")
splitdf <- str_split(unlist(splitdf), ",")
### Remove first row containing bracket
splitdf <- splitdf[-1]
# counter
ct <- 1
df <- data.frame()
# loop
for(i in 1:length(splitdf)){
sub1 <- splitdf[[i]]
sub2 <- as.data.frame(sub1)
names(sub2) <- "Col1"
sub3 <- separate(data = sub2, col = "Col1", into = c("varName", "varNum"), sep = ":")
sub3$varName <- gsub("\"", "", sub3$varName)
sub3$varNum <- gsub("\"", "", sub3$varNum)
sub4 <- subset(sub3, !is.na(varName) & varName != "" & !is.na(varName) & varNum != "" & varName != "rowId")
my_colnames <- c(sub4$varName)
transp <- sub4 %>%
select(varNum) %>%
t() %>%
as.data.frame()
colnames(transp) <- my_colnames
rownames(transp) <- NULL
df <- rbind(transp, df)
# keep track of where the loop is at
print(ct)
ct <- ct + 1
}
### Read player names file from website
playerNames <- read.csv("http://crunchtimebaseball.com/master.csv") %>%
select(mlb_id, mlb_name)
### Merge player names with MLB ids
finished <- merge(playerNames, df, by.x = "mlb_id", by.y = "player_id")