Andrew Tamalunas
August 30, 2018
Scraping Fangraphs is relatively easy. The data is in a convenient table form and can be obtained through simple web scraping methods.
suppressMessages(library(dplyr))
suppressMessages(library(rvest))
### Load data from webpage
url <- "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=30&type=2&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_1000"
l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')
### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]]
fangraphs <- fangraphs[-c(1,3),]
head(fangraphs)## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 2 # Name Team BABIP GB/FB LD% GB% FB% IFFB%
## 4 1 Zach Eflin Phillies .250 5.00 0.0 % 83.3 % 16.7 % 0.0 %
## 5 2 Brent Suter Brewers .211 5.50 27.8 % 61.1 % 11.1 % 0.0 %
## 6 3 Tyler Austin - - - .303 1.19 23.3 % 41.7 % 35.0 % 5.6 %
## 7 4 Francisco Arcia Angels .324 1.88 37.8 % 40.5 % 21.6 % 0.0 %
## 8 5 Derek Fisher Astros .257 1.73 21.1 % 50.0 % 28.9 % 0.0 %
## X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
## 2 HR/FB IFH IFH% BUH BUH% Pull% Cent% Oppo% Soft% Med% Hard%
## 4 50.0 % 1 10.0 % 0 0.0 % 42.1 % 26.3 % 31.6 % 26.3 % 52.6 % 21.1 %
## 5 50.0 % 2 18.2 % 0 0.0 % 23.8 % 19.1 % 57.1 % 28.6 % 33.3 % 38.1 %
## 6 38.9 % 0 0.0 % 0 0.0 % 41.8 % 32.0 % 26.2 % 13.6 % 48.5 % 37.9 %
## 7 37.5 % 0 0.0 % 0 0.0 % 43.2 % 35.1 % 21.6 % 18.9 % 32.4 % 48.7 %
## 8 36.4 % 1 5.3 % 1 100.0 % 46.2 % 53.9 % 0.0 % 12.8 % 46.2 % 41.0 %
For the purpose of the Rmarkdown, I have removed the rows containing gibberish, since the head command gives a very uninformative picture relative to viewing the full data frame. After this, it is pretty clear that essentially all of the work comes down to data manipulation.
We now have the column names as the first row, which we can easily move up to the column names. Tables from the web often contain symbols, which can become problematic in handling the data in R. We can remove these symbols, then the row that contains the column names.
# Extract column names
columnNames <- as.list(fangraphs[1,])
# Take care of symbols in column names
columnNames <- gsub("%", ".p", columnNames)
columnNames <- gsub("/", "per", columnNames)
# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]
head(fangraphs)## # Name Team BABIP GBperFB LD.p GB.p FB.p IFFB.p
## 4 1 Zach Eflin Phillies .250 5.00 0.0 % 83.3 % 16.7 % 0.0 %
## 5 2 Brent Suter Brewers .211 5.50 27.8 % 61.1 % 11.1 % 0.0 %
## 6 3 Tyler Austin - - - .303 1.19 23.3 % 41.7 % 35.0 % 5.6 %
## 7 4 Francisco Arcia Angels .324 1.88 37.8 % 40.5 % 21.6 % 0.0 %
## 8 5 Derek Fisher Astros .257 1.73 21.1 % 50.0 % 28.9 % 0.0 %
## 9 6 J.D. Martinez Red Sox .388 1.43 23.6 % 45.0 % 31.4 % 2.7 %
## HRperFB IFH IFH.p BUH BUH.p Pull.p Cent.p Oppo.p Soft.p Med.p Hard.p
## 4 50.0 % 1 10.0 % 0 0.0 % 42.1 % 26.3 % 31.6 % 26.3 % 52.6 % 21.1 %
## 5 50.0 % 2 18.2 % 0 0.0 % 23.8 % 19.1 % 57.1 % 28.6 % 33.3 % 38.1 %
## 6 38.9 % 0 0.0 % 0 0.0 % 41.8 % 32.0 % 26.2 % 13.6 % 48.5 % 37.9 %
## 7 37.5 % 0 0.0 % 0 0.0 % 43.2 % 35.1 % 21.6 % 18.9 % 32.4 % 48.7 %
## 8 36.4 % 1 5.3 % 1 100.0 % 46.2 % 53.9 % 0.0 % 12.8 % 46.2 % 41.0 %
## 9 33.6 % 14 8.6 % 0 0.0 % 40.3 % 30.0 % 29.7 % 10.0 % 43.9 % 46.1 %
Some of the columns still have percentages and are in character form. This is also a common problem while working with web tables in R. Fortunately, this is easily fixed also.
fangraphs[] <- sapply(fangraphs, function(x) gsub(" %","",x))
fangraphs[4:19] <- sapply(fangraphs[4:19],as.numeric)
head(fangraphs)## # Name Team BABIP GBperFB LD.p GB.p FB.p IFFB.p HRperFB
## 4 1 Zach Eflin Phillies 0.250 5.00 0.0 83.3 16.7 0.0 50.0
## 5 2 Brent Suter Brewers 0.211 5.50 27.8 61.1 11.1 0.0 50.0
## 6 3 Tyler Austin - - - 0.303 1.19 23.3 41.7 35.0 5.6 38.9
## 7 4 Francisco Arcia Angels 0.324 1.88 37.8 40.5 21.6 0.0 37.5
## 8 5 Derek Fisher Astros 0.257 1.73 21.1 50.0 28.9 0.0 36.4
## 9 6 J.D. Martinez Red Sox 0.388 1.43 23.6 45.0 31.4 2.7 33.6
## IFH IFH.p BUH BUH.p Pull.p Cent.p Oppo.p Soft.p Med.p Hard.p
## 4 1 10.0 0 0 42.1 26.3 31.6 26.3 52.6 21.1
## 5 2 18.2 0 0 23.8 19.1 57.1 28.6 33.3 38.1
## 6 0 0.0 0 0 41.8 32.0 26.2 13.6 48.5 37.9
## 7 0 0.0 0 0 43.2 35.1 21.6 18.9 32.4 48.7
## 8 1 5.3 1 100 46.2 53.9 0.0 12.8 46.2 41.0
## 9 14 8.6 0 0 40.3 30.0 29.7 10.0 43.9 46.1
Enjoy your data and support the website! Here is the final script:
suppressMessages(library(dplyr))
suppressMessages(library(rvest))
### Load data from webpage
url <- "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=30&type=2&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_1000"
l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')
### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]]
fangraphs <- fangraphs[-c(1,3),]
# Extract column names
columnNames <- as.list(fangraphs[1,])
# Take care of symbols in column names
columnNames <- gsub("%", ".p", columnNames)
columnNames <- gsub("/", "per", columnNames)
# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]
fangraphs[] <- sapply(fangraphs, function(x) gsub(" %","",x))
fangraphs[4:19] <- sapply(fangraphs[4:19],as.numeric)