Scraping Fangraphs Example

Andrew Tamalunas
August 30, 2018

R Markdown

Scraping Fangraphs is relatively easy. The data is in a convenient table form and can be obtained through simple web scraping methods.

suppressMessages(library(dplyr))
suppressMessages(library(rvest))

### Load data from webpage

url <- "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=30&type=2&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_1000"

l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')

### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]] 
fangraphs <- fangraphs[-c(1,3),]

head(fangraphs)
##   X1              X2       X3    X4    X5     X6     X7     X8    X9
## 2  #            Name     Team BABIP GB/FB    LD%    GB%    FB% IFFB%
## 4  1      Zach Eflin Phillies  .250  5.00  0.0 % 83.3 % 16.7 % 0.0 %
## 5  2     Brent Suter  Brewers  .211  5.50 27.8 % 61.1 % 11.1 % 0.0 %
## 6  3    Tyler Austin    - - -  .303  1.19 23.3 % 41.7 % 35.0 % 5.6 %
## 7  4 Francisco Arcia   Angels  .324  1.88 37.8 % 40.5 % 21.6 % 0.0 %
## 8  5    Derek Fisher   Astros  .257  1.73 21.1 % 50.0 % 28.9 % 0.0 %
##      X10 X11    X12 X13     X14    X15    X16    X17    X18    X19    X20
## 2  HR/FB IFH   IFH% BUH    BUH%  Pull%  Cent%  Oppo%  Soft%   Med%  Hard%
## 4 50.0 %   1 10.0 %   0   0.0 % 42.1 % 26.3 % 31.6 % 26.3 % 52.6 % 21.1 %
## 5 50.0 %   2 18.2 %   0   0.0 % 23.8 % 19.1 % 57.1 % 28.6 % 33.3 % 38.1 %
## 6 38.9 %   0  0.0 %   0   0.0 % 41.8 % 32.0 % 26.2 % 13.6 % 48.5 % 37.9 %
## 7 37.5 %   0  0.0 %   0   0.0 % 43.2 % 35.1 % 21.6 % 18.9 % 32.4 % 48.7 %
## 8 36.4 %   1  5.3 %   1 100.0 % 46.2 % 53.9 %  0.0 % 12.8 % 46.2 % 41.0 %

For the purpose of the Rmarkdown, I have removed the rows containing gibberish, since the head command gives a very uninformative picture relative to viewing the full data frame. After this, it is pretty clear that essentially all of the work comes down to data manipulation.

We now have the column names as the first row, which we can easily move up to the column names. Tables from the web often contain symbols, which can become problematic in handling the data in R. We can remove these symbols, then the row that contains the column names.

# Extract column names
columnNames <- as.list(fangraphs[1,])
# Take care of symbols in column names
columnNames <- gsub("%", ".p", columnNames)
columnNames <- gsub("/", "per", columnNames)

# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]

head(fangraphs)
##   #            Name     Team BABIP GBperFB   LD.p   GB.p   FB.p IFFB.p
## 4 1      Zach Eflin Phillies  .250    5.00  0.0 % 83.3 % 16.7 %  0.0 %
## 5 2     Brent Suter  Brewers  .211    5.50 27.8 % 61.1 % 11.1 %  0.0 %
## 6 3    Tyler Austin    - - -  .303    1.19 23.3 % 41.7 % 35.0 %  5.6 %
## 7 4 Francisco Arcia   Angels  .324    1.88 37.8 % 40.5 % 21.6 %  0.0 %
## 8 5    Derek Fisher   Astros  .257    1.73 21.1 % 50.0 % 28.9 %  0.0 %
## 9 6   J.D. Martinez  Red Sox  .388    1.43 23.6 % 45.0 % 31.4 %  2.7 %
##   HRperFB IFH  IFH.p BUH   BUH.p Pull.p Cent.p Oppo.p Soft.p  Med.p Hard.p
## 4  50.0 %   1 10.0 %   0   0.0 % 42.1 % 26.3 % 31.6 % 26.3 % 52.6 % 21.1 %
## 5  50.0 %   2 18.2 %   0   0.0 % 23.8 % 19.1 % 57.1 % 28.6 % 33.3 % 38.1 %
## 6  38.9 %   0  0.0 %   0   0.0 % 41.8 % 32.0 % 26.2 % 13.6 % 48.5 % 37.9 %
## 7  37.5 %   0  0.0 %   0   0.0 % 43.2 % 35.1 % 21.6 % 18.9 % 32.4 % 48.7 %
## 8  36.4 %   1  5.3 %   1 100.0 % 46.2 % 53.9 %  0.0 % 12.8 % 46.2 % 41.0 %
## 9  33.6 %  14  8.6 %   0   0.0 % 40.3 % 30.0 % 29.7 % 10.0 % 43.9 % 46.1 %

Some of the columns still have percentages and are in character form. This is also a common problem while working with web tables in R. Fortunately, this is easily fixed also.

fangraphs[] <- sapply(fangraphs, function(x) gsub(" %","",x))
fangraphs[4:19] <- sapply(fangraphs[4:19],as.numeric)

head(fangraphs)
##   #            Name     Team BABIP GBperFB LD.p GB.p FB.p IFFB.p HRperFB
## 4 1      Zach Eflin Phillies 0.250    5.00  0.0 83.3 16.7    0.0    50.0
## 5 2     Brent Suter  Brewers 0.211    5.50 27.8 61.1 11.1    0.0    50.0
## 6 3    Tyler Austin    - - - 0.303    1.19 23.3 41.7 35.0    5.6    38.9
## 7 4 Francisco Arcia   Angels 0.324    1.88 37.8 40.5 21.6    0.0    37.5
## 8 5    Derek Fisher   Astros 0.257    1.73 21.1 50.0 28.9    0.0    36.4
## 9 6   J.D. Martinez  Red Sox 0.388    1.43 23.6 45.0 31.4    2.7    33.6
##   IFH IFH.p BUH BUH.p Pull.p Cent.p Oppo.p Soft.p Med.p Hard.p
## 4   1  10.0   0     0   42.1   26.3   31.6   26.3  52.6   21.1
## 5   2  18.2   0     0   23.8   19.1   57.1   28.6  33.3   38.1
## 6   0   0.0   0     0   41.8   32.0   26.2   13.6  48.5   37.9
## 7   0   0.0   0     0   43.2   35.1   21.6   18.9  32.4   48.7
## 8   1   5.3   1   100   46.2   53.9    0.0   12.8  46.2   41.0
## 9  14   8.6   0     0   40.3   30.0   29.7   10.0  43.9   46.1

Enjoy your data and support the website! Here is the final script:

suppressMessages(library(dplyr))
suppressMessages(library(rvest))

### Load data from webpage

url <- "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=30&type=2&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_1000"

l1 <- read_html(url)
l1 <- html_nodes(l1, 'table')

### Extract table from html and remove 'bad' rows
fangraphs <- html_table(l1, fill = TRUE)[[12]] 
fangraphs <- fangraphs[-c(1,3),]

# Extract column names
columnNames <- as.list(fangraphs[1,])
# Take care of symbols in column names
columnNames <- gsub("%", ".p", columnNames)
columnNames <- gsub("/", "per", columnNames)

# Rename data frame and remove row with column names
colnames(fangraphs) <- columnNames
fangraphs <- fangraphs[-1,]

fangraphs[] <- sapply(fangraphs, function(x) gsub(" %","",x))
fangraphs[4:19] <- sapply(fangraphs[4:19],as.numeric)