December 12, 2015

Final Project

Importing Data from Wikipedia (XML)

url <- "https://en.wikipedia.org/wiki/List_of_Major_League_Baseball_postseason_teams"
url_source <- readLines(url, encoding = "UTF-8")
playoffs <- data.frame(readHTMLTable(url_source, stringsAsFactors = F, 
                                     header = T)[2])

names(playoffs) <- c("abbr","team","est","appearances")
##                abbr                 team
## 1               ARI Arizona Diamondbacks
## 2               ATL        Boston Braves
## 3  Milwaukee Braves            1953–1965
## 4    Atlanta Braves         1966–present
## 5               BAL     St. Louis Browns
## 6 Baltimore Orioles         1954–present
##                                                                                                                 est
## 1                                                                                                      1998–present
## 2                                                                                                         1903–1952
## 3                                                                                                        1957, 1958
## 4 1969, 1982, 1991, 1992, 1993, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2010, 2012*, 2013
## 5                                                                                                         1903–1953
## 6                                            1966, 1969, 1970, 1971, 1973, 1974, 1979, 1983, 1996, 1997, 2012, 2014
##                    appearances
## 1 1999, 2001, 2002, 2007, 2011
## 2                   1914, 1948
## 3                         <NA>
## 4                         <NA>
## 5                         1944
## 6                         <NA>

The Fix

  • I was unsure how to proceed, and wanted to use a method other than regular expressions

  • I got some help on stack overflow (first time asking a question) and was impressed with the quality and speed of the feedback. I learned a lot just going through the various methods presented.

  • I decided to go with the one that was most efficient and it introduced me to the data.table package.

data.table

  • This code performs a test to see if there are more than 3 characters and if there are, it shifts the values of the columns based on references leaving an NA value in the abbr column.
library(data.table)
library(zoo)
setDT(playoffs)[nchar(abbr) > 3, `:=` (abbr = NA,
                                            team = abbr,
                                            est = team,
                                            appearances = est)
                ]
#Fill in the NA values
playoffs <- playoffs %>% 
  fill(abbr)

New Wiki Data

head(playoffs)
##    abbr                 team          est
## 1:  ARI Arizona Diamondbacks 1998–present
## 2:  ATL        Boston Braves    1903–1952
## 3:  ATL     Milwaukee Braves    1953–1965
## 4:  ATL       Atlanta Braves 1966–present
## 5:  BAL     St. Louis Browns    1903–1953
## 6:  BAL    Baltimore Orioles 1954–present
##                                                                                                          appearances
## 1:                                                                                      1999, 2001, 2002, 2007, 2011
## 2:                                                                                                        1914, 1948
## 3:                                                                                                        1957, 1958
## 4: 1969, 1982, 1991, 1992, 1993, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2010, 2012*, 2013
## 5:                                                                                                              1944
## 6:                                            1966, 1969, 1970, 1971, 1973, 1974, 1979, 1983, 1996, 1997, 2012, 2014

Stringr & Cleanup

#create a list of playoff appearances
split <- str_split(playoffs$appearances, ", ")

#had some issues here putting the data into a data frame when the lists 
#vary in size this code got me on track
max <- max(sapply(split,length))
out <- do.call(rbind, lapply(split,function(x) x[1:max]))

#Create a data frame with playoff appearance add an ID to both dataframes
playoffs2 <- as.data.frame(out) %>%
  mutate(ID = 1:41)
  
playoffs <- playoffs %>% 
  select(abbr, team, est) %>% 
  mutate(ID = 1:41)

Cleanup Continued

#merge dataframes together using ID
combined <- merge(playoffs, playoffs2, by = 'ID', all.x = T)

#Use the gather function to tidy up the data, get rid of the arbitrary n, sort,
#and get rid of rows with NA
ptidy <- gather(combined, "n", "year", 5:56) %>% 
  select(-n) %>% 
  arrange(abbr) %>% 
  na.omit() 

#remove any * values which indicate that the team lost in the wild card game
ptidy$year <- gsub("\\*", "", ptidy$year)  

#Filter the data so that it is more recent 
ptidy2 <- ptidy %>% 
  select(-team, -est, -ID) %>% 
  mutate(playoff = 1) %>% 
  filter(year >= 1950)

Import Team Statistics

#Bring in raw data for team stats
url <- "https://raw.githubusercontent.com/bkreis84/FINAL/master/teams.csv"
teamdata <- read.csv(url, na.strings = "")

#bring in column header csv and add apply them to the dataframe
namesgit <- "https://raw.githubusercontent.com/bkreis84/FINAL/master/names2.csv"
names <- read.csv(namesgit, header = F, stringsAsFactors = F)
names(teamdata) <- c(names)

#Convert factors to numeric values using a function
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
teamdata[, c("SO","SB","CS","HBP","SF","DP")] <- 
  sapply(teamdata[, c("SO","SB","CS","HBP","SF","DP")], as.numeric.factor)

#Make it possible to add the WS win value by converting it to a number
teamdata$WSWin <- str_replace_all(teamdata$WSWin, "Y", "1")
teamdata$WSWin <- str_replace_all(teamdata$WSWin, "N", "0")
teamdata$WSWin <- as.integer(teamdata$WSWin)

Cleanup of Team Stats

#Data dates back to the 1870s, we will look at more recent info though.
teamdata <- teamdata %>% 
  filter(yearID >= 1950) %>% 
  arrange(franchID) %>% 
  select(-ID,-teamID, -divID:-Ghome, -DivWin:-LgWin, -name:-teamIDretro)
  
#Need to match up the franchise IDs which are different from the two sources.
puniq <- sort(unique(ptidy$abbr))
tuniq <- sort(unique(teamdata$franchID))
unique <- data.frame(puniq,tuniq)
View(unique)

#My main ID will be the abbreviations so I needed to make them uniform.
#For Example:
teamdata$franchID <- str_replace_all(teamdata$franchID, "(FLA)", replacement = "MIA")

Merge and Import Salary

#Merge using 2 variables
new <- merge(teamdata, ptidy2, by.x=c("franchID", "yearID"), 
             by.y=c("abbr", "year"), all.x = T)

salurl <- "https://raw.githubusercontent.com/bkreis84/FINAL/master/salaries.csv"
salaries <- read.csv(salurl, stringsAsFactors = F)

names(salaries) <- c("id", "year", "abbr", "league", "player", "salary")

nuniq <- sort(unique(new$franchID))
suniq <- sort(unique(salaries$abbr))
compare <- data.frame(cbind(nuniq,suniq))

#Make the IDs uniform. Example:
salaries$abbr <- str_replace_all(salaries$abbr, "(ANA)", 
                                 replacement = "LAA")

Merging

#Because the salary data is listed by player, we need to obtain team totals
salaries2 <- salaries %>% 
  group_by(abbr, year) %>% 
  summarise(sum = sum(salary))

#Time to merge in the salary data
#There will be a number of NA values since our salary data begins with 1980.
all <- merge(new, salaries2, by.x = c("yearID", "franchID"),
               by.y = c("year", "abbr"), all.x = T)

#Convert na values for whether the team made the playoffs to 0 
all$playoff[is.na(all$playoff)] <- 0

Calculate Additional Stats

#BABIP is batting average on balls in play. 
#As far as team stats go, a high number is considered lucky 
all <- all %>% 
  mutate(winperc = W/(W + L)) %>% 
  mutate(BABIP = (H-HR)/(AB-SO-HR+SF)) %>% 
  mutate(OBP = (H + BB + HBP)/(AB + BB + SF + HBP)) %>% 
  mutate(SLG = (H - `2B` - `3B` - HR + `2B` * 2 + `3B` * 3 + HR * 4)/ AB) %>% 
  mutate(OPS = OBP + SLG) %>% 
  mutate(TB = (H - `2B` - `3B` - HR + `2B` * 2 + `3B` * 3 + HR * 4))

#BsR is a stat that predicts the number of runs a team "should" have scored
#based on the types of hits and number of walks.
all <- all %>% 
  mutate(BsR = (((H + BB - HR) * ((1.4*TB - .6*H -3*HR +0.1*BB)*1.02)) /
           (((1.4*TB - .6*H -3*HR +.1*BB)*1.02) + AB - H)) + HR)

runs <- all %>% 
  select(R, BsR) %>% 
  mutate (dif = R - BsR)

Accuracy of BsR

summary(runs$dif)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -82.3200 -15.7300  -0.6748  -0.5433  14.7200  84.1000
  • The BsR is extremely close when look at the combined historical stats with a mean difference between BsR and actual runs of -0.5 since 1950.
  • Of course this statistical test was tweaked until it was this close.
  • BsR forms the basis for many other sabermetric stats for evaluating how different players would impact a teams ability to score runs/win games.
library(ggplot2)  

Hits and Win Percentage

- There are years where the number of hits is lower than expected. The league was on strike for part of the season, so We will remove these years.

Home Runs per AB over time

year.totals <- final %>% 
  group_by(yearID) %>% 
  summarise(HR = sum(HR), AB = sum(AB))

year.totals <- year.totals %>% 
  mutate(HRperAB = HR/AB)

Home Runs per AB

plot(ggplot(year.totals, aes(x=yearID, y=HRperAB)) + geom_point())

- As we would expect the homerun rate has increased over time; however, if our data continued it likely decreased slightly over the last 5 seasons.

Playoff Appearances

Correlation tests on multiple variables

cortests <- final %>% 
  select(winperc, R, H, HR, SB, ERA, SV, FP, BABIP:BsR) %>% 
  na.omit()

list <- list()
list[[1]] <- cortests
corframe <- data.frame(lapply(list,cor))
corframe <- corframe[1,]
corframe <- corframe[,-1]

Correlation

head(corframe)
##                 R         H        HR        SB        ERA        SV
## winperc 0.5217586 0.3768335 0.3278563 0.1439385 -0.5068631 0.4649519
##                FP     BABIP       OBP       SLG       OPS        TB
## winperc 0.3180074 0.2356444 0.4911769 0.4133872 0.4597814 0.3942229
##               BsR
## winperc 0.4655911
  • Other than runs, the highest correlations were found in ERA and OBP (On base percentage)

ERA and Win Percentage

OBP and Win Percentage

Multiple Regression

## 
## Call:
## lm(formula = winperc ~ OBP + OPS + ERA + HR + FP + SV + SB, data = final)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.080182 -0.017632 -0.000218  0.017430  0.091094 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.148e+00  2.227e-01  -5.154 2.93e-07 ***
## OBP          1.339e+00  1.502e-01   8.914  < 2e-16 ***
## OPS          6.404e-01  7.278e-02   8.799  < 2e-16 ***
## ERA         -1.009e-01  1.437e-03 -70.193  < 2e-16 ***
## HR           1.984e-04  4.435e-05   4.474 8.34e-06 ***
## FP           1.101e+00  2.315e-01   4.754 2.21e-06 ***
## SV           1.157e-03  9.357e-05  12.368  < 2e-16 ***
## SB           2.893e-05  1.871e-05   1.546    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02654 on 1322 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8643 
## F-statistic:  1210 on 7 and 1322 DF,  p-value: < 2.2e-16
##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) -1.148063e+00 2.227451e-01  -5.154158 2.934978e-07
## OBP          1.338589e+00 1.501740e-01   8.913585 1.603115e-18
## OPS          6.403833e-01 7.278167e-02   8.798689 4.239437e-18
## ERA         -1.008566e-01 1.436836e-03 -70.193499 0.000000e+00
## HR           1.984134e-04 4.434938e-05   4.473872 8.341209e-06
## FP           1.100746e+00 2.315459e-01   4.753900 2.213613e-06
## SV           1.157262e-03 9.357249e-05  12.367540 2.521040e-33
## SB           2.893322e-05 1.871220e-05   1.546222 1.222902e-01

Multiple Regression Salary

## 
## Call:
## lm(formula = winperc ~ OBP + ERA + HR + FP + SV + sum, data = salary)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.062529 -0.017624 -0.000626  0.014797  0.082091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.845e+00  3.366e-01  -5.482 5.86e-08 ***
## OBP          2.432e+00  8.306e-02  29.275  < 2e-16 ***
## ERA         -8.251e-02  2.055e-03 -40.143  < 2e-16 ***
## HR           4.744e-04  3.194e-05  14.852  < 2e-16 ***
## FP           1.758e+00  3.463e-01   5.078 4.89e-07 ***
## SV           2.286e-03  1.440e-04  15.874  < 2e-16 ***
## sum         -5.832e-11  3.187e-11  -1.830   0.0677 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02442 on 703 degrees of freedom
## Multiple R-squared:  0.8762, Adjusted R-squared:  0.8752 
## F-statistic: 829.5 on 6 and 703 DF,  p-value: < 2.2e-16

Salary

## [1] 0.230496

Conclusions

  • Salary is not as significantly related to winning as I originally anticipated.
  • OBP and ERA have a stronger correlation to win percentage.
  • In hindsight, I wish I had looked at individual player statistics, as that seems to be the main purpose of Sabermetrics and would lead to more interesting analysis. Plus, it would help me with my fantasy baseball teams!
  • I might do some BABIP analysis to see what players are most likely for a bounce back if they are under a certain age.