# Read in data
baseball = read.csv("baseball.csv")
str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : chr  "ARI" "ATL" "BAL" "BOS" ...
##  $ League      : chr  "NL" "NL" "AL" "AL" ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

MLB Glossary

1.Team: A code for the name of the team 2.League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League) 3. Year: The year of the corresponding record 4. RS: The number of runs scored by the team in that year 5. RA: The number of runs allowed by the team in that year 6. W: The number of regular season wins by the team in that year 7. OBP: The on-base percentage of the team in that year 8. SLG: The slugging percentage of the team in that year 9. BA: The batting average of the team in that year 10.Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no) 11. RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best) 12. RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1. 13. G: The number of games a team played in that year 14. OOBP: The team’s opponents’ on-base percentage in that year 15. OSLG: The team’s opponents’ slugging percentage in that year

#Each row in the baseball dataset represents a team in a particular year.How many team/year pairs are there in the whole dataset?
nrow(baseball)
## [1] 1232

The number of row is 1232 but there are many teams repited

length(table(baseball$Year))
## [1] 47

We have data from 47 years

baseball = subset(baseball, Playoffs == 1)
nrow(baseball)
## [1] 244

Limiting to Teams Making the Playoffs Because we’re only analyzing teams that made the playoffs, we can use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so our subsetted data frame should still be called “baseball”). How many team/year pairs are included in the new dataset?

baseball = subset(baseball, Playoffs == 1)
nrow(baseball)
## [1] 244

The teams that have made it to playoff is 244

#Through the years, different numbers of teams have been invited to the playoffs.
table(baseball$Year)
## 
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
##    2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##    8    8    8    8    8    8    8    8    8    8    8    8    8    8   10
table(table(baseball$Year))
## 
##  2  4  8 10 
##  7 23 16  1
PlayoffTable = table(baseball$Year)

#You can output the table with the following command:

PlayoffTable
## 
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
##    2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##    8    8    8    8    8    8    8    8    8    8    8    8    8    8   10
str(names(PlayoffTable))
##  chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" ...
 #Which function call returns the number of playoff teams in 1990 and 2001?
PlayoffTable[c("1990", "2001")]
## 
## 1990 2001 
##    4    8
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
baseball$NumCompetitors
##   [1] 10 10 10 10 10 10 10 10 10 10  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [26]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [51]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [76]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
## [101]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
## [126]  8  8  8  8  8  8  8  8  8  8  8  8  8  4  4  4  4  4  4  4  4  4  4  4  4
## [151]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [176]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [201]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [226]  4  4  4  4  4  2  2  2  2  2  2  2  2  2  2  2  2  2  2
#How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?
table(baseball$NumCompetitors)
## 
##   2   4   8  10 
##  14  92 128  10
#How many observations do we have in our dataset where a team did NOT win the World Series?
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)
## 
##   0   1 
## 197  47

Bivariate Models for Predicting World Series Winner When we’re not sure which of our variables are useful in predicting a particular outcome, it’s often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We’ll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models

#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model?
#Varibales to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)
model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)
## 
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 72.23602   22.64409    3.19  0.00142 **
## Year        -0.03700    0.01138   -3.25  0.00115 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 228.35  on 242  degrees of freedom
## AIC: 232.35
## 
## Number of Fisher Scoring iterations: 4
model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)
## 
## Call:
## glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.661226   1.636494   0.404    0.686
## RS          -0.002681   0.002098  -1.278    0.201
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 237.45  on 242  degrees of freedom
## AIC: 241.45
## 
## Number of Fisher Scoring iterations: 4
model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)
## 
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.888174   1.483831   1.272   0.2032  
## RA          -0.005053   0.002273  -2.223   0.0262 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 233.88  on 242  degrees of freedom
## AIC: 237.88
## 
## Number of Fisher Scoring iterations: 4
model4<-glm(WorldSeries~W, data=baseball, family="binomial")
summary(model4)
## 
## Call:
## glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -6.85568    2.87620  -2.384   0.0171 *
## W            0.05671    0.02988   1.898   0.0577 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 235.51  on 242  degrees of freedom
## AIC: 239.51
## 
## Number of Fisher Scoring iterations: 4
model5<-glm(WorldSeries~OBP, data=baseball, family="binomial")
summary(model5)
## 
## Call:
## glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    2.741      3.989   0.687    0.492
## OBP          -12.402     11.865  -1.045    0.296
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 238.02  on 242  degrees of freedom
## AIC: 242.02
## 
## Number of Fisher Scoring iterations: 4
model6<-glm(WorldSeries~SLG, data=baseball, family="binomial")
summary(model6)
## 
## Call:
## glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    3.200      2.358   1.357   0.1748  
## SLG          -11.130      5.689  -1.956   0.0504 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 235.23  on 242  degrees of freedom
## AIC: 239.23
## 
## Number of Fisher Scoring iterations: 4
model7<-glm(WorldSeries~BA, data=baseball, family="binomial")
summary(model7)
## 
## Call:
## glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.6392     3.8988  -0.164    0.870
## BA           -2.9765    14.6123  -0.204    0.839
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 239.08  on 242  degrees of freedom
## AIC: 243.08
## 
## Number of Fisher Scoring iterations: 4
model8<-glm(WorldSeries~RankSeason, data=baseball, family="binomial")
summary(model8) 
## 
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial", 
##     data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.8256     0.3268  -2.527   0.0115 *
## RankSeason   -0.2069     0.1027  -2.016   0.0438 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 234.75  on 242  degrees of freedom
## AIC: 238.75
## 
## Number of Fisher Scoring iterations: 4
model9<-glm(WorldSeries~OOBP, data=baseball, family="binomial")
summary(model9)
## 
## Call:
## glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.9306     8.3728  -0.111    0.912
## OOBP         -3.2233    26.0587  -0.124    0.902
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84.926  on 113  degrees of freedom
## Residual deviance: 84.910  on 112  degrees of freedom
##   (130 observations deleted due to missingness)
## AIC: 88.91
## 
## Number of Fisher Scoring iterations: 4
model10<-glm(WorldSeries~OSLG, data=baseball, family="binomial")
summary(model10)
## 
## Call:
## glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08725    6.07285  -0.014    0.989
## OSLG        -4.65992   15.06881  -0.309    0.757
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 84.926  on 113  degrees of freedom
## Residual deviance: 84.830  on 112  degrees of freedom
##   (130 observations deleted due to missingness)
## AIC: 88.83
## 
## Number of Fisher Scoring iterations: 4
model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)
## 
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial", 
##     data = baseball)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.03868    0.43750   0.088 0.929559    
## NumCompetitors -0.25220    0.07422  -3.398 0.000678 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 226.96  on 242  degrees of freedom
## AIC: 230.96
## 
## Number of Fisher Scoring iterations: 4
model12<-glm(WorldSeries~League, data=baseball, family="binomial")
summary(model12)
## 
## Call:
## glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3558     0.2243  -6.045  1.5e-09 ***
## LeagueNL     -0.1583     0.3252  -0.487    0.626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 238.88  on 242  degrees of freedom
## AIC: 242.88
## 
## Number of Fisher Scoring iterations: 4
#We can compute all pair-wise correlations between these variables with:
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])
##                     Year        RA RankSeason NumCompetitors
## Year           1.0000000 0.4762422  0.3852191      0.9139548
## RA             0.4762422 1.0000000  0.3991413      0.5136769
## RankSeason     0.3852191 0.3991413  1.0000000      0.4247393
## NumCompetitors 0.9139548 0.5136769  0.4247393      1.0000000
model13 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(model13)
## 
## Call:
## glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 63.610741  25.654830   2.479   0.0132 *
## Year        -0.032084   0.013323  -2.408   0.0160 *
## RA          -0.001766   0.002585  -0.683   0.4945  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 227.88  on 241  degrees of freedom
## AIC: 233.88
## 
## Number of Fisher Scoring iterations: 4
#We can compute all pair-wise correlations between these variables with:
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])
##                     Year        RA RankSeason NumCompetitors
## Year           1.0000000 0.4762422  0.3852191      0.9139548
## RA             0.4762422 1.0000000  0.3991413      0.5136769
## RankSeason     0.3852191 0.3991413  1.0000000      0.4247393
## NumCompetitors 0.9139548 0.5136769  0.4247393      1.0000000
model15 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)
summary(model15)
## 
## Call:
## glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial, 
##     data = baseball)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    13.350467  53.481896   0.250    0.803
## Year           -0.006802   0.027328  -0.249    0.803
## NumCompetitors -0.212610   0.175520  -1.211    0.226
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 226.90  on 241  degrees of freedom
## AIC: 232.9
## 
## Number of Fisher Scoring iterations: 4
model16 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)
summary(model16)
## 
## Call:
## glm(formula = WorldSeries ~ RA + RankSeason, family = binomial, 
##     data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.487461   1.506143   0.988    0.323
## RA          -0.003815   0.002441  -1.563    0.118
## RankSeason  -0.140824   0.110908  -1.270    0.204
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 232.22  on 241  degrees of freedom
## AIC: 238.22
## 
## Number of Fisher Scoring iterations: 4
model17 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)
summary(model17)
## 
## Call:
## glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial, 
##     data = baseball)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     0.716895   1.528736   0.469  0.63911   
## RA             -0.001233   0.002661  -0.463  0.64313   
## NumCompetitors -0.229385   0.088399  -2.595  0.00946 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 226.74  on 241  degrees of freedom
## AIC: 232.74
## 
## Number of Fisher Scoring iterations: 4
model18 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(model18)
## 
## Call:
## glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial, 
##     data = baseball)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     0.12277    0.45737   0.268  0.78837   
## RankSeason     -0.07697    0.11711  -0.657  0.51102   
## NumCompetitors -0.22784    0.08201  -2.778  0.00546 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 226.52  on 241  degrees of freedom
## AIC: 232.52
## 
## Number of Fisher Scoring iterations: 4

None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable. This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams! I agree wit Billy Beane because I do not know much about baseball.