# Read in data
baseball = read.csv("baseball.csv")
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : chr "ARI" "ATL" "BAL" "BOS" ...
## $ League : chr "NL" "NL" "AL" "AL" ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
#Each row in the baseball dataset represents a team in a particular year.How many team/year pairs are there in the whole dataset?
nrow(baseball)
## [1] 1232
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.
table(baseball$Year)
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 20 20 20 20 20 20 20 24 24 24 24 24 24 24 26 26
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 26 26 26 26 26 26 26 26 26 26 26 26 26 28 28 28
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
#We can count the number of years in the table, or use the command length(table(baseball$Year)) directly to get the answer.
length(table(baseball$Year))
## [1] 47
Limiting to Teams Making the Playoffs Because we’re only analyzing teams that made the playoffs, we can use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so our subset data frame should still be called “baseball”).
How many team/year pairs are included in the new dataset?
baseball = subset(baseball, Playoffs == 1)
nrow(baseball)
## [1] 244
#Through the years, different numbers of teams have been invited to the playoffs.
table(baseball$Year)
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10
Adding an Important Predictor
It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
#We start by storing the output of the table() function that counts the number of playoff teams from each year:
PlayoffTable = table(baseball$Year)
#You can output the table with the following command:
PlayoffTable
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10
str(names(PlayoffTable))
## chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" ...
#Which function call returns the number of playoff teams in 1990 and 2001?
PlayoffTable[c("1990", "2001")]
##
## 1990 2001
## 4 8
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
baseball$NumCompetitors
## [1] 10 10 10 10 10 10 10 10 10 10 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [26] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [51] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [76] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [101] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [126] 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [151] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [176] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [201] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [226] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#How many observations do we have in our dataset where a team did NOT win the World Series?
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)
##
## 0 1
## 197 47
#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model?
#Varibales to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)
# NOTE: For ease of reading, I only included the models that showed an important significance/factor I wanted to comment on, but please note all 18 models have been run and observed.
model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 228.35 on 242 degrees of freedom
## AIC: 232.35
##
## Number of Fisher Scoring iterations: 4
model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)
##
## Call:
## glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.661226 1.636494 0.404 0.686
## RS -0.002681 0.002098 -1.278 0.201
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 237.45 on 242 degrees of freedom
## AIC: 241.45
##
## Number of Fisher Scoring iterations: 4
model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 233.88 on 242 degrees of freedom
## AIC: 237.88
##
## Number of Fisher Scoring iterations: 4
model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.96 on 242 degrees of freedom
## AIC: 230.96
##
## Number of Fisher Scoring iterations: 4