# read in the data
baseball <- read.csv("C:/Users/duway/Downloads/baseball.csv")
str(baseball)
'data.frame': 1232 obs. of 15 variables:
$ Team : chr "ARI" "ATL" "BAL" "BOS" ...
$ League : chr "NL" "NL" "AL" "AL" ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ RS : int 734 700 712 734 613 748 669 667 758 726 ...
$ RA : int 688 600 705 806 759 676 588 845 890 670 ...
$ W : int 81 94 93 69 61 85 97 68 64 88 ...
$ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
$ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
$ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
$ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
$ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
$ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
$ G : int 162 162 162 162 162 162 162 162 162 162 ...
$ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
$ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
1.Team: A code for the name of the team 2.League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League) 3.Year: The year of the corresponding record 4.RS: The number of runs scored by the team in that year 5.RA: The number of runs allowed by the team in that year 6.W: The number of regular season wins by the team in that year 7.OBP: The on-base percentage of the team in that year 8.SLG: The slugging percentage of the team in that year 9.BA: The batting average of the team in that year 10.Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no) 11.RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best) 12.RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1. 13.G: The number of games a team played in that year 14.OOBP: The team’s opponents’ on-base percentage in that year 15.OSLG: The team’s opponents’ slugging percentage in that year
# Count the number of unique team/year pairs
unique_years <- unique(baseball[, c("Team", "Year")])
num_years <- nrow(unique_years)
num_years
[1] 1232
There are 1232 unique team/year pairs in the dataset.
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.
# Count the number of unique years
table(baseball$Year)
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976
20 20 20 20 20 20 20 24 24 24 24 24 24 24
1977 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
26 26 26 26 26 26 26 26 26 26 26 26 26 26
1992 1993 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
26 28 28 28 30 30 30 30 30 30 30 30 30 30
2008 2009 2010 2011 2012
30 30 30 30 30
# Count the number of unique years
unique_years <- unique(baseball$Year)
num_years <- length(unique_years)
num_years
[1] 47
The dataset contains 47 unique years, ranging from 1962 to 2012.Using the unique() function, we can see that the dataset contains data from 1962 to 2012, with a total of 47 unique years.
# Subset the data to include only teams that made the playoffs
baseball <- subset(baseball, Playoffs == 1)
nrow(baseball)
[1] 244
For the given subset data, there are 244 unique instances of teams making the playoff.
# Count the number of unique teams in the postseason
table(table(baseball$Year))
2 4 8 10
7 23 16 1
#Adding an Important Predictor It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
# Start by storing the ouput of the table() function that counts the number of the playooff teams from each year
playoff_teams <- table(baseball$Year)
playoff_teams
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976
2 2 2 2 2 2 2 4 4 4 4 4 4 4
1977 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
4 4 4 4 4 4 4 4 4 4 4 4 4 4
1992 1993 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
4 4 8 8 8 8 8 8 8 8 8 8 8 8
2008 2009 2010 2011 2012
8 8 8 8 10
str(names(playoff_teams))
chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" ...
# Get the number of playoff teams in 1990 and 2001
playoff_teams[c("1990", "2001")]
1990 2001
4 8
The number of playoff teams in 1990 is 4 and in 2001 is 8.
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.
baseball$NumCompetitors = playoff_teams[as.character(baseball$Year)]
baseball$NumCompetitors
[1] 10 10 10 10 10 10 10 10 10 10 8 8 8 8 8 8 8 8 8 8 8
[22] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[43] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[64] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[85] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[106] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[127] 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4
[148] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[169] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[190] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[211] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2
[232] 2 2 2 2 2 2 2 2 2 2 2 2 2
#How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?
table(baseball$NumCompetitors)
2 4 8 10
14 92 128 10
There are 128 playoff team/year pairs in the dataset from years where 8 teams were invited to the playoffs.
#Bivariate Models for Predicting World Series Winner In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:baseballWorldSeries=as.numeric(baseballRankPlayoffs == 1)WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise.
#How many observations do we have in our dataset where a team did NOT win the World Series?
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)
0 1
197 47
There are 197 observations in the dataset where a team did not win the World Series (WorldSeries = 0).
#Bivariate Models for Predicting World Series Winner When we’re not sure which of our variables are useful in predicting a particular outcome, it’s often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We’ll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models.
#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model?
#Varibales to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)
model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)
Call:
glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 72.23602 22.64409 3.19 0.00142 **
Year -0.03700 0.01138 -3.25 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 228.35 on 242 degrees of freedom
AIC: 232.35
Number of Fisher Scoring iterations: 4
The Year variable is significant in predicting the WorldSeries variable, as indicated by the stars in the summary output of the model. The p-value for Year is less than 0.05, indicating that it is a significant predictor of the outcome.
model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)
Call:
glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.661226 1.636494 0.404 0.686
RS -0.002681 0.002098 -1.278 0.201
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 237.45 on 242 degrees of freedom
AIC: 241.45
Number of Fisher Scoring iterations: 4
The RS variable is not significant in predicting the WorldSeries variable, The p-value for RS is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)
Call:
glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.888174 1.483831 1.272 0.2032
RA -0.005053 0.002273 -2.223 0.0262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 233.88 on 242 degrees of freedom
AIC: 237.88
Number of Fisher Scoring iterations: 4
The RA variable is significant in predicting the WorldSeries variable.The p-value for RA is less than 0.05, indicating that it is a significant predictor of the outcome.
model4<-glm(WorldSeries~W, data=baseball, family="binomial")
summary(model4)
Call:
glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.85568 2.87620 -2.384 0.0171 *
W 0.05671 0.02988 1.898 0.0577 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 235.51 on 242 degrees of freedom
AIC: 239.51
Number of Fisher Scoring iterations: 4
The W variable is not significant in predicting the WorldSeries variable.The p-value for W is 0.05, indicating that it is not a significant predictor of the outcome.
model5<-glm(WorldSeries~OBP, data=baseball, family="binomial")
summary(model5)
Call:
glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.741 3.989 0.687 0.492
OBP -12.402 11.865 -1.045 0.296
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 238.02 on 242 degrees of freedom
AIC: 242.02
Number of Fisher Scoring iterations: 4
The OBP variable is not significant in predicting the WorldSeries variable.The p-value for OBP is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model6<-glm(WorldSeries~SLG, data=baseball, family="binomial")
summary(model6)
Call:
glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.200 2.358 1.357 0.1748
SLG -11.130 5.689 -1.956 0.0504 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 235.23 on 242 degrees of freedom
AIC: 239.23
Number of Fisher Scoring iterations: 4
The SLG variable is not significant in predicting the WorldSeries variable.The p-value for SLG is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model7<-glm(WorldSeries~BA, data=baseball, family="binomial")
summary(model7)
Call:
glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6392 3.8988 -0.164 0.870
BA -2.9765 14.6123 -0.204 0.839
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 239.08 on 242 degrees of freedom
AIC: 243.08
Number of Fisher Scoring iterations: 4
The BA variable is not significant in predicting the WorldSeries variable.The p-value for BA is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model8<-glm(WorldSeries~RankSeason, data=baseball, family="binomial")
summary(model8)
Call:
glm(formula = WorldSeries ~ RankSeason, family = "binomial",
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8256 0.3268 -2.527 0.0115 *
RankSeason -0.2069 0.1027 -2.016 0.0438 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 234.75 on 242 degrees of freedom
AIC: 238.75
Number of Fisher Scoring iterations: 4
The RankSeason variable is significant in predicting the WorldSeries variable.The p-value for RankSeason is greater than 0.05, indicating that it is a significant predictor of the outcome.
model9<-glm(WorldSeries~OOBP, data=baseball, family="binomial")
summary(model9)
Call:
glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9306 8.3728 -0.111 0.912
OOBP -3.2233 26.0587 -0.124 0.902
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.910 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.91
Number of Fisher Scoring iterations: 4
The OOBP variable is not significant in predicting the WorldSeries variable.The p-value for OOBP is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model10<-glm(WorldSeries~OSLG, data=baseball, family="binomial")
summary(model10)
Call:
glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725 6.07285 -0.014 0.989
OSLG -4.65992 15.06881 -0.309 0.757
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.830 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.83
Number of Fisher Scoring iterations: 4
The OSLG variable is not significant in predicting the WorldSeries variable.The p-value for OSLG is greater than 0.05, indicating that it is not a significant predictor of the outcome.
model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)
Call:
glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.03868 0.43750 0.088 0.929559
NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.96 on 242 degrees of freedom
AIC: 230.96
Number of Fisher Scoring iterations: 4
The NumCompetitors variable is significant in predicting the WorldSeries variable.The p-value for NumCompetitors is less than 0.05, indicating that it is a significant predictor of the outcome.
model12<-glm(WorldSeries~League, data=baseball, family="binomial")
summary(model12)
Call:
glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3558 0.2243 -6.045 1.5e-09 ***
LeagueNL -0.1583 0.3252 -0.487 0.626
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 238.88 on 242 degrees of freedom
AIC: 242.88
Number of Fisher Scoring iterations: 4
The League variable is not significant in predicting the WorldSeries variable.The p-value for League is greater than 0.05, indicating that it is not a significant predictor of the outcome.
#In short, the significant predictors are Year, RA, and NumCompetitors.Multivariate Models for Predicting World Series Winner.In this section, we’ll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?
#How many variables are significant in the combined model?
LogModel = glm(WorldSeries ~ Year + RA + RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(LogModel)
Call:
glm(formula = WorldSeries ~ Year + RA + RankSeason + NumCompetitors,
family = binomial, data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.5874376 53.6474210 0.235 0.814
Year -0.0061425 0.0274665 -0.224 0.823
RA -0.0008238 0.0027391 -0.301 0.764
RankSeason -0.0685046 0.1203459 -0.569 0.569
NumCompetitors -0.1794264 0.1815933 -0.988 0.323
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.37 on 239 degrees of freedom
AIC: 236.37
Number of Fisher Scoring iterations: 4
Looking at summary(LogModel), we can see that none of the variables are significant in the multivariate model. Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)?
#We can compute all pair-wise correlations between these variables with:
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])
Year RA RankSeason NumCompetitors
Year 1.0000000 0.4762422 0.3852191 0.9139548
RA 0.4762422 1.0000000 0.3991413 0.5136769
RankSeason 0.3852191 0.3991413 1.0000000 0.4247393
NumCompetitors 0.9139548 0.5136769 0.4247393 1.0000000
#While every pair was at least moderately correlated, the only strongly correlated pair was Year/NumCompetitors, with correlation coefficient 0.914.Let us build all six of the two variable models listed in the previous problem. Together with the four bivariate models that were significant, we should have 10 different logistic regression models to analyze. Which model has the best AIC value (the minimum AIC value)?
#The two-variable models can be built with the following commands:
model13 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(model13)
Call:
glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 63.610741 25.654830 2.479 0.0132 *
Year -0.032084 0.013323 -2.408 0.0160 *
RA -0.001766 0.002585 -0.683 0.4945
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 227.88 on 241 degrees of freedom
AIC: 233.88
Number of Fisher Scoring iterations: 4
model14 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)
summary(model14)
Call:
glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 63.64855 24.37063 2.612 0.00901 **
Year -0.03254 0.01231 -2.643 0.00822 **
RankSeason -0.10064 0.11352 -0.887 0.37534
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 227.55 on 241 degrees of freedom
AIC: 233.55
Number of Fisher Scoring iterations: 4
model15 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)
summary(model15)
Call:
glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.350467 53.481896 0.250 0.803
Year -0.006802 0.027328 -0.249 0.803
NumCompetitors -0.212610 0.175520 -1.211 0.226
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.90 on 241 degrees of freedom
AIC: 232.9
Number of Fisher Scoring iterations: 4
model16 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)
summary(model16)
Call:
glm(formula = WorldSeries ~ RA + RankSeason, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.487461 1.506143 0.988 0.323
RA -0.003815 0.002441 -1.563 0.118
RankSeason -0.140824 0.110908 -1.270 0.204
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 232.22 on 241 degrees of freedom
AIC: 238.22
Number of Fisher Scoring iterations: 4
model17 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)
summary(model17)
Call:
glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.716895 1.528736 0.469 0.63911
RA -0.001233 0.002661 -0.463 0.64313
NumCompetitors -0.229385 0.088399 -2.595 0.00946 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.74 on 241 degrees of freedom
AIC: 232.74
Number of Fisher Scoring iterations: 4
model18 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(model18)
Call:
glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12277 0.45737 0.268 0.78837
RankSeason -0.07697 0.11711 -0.657 0.51102
NumCompetitors -0.22784 0.08201 -2.778 0.00546 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.52 on 241 degrees of freedom
AIC: 232.52
Number of Fisher Scoring iterations: 4
The model with the best AIC value is model 18, which include NumCompetitors as predictor. The AIC value for this model is lower than the AIC values for all other models, indicating that it is the best-fitting model among those considered.
From the professor: None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable. This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!