Frank Vega
IN CLASS ACTIVITY 15
#creating the dataframe
baseball = read.csv("baseball.csv")
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : chr "ARI" "ATL" "BAL" "BOS" ...
## $ League : chr "NL" "NL" "AL" "AL" ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
#We have 1232 observations which are rows which either numerical or categorical
#as well as 15 variables/columns which are important attributes/elements of baseball analytics
#######INDEX###########
#Team - Team name
#League - League in which the team plays
#Year - Year of the season
#RS - Runs scored
#RA - Runs allowed
#W - Wins
#OBP - On-base percentage
#SLG - Slugging percentage
#BA - Batting average
#Playoffs - Indicator of whether the team made the playoffs 0 false and 1 the team qualified
#RankSeason - Season ranking
#RankPlayoffs - Playoff ranking
#G - Games played
#OOBP - Opponent's on-base percentage
#OSLG - Opponent's slugging percentage
#Each row in the baseball dataset represents a team in a particular year.How many team/year pairs are there in the whole dataset?
#-------There are 1232 team/year pairs as that is the number of rows/onservations in the dataframe
#
nrow(baseball)
## [1] 1232
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons.
#Using the table() function, identify the total number of years included in this dataset.
table(baseball$Year)
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 20 20 20 20 20 20 20 24 24 24 24 24 24 24 26 26
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 26 26 26 26 26 26 26 26 26 26 26 26 26 28 28 28
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
#Bivariate Models for Predicting World Series Winner When we’re not sure which of our variables are useful in predicting a particular outcome, it’s often helpful to build bivariate models, which are models that predict the outcome using a single independent variable.
#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We’ll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models
#The most significant variables according to the models we ran below are YEAr RA Rank Season and NUmCompetitors
#All of these variables had a very small p value meaning they had the most correlation
#The baseball dataset contains 47 years (1972, 1981, 1994, and 1995 are missing). We can count the number of years in the table, or use the command length(table(baseball$Year)) directly to get the answer.
length(table(baseball$Year))
## [1] 47
Limiting to Teams Making the Playoffs Because we’re only analyzing teams that made the playoffs, we can use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so our subsetted data frame should still be called “baseball”). How many team/year pairs are included in the new dataset?
#The amount of team/year pairs that are included are 244 observations since we took a subset from the dataframe.
#the subset is teams that qualified for the playoffs
baseball = subset(baseball, Playoffs == 1)
nrow(baseball)
## [1] 244
#Through the years, different numbers of teams have been invited to the playoffs.
table(baseball$Year)
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10
#this code is particularly useful when you need to understand the distribution or uniformity of data across different years in this dataset, revealing how balanced or imbalanced the data entries are per year.
#This creates a frequency table of the year column
table(table(baseball$Year))
##
## 2 4 8 10
## 7 23 16 1
Adding an Important Predictor It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
#We start by storing the output of the table() function that counts the number of playoff teams from each year:
PlayoffTable = table(baseball$Year)
#You can output the table with the following command:
PlayoffTable
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10
#revealing the new structure
str(names(PlayoffTable))
## chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" ...
#Which function call returns the number of playoff teams in 1990 and 2001?
PlayoffTable[c("1990", "2001")]
##
## 1990 2001
## 4 8
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
baseball$NumCompetitors
## [1] 10 10 10 10 10 10 10 10 10 10 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [26] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [51] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [76] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [101] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [126] 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [151] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [176] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [201] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [226] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?
#----There are a total of 128 team/year pairs where 8 teams were invited to the playoff
table(baseball$NumCompetitors)
##
## 2 4 8 10
## 14 92 128 10
#How many observations do we have in our dataset where a team did NOT win the World Series?
#We have a toal of 197 observations the subset of data where a team did not win the world series
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)
##
## 0 1
## 197 47
NumCompetitors is significant
RankSeason is significant
RA is a significant predictor Year is a significant predictor
IN THESE FOUR CASES THE P VALUE IS LESS THAN 0.05 WHICH IS CONSIDERED STATISTICALLY SIGNIFICANT AND WE REJECT THE NULL HYPOTHESIS
Variables to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)
model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0297 -0.6797 -0.5435 -0.4648 2.1504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 228.35 on 242 degrees of freedom
## AIC: 232.35
##
## Number of Fisher Scoring iterations: 4
Year is a significant predictor!
model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)
##
## Call:
## glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8254 -0.6819 -0.6363 -0.5561 2.0308
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.661226 1.636494 0.404 0.686
## RS -0.002681 0.002098 -1.278 0.201
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 237.45 on 242 degrees of freedom
## AIC: 241.45
##
## Number of Fisher Scoring iterations: 4
RS is not significant
model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9749 -0.6883 -0.6118 -0.4746 2.1577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 233.88 on 242 degrees of freedom
## AIC: 237.88
##
## Number of Fisher Scoring iterations: 4
RA is a significant predictor!
model4<-glm(WorldSeries~W, data=baseball, family="binomial")
summary(model4)
##
## Call:
## glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0623 -0.6777 -0.6117 -0.5367 2.1254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.85568 2.87620 -2.384 0.0171 *
## W 0.05671 0.02988 1.898 0.0577 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 235.51 on 242 degrees of freedom
## AIC: 239.51
##
## Number of Fisher Scoring iterations: 4
W is not significant
model5<-glm(WorldSeries~OBP, data=baseball, family="binomial")
summary(model5)
##
## Call:
## glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8071 -0.6749 -0.6365 -0.5797 1.9753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.741 3.989 0.687 0.492
## OBP -12.402 11.865 -1.045 0.296
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 238.02 on 242 degrees of freedom
## AIC: 242.02
##
## Number of Fisher Scoring iterations: 4
OBP is not significant
model6<-glm(WorldSeries~SLG, data=baseball, family="binomial")
summary(model6)
##
## Call:
## glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9498 -0.6953 -0.6088 -0.5197 2.1136
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.200 2.358 1.357 0.1748
## SLG -11.130 5.689 -1.956 0.0504 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 235.23 on 242 degrees of freedom
## AIC: 239.23
##
## Number of Fisher Scoring iterations: 4
SLG is not significant
model7<-glm(WorldSeries~BA, data=baseball, family="binomial")
summary(model7)
##
## Call:
## glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6797 -0.6592 -0.6513 -0.6389 1.8431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6392 3.8988 -0.164 0.870
## BA -2.9765 14.6123 -0.204 0.839
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 239.08 on 242 degrees of freedom
## AIC: 243.08
##
## Number of Fisher Scoring iterations: 4
BA is not significant
model8<-glm(WorldSeries~RankSeason, data=baseball, family="binomial")
summary(model8)
##
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7805 -0.7131 -0.5918 -0.4882 2.1781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8256 0.3268 -2.527 0.0115 *
## RankSeason -0.2069 0.1027 -2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 234.75 on 242 degrees of freedom
## AIC: 238.75
##
## Number of Fisher Scoring iterations: 4
RankSeason is significant!
model9<-glm(WorldSeries~OOBP, data=baseball, family="binomial")
summary(model9)
##
## Call:
## glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5318 -0.5176 -0.5106 -0.5023 2.0697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9306 8.3728 -0.111 0.912
## OOBP -3.2233 26.0587 -0.124 0.902
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 84.926 on 113 degrees of freedom
## Residual deviance: 84.910 on 112 degrees of freedom
## (130 observations deleted due to missingness)
## AIC: 88.91
##
## Number of Fisher Scoring iterations: 4
OOBP is not significant
model10<-glm(WorldSeries~OSLG, data=baseball, family="binomial")
summary(model10)
##
## Call:
## glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5610 -0.5209 -0.5088 -0.4902 2.1268
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08725 6.07285 -0.014 0.989
## OSLG -4.65992 15.06881 -0.309 0.757
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 84.926 on 113 degrees of freedom
## Residual deviance: 84.830 on 112 degrees of freedom
## (130 observations deleted due to missingness)
## AIC: 88.83
##
## Number of Fisher Scoring iterations: 4
OSLG is not significant
model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9871 -0.8017 -0.5089 -0.5089 2.2643
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.96 on 242 degrees of freedom
## AIC: 230.96
##
## Number of Fisher Scoring iterations: 4
NumCompetitors is significant!
model12<-glm(WorldSeries~League, data=baseball, family="binomial")
summary(model12)
##
## Call:
## glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6772 -0.6772 -0.6306 -0.6306 1.8509
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3558 0.2243 -6.045 1.5e-09 ***
## LeagueNL -0.1583 0.3252 -0.487 0.626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 238.88 on 242 degrees of freedom
## AIC: 242.88
##
## Number of Fisher Scoring iterations: 4
League is not significant
In short, the significant predictors are Year, RA, and NumCompetitors.
Multivariate Models for Predicting World Series Winner
In this section, we’ll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?
#How many variables are significant in the combined model?
LogModel = glm(WorldSeries ~ Year + RA + RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(LogModel)
##
## Call:
## glm(formula = WorldSeries ~ Year + RA + RankSeason + NumCompetitors,
## family = binomial, data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0336 -0.7689 -0.5139 -0.4583 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.5874376 53.6474210 0.235 0.814
## Year -0.0061425 0.0274665 -0.224 0.823
## RA -0.0008238 0.0027391 -0.301 0.764
## RankSeason -0.0685046 0.1203459 -0.569 0.569
## NumCompetitors -0.1794264 0.1815933 -0.988 0.323
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.37 on 239 degrees of freedom
## AIC: 236.37
##
## Number of Fisher Scoring iterations: 4
Looking at summary(LogModel), we can see that none of the variables are significant in the multivariate model. Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)?
#These 4 variables are the most significant for bivariate analysis
#We can compute all pair-wise correlations between these variables with:
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])
## Year RA RankSeason NumCompetitors
## Year 1.0000000 0.4762422 0.3852191 0.9139548
## RA 0.4762422 1.0000000 0.3991413 0.5136769
## RankSeason 0.3852191 0.3991413 1.0000000 0.4247393
## NumCompetitors 0.9139548 0.5136769 0.4247393 1.0000000
While every pair was at least moderately correlated, the only strongly correlated pair was Year/NumCompetitors, with correlation coefficient 0.914.
Let us build all six of the two variable models listed in the previous problem. Together with the four bivariate models that were significant, we should have 10 different logistic regression models to analyze. Which model has the best AIC value (the minimum AIC value)?
#The two-variable models can be built with the following commands:
model13 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(model13)
##
## Call:
## glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0402 -0.6878 -0.5298 -0.4785 2.1370
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.610741 25.654830 2.479 0.0132 *
## Year -0.032084 0.013323 -2.408 0.0160 *
## RA -0.001766 0.002585 -0.683 0.4945
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 227.88 on 241 degrees of freedom
## AIC: 233.88
##
## Number of Fisher Scoring iterations: 4
#The Year is very significant however the rank season is not and we need both to be significant for the multirvariate anylysis
model14 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)
summary(model14)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0560 -0.6957 -0.5379 -0.4528 2.2673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.64855 24.37063 2.612 0.00901 **
## Year -0.03254 0.01231 -2.643 0.00822 **
## RankSeason -0.10064 0.11352 -0.887 0.37534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 227.55 on 241 degrees of freedom
## AIC: 233.55
##
## Number of Fisher Scoring iterations: 4
#none of the variables are significant
model15 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)
summary(model15)
##
## Call:
## glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0050 -0.7823 -0.5115 -0.4970 2.2552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.350467 53.481896 0.250 0.803
## Year -0.006802 0.027328 -0.249 0.803
## NumCompetitors -0.212610 0.175520 -1.211 0.226
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.90 on 241 degrees of freedom
## AIC: 232.9
##
## Number of Fisher Scoring iterations: 4
#none of the variables are significant
model16 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)
summary(model16)
##
## Call:
## glm(formula = WorldSeries ~ RA + RankSeason, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9374 -0.6933 -0.5936 -0.4564 2.1979
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.487461 1.506143 0.988 0.323
## RA -0.003815 0.002441 -1.563 0.118
## RankSeason -0.140824 0.110908 -1.270 0.204
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 232.22 on 241 degrees of freedom
## AIC: 238.22
##
## Number of Fisher Scoring iterations: 4
#the numcompetitors seem to be the strongest in this model
model17 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)
summary(model17)
##
## Call:
## glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0433 -0.7826 -0.5133 -0.4701 2.2208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.716895 1.528736 0.469 0.63911
## RA -0.001233 0.002661 -0.463 0.64313
## NumCompetitors -0.229385 0.088399 -2.595 0.00946 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.74 on 241 degrees of freedom
## AIC: 232.74
##
## Number of Fisher Scoring iterations: 4
#numcompetitors once again proves to be strong however once again the second variable is not significant
model18 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(model18)
##
## Call:
## glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0090 -0.7592 -0.5204 -0.4501 2.2562
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.12277 0.45737 0.268 0.78837
## RankSeason -0.07697 0.11711 -0.657 0.51102
## NumCompetitors -0.22784 0.08201 -2.778 0.00546 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.52 on 241 degrees of freedom
## AIC: 232.52
##
## Number of Fisher Scoring iterations: 4
None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable. This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!
summary(model14)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0560 -0.6957 -0.5379 -0.4528 2.2673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.64855 24.37063 2.612 0.00901 **
## Year -0.03254 0.01231 -2.643 0.00822 **
## RankSeason -0.10064 0.11352 -0.887 0.37534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 227.55 on 241 degrees of freedom
## AIC: 233.55
##
## Number of Fisher Scoring iterations: 4