# This code reads a CSV file named "baseball.csv" into a data frame called baseball and then prints its structure using str() function.
# baseball = read.csv("baseball.csv"): This line reads the CSV file named "baseball.csv" from your current working directory (or specified path) and loads it into a data frame named baseball. The read.csv() function # is used to read comma-separated files in R.
# str(baseball): This line prints the structure of the baseball data frame. The str() function in R displays the structure of an R object, providing information such as the names, data types, and first few values of # each column in the data frame.
# After running these commands, R will display the structure of your baseball data frame, showing you the column names, data types, and a preview of the data contained within it.
baseball = read.csv("baseball.csv")
str(baseball)
'data.frame': 1232 obs. of 15 variables:
$ Team : chr "ARI" "ATL" "BAL" "BOS" ...
$ League : chr "NL" "NL" "AL" "AL" ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ RS : int 734 700 712 734 613 748 669 667 758 726 ...
$ RA : int 688 600 705 806 759 676 588 845 890 670 ...
$ W : int 81 94 93 69 61 85 97 68 64 88 ...
$ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
$ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
$ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
$ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
$ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
$ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
$ G : int 162 162 162 162 162 162 162 162 162 162 ...
$ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
$ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
# The dataset contains 1,232 observations (rows) and 15 variables (columns).
1.Team: A code for the name of the team
2.League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League)
3.Year: The year of the corresponding record
4.RS: The number of runs scored by the team in that year
5.RA: The number of runs allowed by the team in that year
6.W: The number of regular season wins by the team in that year
7.OBP: The on-base percentage of the team in that year
8.SLG: The slugging percentage of the team in that year
9.BA: The batting average of the team in that year
10.Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no)
11.RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best)
12.RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1.
13.G: The number of games a team played in that year
14.OOBP: The team’s opponents’ on-base percentage in that year
15.OSLG: The team’s opponents’ slugging percentage in that year
#Each row in the baseball dataset represents a team in a particular year.How many team/year pairs are there in the whole dataset?
# The nrow(baseball) function returns the number of rows in the baseball data frame, which represents the total number of observations or records in your dataset.
# In this case 1232.
nrow(baseball)
[1] 1232
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.
# baseball$Year selects the column named "Year" from the baseball data frame.
# table() function computes the frequency of each unique value in the selected column.
# The output is the frequency table of the unique values (years) found in the "Year" column of your baseball dataset.
# Years: The years range from 1962 to 2012.
# Counts: The numbers next to each year indicate how many observations in your dataset correspond to that particular year.
# In this example, there are 20 observations for each year from 1962 to 1967.
# The years 1973 to 1993 have 24 observations each.
# The years 1996 to 2007 have 26 observations each.
# The years 2008 to 2012 have 30 observations each.
table(baseball$Year)
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
20 20 20 20 20 20 20 24 24 24 24 24 24 24 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 28 28 28 30 30 30 30 30 30 30 30 30
2007 2008 2009 2010 2011 2012
30 30 30 30 30 30
#The baseball dataset contains 47 years (1972, 1981, 1994, and 1995 are missing). We can count the number of years in the table, or use the command length(table(baseball$Year)) directly to get the answer.
# table(baseball$Year): This function generates a frequency table of the unique values found in the "Year" column of your baseball dataset. It counts how many times each unique year appears.
# length(): This function computes the number of elements in an object. When applied to the result of table(baseball$Year), it counts how many unique years are listed in the table.
# There are 47 unique years represented in the baseball dataset, ranging from 1962 to 2012.Also represents that 47 teams out of 244 won the playoff.
length(table(baseball$Year))
[1] 47
Limiting to Teams Making the Playoffs Because we’re only analyzing teams that made the playoffs, we can use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so our subsetted data frame should still be called “baseball”). How many team/year pairs are included in the new dataset?
# baseball = subset(baseball, Playoffs == 1): This line creates a new data frame baseball that contains only the rows where Playoffs is equal to 1.
baseball = subset(baseball, Playoffs == 1)
# The output of nrow(baseball) will give us the total number of observations (rows) in the baseball data frame where Playoffs is equal to 1, indicating the count of teams that made the playoffs in this case 244 teams
nrow(baseball)
[1] 244
#Through the years, different numbers of teams have been invited to the playoffs.
# The table(baseball$Year) function creates a frequency table that counts the occurrences of each unique value in the "Year" column of the baseball data frame.
# The output of table(baseball$Year) will provide a count of how many observations (teams) correspond to each unique year in your subsetted data frame.
table(baseball$Year)
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8
2007 2008 2009 2010 2011 2012
8 8 8 8 8 10
The number of teams in the postseason has changed.
# table(baseball$Year): This inner table() function generates a frequency table of the unique values (years) in the "Year" column of the baseball data frame. It counts how many times each year appears.
# This nested table() approach can be useful for exploring the distribution or frequency of another frequency table's results.
table(table(baseball$Year))
2 4 8 10
7 23 16 1
Adding an Important Predictor It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
#We start by storing the output of the table() function that counts the number of playoff teams from each year:
# This line creates a frequency table (PlayoffTable) that counts the occurrences of each unique value (year) in the "Year" column of the baseball data frame. It calculates how many teams made the playoffs in each year.
PlayoffTable = table(baseball$Year)
#You can output the table with the following command:
# PlayoffTable in R will display the contents of the PlayoffTable object, which is the frequency table created in the previous step.
PlayoffTable
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8
2007 2008 2009 2010 2011 2012
8 8 8 8 8 10
# The names() function retrieves the names (or keys) of the elements in an object. For a named vector like PlayoffTable, names(PlayoffTable) will return the names associated with each element in the vector.
# The str() function in R is used to compactly display the internal structure of an R object. When applied to names(PlayoffTable), it will provide information about the structure of the names within the PlayoffTable.
str(names(PlayoffTable))
chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" "1971" "1973" "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989" "1990" ...
#Which function call returns the number of playoff teams in 1990 and 2001?
# 4 for 1990 and 8 for 2001
# This code snippet subsets the PlayoffTable vector to include only the values corresponding to the years "1990" and "2001".
PlayoffTable[c("1990", "2001")]
1990 2001
4 8
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.
# baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)] efficiently assigns playoff competitor counts to each row in baseball based on the corresponding year using indexing with PlayoffTable.
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
baseball$NumCompetitors
[1] 10 10 10 10 10 10 10 10 10 10 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[69] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[137] 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[205] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?
# table(baseball$NumCompetitors) provides a clear summary of the distribution of playoff competitors by counting occurrences of each unique value in the specified column (baseball$NumCompetitors).
table(baseball$NumCompetitors)
2 4 8 10
14 92 128 10
Bivariate Models for Predicting World Series Winner In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:
baseballWorldSeries=as.numeric(baseball RankPlayoffs == 1)
WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise.
#How many observations do we have in our dataset where a team did NOT win the World Series?
# The combination of baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1) and table(baseball$WorldSeries) efficiently creates and summarizes a binary indicator for World Series appearances in our dataset.
# In this case 197 teams got 0 meaning they did not win the playoff
# On the other hand 47 teams got 1 meaning they won the playoff
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)
0 1
197 47
Bivariate Models for Predicting World Series Winner When we’re not sure which of our variables are useful in predicting a particular outcome, it’s often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We’ll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models
#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model?
#Varibales to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)
# The glm() function with family = "binomial" is useful for fitting logistic regression models.
# summary() provides comprehensive information about model coefficients, significance, and fit statistics,
# Year has a statistically significant effect on the log-odds of a team being in the World Series.
# As Year increases, the log-odds of WorldSeries decrease, suggesting a decreasing trend over time in the likelihood of teams making it to the World Series.
# We have to note that the number of team goin to the playoff each year has increased, so this results are skewed
model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)
Call:
glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 72.23602 22.64409 3.19 0.00142 **
Year -0.03700 0.01138 -3.25 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 228.35 on 242 degrees of freedom
AIC: 232.35
Number of Fisher Scoring iterations: 4
Year is a significant predictor!
# The logistic regression model model2 predicts WorldSeries (indicating whether a team made it to the World Series) based on RS (Runs Scored).
# RS (Runs Scored) does not show a significant effect on the log-odds of making it to the World Series
model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)
Call:
glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.661226 1.636494 0.404 0.686
RS -0.002681 0.002098 -1.278 0.201
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 237.45 on 242 degrees of freedom
AIC: 241.45
Number of Fisher Scoring iterations: 4
RS is not significant
# The logistic regression model model3 predicts WorldSeries (indicating whether a team made it to the World Series) based on RA (Runs Scored).
# RA (Runs Allowed) show a significant effect on the log-odds of making it to the World Series
model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)
Call:
glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.888174 1.483831 1.272 0.2032
RA -0.005053 0.002273 -2.223 0.0262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 233.88 on 242 degrees of freedom
AIC: 237.88
Number of Fisher Scoring iterations: 4
RA is a significant predictor!
# The logistic regression model model4 predicts WorldSeries (indicating whether a team made it to the World Series) based on W (Wins).
# W (Wins) has only a marginal effect on the log-odds of making it to the World Series
model4<-glm(WorldSeries~W, data=baseball, family="binomial")
summary(model4)
Call:
glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.85568 2.87620 -2.384 0.0171 *
W 0.05671 0.02988 1.898 0.0577 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 235.51 on 242 degrees of freedom
AIC: 239.51
Number of Fisher Scoring iterations: 4
W is not significant
# The logistic regression model model5 predicts WorldSeries (indicating whether a team made it to the World Series) based on OBP (On-Base Percentage).
# OBP (On-Base Percentage) does not show a significant effect on the log-odds of making it to the World Series
model5<-glm(WorldSeries~OBP, data=baseball, family="binomial")
summary(model5)
Call:
glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.741 3.989 0.687 0.492
OBP -12.402 11.865 -1.045 0.296
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 238.02 on 242 degrees of freedom
AIC: 242.02
Number of Fisher Scoring iterations: 4
OBP is not significant
# The logistic regression model model6 predicts WorldSeries (indicating whether a team made it to the World Series) based on SLG (Slugging Percentage).
# SLG (Slugging Percentage) has moderate significant effect on the log-odds of making it to the World Series
model6<-glm(WorldSeries~SLG, data=baseball, family="binomial")
summary(model6)
Call:
glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.200 2.358 1.357 0.1748
SLG -11.130 5.689 -1.956 0.0504 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 235.23 on 242 degrees of freedom
AIC: 239.23
Number of Fisher Scoring iterations: 4
SLG is not significant
# The logistic regression model model7 predicts WorldSeries (indicating whether a team made it to the World Series) based on BA (Batting Average).
# BA (Batting Average) does not show a significant effect on the log-odds of making it to the World Series
model7<-glm(WorldSeries~BA, data=baseball, family="binomial")
summary(model7)
Call:
glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6392 3.8988 -0.164 0.870
BA -2.9765 14.6123 -0.204 0.839
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 239.08 on 242 degrees of freedom
AIC: 243.08
Number of Fisher Scoring iterations: 4
BA is not significant
# The logistic regression model model8 predicts WorldSeries (indicating whether a team made it to the World Series) based on RankSeason.
# RankSeason has a significant effect on the log-odds of making it to the World Series
model8<-glm(WorldSeries~RankSeason, data=baseball, family="binomial")
summary(model8)
Call:
glm(formula = WorldSeries ~ RankSeason, family = "binomial",
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8256 0.3268 -2.527 0.0115 *
RankSeason -0.2069 0.1027 -2.016 0.0438 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 234.75 on 242 degrees of freedom
AIC: 238.75
Number of Fisher Scoring iterations: 4
RankSeason is significant!
# The logistic regression model model9 predicts WorldSeries (indicating whether a team made it to the World Series) based on OOBP (on-base percentage) .
# OOBP (on-base percentage) does not has a significant effect on the log-odds of making it to the World Series
model9<-glm(WorldSeries~OOBP, data=baseball, family="binomial")
summary(model9)
Call:
glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9306 8.3728 -0.111 0.912
OOBP -3.2233 26.0587 -0.124 0.902
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.910 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.91
Number of Fisher Scoring iterations: 4
OOBP is not significant
# The logistic regression model model10 predicts WorldSeries (indicating whether a team made it to the World Series) based on OSLG (opponents’ on-base percentage) .
# OSLG (opponents’ on-base percentage) does not has a significant effect on the log-odds of making it to the World Series
model10<-glm(WorldSeries~OSLG, data=baseball, family="binomial")
summary(model10)
Call:
glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725 6.07285 -0.014 0.989
OSLG -4.65992 15.06881 -0.309 0.757
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.830 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.83
Number of Fisher Scoring iterations: 4
OSLG is not significant
# The logistic regression model model11 predicts WorldSeries (indicating whether a team made it to the World Series) based on NumCompetitors
# NumCompetitors has a significant effect on the log-odds of making it to the World Series
model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)
Call:
glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.03868 0.43750 0.088 0.929559
NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.96 on 242 degrees of freedom
AIC: 230.96
Number of Fisher Scoring iterations: 4
NumCompetitors is significant!
# The logistic regression model model12 predicts WorldSeries (indicating whether a team made it to the World Series) based on League
# League does not has a significant effect on the log-odds of making it to the World Series
model12<-glm(WorldSeries~League, data=baseball, family="binomial")
summary(model12)
Call:
glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3558 0.2243 -6.045 1.5e-09 ***
LeagueNL -0.1583 0.3252 -0.487 0.626
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 238.88 on 242 degrees of freedom
AIC: 242.88
Number of Fisher Scoring iterations: 4
League is not significant
In short, the significant predictors are Year, RA, and NumCompetitors.
Multivariate Models for Predicting World Series Winner
In this section, we’ll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?
#How many variables are significant in the combined model?
# The logistic regression model LogModel predicts WorldSeries (indicating whether a team made it to the World Series) based on Year,RA,RankSeason, and NumCompetitors
# None of the variables have a significant effect on the log-odds of making it to the World Series
LogModel = glm(WorldSeries ~ Year + RA + RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(LogModel)
Call:
glm(formula = WorldSeries ~ Year + RA + RankSeason + NumCompetitors,
family = binomial, data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.5874376 53.6474210 0.235 0.814
Year -0.0061425 0.0274665 -0.224 0.823
RA -0.0008238 0.0027391 -0.301 0.764
RankSeason -0.0685046 0.1203459 -0.569 0.569
NumCompetitors -0.1794264 0.1815933 -0.988 0.323
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.37 on 239 degrees of freedom
AIC: 236.37
Number of Fisher Scoring iterations: 4
Looking at summary(LogModel), we can see that none of the variables are significant in the multivariate model. Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)?
#We can compute all pair-wise correlations between these variables with:
# The diagonal elements (top-left to bottom-right) represent the correlation of each variable with itself, which is always 1.
# Off-diagonal elements represent the correlation coefficients between pairs of variables. These coefficients range between -1 (perfect negative correlation) and 1 (perfect positive correlation).
# The closer the coefficient is to 1 or -1, the stronger the correlation between the variables. A coefficient close to 0 indicates little to no linear relationship.
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])
Year RA RankSeason NumCompetitors
Year 1.0000000 0.4762422 0.3852191 0.9139548
RA 0.4762422 1.0000000 0.3991413 0.5136769
RankSeason 0.3852191 0.3991413 1.0000000 0.4247393
NumCompetitors 0.9139548 0.5136769 0.4247393 1.0000000
While every pair was at least moderately correlated, the only strongly correlated pair was Year/NumCompetitors, with correlation coefficient 0.914.
Let us build all six of the two variable models listed in the previous problem. Together with the four bivariate models that were significant, we should have 10 different logistic regression models to analyze. Which model has the best AIC value (the minimum AIC value)?
#The two-variable models can be built with the following commands:
# The logistic regression model model13 predicts WorldSeries (indicating whether a team made it to the World Series) based on Year, and RA
# Year has a significant effect on the log-odds of making it to the World Series
# RA does not have a significant effect on the log-odds of making it to the World Series
model13 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(model13)
Call:
glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 63.610741 25.654830 2.479 0.0132 *
Year -0.032084 0.013323 -2.408 0.0160 *
RA -0.001766 0.002585 -0.683 0.4945
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 227.88 on 241 degrees of freedom
AIC: 233.88
Number of Fisher Scoring iterations: 4
#The two-variable models can be built with the following commands:
# The logistic regression model model14 predicts WorldSeries (indicating whether a team made it to the World Series) based on Year, and RankSeason
# Year has a significant effect on the log-odds of making it to the World Series
# RankSeason does not have a significant effect on the log-odds of making it to the World Series
model14 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)
summary(model14)
Call:
glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 63.64855 24.37063 2.612 0.00901 **
Year -0.03254 0.01231 -2.643 0.00822 **
RankSeason -0.10064 0.11352 -0.887 0.37534
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 227.55 on 241 degrees of freedom
AIC: 233.55
Number of Fisher Scoring iterations: 4
# The logistic regression model model15 predicts WorldSeries (indicating whether a team made it to the World Series) based on Year, and NumCompetitors
# Npne of the variables have a significant effect on the log-odds of making it to the World Series
model15 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)
summary(model15)
Call:
glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.350467 53.481896 0.250 0.803
Year -0.006802 0.027328 -0.249 0.803
NumCompetitors -0.212610 0.175520 -1.211 0.226
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.90 on 241 degrees of freedom
AIC: 232.9
Number of Fisher Scoring iterations: 4
# The logistic regression model model16 predicts WorldSeries (indicating whether a team made it to the World Series) based on RA, and RankSeason
# None of the variables have a significant effect on the log-odds of making it to the World Series
model16 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)
summary(model16)
Call:
glm(formula = WorldSeries ~ RA + RankSeason, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.487461 1.506143 0.988 0.323
RA -0.003815 0.002441 -1.563 0.118
RankSeason -0.140824 0.110908 -1.270 0.204
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 232.22 on 241 degrees of freedom
AIC: 238.22
Number of Fisher Scoring iterations: 4
# The logistic regression model model17 predicts WorldSeries (indicating whether a team made it to the World Series) based on RA, and NumCompetitors
# RA does not have a significant effect on the log-odds of making it to the World Series
# NumCompetitors have a significant effect on the log-odds of making it to the World Series
model17 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)
summary(model17)
Call:
glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.716895 1.528736 0.469 0.63911
RA -0.001233 0.002661 -0.463 0.64313
NumCompetitors -0.229385 0.088399 -2.595 0.00946 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.74 on 241 degrees of freedom
AIC: 232.74
Number of Fisher Scoring iterations: 4
# The logistic regression model model18 predicts WorldSeries (indicating whether a team made it to the World Series) based on RankSeason, and NumCompetitors
# RankSeason does not have a significant effect on the log-odds of making it to the World Series
# NumCompetitors have a significant effect on the log-odds of making it to the World Series
model18 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(model18)
Call:
glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial,
data = baseball)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12277 0.45737 0.268 0.78837
RankSeason -0.07697 0.11711 -0.657 0.51102
NumCompetitors -0.22784 0.08201 -2.778 0.00546 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.52 on 241 degrees of freedom
AIC: 232.52
Number of Fisher Scoring iterations: 4
None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable. This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!