# Read in data
baseball = read.csv("baseball.csv")
str(baseball)

## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : chr  "ARI" "ATL" "BAL" "BOS" ...
##  $ League      : chr  "NL" "NL" "AL" "AL" ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

#Each row in the baseball dataset represents a team in a particular year.How many team/year pairs are there in the whole dataset?

nrow(baseball)

## [1] 1232

There are 1,232 pairs within the entire dataset.

#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.
table(baseball$Year)

## 
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
##   20   20   20   20   20   20   20   24   24   24   24   24   24   24   26   26 
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
##   26   26   26   26   26   26   26   26   26   26   26   26   26   28   28   28 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##   30   30   30   30   30   30   30   30   30   30   30   30   30   30   30

The baseball dataset contains 47 years (1972, 1981, 1994, and 1995 are missing - this could potentially be due to a lack of playing seasons during those years).

#We can count the number of years in the table, or use the command length(table(baseball$Year)) directly to get the answer.
length(table(baseball$Year))

## [1] 47

Limiting to Teams Making the Playoffs Because we’re only analyzing teams that made the playoffs, we can use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so our subset data frame should still be called “baseball”).

How many team/year pairs are included in the new dataset?

There are 244 pairs in the new dataset, according to the below code block.

baseball = subset(baseball, Playoffs == 1)
nrow(baseball)

## [1] 244

#Through the years, different numbers of teams have been invited to the playoffs.

table(baseball$Year)

## 
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
##    2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##    8    8    8    8    8    8    8    8    8    8    8    8    8    8   10

Adding an Important Predictor

It’s much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.

#We start by storing the output of the table() function that counts the number of playoff teams from each year:

PlayoffTable = table(baseball$Year)

#You can output the table with the following command:

PlayoffTable

## 
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
##    2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
## 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
##    4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##    8    8    8    8    8    8    8    8    8    8    8    8    8    8   10

str(names(PlayoffTable))

##  chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" ...

#Which function call returns the number of playoff teams in 1990 and 2001?
PlayoffTable[c("1990", "2001")]

## 
## 1990 2001 
##    4    8

#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame.

baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
baseball$NumCompetitors

##   [1] 10 10 10 10 10 10 10 10 10 10  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [26]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [51]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##  [76]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
## [101]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
## [126]  8  8  8  8  8  8  8  8  8  8  8  8  8  4  4  4  4  4  4  4  4  4  4  4  4
## [151]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [176]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [201]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [226]  4  4  4  4  4  2  2  2  2  2  2  2  2  2  2  2  2  2  2

#How many observations do we have in our dataset where a team did NOT win the World Series?

baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
table(baseball$WorldSeries)

## 
##   0   1 
## 197  47

There are a total of 197 observations where a team didn’t win the World Series, as shown by the code block above.

#Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model?
#Varibales to use as predictors for each bivariate model(Year, RS, RA, W, OBP, SLG, BA, RankSeason, OOBP,OSLG, NumCompetitors, League)

# NOTE: For ease of reading, I only included the models that showed an important significance/factor I wanted to comment on, but please note all 18 models have been run and observed.

model1<-glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)

## 
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 72.23602   22.64409    3.19  0.00142 **
## Year        -0.03700    0.01138   -3.25  0.00115 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 228.35  on 242  degrees of freedom
## AIC: 232.35
## 
## Number of Fisher Scoring iterations: 4

In this case, “Year” is a significant factor.

model2<-glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)

## 
## Call:
## glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.661226   1.636494   0.404    0.686
## RS          -0.002681   0.002098  -1.278    0.201
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 237.45  on 242  degrees of freedom
## AIC: 241.45
## 
## Number of Fisher Scoring iterations: 4

As we can see from the code block above, RS (runs scored) is not considered significant.

model3<-glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)

## 
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.888174   1.483831   1.272   0.2032  
## RA          -0.005053   0.002273  -2.223   0.0262 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 233.88  on 242  degrees of freedom
## AIC: 237.88
## 
## Number of Fisher Scoring iterations: 4

In this above code black, we can see that RA (runs against) is also significant.

model11<-glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)

## 
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial", 
##     data = baseball)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.03868    0.43750   0.088 0.929559    
## NumCompetitors -0.25220    0.07422  -3.398 0.000678 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 239.12  on 243  degrees of freedom
## Residual deviance: 226.96  on 242  degrees of freedom
## AIC: 230.96
## 
## Number of Fisher Scoring iterations: 4

The three statistically significant predictors are “Year”, “Runs Against”, and “Number of Competitors”. If we were to stop here it would seem that the only team statistic that seems relevant to world series chances would be the Runs Against stat. This could potentially show a strong Pitching Staff being of higher importance than a strong Offensive Lineup.

None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable.

Assignment 15_World Series