Module 3 Activity

getwd()
## [1] "/cloud/project"
# watch video #2

#load dataset
baseball = read.csv("baseball.csv")
str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : chr  "ARI" "ATL" "BAL" "BOS" ...
##  $ League      : chr  "NL" "NL" "AL" "AL" ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
#Subsetting to include only the years before 2002 (we're calling it moneyball--based on the movie)

# str = "structure"

moneyball = subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : chr  "ANA" "ARI" "ATL" "BAL" ...
##  $ League      : chr  "AL" "NL" "NL" "AL" ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
#Difference in RUNS (runs difference)
# RS = runs scored
# RA = runs allowed

moneyball$RD = moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : chr  "ANA" "ARI" "ATL" "BAL" ...
##  $ League      : chr  "AL" "NL" "NL" "AL" ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...
#Scatterplot to look for a linear relationship

plot(moneyball$RD, moneyball$W)

# Linea regression model to predict future wins based on this set "moneyball"

WinsReg = lm(W ~ RD, data=moneyball)
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.881375   0.131157  616.67   <2e-16 ***
## RD           0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16
# IS THE VARIABLE SIGNIFICANT? USE P-VALUE

# p-value less than 0.05
# p-value small, we rehject the null
# runs difference DOES significantly affect the wins 

#TO COMPUTE: USE THE ESTIMATE

# +10 runs diffrence means +1 win 
# (estimate 0.105766)
# Regression model to predict runs scored
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16
#Significance of variable? (p-value)

# p-value smaller than 0.05 yes
# reject null
# yes the variable is significant

#Compute with estimate

# 3 different variables
#high correlations between OBP and batting average, means no need to have both, remove one

# HOW DO WE KNOW THIS??
# BA has a NEGATIVE IMPACT (estimate) on runs scored, which makes NO SENSE logically
# it also highly relates to OBP we know this based on COMMON SENSE
# so we remove BA due to multi-colliniarity
# 2nd Regression model to predict runs scored (WITH BA REMOVED DUE TO CORRELATION)
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16
# Regression model to predict runs allowed
RunsAllowedReg = lm(RA ~ OOBP + OSLG, data=moneyball)
summary(RunsAllowedReg)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.397 -15.178  -0.129  17.679  60.955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -837.38      60.26 -13.897  < 2e-16 ***
## OOBP         2913.60     291.97   9.979 4.46e-16 ***
## OSLG         1514.29     175.43   8.632 2.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.67 on 87 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9073, Adjusted R-squared:  0.9052 
## F-statistic: 425.8 on 2 and 87 DF,  p-value: < 2.2e-16
# install.packages("car")
# library(car)

# HIT NO ON RE INSTALL AND WAIT A MIN OR TWO
# vif(RunsReg)

In Class Activity 7 Question

If a baseball team scores 763 runs and allows 614 runs, how many games do we expect the team to win?

Using the linear regression model constructed during the lecture, enter the number of games we expect the team to win:

Numberofwins = 80.88 +0.106 * (763-614)
Numberofwins
## [1] 96.674
# A team with 763 RS and 614 RA is expected to win 96 games

In Class Activity 8 Question

Exercise 1

If a baseball team’s OBP is 0.361 and SLG is 0.409, how many runs do we expect the team to score?

Using the linear regression model constructed during the lecture (the one that uses OBP and SLG as independent variables), find the number of runs we expect the team to score:

Expectedruns = -804.63 + 2737.77 * 0.361 + 1584.29 * 0.409
Expectedruns
## [1] 831.6796
# A team with OBP 0.361 and SLG 0.409 is expected to make 831 runs

Exercise 2

If a baseball team’s opponents OBP (OOBP) is 0.267 and opponents SLG (OSLG) is 0.392, how many runs do we expect the team to allow?

Using the linear regression model discussed during the lecture (the one on the last slide of the previous video), find the number of runs we expect the team

Expectedrunsallowed = -837.38 + 2913.60 * 0.267 + 1514.29 * 0.392
Expectedrunsallowed
## [1] 534.1529
# The team is expected to make 534 runs