So our first question is: how many games does a team need to win in the regular season to make it to the playoffs?
Let’s start by loading our data.
baseball<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\baseball.csv", header = TRUE)
Top row of the data
head(baseball)
## Team League Year RS RA W OBP SLG BA Playoffs RankSeason
## 1 ARI NL 2012 734 688 81 0.328 0.418 0.259 0 NA
## 2 ATL NL 2012 700 600 94 0.320 0.389 0.247 1 4
## 3 BAL AL 2012 712 705 93 0.311 0.417 0.247 1 5
## 4 BOS AL 2012 734 806 69 0.315 0.415 0.260 0 NA
## 5 CHC NL 2012 613 759 61 0.302 0.378 0.240 0 NA
## 6 CHW AL 2012 748 676 85 0.318 0.422 0.255 0 NA
## RankPlayoffs G OOBP OSLG
## 1 NA 162 0.317 0.415
## 2 5 162 0.306 0.378
## 3 4 162 0.315 0.403
## 4 NA 162 0.331 0.428
## 5 NA 162 0.335 0.424
## 6 NA 162 0.319 0.405
We can look at the structure of our data by using the str function and summary of the data by using summary function
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
summary(baseball)
## Team League Year RS RA
## BAL : 47 AL:616 Min. :1962 Min. : 463.0 Min. : 472.0
## BOS : 47 NL:616 1st Qu.:1977 1st Qu.: 652.0 1st Qu.: 649.8
## CHC : 47 Median :1989 Median : 711.0 Median : 709.0
## CHW : 47 Mean :1989 Mean : 715.1 Mean : 715.1
## CIN : 47 3rd Qu.:2002 3rd Qu.: 775.0 3rd Qu.: 774.2
## CLE : 47 Max. :2012 Max. :1009.0 Max. :1103.0
## (Other):950
## W OBP SLG BA
## Min. : 40.0 Min. :0.2770 Min. :0.3010 Min. :0.2140
## 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750 1st Qu.:0.2510
## Median : 81.0 Median :0.3260 Median :0.3960 Median :0.2600
## Mean : 80.9 Mean :0.3263 Mean :0.3973 Mean :0.2593
## 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210 3rd Qu.:0.2680
## Max. :116.0 Max. :0.3730 Max. :0.4910 Max. :0.2940
##
## Playoffs RankSeason RankPlayoffs G
## Min. :0.0000 Min. :1.000 Min. :1.000 Min. :158.0
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:162.0
## Median :0.0000 Median :3.000 Median :3.000 Median :162.0
## Mean :0.1981 Mean :3.123 Mean :2.717 Mean :161.9
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:162.0
## Max. :1.0000 Max. :8.000 Max. :5.000 Max. :165.0
## NA's :988 NA's :988
## OOBP OSLG
## Min. :0.2940 Min. :0.3460
## 1st Qu.:0.3210 1st Qu.:0.4010
## Median :0.3310 Median :0.4190
## Mean :0.3323 Mean :0.4197
## 3rd Qu.:0.3430 3rd Qu.:0.4380
## Max. :0.3840 Max. :0.4990
## NA's :812 NA's :812
This data set includes an observation for every team and year pair from 1962 to 2012 for all seasons with 162 games.
subsetting our data to only include the years before 2002.
moneyball = subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame': 902 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
If we take a look at the structure of moneyball, we can see that we now have 902 observations of the same 15 variables.
So we want to build a linear regression equation to predict wins using the difference between runs scored and runs allowed.
To make this a little easier, we will create a new variable, moneyball$RD(run difference)
moneyball$RD = moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame': 902 obs. of 16 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
## $ RD : int -39 141 86 -142 27 76 3 -115 76 17 ...
Let’s visually check to see if there’s a linear relationship between Run Difference and Wins. We’ll do that by creating a scatter plot with the plot function.
plot(moneyball$RD, moneyball$W, xlab="Run Difference", ylab="Wins", pch=19, col="red", main="RD Vs Wins")
Plot shows us that there is a very strong linear relationship between these two variables.
Now let’s build a linear regression model, where we will predict Wins on the basis of Run difference.
WinsReg = lm(W ~ RD, data=moneyball)
summary(WinsReg)
##
## Call:
## lm(formula = W ~ RD, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2662 -2.6509 0.1234 2.9364 11.6570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.881375 0.131157 616.67 <2e-16 ***
## RD 0.105766 0.001297 81.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807
## F-statistic: 6651 on 1 and 900 DF, p-value: < 2.2e-16
If we look at the summary of our regression, we see that RD is very significant with three stars, and the R-squared of our model is 0.88.
Now we need to know how many runs a team will score, which we’ll show can be predicted with batting statistics, and how many runs a team will allow, which we’ll show can be predicted using fielding and pitching statistics. Let’s start by creating a linear regression model to predict runs scored.
We want to see if we can use linear regression to predict runs scored, RS, using these three hitting statistics on-base percentage, slugging percentage and batting average.
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg)
##
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.941 -17.247 -0.621 16.754 90.998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -788.46 19.70 -40.029 < 2e-16 ***
## OBP 2917.42 110.47 26.410 < 2e-16 ***
## SLG 1637.93 45.99 35.612 < 2e-16 ***
## BA -368.97 130.58 -2.826 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared: 0.9302, Adjusted R-squared: 0.93
## F-statistic: 3989 on 3 and 898 DF, p-value: < 2.2e-16
If we take a look at the summary of our regression equation, we can see that all of our independent variables are significant, and our R-squared is 0.93. But if we look at our coefficients, we can see that the coefficient for batting average is negative.
This implies that, all else being equal, a team with a lower batting average will score more runs, which is a little counterintuitive. What’s going on here is a case of multicollinearity. These three hitting statistics are highly correlated, so it’s hard to interpret the coefficients of our model. Let’s try removing batting average, the variable with the least significance, to see what happens to our model.
RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
summary(RunsReg)
##
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.838 -17.174 -1.108 16.770 90.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -804.63 18.92 -42.53 <2e-16 ***
## OBP 2737.77 90.68 30.19 <2e-16 ***
## SLG 1584.91 42.16 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294
## F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16
We can see that our independent variables are still very significant, the coefficients are both positive as we expect, and our R-squared is still about 0.93. So this model is simpler, with only two independent variables, and has about the same R-squared. Overall a better model.
So by using linear regression, we’re able to verify the claims made in Moneyball: that batting average is overvalued, on-base percentage is the most important, and slugging percentage is important for predicting runs scored.