Linear Regression Model On baseball Dataset

So our first question is: how many games does a team need to win in the regular season to make it to the playoffs?

Let’s start by loading our data.

baseball<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\baseball.csv", header = TRUE)

Top row of the data

head(baseball)

##   Team League Year  RS  RA  W   OBP   SLG    BA Playoffs RankSeason
## 1  ARI     NL 2012 734 688 81 0.328 0.418 0.259        0         NA
## 2  ATL     NL 2012 700 600 94 0.320 0.389 0.247        1          4
## 3  BAL     AL 2012 712 705 93 0.311 0.417 0.247        1          5
## 4  BOS     AL 2012 734 806 69 0.315 0.415 0.260        0         NA
## 5  CHC     NL 2012 613 759 61 0.302 0.378 0.240        0         NA
## 6  CHW     AL 2012 748 676 85 0.318 0.422 0.255        0         NA
##   RankPlayoffs   G  OOBP  OSLG
## 1           NA 162 0.317 0.415
## 2            5 162 0.306 0.378
## 3            4 162 0.315 0.403
## 4           NA 162 0.331 0.428
## 5           NA 162 0.335 0.424
## 6           NA 162 0.319 0.405

We can look at the structure of our data by using the str function and summary of the data by using summary function

str(baseball)

## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

summary(baseball)

##       Team     League        Year            RS               RA        
##  BAL    : 47   AL:616   Min.   :1962   Min.   : 463.0   Min.   : 472.0  
##  BOS    : 47   NL:616   1st Qu.:1977   1st Qu.: 652.0   1st Qu.: 649.8  
##  CHC    : 47            Median :1989   Median : 711.0   Median : 709.0  
##  CHW    : 47            Mean   :1989   Mean   : 715.1   Mean   : 715.1  
##  CIN    : 47            3rd Qu.:2002   3rd Qu.: 775.0   3rd Qu.: 774.2  
##  CLE    : 47            Max.   :2012   Max.   :1009.0   Max.   :1103.0  
##  (Other):950                                                            
##        W              OBP              SLG               BA        
##  Min.   : 40.0   Min.   :0.2770   Min.   :0.3010   Min.   :0.2140  
##  1st Qu.: 73.0   1st Qu.:0.3170   1st Qu.:0.3750   1st Qu.:0.2510  
##  Median : 81.0   Median :0.3260   Median :0.3960   Median :0.2600  
##  Mean   : 80.9   Mean   :0.3263   Mean   :0.3973   Mean   :0.2593  
##  3rd Qu.: 89.0   3rd Qu.:0.3370   3rd Qu.:0.4210   3rd Qu.:0.2680  
##  Max.   :116.0   Max.   :0.3730   Max.   :0.4910   Max.   :0.2940  
##                                                                    
##     Playoffs        RankSeason     RankPlayoffs         G        
##  Min.   :0.0000   Min.   :1.000   Min.   :1.000   Min.   :158.0  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:162.0  
##  Median :0.0000   Median :3.000   Median :3.000   Median :162.0  
##  Mean   :0.1981   Mean   :3.123   Mean   :2.717   Mean   :161.9  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:162.0  
##  Max.   :1.0000   Max.   :8.000   Max.   :5.000   Max.   :165.0  
##                   NA's   :988     NA's   :988                    
##       OOBP             OSLG       
##  Min.   :0.2940   Min.   :0.3460  
##  1st Qu.:0.3210   1st Qu.:0.4010  
##  Median :0.3310   Median :0.4190  
##  Mean   :0.3323   Mean   :0.4197  
##  3rd Qu.:0.3430   3rd Qu.:0.4380  
##  Max.   :0.3840   Max.   :0.4990  
##  NA's   :812      NA's   :812

This data set includes an observation for every team and year pair from 1962 to 2012 for all seasons with 162 games.

subsetting our data to only include the years before 2002.

moneyball = subset(baseball, Year < 2002)
str(moneyball)

## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...

If we take a look at the structure of moneyball, we can see that we now have 902 observations of the same 15 variables.

So we want to build a linear regression equation to predict wins using the difference between runs scored and runs allowed.

To make this a little easier, we will create a new variable, moneyball$RD(run difference)

moneyball$RD = moneyball$RS - moneyball$RA
str(moneyball)

## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...

Let’s visually check to see if there’s a linear relationship between Run Difference and Wins. We’ll do that by creating a scatter plot with the plot function.

plot(moneyball$RD, moneyball$W, xlab="Run Difference", ylab="Wins", pch=19, col="red", main="RD Vs Wins")

Plot shows us that there is a very strong linear relationship between these two variables.

Now let’s build a linear regression model, where we will predict Wins on the basis of Run difference.

Regression model to predict wins

WinsReg = lm(W ~ RD, data=moneyball)
summary(WinsReg)

## 
## Call:
## lm(formula = W ~ RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.881375   0.131157  616.67   <2e-16 ***
## RD           0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16

If we look at the summary of our regression, we see that RD is very significant with three stars, and the R-squared of our model is 0.88.

Now we need to know how many runs a team will score, which we’ll show can be predicted with batting statistics, and how many runs a team will allow, which we’ll show can be predicted using fielding and pitching statistics. Let’s start by creating a linear regression model to predict runs scored.

We want to see if we can use linear regression to predict runs scored, RS, using these three hitting statistics on-base percentage, slugging percentage and batting average.

RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg)

## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16

If we take a look at the summary of our regression equation, we can see that all of our independent variables are significant, and our R-squared is 0.93. But if we look at our coefficients, we can see that the coefficient for batting average is negative.
This implies that, all else being equal, a team with a lower batting average will score more runs, which is a little counterintuitive. What’s going on here is a case of multicollinearity. These three hitting statistics are highly correlated, so it’s hard to interpret the coefficients of our model. Let’s try removing batting average, the variable with the least significance, to see what happens to our model.

RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
summary(RunsReg)

## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.838 -17.174  -1.108  16.770  90.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -804.63      18.92  -42.53   <2e-16 ***
## OBP          2737.77      90.68   30.19   <2e-16 ***
## SLG          1584.91      42.16   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9294 
## F-statistic:  5934 on 2 and 899 DF,  p-value: < 2.2e-16

We can see that our independent variables are still very significant, the coefficients are both positive as we expect, and our R-squared is still about 0.93. So this model is simpler, with only two independent variables, and has about the same R-squared. Overall a better model.

So by using linear regression, we’re able to verify the claims made in Moneyball: that batting average is overvalued, on-base percentage is the most important, and slugging percentage is important for predicting runs scored.