Data & Decisions

Lab Assignment #6

Agata Braja, Anan Ogawa

November 8, 2024

Read the data file and show basic structure

Moneyball<- read.csv("Moneyball.csv")
str(Moneyball)

## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : chr  "ARI" "ATL" "BAL" "BOS" ...
##  $ League      : chr  "NL" "NL" "AL" "AL" ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

summary(Moneyball)

##      Team              League               Year            RS        
##  Length:1232        Length:1232        Min.   :1962   Min.   : 463.0  
##  Class :character   Class :character   1st Qu.:1977   1st Qu.: 652.0  
##  Mode  :character   Mode  :character   Median :1989   Median : 711.0  
##                                        Mean   :1989   Mean   : 715.1  
##                                        3rd Qu.:2002   3rd Qu.: 775.0  
##                                        Max.   :2012   Max.   :1009.0  
##                                                                       
##        RA               W              OBP              SLG        
##  Min.   : 472.0   Min.   : 40.0   Min.   :0.2770   Min.   :0.3010  
##  1st Qu.: 649.8   1st Qu.: 73.0   1st Qu.:0.3170   1st Qu.:0.3750  
##  Median : 709.0   Median : 81.0   Median :0.3260   Median :0.3960  
##  Mean   : 715.1   Mean   : 80.9   Mean   :0.3263   Mean   :0.3973  
##  3rd Qu.: 774.2   3rd Qu.: 89.0   3rd Qu.:0.3370   3rd Qu.:0.4210  
##  Max.   :1103.0   Max.   :116.0   Max.   :0.3730   Max.   :0.4910  
##                                                                    
##        BA            Playoffs        RankSeason     RankPlayoffs  
##  Min.   :0.2140   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.2510   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :0.2600   Median :0.0000   Median :3.000   Median :3.000  
##  Mean   :0.2593   Mean   :0.1981   Mean   :3.123   Mean   :2.717  
##  3rd Qu.:0.2680   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :0.2940   Max.   :1.0000   Max.   :8.000   Max.   :5.000  
##                                    NA's   :988     NA's   :988    
##        G              OOBP             OSLG       
##  Min.   :158.0   Min.   :0.2940   Min.   :0.3460  
##  1st Qu.:162.0   1st Qu.:0.3210   1st Qu.:0.4010  
##  Median :162.0   Median :0.3310   Median :0.4190  
##  Mean   :161.9   Mean   :0.3323   Mean   :0.4197  
##  3rd Qu.:162.0   3rd Qu.:0.3430   3rd Qu.:0.4380  
##  Max.   :165.0   Max.   :0.3840   Max.   :0.4990  
##                  NA's   :812      NA's   :812

Part I - Part 1: More Moneyball

Make a new variable called RD, for “Runs Differential”, which equals Runs Scored minus Runs Against. Run a bivariate regression of # of Wins on Runs Differential. Report the regression output, and also make a line fit plot, and plot the residuals versus the predicted Y.

NOTE: Here are two ways to plot the residuals versus the predicted Y values (with the predicted Y values on the X axis and the residuals on the Y axis, for a regression named “reg”). These graphs should have the same properties as the plots of the residuals versus X described in the module, but they are easier to produce in many cases - especially if there are many X variables.

Method 1: plot(reg\(fitted.values,reg\)residuals) abline(h=0) Method 2: plot(reg,1)

#Create Runs Differential (RD) variable
Moneyball$RD <- Moneyball$RS - Moneyball$RA  # RS = Runs Scored, RA = Runs Against
#Run a bivariate regression of Wins on Runs Differential
reg <- lm(W ~ RD, data = Moneyball)
summary(reg)  # This will give the regression output

## 
## Call:
## lm(formula = W ~ RD, data = Moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3767  -2.7765   0.0571   2.8022  12.8235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.904221   0.113335  713.85   <2e-16 ***
## RD           0.104548   0.001103   94.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.978 on 1230 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8795 
## F-statistic:  8983 on 1 and 1230 DF,  p-value: < 2.2e-16

# Plot the line fit
plot(Moneyball$RD, Moneyball$W, main = "Wins vs Runs Differential",
     xlab = "Runs Differential (RD)", ylab = "Wins")
abline(reg, col = "blue", lwd = 3)

# Plot residuals versus predicted Y (Wins)
# Method 1
plot(reg$fitted.values, reg$residuals, main = "Residuals vs Fitted Values",
     xlab = "Fitted Values (Predicted Wins)", ylab = "Residuals")
abline(h = 0)

# Method 2
plot(reg, which = 1)

(a) What is the interpretation of the coefficient on Runs Differential?

The coefficient for RD is approximately 0.1045. This means that for each additional unit increase in Runs Differential (i.e., each additional run scored relative to runs allowed), the number of wins is expected to increase by about 0.1045 wins. In other words, if a team scores 10 more runs than it allows (a Runs Differential of +10), we would expect the team to win approximately 1.045 more games (0.1045 × 10).

(b) Is Runs Differential statistically significant at the 5% level?

This coefficient is statistically significant, as indicated by the very low p-value (less than 0.001), meaning that the relationship between Runs Differential and Wins is very unlikely to be due to random chance.

(c) Construct a 95% confidence interval for the true coefficient on Runs Differential.

confint(reg, level = 0.95)

##                  2.5 %     97.5 %
## (Intercept) 80.6818704 81.1265712
## RD           0.1023841  0.1067123

The 95% confidence interval for the true coefficient on Runs Differential (RD) is: [0.1023841,0.1067123] This interval indicates that, with 95% confidence, the true effect of a one-unit increase in Runs Differential on the number of wins is between approximately 0.1024 and 0.1067 wins.

(d) What is the R2 in this regression? What is the interpretation of the R2 in words?

summary(reg)

## 
## Call:
## lm(formula = W ~ RD, data = Moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3767  -2.7765   0.0571   2.8022  12.8235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.904221   0.113335  713.85   <2e-16 ***
## RD           0.104548   0.001103   94.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.978 on 1230 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8795 
## F-statistic:  8983 on 1 and 1230 DF,  p-value: < 2.2e-16

The R-squared value, 0.8796, indicates that approximately 87.96% of the variance in the number of Wins (W) is explained by the Runs Differential (RD) in this regression model. In other words, RD is a strong predictor of Wins, as it accounts for most of the variability in the Wins data.

(e) Describe the residuals in this regression. Are there any patterns? Is the mean zero for every value of predicted Y? Is the spread constant for every value of predicted Y?

Mean Around Zero: The residuals should ideally have a mean close to zero for each predicted value. In the residuals vs. fitted plot, the points are centered around the zero line, indicating that the residuals have an average close to zero across the range of predicted values, which is a desirable property.

Constant Spread: The spread of residuals should be roughly constant across all predicted values (homoscedasticity). From the plots, the residuals appear to have a relatively constant spread across the fitted values, although there may be slightly more variance in the center. This generally supports the assumption of constant variance.

No Patterns: Ideally, the residuals should be randomly scattered around the zero line without any distinct patterns, suggesting that the model has captured the systematic part of the relationship between Runs Differential and Wins. The residuals appear to be fairly randomly distributed, although there might be minor clustering, which can sometimes indicate a slight model misspecification or an omitted variable. However, overall, the residuals do not show a strong pattern, indicating a reasonably good fit.

In summary, the residuals seem to meet the assumptions of linear regression reasonably well, with no major patterns, a mean around zero, and a relatively constant spread across predicted values.

(f) Billy Beane and Paul DePodesta want to make the playoffs. They believe that making the playoffs will require at least 95 wins. How many more runs do the Oakland A’s need to score than their opponents, in order for your regression model to predict that they will win 95 games?

# Given values from the regression output
intercept <- 80.904221
rd_coefficient <- 0.104548
target_wins <- 95

#95=Intercept+(RD Coefficient×RD)
# Calculate the required Runs Differential (RD)
required_rd <- (target_wins - intercept) / rd_coefficient
cat("Required Runs Differential to achieve 95 wins:", required_rd, "\n")

## Required Runs Differential to achieve 95 wins: 134.8259

The model calculation (about 135 Runs Differential) means that, over the season, the Oakland A’s need to score around 135 more runs than they allow to have a strong chance of reaching 95 wins. (very unlikely!)

(g) Compute the average Runs Against for the Oakland A’s from 2000 to 2010 (including both end years). Add this to the number that you found above in part (f) for the Run Differential required to win 95 games. The number you obtain is an estimate of the number of Runs Scored Oakland would need, to reach the playoffs during this period.

Average RA: The Oakland A’s allowed an average of 701.55 runs per season over the 2000–2010 period. Estimated RS Needed: To reach the target of 95 wins, the A’s would need to score approximately 836.37 runs. This is calculated by adding the required RD (134.82) to the average RA (701.55).

Run a bivariate regression of Runs Scored on On Base Percentage. Report the regression output.

moneyball <- read.csv("Moneyball.csv")
# Filter the data for Oakland A's from 2000 to 2010
oakland_data <- subset(moneyball, Team == "OAK" & Year >= 2000 & Year <= 2010)

# Average Runs Against (RA) for Oakland from 2000 to 2010
average_ra <- mean(oakland_data$RA)
cat("Average RA: ", average_ra, "\n")

## Average RA:  701.5455

# Runs Differential from part (f) calculated previously
required_rd <- 134.82  

# Estimated Runs Scored needed
required_rs <- average_ra + required_rd
cat("Estimated Runs Scored needed to reach 95 wins:", required_rs,"\n")

## Estimated Runs Scored needed to reach 95 wins: 836.3655

What is the interpretation of the coefficient on On Base Percentage?

# Run a bivariate regression of Runs Scored (RS) on On Base Percentage (OBP) 
reg_OBP <- lm(RS ~ OBP, data = Moneyball)
summary(reg_OBP)  # This will provide the regression output

## 
## Call:
## lm(formula = RS ~ OBP, data = Moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.129  -27.110    1.284   26.441  135.265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1076.6       24.7  -43.59   <2e-16 ***
## OBP           5490.4       75.6   72.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.82 on 1230 degrees of freedom
## Multiple R-squared:  0.8109, Adjusted R-squared:  0.8107 
## F-statistic:  5274 on 1 and 1230 DF,  p-value: < 2.2e-16

The coefficient on On Base Percentage (OBP) is 5490.4. This coefficient represents the expected change in Runs Scored (RS) for each one-unit increase in On Base Percentage (OBP). The coefficient tells us that an increase in OBP is strongly associated with an increase in Runs Scored, highlighting OBP as a powerful predictor of scoring.

What is the R2? What is the interpretation of the R2 in words?

The 𝑅2 value is 0.8109, which means that 81.09% of the variability in Runs Scored (RS) is explained by On Base Percentage (OBP). This indicates a strong relationship between OBP and RS, suggesting that OBP is a significant predictor of Runs Scored.

What On Base Percentage would Oakland need to have, in order that your regression models would predict that they would make the play- offs? (Hint: use your answer to 1(g) above to help answer this question.)

# Given values from the regression output
intercept <- -1076.6
obp_coefficient <- 5490.4
target_rs <- 836.37

# Calculate the required OBP
required_obp <- (target_rs - intercept) / obp_coefficient
cat("Required On Base Percentage (OBP) to achieve 836.37 Runs Scored:", required_obp, "\n")

## Required On Base Percentage (OBP) to achieve 836.37 Runs Scored: 0.3484209

An OBP of 0.348 means that Oakland’s batters would need to reach base about 34.8% of the time.

Part 2: Regression Project Proposal

Please Note: In most cases your regression project group has been split into two groups for this assignment. You should write up your project proposal together (Part 2) with your whole regression project group. Each assignment group should include that same proposal with their assignment 6.

With your regression project group, write up a one page project proposal describ- ing your project. The purpose of this proposal is for me to provide you with some advance feedback on your project. The project proposal should be a one page document with the following information: the names of all the project members, the project title, a concise statement of what question(s) is(are) to be addressed, to whom these questions are of interest, and a summary of the data to be used. Be sure to describe what information is contained in a typical data point from the data to be used. Most importantly, you must make it clear what your main left hand side (Y) variable is — the data you wish to explain. You should also describe the most important right hand side (X) variables in your data.