November 8, 2024
Read the data file and show basic structure
Moneyball<- read.csv("Moneyball.csv")
str(Moneyball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : chr "ARI" "ATL" "BAL" "BOS" ...
## $ League : chr "NL" "NL" "AL" "AL" ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
summary(Moneyball)
## Team League Year RS
## Length:1232 Length:1232 Min. :1962 Min. : 463.0
## Class :character Class :character 1st Qu.:1977 1st Qu.: 652.0
## Mode :character Mode :character Median :1989 Median : 711.0
## Mean :1989 Mean : 715.1
## 3rd Qu.:2002 3rd Qu.: 775.0
## Max. :2012 Max. :1009.0
##
## RA W OBP SLG
## Min. : 472.0 Min. : 40.0 Min. :0.2770 Min. :0.3010
## 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750
## Median : 709.0 Median : 81.0 Median :0.3260 Median :0.3960
## Mean : 715.1 Mean : 80.9 Mean :0.3263 Mean :0.3973
## 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210
## Max. :1103.0 Max. :116.0 Max. :0.3730 Max. :0.4910
##
## BA Playoffs RankSeason RankPlayoffs
## Min. :0.2140 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:0.2510 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000
## Median :0.2600 Median :0.0000 Median :3.000 Median :3.000
## Mean :0.2593 Mean :0.1981 Mean :3.123 Mean :2.717
## 3rd Qu.:0.2680 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :0.2940 Max. :1.0000 Max. :8.000 Max. :5.000
## NA's :988 NA's :988
## G OOBP OSLG
## Min. :158.0 Min. :0.2940 Min. :0.3460
## 1st Qu.:162.0 1st Qu.:0.3210 1st Qu.:0.4010
## Median :162.0 Median :0.3310 Median :0.4190
## Mean :161.9 Mean :0.3323 Mean :0.4197
## 3rd Qu.:162.0 3rd Qu.:0.3430 3rd Qu.:0.4380
## Max. :165.0 Max. :0.3840 Max. :0.4990
## NA's :812 NA's :812
NOTE: Here are two ways to plot the residuals versus the predicted Y values (with the predicted Y values on the X axis and the residuals on the Y axis, for a regression named “reg”). These graphs should have the same properties as the plots of the residuals versus X described in the module, but they are easier to produce in many cases - especially if there are many X variables.
Method 1: plot(reg\(fitted.values,reg\)residuals) abline(h=0) Method 2: plot(reg,1)
#Create Runs Differential (RD) variable
Moneyball$RD <- Moneyball$RS - Moneyball$RA # RS = Runs Scored, RA = Runs Against
#Run a bivariate regression of Wins on Runs Differential
reg <- lm(W ~ RD, data = Moneyball)
summary(reg) # This will give the regression output
##
## Call:
## lm(formula = W ~ RD, data = Moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3767 -2.7765 0.0571 2.8022 12.8235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.904221 0.113335 713.85 <2e-16 ***
## RD 0.104548 0.001103 94.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.978 on 1230 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8795
## F-statistic: 8983 on 1 and 1230 DF, p-value: < 2.2e-16
# Plot the line fit
plot(Moneyball$RD, Moneyball$W, main = "Wins vs Runs Differential",
xlab = "Runs Differential (RD)", ylab = "Wins")
abline(reg, col = "blue", lwd = 3)
# Plot residuals versus predicted Y (Wins)
# Method 1
plot(reg$fitted.values, reg$residuals, main = "Residuals vs Fitted Values",
xlab = "Fitted Values (Predicted Wins)", ylab = "Residuals")
abline(h = 0)
# Method 2
plot(reg, which = 1)
(a) What is the interpretation of the coefficient on Runs Differential?
The coefficient for RD is approximately 0.1045. This means that for each additional unit increase in Runs Differential (i.e., each additional run scored relative to runs allowed), the number of wins is expected to increase by about 0.1045 wins. In other words, if a team scores 10 more runs than it allows (a Runs Differential of +10), we would expect the team to win approximately 1.045 more games (0.1045 × 10).
(b) Is Runs Differential statistically significant at the 5% level?
This coefficient is statistically significant, as indicated by the very low p-value (less than 0.001), meaning that the relationship between Runs Differential and Wins is very unlikely to be due to random chance.
(c) Construct a 95% confidence interval for the true coefficient on Runs Differential.
confint(reg, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 80.6818704 81.1265712
## RD 0.1023841 0.1067123
The 95% confidence interval for the true coefficient on Runs Differential (RD) is: [0.1023841,0.1067123] This interval indicates that, with 95% confidence, the true effect of a one-unit increase in Runs Differential on the number of wins is between approximately 0.1024 and 0.1067 wins.
(d) What is the R2 in this regression? What is the interpretation of the R2 in words?
summary(reg)
##
## Call:
## lm(formula = W ~ RD, data = Moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3767 -2.7765 0.0571 2.8022 12.8235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.904221 0.113335 713.85 <2e-16 ***
## RD 0.104548 0.001103 94.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.978 on 1230 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8795
## F-statistic: 8983 on 1 and 1230 DF, p-value: < 2.2e-16
The R-squared value, 0.8796, indicates that approximately 87.96% of the variance in the number of Wins (W) is explained by the Runs Differential (RD) in this regression model. In other words, RD is a strong predictor of Wins, as it accounts for most of the variability in the Wins data.
(e) Describe the residuals in this regression. Are there any patterns? Is the mean zero for every value of predicted Y? Is the spread constant for every value of predicted Y?
Mean Around Zero: The residuals should ideally have a mean close to zero for each predicted value. In the residuals vs. fitted plot, the points are centered around the zero line, indicating that the residuals have an average close to zero across the range of predicted values, which is a desirable property.
Constant Spread: The spread of residuals should be roughly constant across all predicted values (homoscedasticity). From the plots, the residuals appear to have a relatively constant spread across the fitted values, although there may be slightly more variance in the center. This generally supports the assumption of constant variance.
No Patterns: Ideally, the residuals should be randomly scattered around the zero line without any distinct patterns, suggesting that the model has captured the systematic part of the relationship between Runs Differential and Wins. The residuals appear to be fairly randomly distributed, although there might be minor clustering, which can sometimes indicate a slight model misspecification or an omitted variable. However, overall, the residuals do not show a strong pattern, indicating a reasonably good fit.
In summary, the residuals seem to meet the assumptions of linear regression reasonably well, with no major patterns, a mean around zero, and a relatively constant spread across predicted values.
(f) Billy Beane and Paul DePodesta want to make the playoffs. They believe that making the playoffs will require at least 95 wins. How many more runs do the Oakland A’s need to score than their opponents, in order for your regression model to predict that they will win 95 games?
# Given values from the regression output
intercept <- 80.904221
rd_coefficient <- 0.104548
target_wins <- 95
#95=Intercept+(RD Coefficient×RD)
# Calculate the required Runs Differential (RD)
required_rd <- (target_wins - intercept) / rd_coefficient
cat("Required Runs Differential to achieve 95 wins:", required_rd, "\n")
## Required Runs Differential to achieve 95 wins: 134.8259
The model calculation (about 135 Runs Differential) means that, over the season, the Oakland A’s need to score around 135 more runs than they allow to have a strong chance of reaching 95 wins. (very unlikely!)
(g) Compute the average Runs Against for the Oakland A’s from 2000 to 2010 (including both end years). Add this to the number that you found above in part (f) for the Run Differential required to win 95 games. The number you obtain is an estimate of the number of Runs Scored Oakland would need, to reach the playoffs during this period.
Average RA: The Oakland A’s allowed an average of 701.55 runs per season over the 2000–2010 period. Estimated RS Needed: To reach the target of 95 wins, the A’s would need to score approximately 836.37 runs. This is calculated by adding the required RD (134.82) to the average RA (701.55).
moneyball <- read.csv("Moneyball.csv")
# Filter the data for Oakland A's from 2000 to 2010
oakland_data <- subset(moneyball, Team == "OAK" & Year >= 2000 & Year <= 2010)
# Average Runs Against (RA) for Oakland from 2000 to 2010
average_ra <- mean(oakland_data$RA)
cat("Average RA: ", average_ra, "\n")
## Average RA: 701.5455
# Runs Differential from part (f) calculated previously
required_rd <- 134.82
# Estimated Runs Scored needed
required_rs <- average_ra + required_rd
cat("Estimated Runs Scored needed to reach 95 wins:", required_rs,"\n")
## Estimated Runs Scored needed to reach 95 wins: 836.3655
# Run a bivariate regression of Runs Scored (RS) on On Base Percentage (OBP)
reg_OBP <- lm(RS ~ OBP, data = Moneyball)
summary(reg_OBP) # This will provide the regression output
##
## Call:
## lm(formula = RS ~ OBP, data = Moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.129 -27.110 1.284 26.441 135.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1076.6 24.7 -43.59 <2e-16 ***
## OBP 5490.4 75.6 72.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.82 on 1230 degrees of freedom
## Multiple R-squared: 0.8109, Adjusted R-squared: 0.8107
## F-statistic: 5274 on 1 and 1230 DF, p-value: < 2.2e-16
The coefficient on On Base Percentage (OBP) is 5490.4. This coefficient represents the expected change in Runs Scored (RS) for each one-unit increase in On Base Percentage (OBP). The coefficient tells us that an increase in OBP is strongly associated with an increase in Runs Scored, highlighting OBP as a powerful predictor of scoring.
The 𝑅2 value is 0.8109, which means that 81.09% of the variability in Runs Scored (RS) is explained by On Base Percentage (OBP). This indicates a strong relationship between OBP and RS, suggesting that OBP is a significant predictor of Runs Scored.
# Given values from the regression output
intercept <- -1076.6
obp_coefficient <- 5490.4
target_rs <- 836.37
# Calculate the required OBP
required_obp <- (target_rs - intercept) / obp_coefficient
cat("Required On Base Percentage (OBP) to achieve 836.37 Runs Scored:", required_obp, "\n")
## Required On Base Percentage (OBP) to achieve 836.37 Runs Scored: 0.3484209
An OBP of 0.348 means that Oakland’s batters would need to reach base about 34.8% of the time.
Please Note: In most cases your regression project group has been split into two groups for this assignment. You should write up your project proposal together (Part 2) with your whole regression project group. Each assignment group should include that same proposal with their assignment 6.
With your regression project group, write up a one page project proposal describ- ing your project. The purpose of this proposal is for me to provide you with some advance feedback on your project. The project proposal should be a one page document with the following information: the names of all the project members, the project title, a concise statement of what question(s) is(are) to be addressed, to whom these questions are of interest, and a summary of the data to be used. Be sure to describe what information is contained in a typical data point from the data to be used. Most importantly, you must make it clear what your main left hand side (Y) variable is — the data you wish to explain. You should also describe the most important right hand side (X) variables in your data.