download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Guided Exercises

Exercise 1: What type of plot would you use to display the relationship between runs and one of the other numerical variables? I would use a linear model at first and then decide whether or not this is an accurate relationship. Therefore, it would be a scatterplot.

Plot this relationship using the variable at-bats as the predictor. Does the relationship look linear? The relationship is positive-weak-linear relationship. It is possible to identify several outliers and some of them have a significant leverage.
If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs? I would be suspicious of the accuracy of the model and I would use a different variable if possible.
# Scatter Plot
plot(x = mlb11$at_bats, y = mlb11$runs, xlab="At-bats", ylab="Runs", main=" At-bats vs. Runs for The Oakland Athletics")
# Linear model between variables "hits" and "runs"
ml <- lm(runs ~ at_bats, data = mlb11)
#Linear model in scatterplot
abline(ml)

# Correlation coefficient to test possible relationship
cor(mlb11$runs, mlb11$hits)
## [1] 0.8012108

Exercise 2: Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations. Based on the scatterplot, the model could be represented with a linear relationship. However, as mentioned earlier, this would be a positive-weak relationship. There are several outliers and most of the points are located before 5650 at-bats. Also, some of the outliers may have significant leverage over the displayed linear model, which also affects the accuracy of the relationship.

Exercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? SS= 70638.75

plot_ss(x = mlb11$hits, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75

Exercise 4: Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs? Least squares regression line: \(\hat{y}=415.2389+1.8345* home runs\). In this equation the slope explains how homeruns would be needed to raise the number of runs (rise/run = runs/ home runs). This would describe the success for the team.

mh <-  lm(runs ~ homeruns, data = mlb11)
summary(mh)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 5: If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,579 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction? The value from the real data shows that if a team has 5,579 at-bats, the same team has 713 (Philadelphia Phillies). Therefore the model is overstimating and \(e= -15.3166\).

# Result based on equation for least squares regression line for the linear model:
Runs= -2789.2429+ (0.6305*5579); Runs
## [1] 728.3166
# Residual
Residual= mlb11[16,2]- Runs; Residual #The first part of the substration locates the entry in the dataset in which there are 5579 at-bats
## [1] -15.3166

Exercise 6. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats? There are three extreme outliers. However, most of the points are ramdonly distributed. Although, there are more datapoints located in the left side of the plot, and the datapoints form a funnel-like shape. We could still use a linear relationship for the model, bute there might be a more accurate option for this dataset. This is confffirmed with the Q-Q plot, in which towards the end of the line, there is a significant separation.

#Residuals Plot
plot(ml$residuals ~ mlb11$at_bats, xlab= "At-bats", ylab= "Residuals", main= "Residuals Plot")
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0 

#Histogram
hist(ml$residuals, xlab= "Residuals", main= "Histogram of Residuals")

#Linearity
qqnorm(ml$residuals)
qqline(ml$residuals)  # adds diagonal line to the normal prob plot

Exercise 7. Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met? The histogram shows a possible bimodal distribution, therefore the nearly normal residuals conditions is not met based on the histogram. The normal Q-Q plot shows this bimodal condition identified in the histogram too. The normal Q-Q plot shows that the residuals are greater and further away from the line when the theoretical quantiles approaches 0 and again when it approaches ~1.5. In addition, the last point in the plot has enough leverage to shift the line from a better positioning that would make the linear model more accurate.

Exercise 8: Based on the plot in (1), does the constant variability condition appear to be met? According to the Scale-Location plot, the line is not completely horizontal but still shows that the residuals are fairly spread around the line.

par(mfrow=c(2,2))
plot(ml)

On Your Own


1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship? Based on the plot and the projected linear model, the data has a weak-positive-linear relationship.

# Correlation coefficient to test possible relationship
cor(mlb11$runs, mlb11$hits)
## [1] 0.8012108
# Scatter Plot
plot(x = mlb11$hits, y = mlb11$runs, xlab="Hits", ylab="Runs", main=" Hits vs. Runs for The Oakland Athletics")
# Linear model between variables "hits" and "runs"
m1 <- lm(runs ~ hits, data = mlb11)
#Linear model in scatterplot
abline(m1)

plot_ss(x = mlb11$hits, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75


2. How does this relationship compare to the relationship between runs and at_bats? Use the R^2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell? The relationship between “runs” and “at_bats” is weaker than the one between “hits” and “runs” based on the adjusted \(R^2\) values, 0.3505 vs 0.6292 respectively. Therefore, based on this, the variable “hits” seems to predict better the runs for the team. This is because the \(R^2\) values represent the strenth of the relationship between two variables, in this case the summaries for the linear model presents this value for each model.

# Linear model between the variables "runs" and "bats"
m2 <- lm(runs ~ at_bats, data = mlb11)
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
# Summary for the two linear models
summary(m1)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
summary(m2)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388


3. Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five). Out of the five variables that are being tested for this question “batting average” is the variable with the strongest relationship with “runs” compared to the others. This is based on the plot and the plotted linear model, which presented a moderate-positive-linear relationship, and the adjusted \(R^2 = 0.6438\). Thus, “batting average” has the strongest relationship with “runs” out of all the traditional variables.

# Linear model between "runs" and "batting average" and summary
m3 <- lm(runs ~ bat_avg, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
plot(x = mlb11$bat_avg, y = mlb11$runs, xlab="Batting Average", ylab="Runs", main=" Hits vs. Batting Average for The Oakland Athletics")
abline(m3)

# Linear model between "runs" and "strikeouts" and summary
m4 <- lm(runs ~ strikeouts, data = mlb11)
summary(m4)
plot(x = mlb11$strikeouts, y = mlb11$runs, xlab="Strikeouts", ylab="Runs", main=" Hits vs. Strikeouts for The Oakland Athletics")
abline(m4)
plot_ss(x = mlb11$strikeouts, y = mlb11$runs, showSquares = TRUE)
# Linear model between "runs" and "stolen bases" and summary
m5 <- lm(runs ~ stolen_bases, data = mlb11)
summary(m5)
plot(x = mlb11$stolen_bases, y = mlb11$runs, xlab="Stolen Bases", ylab="Runs", main=" Hits vs. Stolen Bases for The Oakland Athletics")
abline(m5)
plot_ss(x = mlb11$strikeouts, y = mlb11$runs, showSquares = TRUE)
# Linear model between "runs" and "wins" and summary
m6 <- lm(runs ~ wins, data = mlb11)
summary(m6)
plot(x = mlb11$wins, y = mlb11$runs, xlab="Wins", ylab="Runs", main=" Hits vs. Wins for The Oakland Athletics")
abline(m6)


4. Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense? Based on the adjusted \(R^2\) values, which are higher for the new variables than the traditional ones. It is possible to say that the new variables are more effective at predicting runs than the old ones. Comparing the overall values for correlation, the old variables do not have a \(R^2\) value greater than 0.6438, while the new variables have \(R^2\) values greater than 0.8437. In addition, the linear models look more consistent and with no clear outliers compared to the old variables.This makes sense, since for example, the variable “on-base plus slugging” measures how well can a hitter perform to reach base, which are necessary factors to make a home run. Thus, the most effective variable out of all ten is “on-base plus slugging” because of its \(R^2= 0.9326\) and its positive-strong- linear relationship with “runs” shown in the plot.

# Linear model between "runs" and "on-base percentage" and summary
m7 <- lm(runs ~ new_onbase, data = mlb11)
summary(m7)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
plot(x = mlb11$new_onbase, y = mlb11$runs, xlab="Batting Average", ylab="Runs", main=" Hits vs. On-base Percentage for The Oakland Athletics")
abline(m7)

# Linear model between "runs" and "slugging percentage" and summary
m8 <- lm(runs ~ new_slug, data = mlb11)
summary(m8)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
plot(x = mlb11$new_slug, y = mlb11$runs, xlab="Slugging Percentage", ylab="Runs", main=" Hits vs. Slugging Percentage for The Oakland Athletics")
abline(m8)

# Linear model between "runs" and "on-base plus slugging" and summary
m9 <- lm(runs ~ new_obs, data = mlb11)
summary(m9)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
plot(x = mlb11$new_obs, y = mlb11$runs, xlab="On-base Plus Slugging", ylab="Runs", main=" Hits vs. On-base Plus Slugging for The Oakland Athletics")
abline(m9)


5. Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Residual vs. Fitted (linearity): The datapoints are fairly spreaed around the horizontal line. However, the line has some slope at the beginning and end in a nonconstant behavior. Therefore, the linearity condition for the model would be met, however, still revising the final conclusions.
Normal Q-Q (Normality): The residuals follow the line along the whole plot. Therefore, it the model meets the normality condition.
Scale-Location (Constant Variance): The residuals are fairly around the horizontal line. Also, the line does not have a dramatic slope, it has a downward slope towards the end but still horizontal. Therefore, the equal variance condition is met.

Residuals vs Leverage (Possible outlier with significant leverage): The plot identified two influential outliers (Observations #70 and #29). However, all the values are within the Cook’s distance lines. We would take off these observations from the dataset to adequate the relationship. Therefore, this condition is met too.

# Diagnosis for normality
hist(m9$residuals, xlab= "Residuals", main= "Residuals Distribution")

par(mfrow=c(2,2))

plot(m9)

Documentation Statement:

  • I used on 05/13/2020 the files uploaded on BlackBoard for the milestone.
  • On 14 May, 2020 Capt Forbes answered questions regarding the diagnostic plots and he advised for me to use the article cited in the references (University of Virginia Library) to understand the diagnostic plots better. He also showed me the code for the diagnostic plots. In addition, he explained me how to consider the conditions and evaluate them better.

References: