The Data

##loading the data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use a scatterplot to display the relationship between the runs and the other numerical variables. The scatterplot does not look obviously linear because some data points are not on the line. I wouldn’t be comfortable trying to use the linear model to predict the number of runs.

#code for plotting the scatterplot with runs and at_bats

plot(mlb11$runs ~ mlb11$at_bats, main="Relationship between runs and at bats")

## trying this part of the code to find the correlation coefficient even though it doesn't seem linear
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

Sum of squared residuals

Exercise 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

The relationship between the variables seems to be that as the at bats gets higher the runs also get higher but the slope is not that high. The points are going to the upper right corner of the graph. There are ununsual observation such as the one around 5520 at bats and around 860 runs is far from the regression line.

## finding the best line that follows their association.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
##showing linear regression and showing the squared residuals

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

The smalles sum I have is 123721.9. I can’t really compare it to any neighbors.

The linear model

# Regression line code makes it easier.

m1 <- lm(runs ~ at_bats, data = mlb11)

#showing what m1 summary looks like

summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Based on the estimates the equation of the graph is y= 415.2389 + 1.8345 * homeruns. There is a positive slope so that means the more homeruns will have more runs is total because if one goes up the other also goes up. The more homeruns a team gets the more runs and the more points they get.

m2 <-  lm(runs ~ homeruns, data = mlb11)

summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Prediction and prediction errors

## scatterplot with the least squares line on top if it. 

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

The team manager would predict around 720 runs for the team with 5,578 at-bats. Comparing it to an actual data point around the same at-bats there would be an overstimation because the residual is around 14.

Model diagnostics

#seeing the linearity so this code will verify it

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Exercise 6

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

There is no apparent pattern in the residuals plot. This indicates that the relationship is linear.

#showing histogram
hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Yes the nearly normal residuals conditions appear to be met.

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

Yes the constant variability appear to be met even with the outliers.

On Your Own

1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

At a glance the hits and runs scatterplot seem to have a linear relationship where one goes up the other also goes up.

##Another good indicator for runs would be hits

plot(mlb11$runs ~ mlb11$hits, main="Relationship between hits and runs")

2. How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

The R^2 value of the runs and at-bats is 0.3729 and the R^2 value of the runs and hits is 0.6419. The model does predict runs better than at_bats bceuase the R^2 value is higher. It will be a better fit for the models.

#to show the R squared values 
m3 <- lm(runs ~ hits, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

3. Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

m3 <- lm(runs ~ hits, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
plot(mlb11$runs~ mlb11$hits)
abline(m3)

plot(m3$residuals ~ mlb11$hits)
abline(h = 0, lty = 3)

The variable hits is the best predictor for runs at the R^2 at 0.6419. The residual plots and prediction models also support that hits is a better predictor.

4. Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

It seems that they are more effective at predicting runs compared to the older variables. Based on the graphs and the summaries the best predictor is obs. A quick google search shows that obs is On-base slugging and with my limited knowledge from baseball that does seem to make sense.

###NEW ON BASE
onbase <- lm(runs ~ new_onbase, data = mlb11)
summary(onbase)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
plot(mlb11$runs~ mlb11$new_onbase)
abline(onbase)

plot(onbase$residuals ~ mlb11$new_onbase)
abline(h = 0, lty = 3)

###NEW SLUGS

slug<- lm(runs ~ new_slug, data = mlb11)
summary (slug)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
plot(mlb11$runs~ mlb11$new_slug)
abline(slug)

plot(slug$residuals ~ mlb11$new_slug)
abline(h = 0, lty = 3)

### NEW OBS

obs <- lm (runs~new_obs, data=mlb11)
summary(obs)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
plot(mlb11$runs~ mlb11$new_obs)
abline(obs)

plot(obs$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

#### 5. Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Running the code in #4 it showed that new_obs was the best predictor