library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(openintro)
library(statsr)
## Warning: package 'statsr' was built under R version 4.0.5
## Warning: package 'BayesFactor' was built under R version 4.0.5
## Warning: package 'coda' was built under R version 4.0.5

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

#load the data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.

summary(mlb11)
##                    team         runs          at_bats          hits     
##  Arizona Diamondbacks: 1   Min.   :556.0   Min.   :5417   Min.   :1263  
##  Atlanta Braves      : 1   1st Qu.:629.0   1st Qu.:5448   1st Qu.:1348  
##  Baltimore Orioles   : 1   Median :705.5   Median :5516   Median :1394  
##  Boston Red Sox      : 1   Mean   :693.6   Mean   :5524   Mean   :1409  
##  Chicago Cubs        : 1   3rd Qu.:734.0   3rd Qu.:5575   3rd Qu.:1441  
##  Chicago White Sox   : 1   Max.   :875.0   Max.   :5710   Max.   :1600  
##  (Other)             :24                                                
##     homeruns        bat_avg         strikeouts    stolen_bases   
##  Min.   : 91.0   Min.   :0.2330   Min.   : 930   Min.   : 49.00  
##  1st Qu.:118.0   1st Qu.:0.2447   1st Qu.:1085   1st Qu.: 89.75  
##  Median :154.0   Median :0.2530   Median :1140   Median :107.00  
##  Mean   :151.7   Mean   :0.2549   Mean   :1150   Mean   :109.30  
##  3rd Qu.:172.8   3rd Qu.:0.2602   3rd Qu.:1248   3rd Qu.:130.75  
##  Max.   :222.0   Max.   :0.2830   Max.   :1323   Max.   :170.00  
##                                                                  
##       wins          new_onbase        new_slug         new_obs      
##  Min.   : 56.00   Min.   :0.2920   Min.   :0.3480   Min.   :0.6400  
##  1st Qu.: 72.00   1st Qu.:0.3110   1st Qu.:0.3770   1st Qu.:0.6920  
##  Median : 80.00   Median :0.3185   Median :0.3985   Median :0.7160  
##  Mean   : 80.97   Mean   :0.3205   Mean   :0.3988   Mean   :0.7191  
##  3rd Qu.: 90.00   3rd Qu.:0.3282   3rd Qu.:0.4130   3rd Qu.:0.7382  
##  Max.   :102.00   Max.   :0.3490   Max.   :0.4610   Max.   :0.8100  
## 

Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

Answer 1. I would use a scatterplot to display the relationship between runs and one of the other numerical variables such as at_bats. 2. See Plot below between runs and at_bats 3. The trend does show some linear relationship between runs and at_bats, but there appears to be a lot of variation, so the linear model doesn’t appear to be very useful as an accurate predictor in this case.

#scatterplot 

plot(mlb11$runs ~ mlb11$at_bats,
     main = "Relationship Between Runs and At_Bats",
     ylab = "Runs", 
     xlab = "At_Bats")

#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

Sum of squared residuals Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats above.

Exercise 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

Answer

As at_bats increase, runs increase, so there is a positive relationship between the two variables. The correlation coefficient of .611 means there is a moderately strong relationship between at_bats and runs, but it’s certainly not 1-to-1 as a correlation coefficient of 1 one would indicate (for every bat, there is a run). So, while the relationship appears positive and increasing, it’s only moderately strong with a fair number of variances from the best fit line.

#Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their #association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.



plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
#showSquares = TRUE   #The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, run this line
                     #Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.

After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:

ei=yi−y^i

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

Answer I ran the function several times. Something appears wrong….my sum of squares didn’t change. It stayed the same, 123721.9 I must have done something wrong.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
                     #The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, run this line
                     #Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.

The linear model It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Let’s consider this output piece by piece: 1. formula used to describe the model. 2. five-number summary of the residuals. 3. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:

y^=−2789.2429+0.6305∗atbats

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, R2. The R2 value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Answer The line is increasing; the slope is positive; the correlation coefficient is fairly linear at 0.7915577
In terms of the slope (rise/run) of the line, for each increase in homeruns, run increases by 1.8345

Equation of the Regression Line The “Coefficients” table shown below is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns. With this table, we can write down the least squares regression line for the linear model:

y^= 415.2389+1.8345∗homeruns

#plot of homeruns to predict runs

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99
lm(formula = runs ~ homeruns, data = mlb11)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Coefficients:
## (Intercept)     homeruns  
##     415.239        1.835
##make a linear model of runs as a function of homeruns
#R should look in the mlb11 data frame to find the runs and homeruns variables.
#The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
m1 <- lm(runs ~ homeruns, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07
#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577

Prediction and prediction errors Let’s create a scatterplot with the least squares line laid on top.

#Extrapolation

#The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to #predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. #However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.


plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Answer We know from Exercise 3 above that the least squares regression line for the linear model is: y^=−2789.2429+0.6305∗atbats If at_bats = 5,578; so y^=727.6861. Without the actual data, a team manager would predict runs of 727.69 (728) with 5,578 at_bats.

In looking at the data table, there is an at_bat of 5,579 and a run of 713; 728 - 713 = residual overstatement of 15 for this estimate

We ran the following code to get the equation: m1 <- lm(runs ~ at_bats, data = mlb11) #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables. #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this #information using the summary function. summary(m1)

#look at the data and pull a close data point 
view(mlb11)    #in the table, there is an at_bat of 5,579 and a run of 713

Model diagnostics To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.

#To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
#ITEM 1:  Linearity:  DID ALL OF THIS ABOVE
#ITEM 1:  Linearity:  Check if relationship between runs and at-bats is linear by verifying condition with a plot of the residuals vs. at-bats.  

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Exercise 6

#To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability. #ITEM 1: Linearity: DID ALL OF THIS ABOVE #ITEM 1: Linearity: Check if relationship between runs and at-bats is linear by verifying condition with a plot of the residuals vs. at-bats. #Item 2: Nearly Normal Residuals: Histogram #Item 2: Nearly Normal Residuals: or a normal probability plot of the residuals.

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Answer There is no apparent pattern in the residuals plot.

This indicates the relationship between runs and at-bats is linear.

#Nearly normal residuals: To check this condition, we can look at a histogram
hist(m1$residuals)

#or a normal probability plot of the residuals.
qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Answer

Yes, based on the histogram and normal probability plot, there appears to be a normal distribution with very few extreme variances.

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

Answer

The constant variability condition is met when the points around the line appear to be evenly varied/distanced. Yes, the constant variability condition appears to be met based on the plot in (1).

ON YOUR OWN 1

Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Answer Scatterplot with a linear model: Traditional variable selected = bat_avg Does there seem to be a linear relationship: Yes, there seems to be a linear relationship. The correlation coefficient of .8099 indicates a linear relationship between batting average and runs. The slope is positive between the two variables.

#scatterplot 

plot(mlb11$runs ~ mlb11$bat_avg,
     main = "Relationship Between Runs and Bat_Avg",
     ylab = "Runs", 
     xlab = "Bat_Avg",)


#fit a simple linear regression model
model <- lm(mlb11$run ~ mlb11$bat_avg, data = mlb11)

#add the fitted regression line to the scatterplot
abline(model)

#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859

ON YOUR OWN 2

How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

Answer

R2 - at_bats: 0.3729 R2 - bat_avg: 0.6561

Based on the values of R2, the relationship between runs and bat_avg seems is stronger than that of runs and at_bats. The R2 value for the linear model using at_bats is 0.3729 while for the linear model using bat_avg is 0.6561. The linear model using bat_avg is a better predictor of runs. The sum of residuals for the linear model using the variable at_bats as a predictor of runs is 123721.9 (see R Console output of plot_ss) and for the linear model using the variable bat_avg as a predictor of runs is 67849.52 (see R Console output of plot_ss). The model summary using bat_avg is the best predictor of runs as compared to that of the model using the at_bats variable.

#Runs and AT_BATS
plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The linear model use the lm function in R to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388
#BAT_AVB
plot_ss(x = mlb11$bat_avg, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -642.8       5242.2  
## 
## Sum of Squares:  67849.52
m1 <- lm(runs ~ bat_avg, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

ON YOUR OWN 3

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

Answer 1. R2 - hits: 0.6419 1. Sum of Residuals - hits: 70638.75

  1. R2 - homeruns: 0.6266

  2. Sum of Residuals - homeruns: 73671.99

  3. R2 - strikeouts: 0.1694

  4. Sum of Residuals - strikeouts: 163870.1

  5. R2 - stolen_bases: 0.002914

  6. Sum of Residuals - Stolen Bases: 196706.3

  7. R2 - wins: 0.361

  8. Sum of Residuals - Wins: 126068.4

Based on the values of R2, the relationship between runs and hits and runs and homeruns is very close at .64 for hits and .63 for homeruns. Hits appears to be the best predictor as far as R2 value is concerned, but it is very close to homeruns. Using the sum of residuals between hits and homeruns, homeruns is a better predictor at 73,672 vs 70,639. Using sum of residuals, however, stolen_bases appears to be the strongest predictor of runs at 196,703 as compared to the other models.

Overall, in looking at the traditional variables, bat_bat avg. appears to be the best predictor followed by hits.

# all plots together
trad1=lm(runs~hits,data=mlb11)
trad2=lm(runs~homeruns,data=mlb11)
trad3=lm(runs~strikeouts,data=mlb11)
trad4=lm(runs~stolen_bases, data=mlb11)
trad5=lm(runs~wins,data=mlb11)
par(mfrow=c(2,3))
plot(mlb11$hits,mlb11$runs,xlab="Hits",ylab="Runs",main="Hits Vs Runs") 
abline(trad1)
plot(mlb11$homeruns,mlb11$runs,xlab="Homeruns",ylab="Runs",main="Homeruns Vs Runs") 
abline(trad2)
plot(mlb11$strikeouts,mlb11$runs,xlab="Strikeouts",ylab="Runs",main="Strikeouts Vs Runs") 
abline(trad3)
plot(mlb11$stolen_bases,mlb11$runs,xlab="Stolen Bases",ylab="Runs",main="Stolen Bases Vs Runs") 
abline(trad4)
plot(mlb11$wins,mlb11$runs,xlab="Wins",ylab="Runs",main="Wins Vs Runs") 
abline(trad5)

#1.  Runs and HITS
plot_ss(x = mlb11$hits, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75
##1.  Runs and HITS
m1 <- lm(runs ~ hits, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of hits
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and hits variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
#2.  Runs and HOMERUNS
plot_ss(x = mlb11$homeruns, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99
##2.  Runs and HOMERUNS
m1 <- lm(runs ~ homeruns, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of homeruns
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and homeruns variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07
#3.  Runs and STRIKEOUTS
plot_ss(x = mlb11$strikeouts, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   1054.7342      -0.3141  
## 
## Sum of Squares:  163870.1
##3.  Runs and STRIKEOUTS
m1 <- lm(runs ~ strikeouts, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of strikeouts
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and strikeouts variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386
#4.  Runs and STOLEN BASES
plot_ss(x = mlb11$stolen_bases, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##    677.3074       0.1491  
## 
## Sum of Squares:  196706.3
##4.  Runs and STOLEN BASES
m1 <- lm(runs ~ stolen_bases, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of stolen bases
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and stolen bases variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769
#5.  Runs and Wins
plot_ss(x = mlb11$wins, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     342.121        4.341  
## 
## Sum of Squares:  126068.4
##5.  Runs and STOLEN BASES
m1 <- lm(runs ~ wins, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of wins
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and wins variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469

ON YOUR OWN 4

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

Answer

Answer The new variables, Onbase, slug, and obs, have higher R2 values than the traditional variables. The Sum of Squares for the residuals is less for the new variables than for the traditional variables. With the R2 values being higher and sum of squares of residual being less for the new variables, the new variables are more effective predictors of runs than the traditional variables. Of all ten variables we’ve analyzed, the new variables seem to be the best predictor of runs, with new_obs being the best of the three.

NEW VARIABLES 1. R2 - new_onbase: 0.8491 1. Sum of Residuals - new_onbase: 29768.7

  1. R2 - new_slug: 0.8969

  2. Sum of Residuals - new_slug: 20345.54

  3. R2 - new_obs: 0.9349

  4. Sum of Residuals - new_obs: 12837.66

TRADITIONAL VARIABILES (SEE ON YOUR OWN 3 FOR WORK) R2 - hits: 0.6419 1. Sum of Residuals - hits: 70638.75

  1. R2 - homeruns: 0.6266

  2. Sum of Residuals - homeruns: 73671.99

  3. R2 - strikeouts: 0.1694

  4. Sum of Residuals - strikeouts: 163870.1

  5. R2 - stolen_bases: 0.002914

  6. Sum of Residuals - Stolen Bases: 196706.3

  7. R2 - wins: 0.361

  8. Sum of Residuals - Wins: 126068.4

# all plots together
new1=lm(runs~new_onbase,data=mlb11)
new2=lm(runs~new_slug,data=mlb11)
new3=lm(runs~new_obs,data=mlb11)
par(mfrow=c(1,3))
plot(mlb11$new_onbase,mlb11$runs,xlab="New On Base",ylab="Runs",main="New On Base Vs Runs") 
abline(new1)
plot(mlb11$new_slug,mlb11$runs,xlab="New Slug",ylab="Runs",main="New Slug Vs Runs") 
abline(new2)
plot(mlb11$new_obs,mlb11$runs,xlab="New Obs",ylab="Runs",main="New Obs Vs Runs") 
abline(new3)

summary(new1)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
summary(new2)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
summary(new3)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
#1.  Runs and new_onbase
plot_ss(x = mlb11$new_onbase, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       -1118         5654  
## 
## Sum of Squares:  29768.7
##1.  Runs and new_onbase
m1 <- lm(runs ~ new_onbase, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_onbase
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_onbase variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
#1.  Runs and new_slug
plot_ss(x = mlb11$new_slug, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -375.8       2681.3  
## 
## Sum of Squares:  20345.54
##1.  Runs and new_slug
m1 <- lm(runs ~ new_onbase, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_slug
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_slug variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
#1.  Runs and new_obs
plot_ss(x = mlb11$new_obs, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -686.6       1919.4  
## 
## Sum of Squares:  12837.66
##1.  Runs and new_obs
m1 <- lm(runs ~ new_obs, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_obs
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_obs variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

ON YOUR OWN 5

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Answer

As noted in On Your Own 5, new_obs is the best variable for predicting runs. New_obs R2 value is higher than all other models at .935 (rounded to the 3rd decimal place), and the sum of square residuals error is lower than all the other models at 12837.66.

  1. R2 - new_obs: 0.9349
  2. Sum of Residuals - new_obs: 12837.66
new3=lm(runs~new_obs,data=mlb11)
par(mfrow=c(1,1))
plot(mlb11$new_obs,mlb11$runs,xlab="New Obs",ylab="Runs",main="New Obs Vs Runs") 
abline(new3)

#1.  Runs and new_obs
plot_ss(x = mlb11$new_obs, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -686.6       1919.4  
## 
## Sum of Squares:  12837.66
---
title: "Intro to Linear Regression - Week 11"
author: "Aaryn Zimmerman"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(statsr)

```
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

```{r}
#load the data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")


```
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.

```{r}
summary(mlb11)
```


### Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?


***Answer***
1.  I would use a scatterplot to display the relationship between runs and one of the other numerical variables such as at_bats.
2.  See Plot below between runs and at_bats
3.  The trend does show some linear relationship between runs and at_bats, but there appears to be a lot of variation, so the linear model doesn't appear to be very useful as an accurate predictor in this case.



```{r code-chunk-label}
#scatterplot 

plot(mlb11$runs ~ mlb11$at_bats,
     main = "Relationship Between Runs and At_Bats",
     ylab = "Runs", 
     xlab = "At_Bats")
```


```{r}
#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$at_bats)


```
Sum of squared residuals
Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats above.


### Exercise 2
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.


***Answer***

As at_bats increase, runs increase, so there is a positive relationship between the two variables.  The correlation coefficient of .611 means there is a moderately strong relationship between at_bats and runs, but it's certainly not 1-to-1 as a correlation coefficient of 1 one would indicate (for every bat, there is a run).  So, while the relationship appears positive and increasing, it's only moderately strong with a fair number of variances from the best fit line.


```{r}


#Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their #association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.



plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

#showSquares = TRUE   #The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, run this line
                     #Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.
```


After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:

ei=yi−y^i


```{r}

```


### Exercise 3
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?


***Answer***
I ran the function several times.  Something appears wrong....my sum of squares didn't change.  It stayed the same, 123721.9  I must have done something wrong.


```{r}


plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
                     #The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, run this line
                     #Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.


```
```{r}

```

The linear model
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).

```{r}
m1 <- lm(runs ~ at_bats, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
Let’s consider this output piece by piece:
1.  formula used to describe the model. 
2.  five-number summary of the residuals. 
3.  The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:

y^=−2789.2429+0.6305∗atbats

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, R2. The R2 value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.



```{r}

```


### Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?


***Answer***
The line is increasing; the slope is positive; the correlation coefficient is fairly linear at 0.7915577  
In terms of the slope (rise/run) of the line, for each increase in homeruns, run increases by 1.8345  


Equation of the Regression Line
The “Coefficients” table shown below is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns. With this table, we can write down the least squares regression line for the linear model:

y^= 415.2389+1.8345∗homeruns



```{r}
#plot of homeruns to predict runs

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)
lm(formula = runs ~ homeruns, data = mlb11)


```

```{r}

##make a linear model of runs as a function of homeruns
#R should look in the mlb11 data frame to find the runs and homeruns variables.
#The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
m1 <- lm(runs ~ homeruns, data = mlb11)
summary(m1)
```
```{r}
#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$homeruns)
```

Prediction and prediction errors
Let’s create a scatterplot with the least squares line laid on top.

```{r}
#Extrapolation

#The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to #predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. #However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.


plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

```




```{r}

```


### Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?


***Answer***
We know from Exercise 3 above that the least squares regression line for the linear model is:  y^=−2789.2429+0.6305∗atbats
If at_bats = 5,578; so y^=727.6861.  Without the actual data, a team manager would predict runs of 727.69 (728) with 5,578 at_bats.

In looking at the data table, there is an at_bat of 5,579 and a run of 713;    728 - 713 = residual overstatement of 15 for this estimate



We ran the following code to get the equation:
m1 <- lm(runs ~ at_bats, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)

```{r}
#look at the data and pull a close data point 
view(mlb11)    #in the table, there is an at_bat of 5,579 and a run of 713


```


Model diagnostics
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.

```{r}

#To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
#ITEM 1:  Linearity:  DID ALL OF THIS ABOVE
#ITEM 1:  Linearity:  Check if relationship between runs and at-bats is linear by verifying condition with a plot of the residuals vs. at-bats.  

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0
```
```{r}

```

### Exercise 6

#To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
#ITEM 1:  Linearity:  DID ALL OF THIS ABOVE
#ITEM 1:  Linearity:  Check if relationship between runs and at-bats is linear by verifying condition with a plot of the residuals vs. at-bats.
#Item 2:  Nearly Normal Residuals:  Histogram
#Item 2:  Nearly Normal Residuals:  or a normal probability plot of the residuals.


Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

***Answer***
There is no apparent pattern in the residuals plot.

This indicates the relationship between runs and at-bats is linear.

```{r}
#Nearly normal residuals: To check this condition, we can look at a histogram
hist(m1$residuals)

```
```{r}
#or a normal probability plot of the residuals.
qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot


```
```{r}

```

### Exercise 7
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

***Answer***

Yes, based on the histogram and normal probability plot, there appears to be a normal distribution with very few extreme variances.
```{r}

```

### Exercise 8
Based on the plot in (1), does the constant variability condition appear to be met?

***Answer***

The constant variability condition is met when the points around the line appear to be evenly varied/distanced.  Yes, the constant variability condition appears to be met based on the plot in (1).



```{r}

```

### ON YOUR OWN 1
Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?


***Answer***
Scatterplot with a linear model:  Traditional variable selected = bat_avg
Does there seem to be a linear relationship:  Yes, there seems to be a linear relationship.  The correlation coefficient of .8099 indicates a linear relationship between batting average and runs.  The slope is positive between the two variables. 


```{r}
#scatterplot 

plot(mlb11$runs ~ mlb11$bat_avg,
     main = "Relationship Between Runs and Bat_Avg",
     ylab = "Runs", 
     xlab = "Bat_Avg",)


#fit a simple linear regression model
model <- lm(mlb11$run ~ mlb11$bat_avg, data = mlb11)

#add the fitted regression line to the scatterplot
abline(model)

```

```{r}
#If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$bat_avg)
```
```{r}

```

### ON YOUR OWN 2

How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?


***Answer***

R2 - at_bats:  0.3729
R2 - bat_avg:  0.6561



Based on the values of R2, the relationship between runs and bat_avg seems is stronger than that of runs and at_bats. The R2 value for the linear model using at_bats is 0.3729 while for the linear model using bat_avg is 0.6561.  The linear model using bat_avg is a better predictor of runs. The sum of residuals for the linear model using the variable at_bats as a predictor of runs is 123721.9 (see R Console output of plot_ss) and for the linear model using the variable bat_avg as a predictor of runs is 67849.52 (see R Console output of plot_ss). The model summary using bat_avg is the best predictor of runs as compared to that of the model using the at_bats variable.




```{r}
#Runs and AT_BATS
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
```
The linear model
use the lm function in R to fit the linear model (a.k.a. regression line).

```{r}
m1 <- lm(runs ~ at_bats, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
```{r}

#BAT_AVB
plot_ss(x = mlb11$bat_avg, y = mlb11$runs)



```

```{r}
m1 <- lm(runs ~ bat_avg, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
```{r}

```

### ON YOUR OWN 3


Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).



***Answer***
1.  R2 - hits:  0.6419
1.  Sum of Residuals - hits:  70638.75

2.  R2 - homeruns:  0.6266
2.  Sum of Residuals - homeruns:  73671.99

3.  R2 - strikeouts:  0.1694
3.  Sum of Residuals - strikeouts:  163870.1

4.  R2 - stolen_bases:  0.002914
4.  Sum of Residuals - Stolen Bases:  196706.3

5.  R2 - wins:   0.361
5.  Sum of Residuals - Wins:  126068.4


Based on the values of R2, the relationship between runs and hits and runs and homeruns is very close at .64 for hits and .63 for homeruns.  Hits appears to be the best predictor as far as R2 value is concerned, but it is very close to homeruns.  Using the sum of residuals between hits and homeruns, homeruns is a better predictor at 73,672 vs 70,639.  Using sum of residuals, however, stolen_bases appears to be the strongest predictor of runs at 196,703 as compared to the other models.

Overall, in looking at the traditional variables, bat_bat avg. appears to be the best predictor followed by hits.


```{r}
# all plots together
trad1=lm(runs~hits,data=mlb11)
trad2=lm(runs~homeruns,data=mlb11)
trad3=lm(runs~strikeouts,data=mlb11)
trad4=lm(runs~stolen_bases, data=mlb11)
trad5=lm(runs~wins,data=mlb11)
par(mfrow=c(2,3))
plot(mlb11$hits,mlb11$runs,xlab="Hits",ylab="Runs",main="Hits Vs Runs") 
abline(trad1)
plot(mlb11$homeruns,mlb11$runs,xlab="Homeruns",ylab="Runs",main="Homeruns Vs Runs") 
abline(trad2)
plot(mlb11$strikeouts,mlb11$runs,xlab="Strikeouts",ylab="Runs",main="Strikeouts Vs Runs") 
abline(trad3)
plot(mlb11$stolen_bases,mlb11$runs,xlab="Stolen Bases",ylab="Runs",main="Stolen Bases Vs Runs") 
abline(trad4)
plot(mlb11$wins,mlb11$runs,xlab="Wins",ylab="Runs",main="Wins Vs Runs") 
abline(trad5)
```





```{r}
#1.  Runs and HITS
plot_ss(x = mlb11$hits, y = mlb11$runs)
```

```{r}
##1.  Runs and HITS
m1 <- lm(runs ~ hits, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of hits
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and hits variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
```{r}
#2.  Runs and HOMERUNS
plot_ss(x = mlb11$homeruns, y = mlb11$runs)
```

```{r}
##2.  Runs and HOMERUNS
m1 <- lm(runs ~ homeruns, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of homeruns
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and homeruns variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```

```{r}
#3.  Runs and STRIKEOUTS
plot_ss(x = mlb11$strikeouts, y = mlb11$runs)
```

```{r}
##3.  Runs and STRIKEOUTS
m1 <- lm(runs ~ strikeouts, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of strikeouts
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and strikeouts variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```

```{r}
#4.  Runs and STOLEN BASES
plot_ss(x = mlb11$stolen_bases, y = mlb11$runs)
```


```{r}
##4.  Runs and STOLEN BASES
m1 <- lm(runs ~ stolen_bases, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of stolen bases
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and stolen bases variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```


```{r}
#5.  Runs and Wins
plot_ss(x = mlb11$wins, y = mlb11$runs)
```

```{r}
##5.  Runs and STOLEN BASES
m1 <- lm(runs ~ wins, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of wins
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and wins variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
```{r}

```


### ON YOUR OWN 4

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?


***Answer***


***Answer***
The new variables, Onbase, slug, and obs, have higher R2 values than the traditional variables.  The Sum of Squares for the residuals is less for the new variables than for the traditional variables.  With the R2 values being higher and sum of squares of residual being less for the new variables, the new variables are more effective predictors of runs than the traditional variables.  Of all ten variables we've analyzed, the new variables seem to be the best predictor of runs, with new_obs being the best of the three.  

NEW VARIABLES
1.  R2 - new_onbase:  0.8491
1.  Sum of Residuals - new_onbase:   29768.7

2.  R2 - new_slug:   0.8969
2.  Sum of Residuals - new_slug:  20345.54

3.  R2 - new_obs:  0.9349
3.  Sum of Residuals - new_obs:  12837.66



TRADITIONAL VARIABILES (SEE ON YOUR OWN 3 FOR WORK)
 R2 - hits:  0.6419
1.  Sum of Residuals - hits:  70638.75

2.  R2 - homeruns:  0.6266
2.  Sum of Residuals - homeruns:  73671.99

3.  R2 - strikeouts:  0.1694
3.  Sum of Residuals - strikeouts:  163870.1

4.  R2 - stolen_bases:  0.002914
4.  Sum of Residuals - Stolen Bases:  196706.3

5.  R2 - wins:   0.361
5.  Sum of Residuals - Wins:  126068.4




```{r}
# all plots together
new1=lm(runs~new_onbase,data=mlb11)
new2=lm(runs~new_slug,data=mlb11)
new3=lm(runs~new_obs,data=mlb11)
par(mfrow=c(1,3))
plot(mlb11$new_onbase,mlb11$runs,xlab="New On Base",ylab="Runs",main="New On Base Vs Runs") 
abline(new1)
plot(mlb11$new_slug,mlb11$runs,xlab="New Slug",ylab="Runs",main="New Slug Vs Runs") 
abline(new2)
plot(mlb11$new_obs,mlb11$runs,xlab="New Obs",ylab="Runs",main="New Obs Vs Runs") 
abline(new3)
```


```{r}
summary(new1)
summary(new2)
summary(new3)

```










```{r}

#1.  Runs and new_onbase
plot_ss(x = mlb11$new_onbase, y = mlb11$runs)


```


```{r}
##1.  Runs and new_onbase
m1 <- lm(runs ~ new_onbase, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_onbase
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_onbase variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```


```{r}

#1.  Runs and new_slug
plot_ss(x = mlb11$new_slug, y = mlb11$runs)
```


```{r}
##1.  Runs and new_slug
m1 <- lm(runs ~ new_onbase, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_slug
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_slug variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```
```{r}
#1.  Runs and new_obs
plot_ss(x = mlb11$new_obs, y = mlb11$runs)
```
```{r}
##1.  Runs and new_obs
m1 <- lm(runs ~ new_obs, data = mlb11)  #lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of new_obs
                                        #The second argument specifies that R should look in the mlb11 data frame to find the runs and new_obs variables.
                                        #The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this 
                                              #information using the summary function.
summary(m1)
```

### ON YOUR OWN 5
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

***Answer***

As noted in On Your Own 5, new_obs is the best variable for predicting runs. New_obs R2 value is higher than all other models at .935 (rounded to the 3rd decimal place), and the sum of square residuals error is lower than all the other models at 12837.66. 

3.  R2 - new_obs:  0.9349
3.  Sum of Residuals - new_obs:  12837.66



```{r}

new3=lm(runs~new_obs,data=mlb11)
par(mfrow=c(1,1))
plot(mlb11$new_obs,mlb11$runs,xlab="New Obs",ylab="Runs",main="New Obs Vs Runs") 
abline(new3)
```



```{r}
#1.  Runs and new_obs
plot_ss(x = mlb11$new_obs, y = mlb11$runs)
```

