Introduction to linear regression

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data ( 2011 season)

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")

load("mlb11.RData")

variables consist of :

Seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables.

Exercise 1

Q.What type of plot would you use to display the relationship between runs and one of the other numerical variables?

A. A dot plot can be best used to display the relationships of these numerical variables.

Q. Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

A. It seems that the variables increase rate though it is hard to tell by the plot it seems somewhat linear .

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

plot(mlb11$runs~ mlb11$at_bats,main= "Runs and at_bats", ylab = "Runs", xlab="at_runs")

Sum of squared residuals

Note:

In describing a distribution of a single variable we must consider characteristics such as center, spread, and shape.

Exercise 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their association.

Q.Use the following interactive function to select the line that you think does the best job of going through the cloud of points.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

-The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

Q. Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

A. Above code was run a total of 8 times the smallest sum of squares that is generates is 123,721.9. neighbors?

The linar model.

-We can use the lm function in R to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)
m1

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Coefficients:
## (Intercept)      at_bats  
##  -2789.2429       0.6305

#output lm is an object that contains all of the information we need to about linear model 

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Formula that explains the liner a mode (m1)l:

ŷ =−2789.2429+0.6305∗atbats

The output also generates the R^2 value r-sqaured value which represents proportion of variability in the response variable that is explained by the explanatory variable .

Proportion of variability (m1):

R-square= 37.3% of variability explained by at_bats.

Exercise 4

Q. Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

m2 <- lm(runs ~ homeruns, data = mlb11)
m2

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Coefficients:
## (Intercept)     homeruns  
##     415.239        1.835

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Formula that explains the liner a mode (m2):

ŷ=415.239+1.835*homerun

Proportion of variability (m2):

R-square= 62.66% of variability explained by homeruns.

Prediction and Pridiction errors

# Scatterplot with the least squares line laid on top(runs and at_bats)

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Note :

This line can be used to predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

Exercise 5

Q.If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats?

atbat1<-5578
mr1<-(-2789.2429+0.6305* atbat1)
mr1

## [1] 727.6861

Q.Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Philadelphia Phillies - runs:713, at_bats 5579

In comparison the the estimated value is an overestimate by 14 runs.

Model Diagnostics:

Note (To assess whether the linear model is reliable):

we need to check for-

linearity
nearly normal residuals
constant variability.

Lineratity

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

Exercise 6

Q. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

A. There is no apparent pattern about the dashed line though it may show a slight skew to the left due to the concentration of the dots to one side than the other. This indicates that there is a linear relationship between the runs and atbats.

Nearly normal risiduals:

Nearly normal residuals: To check this condition, we can look at a histogram.

hist(m1$residuals)

# normal probability plot of residuals

qqnorm(m1$residuals)
qqline(m1$residuals)

Exercise 7

Q.Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

A. Both the histogram and the plot indicate to somewhat meet the nearly normal residual condition.

Constant Variability:

Exercise 8

Q.Based on the plot in (1), does the constant variability condition appear to be met?

On your own.

Question 1:

Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatter plot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

New variable : batting average (bat_avg) there seems to be a linear relationship, with a strong relationship correlation coefficient.

Formula : ŷ=-642.8+5242.2*bat_avg

Scatter plot:

# quantify the strength of the relationship
cor(mlb11$runs, mlb11$bat_avg)

## [1] 0.8099859

plot_ss(x = mlb11$bat_avg, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -642.8       5242.2  
## 
## Sum of Squares:  67849.52

plot_ss(x = mlb11$bat_avg, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -642.8       5242.2  
## 
## Sum of Squares:  67849.52

Question 2:

How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

m3<-lm(runs ~ bat_avg, data = mlb11)
summary(m3)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Formula : ŷ=-642.8+5242.2*bat_avg

r^2: For this model 65.61% of the variability in runs is explained by the bat_avg.

at_bats has a 37% varability in runs slightly lower than the 65 % r^2 value, therefore the batting averages predicts runs better than that of at_bats.

** Infact the batting averages is the best variable in predicting the runs than any of the above in the data set.

Question 3 :

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

Model Diagnostics:

Linearity:

plot(m1$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Nearly normal residuals/normal plot of the residuals.

hist(m1$residuals)

#normal plot 

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

A. Based on the histogram and the normal plot the condition of nearly normal condition appears to be met.Futher constant variability condition appears to be met.

Three newer varaibles:

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success.

1.new_slug 2.new_onbase 3.New_obs

Find the strength of the liner relationship.

New slug

#new slug
cor(mlb11$runs, mlb11$new_slug)

## [1] 0.9470324

 m4<-lm(runs ~ new_slug, data = mlb11)
 summary(m4)

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

new _onbase

#new_onbase
cor(mlb11$runs, mlb11$new_onbase)

## [1] 0.9214691

m5<-lm(runs ~ new_onbase , data = mlb11)
summary(m5)

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

new_obs

#Hits
cor(mlb11$runs, mlb11$new_obs)

## [1] 0.9669163

m6<-lm(runs ~ new_obs, data = mlb11)
summary(m6)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

1.In general, are they more or less effective at predicting runs that the old variables?2.Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs?

Based on the R-squared vales in comparrison and corretion vlaues they are a much better indicator or predictor of runs than the old variables.

2.Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs?

We know that the vaibles that best indicate or predict runs are in the new variable family and of those new_obs is the best which makes it better predictor than all the variables in the data set.

3.Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense

Question 4:

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

#Lineaity:
plot(m1$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

hist(m1$residuals)

#normal plot 

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

hist(m1$residuals)

#normal plot 

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

liner regression lab

M.Demelash

11/10/2021

Introduction to linear regression

The data ( 2011 season)

variables consist of :

Exercise 1

Sum of squared residuals

Note:

Exercise 2

Exercise 3

The linar model.

Formula that explains the liner a mode (m1)l:

Proportion of variability (m1):

Exercise 4

Formula that explains the liner a mode (m2):

Proportion of variability (m2):

Prediction and Pridiction errors

Note :

Exercise 5

Model Diagnostics:

Note (To assess whether the linear model is reliable):

Lineratity

Exercise 6

Nearly normal risiduals:

Exercise 7

Constant Variability:

Exercise 8

On your own.

Question 1:

Question 2:

Question 3 :

Model Diagnostics:

Linearity:

Nearly normal residuals/normal plot of the residuals.

Three newer varaibles:

Question 4: