Introduction to linear regression

load("more/mlb11.RData")

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

Scatter plot. Based on the plot below, the relationship looks linear. I would not have any problems using linear model to predict the number of runs.

plot(x = mlb11$at_bats, y = mlb11$runs)

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

There is a positive linear relationship between those two variables. As at bat increases, number of run scored also increases. There are few unusual observations for ex, for at bat 5510 there are 860 runs scored. I have hightlited in the picture few points that are outliers or not normal.

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

I ran the funciton many times. The samllest value I got is 28800. As seen in the picture I attached, the highlighted points will impact the value calculated.

m1 <- lm(runs ~ at_bats, data = mlb11)

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

plot_ss(x = mlb11$homeruns, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99

hr_run <- lm(runs ~ homeruns, data = mlb11)
summary(hr_run)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

\[ \hat{y} = 415.2389 + 1.8345 * homeruns \] The relationship between homeruns and runs is a positive linear one. For each home run, there is about 1.83 runs scored.

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

I would say the manager expects aroudn 700+ runs. Predicted value is-2789.2429 + 0.6305*5578 which is 727.68. This point has minus residual, it is below the regression line.

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

There is not any apparent pattern in the residual plot. Looks like points are spread above and below zero evenly with out any patterns. SO based on this we could say that there is a linear relationship between runs and at bats.

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Yes based on the histogram and normal qq plot. The plot shows all points are close to the line.

Constant variability:

Based on the plot in (1), does the constant variability condition appear to be met?

Yes, except on or two points. If you look at the residual plot, points are spread evenly above below zero.

On Your Own

1 Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

    I have selected hits as the variable. 
    The relationship between hits and run is linear as seen by the plots below.

plot(x = mlb11$hits, y = mlb11$runs)

plot_ss(x = mlb11$hits, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75

onbase_run <- lm(runs ~ hits, data = mlb11)
summary(onbase_run)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

2 How does this relationship compare to the relationship between runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

  The relatioship between hit and run  is very similar to the relationship between bats and run. 
  Hits is slighty better at predicting runs than at_bats based on R$^2$.
  R$^2$ runs ~ hits = 0.6419
  R$^2$ runs ~ at_bats =  0.6266

3 Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

    Based on the r squared values, it looks like bat_avt is the best predicts runs followed by hits. 
    
    strikeouts multiple R-squared:  0.1694
    hits multiple R-squared:  0.6419
    at_bats multiple R-squared:  0.6266
    strikeouts multiple R-squared:  0.1694
    stolen_bases multiple R-squared:   0.002914
    bat_avg multiple R-squared:   0.6561
    homeruns multiple R-squared:    0.6266

plot(x = mlb11$hits, y = mlb11$runs)

onbase_run <- lm(runs ~ hits, data = mlb11)
summary(onbase_run)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

plot(onbase_run$residuals ~ mlb11$hits)
abline(h = 0, lty = 3)

4 Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

  new_obs is the best predictor of run based on the r-squred values. This makes sense a team that hig on           base pecertange and good sluggers will score more runs. 
  new_onbase multiple R-squared:   0.8491
  new_slug multiple R-squared: 0.8969
  new_obs multiple R-squared:   0.9349

plot(x = mlb11$new_obs, y = mlb11$runs)

onbase_run <- lm(runs ~ new_obs, data = mlb11)
summary(onbase_run)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

plot(onbase_run$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

5 Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

plot(x = mlb11$hits, y = mlb11$runs)

plot_ss(x = mlb11$hits, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75

onbase_run <- lm(runs ~ hits, data = mlb11)
plot(onbase_run$residuals ~ mlb11$hits)
abline(h = 0, lty = 3)

hist(onbase_run$residuals)

qqnorm(onbase_run$residuals)
qqline(onbase_run$residuals)

    Constant variability : Based on the residuals plot, the points are evenly spread above and below zero            with out any apparent patterns.
    Nomral : Based on the histogram and qq plot, residuals are nearlly normally distributed. 
    Linearity:  Baed on the scatterplot the relationship between the variable is a positive linear one.              Residuals plot doesnt show any patterns.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.

Introduction to linear regression

Joby John - DATA 607

On Your Own