page 16

1. Take the Galton dataset and find the mean, standard deviation and correlation between the parental and child heights.

  child parent
1  61.7   70.5
2  61.7   68.5
3  61.7   65.5
4  61.7   64.5
5  61.7   64.0
6  62.2   67.5
     child           parent     
 Min.   :61.70   Min.   :64.00  
 1st Qu.:66.20   1st Qu.:67.50  
 Median :68.20   Median :68.50  
 Mean   :68.09   Mean   :68.31  
 3rd Qu.:70.20   3rd Qu.:69.50  
 Max.   :73.70   Max.   :73.00  
Mean of Child heights: 68.08847 
Mean of Parental Heights: 68.30819 
Standard Deviation of Parental Heights: 2.517941 
Standard Deviation of Parental Heights: 1.787333 
           child    parent
child  1.0000000 0.4587624
parent 0.4587624 1.0000000
Correlation between parental and child heights: 1 0.4587624 0.4587624 1 

2. Center the parent and child variables and verify that the centered variable means are 0.

The centered parent variables: 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 
The centered child variables: 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 

So the centered variable is:

The centered variable means: 9.775954e-16 

3. Rescale the parent and child variables and verify that the scaled variable standard deviations are 1.

To rescale the parent and child variables we divide it by its standard deviation, we have

Standard deviation of a scaled parent variable: 1 
Standard deviation of a scaled child variable: 1 

4. Normalize the parental and child heights. Verify that the normalized variables have mean 0 and standard deviation 1 and take the correlation between them.

The normalized parental height: 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 
The normalized parental height: 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 
The normalized parental height has a mean: 5.501733e-16 
The normalized parental height has a standard deviaton: 1 
The normalized child height has a mean: 2.183943e-16 
The normalized child height has a standard deviaton: 1 

Notice that the mean of xn and yn is 0 which is close enough to machine precision and the standard deviation of both is 1.

Correlation between normalized parental and child heights: 0.4587624 
Correlation between parental and child heights: 0.4587624 

Now, the correlation between normalized parental and child height is that 0.458 notice this is identical to the correlation between parental and child height because centering the scaling variables has no impact on the value of the correlation, since the correlation is a unit free quantity.

page 21

1. Install and load the package UsingR and load the father.son data with data(father.son). Get the linear regression fit where the son’s height is the outcome and the father’s height is the predictor. Give the intercept and the slope, plot the data and overlay the fitted regression line.

First, we define y as the son’s height and x as the father’s height, then we get the linear regression.

     (Intercept)  fheight
[1,]     33.8866 0.514093
[2,]     33.8866 0.514093
Intercept: 33.8866 
Slope: 0.514093 

The figure above shows the father’s and son’s height data, and the red line indicates the fitted regression line.

2. Center the father and son variables and refit the model omitting the intercept. Verify that the slope estimate is the same as the linear regression fit from problem 1.

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight      0.514093 0.02704874 19.00618 1.121268e-69

We see that the father’s height coefficient from the linear regression model is at 0.514. Now, let’s refit the model with the centered variables such that xc or centered father = fheight - mean(fheight) and yc or centered son = sheight - mean(sheight), then we get

centered father's height: 0.514093 

or we can also do linear model with yc as the outcome and xc as the predictor:

      xc 
0.514093 

We see that we get 0.514 exactly as before.

3. Normalize the father and son data and see that the fitted slope is the correlation.

First, we define xn as the normalized father’s height and yn as the normalized son’s height. Then, we have

Fitted Slope: 0.5013383 

The result above shows that the normalized father and son’s height is about 0.5013. If we take the correlation between xn and yn, we get

Correlation between Normalized Variables: 0.5013383 

Note that if we normalized the data first for both father and son’s height variables, then the fitted slope coefficient is going to be exactly the correlation.

4. Go back to the linear regression line from Problem 1. If a father’s height was 63 inches, what would you predict the son’s height to be?

If we want to predict using a linear model, we will use a predict function, and there we get

       1 
66.27447 

or we can also do this by coeff fit that shows two coefficients which we can directly estimate the predicted son’s height from father’s height using the coefficients directly, and we get the same answer,

(Intercept)     fheight 
  33.886604    0.514093 
Predicted son's height: 66.27447 

5. Consider a data set where the standard deviation of the outcome variable is double that of the predictor. Also, the variables have a correlation of 0.3. If you fit a linear regression model,

what would be the estimate of the slope?

Let y be our outcome and x as our predictor, then the sd(y) = 2 sd(x) implies that our ratio of standard deviation is sd(y)/ sd(x) = 2. Since our correlation coefficient is cor(y,x) = 0.3. So remember that our slope estimate is cor(y,x) * sd(y)/ sd(x) = 0.3(2) = 0.6 .

Slope Estimate: 0.6 

6. Consider the previous problem. The outcome variable has a mean of 1 and the predictor has a mean of 0.5. What would be the intercept?

Note that the outcome variable has a mean of 1, then the formula for the intercept was β0 = mean(y) -β1*mean(x) = 1 - 0.6(0.5) = 1 - 0.3 = 0.7.

Intercept: 0.7 

7. True or false, if the predictor variable has mean 0, the estimated intercept from linear regression will be the mean of the outcome?

β₀^ = mean(sheight)- β₁^ *mean(fheight); since the predictor variable has a mean of 0, then β0^ = mean(sheight). Thus, it is true that the estimated intercept is the mean of the outcome, which is the mean (sheight) or mean of the son’s height.

8. Consider problem 5 again. What would be the estimated slope if the predictor and outcome were reversed?

The slope estimate is the correlation of x and y which is cor(y,x)* sd(y)/ sd(x) if we reverse the regressor and outcome relationship then the slope will be correlation of y and x doesn’t change but the ratio of standard deviation which is sd(x)/sd(y) = 1/2. Then,

Slope Estimate: 0.15 

page 32

1. Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Give a p-value for the slope coefficient and perform the relevant

hypothesis test.

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight      0.514093 0.02704874 19.00618 1.121268e-69
P-value for the Slope Coefficient (Father's Height): 1.121268e-69 
Reject the null hypothesis: There is a significant relationship between father's height and son's height.

2. Refer to question 1. Interpret both parameters. Recenter for the intercept if necessary.

From the previous problem, we see that the estimate is 0.514093 this implies that to every 1 inch increase father’s height we estimate 0.514 increase in the son’s height. And the intercept was estimated at about 34 this is estimated son’s height for a fathers height of 0 inches. Now, let’s recenter our father’s height,

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight      0.514093 0.02704874 19.00618 1.121268e-69
Centered Intercept (Estimated Son's Height when Father's Height is at its Mean): 68.68407 

Notice that the slope did not change because centering or shifting around the regressor will have no impact on the slope. Now, at about 69 inches, this implies that the estimated son’s height at the average father’s height is about 69 inches.

3. Refer to question 1. Predict the son’s height if the father’s height is 80 inches. Would you recommend this prediction? Why or why not?

Predicted Son's Height when Father's Height is 80 inches: 75.01405 inches
    fheight         sheight     
 Min.   :59.01   Min.   :58.51  
 1st Qu.:65.79   1st Qu.:66.93  
 Median :67.77   Median :68.62  
 Mean   :67.69   Mean   :68.68  
 3rd Qu.:69.60   3rd Qu.:70.47  
 Max.   :75.43   Max.   :78.36  

We see that the maximum of the father’s height is 75 inches so we’re predicting beyond the data that we’ve actually observed. So we’re predicting for a much taller person than we typically observe. We still recommend this prediction if we had to give a prediction this would be a best prediction but we must fully aware that this prediction is extrapolation beyond the data that we’ve actually observed.

4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Interpret your coefficients, recenter for the intercept if necessary.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      20.09062500  0.6828817 29.420360 1.101810e-23
I(hp - mean(hp)) -0.06822828  0.0101193 -6.742389 1.787835e-07
Intercept (β₀): 20.09062 
Slope (β₁) for Horsepower (hp): -0.06822828 

Based on the output, our slope for horsepower is about -0.07 so this means that for every 1 unit increase in horsepower we get about a 0.07 decrease in miles per gallon. The relationship is highly statistically significant suggesting that if we were to test a hypothesis that the slope coefficient is 0 we would reject the null hypothesis. Note that recenter the intercept and we get an intercept that is now 20 interpreted as the estimated miles per gallon for the average horse powered car.

5. Refer to question 4. Overlay the fit onto a scatterplot.

6. Refer to question 4. Test the hypothesis of no linear relationship between horsepower and miles per gallon.

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
hp          -0.06822828  0.0101193 -6.742389 1.787835e-07
P-value: 1.787835e-07 
Conclusion: Reject the null hypothesis (H0). There is a significant linear relationship between horsepower and miles per gallon.

We see that the p-value for horsepower is 1.787835e-07. Comparing that to our α=0.05 level of significance, we would reject the null hypothesis that the coefficient in front of horsepower is 0, while the alternative is nonzero. In this case, there’s strong evidence to suggest that it’s nonzero.

7. Refer to question 4. Predict the miles per gallon for a horsepower of 111.

Predicted Miles per Gallon (mpg) for Horsepower 111: 22.52552 

Thus, the model predicts that at a horsepower of 111, the miles per gallon are 22.2.

page 45

1. Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Plot the son’s height (horizontal axis) versus the residuals (vertical

axis).

2. . Refer to question 1. Directly estimate the residual variance and compare this estimate to the output of lm.

Estimated Residual Variance: 5.931292 
Residual Variance from lm: 5.936804 

3. Refer to question 1. Give the R squared for this model.

R-squared (R²) Value: 0.2513401 

4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Plot horsepower versus the residuals.

5. Refer to question 4. Directly estimate the residual variance and compare this estimate to the output of lm.

Estimated Residual Variance: 14.44111 
Residual Variance from lm: 14.92248 

6. Refer to question 4. Give the R squared for this model.

R-squared (R²) Value: 0.6024373