page 16
child parent
1 61.7 70.5
2 61.7 68.5
3 61.7 65.5
4 61.7 64.5
5 61.7 64.0
6 62.2 67.5
child parent
Min. :61.70 Min. :64.00
1st Qu.:66.20 1st Qu.:67.50
Median :68.20 Median :68.50
Mean :68.09 Mean :68.31
3rd Qu.:70.20 3rd Qu.:69.50
Max. :73.70 Max. :73.00
Mean of Child heights: 68.08847
Mean of Parental Heights: 68.30819
Standard Deviation of Parental Heights: 2.517941
Standard Deviation of Parental Heights: 1.787333
child parent
child 1.0000000 0.4587624
parent 0.4587624 1.0000000
Correlation between parental and child heights: 1 0.4587624 0.4587624 1
The centered parent variables: 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5
The centered child variables: 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2
So the centered variable is:
The centered variable means: 9.775954e-16
To rescale the parent and child variables we divide it by its standard deviation, we have
Standard deviation of a scaled parent variable: 1
Standard deviation of a scaled child variable: 1
The normalized parental height: 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5
The normalized parental height: 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2
The normalized parental height has a mean: 5.501733e-16
The normalized parental height has a standard deviaton: 1
The normalized child height has a mean: 2.183943e-16
The normalized child height has a standard deviaton: 1
Notice that the mean of xn and yn is 0 which is close enough to machine precision and the standard deviation of both is 1.
Correlation between normalized parental and child heights: 0.4587624
Correlation between parental and child heights: 0.4587624
Now, the correlation between normalized parental and child height is that 0.458 notice this is identical to the correlation between parental and child height because centering the scaling variables has no impact on the value of the correlation, since the correlation is a unit free quantity.
page 21
First, we define y as the son’s height and x as the father’s height, then we get the linear regression.
(Intercept) fheight
[1,] 33.8866 0.514093
[2,] 33.8866 0.514093
Intercept: 33.8866
Slope: 0.514093
The figure above shows the father’s and son’s height data, and the red line indicates the fitted regression line.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
We see that the father’s height coefficient from the linear regression model is at 0.514. Now, let’s refit the model with the centered variables such that xc or centered father = fheight - mean(fheight) and yc or centered son = sheight - mean(sheight), then we get
centered father's height: 0.514093
or we can also do linear model with yc as the outcome and xc as the predictor:
xc
0.514093
We see that we get 0.514 exactly as before.
First, we define xn as the normalized father’s height and yn as the normalized son’s height. Then, we have
Fitted Slope: 0.5013383
The result above shows that the normalized father and son’s height is about 0.5013. If we take the correlation between xn and yn, we get
Correlation between Normalized Variables: 0.5013383
Note that if we normalized the data first for both father and son’s height variables, then the fitted slope coefficient is going to be exactly the correlation.
If we want to predict using a linear model, we will use a predict function, and there we get
1
66.27447
or we can also do this by coeff fit that shows two coefficients which we can directly estimate the predicted son’s height from father’s height using the coefficients directly, and we get the same answer,
(Intercept) fheight
33.886604 0.514093
Predicted son's height: 66.27447
what would be the estimate of the slope?
Let y be our outcome and x as our predictor, then the sd(y) = 2 sd(x) implies that our ratio of standard deviation is sd(y)/ sd(x) = 2. Since our correlation coefficient is cor(y,x) = 0.3. So remember that our slope estimate is cor(y,x) * sd(y)/ sd(x) = 0.3(2) = 0.6 .
Slope Estimate: 0.6
Note that the outcome variable has a mean of 1, then the formula for the intercept was β0 = mean(y) -β1*mean(x) = 1 - 0.6(0.5) = 1 - 0.3 = 0.7.
Intercept: 0.7
β₀^ = mean(sheight)- β₁^ *mean(fheight); since the predictor variable has a mean of 0, then β0^ = mean(sheight). Thus, it is true that the estimated intercept is the mean of the outcome, which is the mean (sheight) or mean of the son’s height.
The slope estimate is the correlation of x and y which is cor(y,x)* sd(y)/ sd(x) if we reverse the regressor and outcome relationship then the slope will be correlation of y and x doesn’t change but the ratio of standard deviation which is sd(x)/sd(y) = 1/2. Then,
Slope Estimate: 0.15
page 32
hypothesis test.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
P-value for the Slope Coefficient (Father's Height): 1.121268e-69
Reject the null hypothesis: There is a significant relationship between father's height and son's height.
From the previous problem, we see that the estimate is 0.514093 this implies that to every 1 inch increase father’s height we estimate 0.514 increase in the son’s height. And the intercept was estimated at about 34 this is estimated son’s height for a fathers height of 0 inches. Now, let’s recenter our father’s height,
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
Centered Intercept (Estimated Son's Height when Father's Height is at its Mean): 68.68407
Notice that the slope did not change because centering or shifting around the regressor will have no impact on the slope. Now, at about 69 inches, this implies that the estimated son’s height at the average father’s height is about 69 inches.
Predicted Son's Height when Father's Height is 80 inches: 75.01405 inches
fheight sheight
Min. :59.01 Min. :58.51
1st Qu.:65.79 1st Qu.:66.93
Median :67.77 Median :68.62
Mean :67.69 Mean :68.68
3rd Qu.:69.60 3rd Qu.:70.47
Max. :75.43 Max. :78.36
We see that the maximum of the father’s height is 75 inches so we’re predicting beyond the data that we’ve actually observed. So we’re predicting for a much taller person than we typically observe. We still recommend this prediction if we had to give a prediction this would be a best prediction but we must fully aware that this prediction is extrapolation beyond the data that we’ve actually observed.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.09062500 0.6828817 29.420360 1.101810e-23
I(hp - mean(hp)) -0.06822828 0.0101193 -6.742389 1.787835e-07
Intercept (β₀): 20.09062
Slope (β₁) for Horsepower (hp): -0.06822828
Based on the output, our slope for horsepower is about -0.07 so this means that for every 1 unit increase in horsepower we get about a 0.07 decrease in miles per gallon. The relationship is highly statistically significant suggesting that if we were to test a hypothesis that the slope coefficient is 0 we would reject the null hypothesis. Note that recenter the intercept and we get an intercept that is now 20 interpreted as the estimated miles per gallon for the average horse powered car.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
P-value: 1.787835e-07
Conclusion: Reject the null hypothesis (H0). There is a significant linear relationship between horsepower and miles per gallon.
We see that the p-value for horsepower is 1.787835e-07. Comparing that to our α=0.05 level of significance, we would reject the null hypothesis that the coefficient in front of horsepower is 0, while the alternative is nonzero. In this case, there’s strong evidence to suggest that it’s nonzero.
Predicted Miles per Gallon (mpg) for Horsepower 111: 22.52552
Thus, the model predicts that at a horsepower of 111, the miles per gallon are 22.2.
page 45
axis).
Estimated Residual Variance: 5.931292
Residual Variance from lm: 5.936804
R-squared (R²) Value: 0.2513401
Estimated Residual Variance: 14.44111
Residual Variance from lm: 14.92248
R-squared (R²) Value: 0.6024373