library(UsingR)
Loading required package: MASS
Loading required package: HistData
Loading required package: Hmisc
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, units
library(ggplot2)
data("Galton")
mean(Galton$parent)
[1] 68.30819
mean(Galton$child)
[1] 68.08847
sd(Galton$parent)
[1] 1.787333
sd(Galton$child)
[1] 2.517941
The mean for the child’s height is 68.08847 with a standard deviation of 2.517941. On the other hand, the mean for the parent’s height is 68.30819 with a standard deviation of 1.787333
cor(Galton)
parent child
parent 1.0000000 0.4587624
child 0.4587624 1.0000000
The correlation between the parent and the child is the same with the correlation between the child and the parent.
In centering variables we subtract the mean from data points. If the mean of this data is 0, then the variables are being centered. Here we let x and y be the parent and child variables respectively. The variable xc is the data that contains the child variables minus the child’s mean. Similary, the variable xy is the data that contains the parent variables minus the parent’s mean.
x<-Galton$parent
y<-Galton$child
xc<- x-mean(x)
xy<- y-mean(y)
mean(xc)
[1] 9.775954e-16
mean(xy)
[1] 4.817867e-16
Clearly, the mean of xc is 9.775954e-16 and the mean of xy is 4.817867e-16 which is approximately equal to zero.
To rescale a variable, we divide the variables by its standard deviation. The standard deviation of the derived data is 1.
ys<- y/sd(y)
xs<- x/sd(x)
sd(ys)
[1] 1
sd(xs)
[1] 1
Thus, it is true that the scaled variable standard deviations are 1.
Normalizing parental and child heights is dividing the standard deviation from the difference of each height and their corresponding mean. Here, we let xn be the normalized data for child height and yn is the normalized data for parents heights.
xn<-(x-mean(x))/sd(x)
yn<-(y-mean(y))/sd(y)
mean(xn)
[1] 5.501733e-16
mean(yn)
[1] 2.183943e-16
sd(xn)
[1] 1
sd(yn)
[1] 1
cor(xn, yn)
[1] 0.4587624
cor(yn, xn)
[1] 0.4587624
The normalized parent’s height and child’s height is 2.183943e-16 and 5.501733e-16 respectively which is approximately equal to 0 and with standard deviation equal to 1. The correlation between parent and child and between child and parent is the same which is equal to 0.4587624 since centerinig and scaling has no impact on the value of the correlation.
Having son’s height (s) as the outcome we obtain the line Y=ß_0 + ß_1X. Here Y is the son’s height(s) and X is the father’s height(f), ß_1=cor(Y,X)(sd(Y)/sd(X)) and ß_0= mean(Y)-ß_1(mean(X)). ß_1 is the slope and ß_0 is the intercept. Thus, we have
data("father.son")
fit<-lm(sheight~fheight, father.son)
f<-father.son$fheight
s<-father.son$sheight
beta1<- cor(s,f)* sd(s)/sd(f)
beta0<- mean(s) - beta1 * mean(f)
rbind(coef(fit), c(beta0,beta1))
(Intercept) fheight
[1,] 33.8866 0.514093
[2,] 33.8866 0.514093
plot<-ggplot(father.son, aes(x=fheight, y=sheight))
plot<- plot + geom_point()
plot<- plot + geom_smooth(method = lm, se = FALSE, lwd=2)
plot
`geom_smooth()` using formula = 'y ~ x'
From Problem 1, we have father’s height coefficient equal to 0.51409.
summary(fit)
Call:
lm(formula = sheight ~ fheight, data = father.son)
Residuals:
Min 1Q Median 3Q Max
-8.8772 -1.5144 -0.0079 1.6285 8.9685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.88660 1.83235 18.49 <2e-16 ***
fheight 0.51409 0.02705 19.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
Centering the data creates a slope coefficient equal to ß_1=sum(XY)/(X^2). We can also verify using lm(Y-centered~X-centered-1). Subtracting 1 mean omitting the intercept.
fc<- f - mean(f)
sc<- s - mean(s)
sum(fc * sc)/sum(fc^2)
[1] 0.514093
lm(sc~fc-1)
Call:
lm(formula = sc ~ fc - 1)
Coefficients:
fc
0.5141
Thus,the slope estimate is the same as the linear regression fit from problem 1.
Normalizing the father and son data, we have
fn<-(f-mean(f))/sd(f)
sn<-(s-mean(s))/sd(s)
and getting the fitted slope and correlation, we have
lm(sn~fn)
Call:
lm(formula = sn ~ fn)
Coefficients:
(Intercept) fn
1.820e-15 5.013e-01
cor(sn,fn)
[1] 0.5013383
fitted slope=0.5013=correlation.
From our linear model in problem 1, which we named as fit, we have the son’s height as the outcome and father’s height as predictor. We get the estimated value for slope and intercept and other necessary information by getting the summary of fit. To predict the son’s height given that the father’s height is 63 inches, we use the funtion predict and use the linear model fit. We have
predict(fit, newdata = data.frame(fheight=63))
1
66.27447
We can also predict using the coefficients directly and using the equation Y=ß_0 + ß_1X where X=63
Y=beta0+beta1*63
Y
[1] 66.27447
Here, the outcome variable is s and the predictor variable is f and sd(s)=2sd(f). Hence we have (sd(s)/sd(f))=2 and note that cor(s,f)=0.3. We have the formula for the estimated slope equal to ß_1=cor(s,f)(sd(s)/sd(f)).Thus, the estimated slope is equal to
ß_1 <- 0.3*2
ß_1
[1] 0.6
From problem 1, the formula for the intercept is ß_0= mean(s)-ß_1(mean(f)). Hence the intercept is equal to
ß_0 <- 1-(0.6*0.5)
ß_0
[1] 0.7
Yes. Getting the intercept, we have the formula ß_0=mean(s)-ß_1(mean(f)) where s is the outcome and f is the predictor. If we have mean(f)=0, then the equation becomes ß_0=mean(s). Therefore, the estimated intercept from linear regression will be the mean of the outcome.
If the standard deviation of the predictor variable is double that of the outcome then we have sd(f)=2sd(s). Here, the outcome variable is s and the predictor variable is f and cor(s,f)=0.3.The estimated slope is equal to ß_1=cor(s,f)(sd(s)/sd(f) with sd(s)/sd(f)=1/2, thus we have
ß_1<- 0.3*(1/2)
ß_1
[1] 0.15
Normalizing each set of weights, the center of the axes falls in the mean of the data and the standard deviation is 1 for both axis. In this case the slope of the regression line is the correlation since the slope is equal to the correlation times the ratio of the standard deviations. Now if we wanted to predict the second weight given the first it would be multiplying the given weight by the correlation. In this case, we multiply 0.75 to the given weight in the first scale to get a good estimate of the other weight in the second scale.
Note that we have normalized the data and the correlation is the slope of the data since we have a standard deviation equal to 1. From previous problem, we just have to multiply the correlation to the 1st scale to predict the 2nd scale. That is,
2*0.75
[1] 1.5
First we get the slope of the data which is equal to the correlation times the ratio of the standard deviations and note that in this case, the wife’s guess is the outcome and the husband’s guess is the predictor. Hence we have a slope equal to
0.2*8/10
[1] 0.16
Since the data have been centered, we multiply the slope to the predictor’s value to get the estimate the wife’s guess since our intercept here is equal to 0. Thus we have
0.16*30
[1] 4.8
fit<-lm(sheight~fheight, father.son)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
Recall that the Son’s height is equal to ß_0 + ß_1(Father’s height). Thus we have our hypotheses, H_0: ß_1=0 and H_1: ß_1!=0. From the summary table, we have the p-value for the slope coefficient equal to 1.121268e-69. Therefore, we reject our null hypothesis and conclude the alternative hypothesis, that is H_1: ß_1!=0. This implies that there an amount of linear relationship between the father’s height and the son’s height.
From the summary table above, we have a slope estimate equal to 0.514. This implies that in every one inch decrease in the father’s height, we estimate a 0.514 increase in the son’s height. Now the intercept is equal to 33.89. This is the estimated son’s height for every 0 height of the father. Since there is no such 0 height, we conclude that this is not a relevant intercept. Now, we recenter the father’s height.
fit2<-lm(sheight~I(fheight-mean(fheight)), father.son)
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.684070 0.07421078 925.52689 0.000000e+00
I(fheight - mean(fheight)) 0.514093 0.02704874 19.00618 1.121268e-69
Note that we have the same slope since centering does not affect the slope. Now since the intercept is equal to 68.68, this implies that the estimated son’s height is 68.68 at the average father’s height.
Using predict function, we have
predict(fit, newdata = data.frame(fheight=80))
1
75.01405
summary(father.son)
fheight sheight
Min. :59.01 Min. :58.51
1st Qu.:65.79 1st Qu.:66.93
Median :67.77 Median :68.62
Mean :67.69 Mean :68.68
3rd Qu.:69.60 3rd Qu.:70.47
Max. :75.43 Max. :78.36
Getting the summary of the data, we have the maximum height of the father equal to 75.43. Since we predict the son’s height using 80 in height of the father, the result is not reliable since we choose a value above the maximum value of father’s height.
data("mtcars")
fit3<-lm(mpg~hp, mtcars)
summary(fit3)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
From the table above, the estimate is -0.06822828. This implies that for every one unit increase of horsepower we get -0.06822828 decrease in miles per gallon. Additionally, this relationship is statistically highly significant given the p-value. The intercept is the estimate of the miles per gallon for a zero horsepower, that is 30.09886054 which is not a meaningful estimate. Hence we should recenter the horsepower.
fit4<-lm(mpg~I(hp-mean(hp)), mtcars)
summary(fit4)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.09062500 0.6828817 29.420360 1.101810e-23
I(hp - mean(hp)) -0.06822828 0.0101193 -6.742389 1.787835e-07
Thus, we get now a meaningful estimate for horsepower.
plot2<- ggplot(mtcars, aes(x=hp, y=mpg))
plot2<- plot2 + geom_point(cex=5, alpha= .5)
plot2<- plot2 + geom_smooth(method = lm, se=FALSE, lwd=2)
plot2
`geom_smooth()` using formula = 'y ~ x'
From question number 4, getting the summary of fit3 we get the p-value for the horsepower equal to 1.787835e-07. Given this p-value and using alpha=0.05, we reject our null hypothesis that the coefficient in front of hp is zero. Therefore, there is a linear relationship between horsepower and miles per gallon.
From question number 4, getting the summary of fit3 we get a slope intercept equal to 30.09886 and a slope of -0.06823. Using the regression line equation Y=ß_0 + ß_1X where ß_0=30.09886 ß_1= -0.06823 and X=111 we have,
Y<- 30.09886 -0.06823*111
Y
[1] 22.52533
data("father.son")
fit<-lm(sheight~fheight, father.son)
plot(father.son$sheight, resid(fit))
abline(v = 0, col="red")
#### 2. Refer to question 1. Directly estimate the residual variance and
compare this estimate to the output of lm. Note that the residual
variance is estimating the sum of the residuals squared divided by n-p
where p is the number of regression paramater. We have
sum(resid(fit)^2)/(nrow(father.son)-2)
[1] 5.936804
Comparing it to the output lm
summary(fit)$sigma^2
[1] 5.936804
and obviously, they are equal.
summary(fit)$r.squared
[1] 0.2513401
data("mtcars")
fit3<-lm(mpg~hp, mtcars)
plot(mtcars$hp, resid(fit3))
abline(v = 0, col="red")
sum(resid(fit3)^2)/(nrow(mtcars)-2)
[1] 14.92248
summary(fit3)$sigma^2
[1] 14.92248
Hence, the residual variance is equal to the variance output in lm.
summary(fit3)$r.squared
[1] 0.6024373