PROBLEM SET # 2

PAGE 16

1. Take the Galton dataset and find the mean, standard deviation and correlation between the parental and child heights.

library(UsingR)

  Loading required package: MASS

  Loading required package: HistData

  Loading required package: Hmisc

  
  Attaching package: 'Hmisc'

  The following objects are masked from 'package:base':
  
      format.pval, units

library(ggplot2)

data("Galton")

mean(Galton$parent)

  [1] 68.30819

mean(Galton$child)

  [1] 68.08847

sd(Galton$parent)

  [1] 1.787333

sd(Galton$child)

  [1] 2.517941

The mean for the child’s height is 68.08847 with a standard deviation of 2.517941. On the other hand, the mean for the parent’s height is 68.30819 with a standard deviation of 1.787333

cor(Galton)

            parent     child
  parent 1.0000000 0.4587624
  child  0.4587624 1.0000000

The correlation between the parent and the child is the same with the correlation between the child and the parent.

2. Center the parent and child variables and verify that the centered variable means are 0.

In centering variables we subtract the mean from data points. If the mean of this data is 0, then the variables are being centered. Here we let x and y be the parent and child variables respectively. The variable xc is the data that contains the child variables minus the child’s mean. Similary, the variable xy is the data that contains the parent variables minus the parent’s mean.

x<-Galton$parent
y<-Galton$child
xc<- x-mean(x)
xy<- y-mean(y)
mean(xc)

  [1] 9.775954e-16

mean(xy)

  [1] 4.817867e-16

Clearly, the mean of xc is 9.775954e-16 and the mean of xy is 4.817867e-16 which is approximately equal to zero.

3.Rescale the parent and child variables and verify that the scaled variable standard deviations are 1.

To rescale a variable, we divide the variables by its standard deviation. The standard deviation of the derived data is 1.

ys<- y/sd(y)
xs<- x/sd(x)
sd(ys)

  [1] 1

sd(xs)

  [1] 1

Thus, it is true that the scaled variable standard deviations are 1.

4. Normalize the parental and child heights. Verify that the normalized variables have mean 0 and standard deviation 1 and take the correlation between them.

Normalizing parental and child heights is dividing the standard deviation from the difference of each height and their corresponding mean. Here, we let xn be the normalized data for child height and yn is the normalized data for parents heights.

xn<-(x-mean(x))/sd(x)
yn<-(y-mean(y))/sd(y)
mean(xn)

  [1] 5.501733e-16

mean(yn)

  [1] 2.183943e-16

sd(xn)

  [1] 1

sd(yn)

  [1] 1

cor(xn, yn)

  [1] 0.4587624

cor(yn, xn)

  [1] 0.4587624

The normalized parent’s height and child’s height is 2.183943e-16 and 5.501733e-16 respectively which is approximately equal to 0 and with standard deviation equal to 1. The correlation between parent and child and between child and parent is the same which is equal to 0.4587624 since centerinig and scaling has no impact on the value of the correlation.

PAGE 21 & 22

1. Install and load the package UsingR and load the father.son data with data(father.son). Get the linear regression fit where the son’s height is the outcome and the father’s height is the predictor. Give the intercept and the slope, plot the data and overlay the fitted regression line.

Having son’s height (s) as the outcome we obtain the line Y=ß_0 + ß_1X. Here Y is the son’s height(s) and X is the father’s height(f), ß_1=cor(Y,X)(sd(Y)/sd(X)) and ß_0= mean(Y)-ß_1(mean(X)). ß_1 is the slope and ß_0 is the intercept. Thus, we have

data("father.son")
fit<-lm(sheight~fheight, father.son)
f<-father.son$fheight
s<-father.son$sheight
beta1<- cor(s,f)* sd(s)/sd(f)
beta0<- mean(s) - beta1 * mean(f)
rbind(coef(fit), c(beta0,beta1))

       (Intercept)  fheight
  [1,]     33.8866 0.514093
  [2,]     33.8866 0.514093

plot<-ggplot(father.son, aes(x=fheight, y=sheight))
plot<- plot + geom_point()
plot<- plot + geom_smooth(method = lm, se = FALSE, lwd=2)
plot

  `geom_smooth()` using formula = 'y ~ x'

2. Refer to problem 1. Center the father and son variables and refit the model omitting the intercept. Verify that the slope estimate is the same as the linear regression fit from problem 1.

From Problem 1, we have father’s height coefficient equal to 0.51409.

summary(fit)

  
  Call:
  lm(formula = sheight ~ fheight, data = father.son)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -8.8772 -1.5144 -0.0079  1.6285  8.9685 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept) 33.88660    1.83235   18.49   <2e-16 ***
  fheight      0.51409    0.02705   19.01   <2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 2.437 on 1076 degrees of freedom
  Multiple R-squared:  0.2513,  Adjusted R-squared:  0.2506 
  F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

Centering the data creates a slope coefficient equal to ß_1=sum(XY)/(X^2). We can also verify using lm(Y-centered~X-centered-1). Subtracting 1 mean omitting the intercept.

fc<- f - mean(f)
sc<- s - mean(s)
sum(fc * sc)/sum(fc^2)

  [1] 0.514093

lm(sc~fc-1)

  
  Call:
  lm(formula = sc ~ fc - 1)
  
  Coefficients:
      fc  
  0.5141

Thus,the slope estimate is the same as the linear regression fit from problem 1.

3. Refer to problem 1. Normalize the father and son data and see that the fitted slope is the correlation.

Normalizing the father and son data, we have

fn<-(f-mean(f))/sd(f)
sn<-(s-mean(s))/sd(s)

and getting the fitted slope and correlation, we have

lm(sn~fn)

  
  Call:
  lm(formula = sn ~ fn)
  
  Coefficients:
  (Intercept)           fn  
    1.820e-15    5.013e-01

cor(sn,fn)

  [1] 0.5013383

fitted slope=0.5013=correlation.

4. Go back to the linear regression line from Problem 1. If a father’s height was 63 inches, what would you predict the son’s height to be?

From our linear model in problem 1, which we named as fit, we have the son’s height as the outcome and father’s height as predictor. We get the estimated value for slope and intercept and other necessary information by getting the summary of fit. To predict the son’s height given that the father’s height is 63 inches, we use the funtion predict and use the linear model fit. We have

predict(fit, newdata = data.frame(fheight=63))

         1 
  66.27447

We can also predict using the coefficients directly and using the equation Y=ß_0 + ß_1X where X=63

Y=beta0+beta1*63
Y

  [1] 66.27447

5. Consider a data set where the standard deviation of the outcome variable is double that of the predictor. Also, the variables have a correlation of 0.3. If you fit a linear regression model, what would be the estimate of the slope?

Here, the outcome variable is s and the predictor variable is f and sd(s)=2sd(f). Hence we have (sd(s)/sd(f))=2 and note that cor(s,f)=0.3. We have the formula for the estimated slope equal to ß_1=cor(s,f)(sd(s)/sd(f)).Thus, the estimated slope is equal to

ß_1 <- 0.3*2
ß_1

  [1] 0.6

6. Consider the previous problem. The outcome variable has a mean of 1 and the predictor has a mean of 0.5. What would be the intercept?

From problem 1, the formula for the intercept is ß_0= mean(s)-ß_1(mean(f)). Hence the intercept is equal to

ß_0 <- 1-(0.6*0.5)
ß_0

  [1] 0.7

7. True or false, if the predictor variable has mean 0, the estimated intercept from linear regression will be the mean of the outcome?

Yes. Getting the intercept, we have the formula ß_0=mean(s)-ß_1(mean(f)) where s is the outcome and f is the predictor. If we have mean(f)=0, then the equation becomes ß_0=mean(s). Therefore, the estimated intercept from linear regression will be the mean of the outcome.

8. Consider problem 5 again. What would be the estimated slope if the predictor and outcome were reversed?

If the standard deviation of the predictor variable is double that of the outcome then we have sd(f)=2sd(s). Here, the outcome variable is s and the predictor variable is f and cor(s,f)=0.3.The estimated slope is equal to ß_1=cor(s,f)(sd(s)/sd(f) with sd(s)/sd(f)=1/2, thus we have

ß_1<- 0.3*(1/2)
ß_1

  [1] 0.15

PAGE 26

1. You have two noisy scales and a bunch of people that you’d like to weigh. You weigh each person on both scales. The correlation was 0.75. If you normalized each set of weights, what would you have to multiply the weight on one scale to get a good estimate of the weight on the other scale?

Normalizing each set of weights, the center of the axes falls in the mean of the data and the standard deviation is 1 for both axis. In this case the slope of the regression line is the correlation since the slope is equal to the correlation times the ratio of the standard deviations. Now if we wanted to predict the second weight given the first it would be multiplying the given weight by the correlation. In this case, we multiply 0.75 to the given weight in the first scale to get a good estimate of the other weight in the second scale.

2. Consider the previous problem. Someone’s weight was 2 standard deviations above the mean of the group on the first scale. How many standard deviations above the mean would you estimate them to be on the second?

Note that we have normalized the data and the correlation is the slope of the data since we have a standard deviation equal to 1. From previous problem, we just have to multiply the correlation to the 1st scale to predict the 2nd scale. That is,

2*0.75

  [1] 1.5

3. You ask a collection of husbands and wives to guess how many jellybeans are in a jar. The correlation is 0.2. The standard deviation for the husbands is 10 beans while the standard deviation for wives is 8 beans. Assume that the data were centered so that 0 is the mean for each. The centered guess for a husband was 30 beans (above the mean). What would be your best estimate of the wife’s guess?

First we get the slope of the data which is equal to the correlation times the ratio of the standard deviations and note that in this case, the wife’s guess is the outcome and the husband’s guess is the predictor. Hence we have a slope equal to

0.2*8/10

  [1] 0.16

Since the data have been centered, we multiply the slope to the predictor’s value to get the estimate the wife’s guess since our intercept here is equal to 0. Thus we have

0.16*30

  [1] 4.8

PAGE 32 and 33

1. Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Give a p-value for the slope coefficient and perform the relevant hypothesis test.

fit<-lm(sheight~fheight, father.son)
summary(fit)$coef

               Estimate Std. Error  t value     Pr(>|t|)
  (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
  fheight      0.514093 0.02704874 19.00618 1.121268e-69

Recall that the Son’s height is equal to ß_0 + ß_1(Father’s height). Thus we have our hypotheses, H_0: ß_1=0 and H_1: ß_1!=0. From the summary table, we have the p-value for the slope coefficient equal to 1.121268e-69. Therefore, we reject our null hypothesis and conclude the alternative hypothesis, that is H_1: ß_1!=0. This implies that there an amount of linear relationship between the father’s height and the son’s height.

2. Refer to question 1. Interpret both parameters. Recenter for the intercept if necessary.

From the summary table above, we have a slope estimate equal to 0.514. This implies that in every one inch decrease in the father’s height, we estimate a 0.514 increase in the son’s height. Now the intercept is equal to 33.89. This is the estimated son’s height for every 0 height of the father. Since there is no such 0 height, we conclude that this is not a relevant intercept. Now, we recenter the father’s height.

fit2<-lm(sheight~I(fheight-mean(fheight)), father.son)
summary(fit2)$coef

                              Estimate Std. Error   t value     Pr(>|t|)
  (Intercept)                68.684070 0.07421078 925.52689 0.000000e+00
  I(fheight - mean(fheight))  0.514093 0.02704874  19.00618 1.121268e-69

Note that we have the same slope since centering does not affect the slope. Now since the intercept is equal to 68.68, this implies that the estimated son’s height is 68.68 at the average father’s height.

3. Refer to question 1. Predict the son’s height if the father’s height is 80 inches. Would you recommend this prediction? Why or why not?

Using predict function, we have

predict(fit, newdata = data.frame(fheight=80))

         1 
  75.01405

summary(father.son)

      fheight         sheight     
   Min.   :59.01   Min.   :58.51  
   1st Qu.:65.79   1st Qu.:66.93  
   Median :67.77   Median :68.62  
   Mean   :67.69   Mean   :68.68  
   3rd Qu.:69.60   3rd Qu.:70.47  
   Max.   :75.43   Max.   :78.36

Getting the summary of the data, we have the maximum height of the father equal to 75.43. Since we predict the son’s height using 80 in height of the father, the result is not reliable since we choose a value above the maximum value of father’s height.

4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Interpret your coefficients, recenter for the intercept if necessary.

data("mtcars")
fit3<-lm(mpg~hp, mtcars)
summary(fit3)$coef

                 Estimate Std. Error   t value     Pr(>|t|)
  (Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
  hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

From the table above, the estimate is -0.06822828. This implies that for every one unit increase of horsepower we get -0.06822828 decrease in miles per gallon. Additionally, this relationship is statistically highly significant given the p-value. The intercept is the estimate of the miles per gallon for a zero horsepower, that is 30.09886054 which is not a meaningful estimate. Hence we should recenter the horsepower.

fit4<-lm(mpg~I(hp-mean(hp)), mtcars)
summary(fit4)$coef

                      Estimate Std. Error   t value     Pr(>|t|)
  (Intercept)      20.09062500  0.6828817 29.420360 1.101810e-23
  I(hp - mean(hp)) -0.06822828  0.0101193 -6.742389 1.787835e-07

Thus, we get now a meaningful estimate for horsepower.

5. Refer to question 4. Overlay the fit onto a scatterplot.

plot2<- ggplot(mtcars, aes(x=hp, y=mpg))
plot2<- plot2 + geom_point(cex=5, alpha= .5)
plot2<- plot2 + geom_smooth(method = lm, se=FALSE, lwd=2)
plot2

  `geom_smooth()` using formula = 'y ~ x'

6. Refer to question 4. Test the hypothesis of no linear relationship between horsepower and miles per gallon.

From question number 4, getting the summary of fit3 we get the p-value for the horsepower equal to 1.787835e-07. Given this p-value and using alpha=0.05, we reject our null hypothesis that the coefficient in front of hp is zero. Therefore, there is a linear relationship between horsepower and miles per gallon.

7. Refer to question 4. Predict the miles per gallon for a horsepower of 111.

From question number 4, getting the summary of fit3 we get a slope intercept equal to 30.09886 and a slope of -0.06823. Using the regression line equation Y=ß_0 + ß_1X where ß_0=30.09886 ß_1= -0.06823 and X=111 we have,

Y<- 30.09886 -0.06823*111
Y

  [1] 22.52533

PAGE 45

1. Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Plot the son’s height (horizontal axis) versus the residuals (vertical axis).

data("father.son")
fit<-lm(sheight~fheight, father.son)
plot(father.son$sheight, resid(fit))
abline(v = 0, col="red")

#### 2. Refer to question 1. Directly estimate the residual variance and compare this estimate to the output of lm. Note that the residual variance is estimating the sum of the residuals squared divided by n-p where p is the number of regression paramater. We have

sum(resid(fit)^2)/(nrow(father.son)-2)

  [1] 5.936804

Comparing it to the output lm

summary(fit)$sigma^2

  [1] 5.936804

and obviously, they are equal.

3. Refer to question 1. Give the R squared for this model.

summary(fit)$r.squared

  [1] 0.2513401

4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Plot horsepower versus the residuals.

data("mtcars")
fit3<-lm(mpg~hp, mtcars)
plot(mtcars$hp, resid(fit3))
abline(v = 0, col="red")

5. Refer to question 4. Directly estimate the residual variance and compare this estimate to the output of lm.

sum(resid(fit3)^2)/(nrow(mtcars)-2)

  [1] 14.92248

summary(fit3)$sigma^2

  [1] 14.92248

Hence, the residual variance is equal to the variance output in lm.

6. Refer to question 4. Give the R squared for this model.

summary(fit3)$r.squared

  [1] 0.6024373