Stats 516 Homework 8

1

galton <- read.csv("galton (1).csv")

1. (a)

head(galton)

##   X family father mother sex height nkids
## 1 1      1   78.5   67.0   M   73.2     4
## 2 2      1   78.5   67.0   F   69.2     4
## 3 3      1   78.5   67.0   F   69.0     4
## 4 4      1   78.5   67.0   F   69.0     4
## 5 5      2   75.5   66.5   M   73.5     4
## 6 6      2   75.5   66.5   M   72.5     4

galton.s <- subset ( galton, sex == "F" )

1. (b)

library(ggplot2)
plot1 <- ggplot ( galton.s, aes ( x=mother, y=height ) )
plot1 + geom_point()

1. (c)

plot2 <- ggplot ( galton.s, aes ( x=mother, y=height ) )
plot2 + geom_jitter()

1. (d)

plot2 + geom_jitter() + geom_abline ( intercept=0, slope=1 )

It appears as though the points are scattered around the line y=x, but most of the points lie above the line y=x.

1. (e)

mod.g <- lm ( height ~ mother, data = galton.s )
coef(mod.g)

## (Intercept)      mother 
##  43.1554585   0.3265523

plot2 + geom_jitter() +
geom_abline ( intercept=0, slope=1 ) +
geom_abline ( intercept=coef(mod.g)[1], slope=coef(mod.g)[2] )

The best fitting line seems to fit the data better, as the ponts are more evenly spread out on both sides of the line. This best fitting line also seems to be better than the y=x line in terms of going through the average of the daughters’ heights for extreme mother-height cases (68-70 and 58-60 inches).

1. (f)

setwd("/Users/nathan_hall/Downloads")
pearson <- read.csv("pearson.csv")

1. (g)

plot4 <- ggplot ( pearson, aes ( x=Mheight, y=Dheight ) )
mod.p <- lm ( Dheight ~ Mheight, data = pearson )
coef(mod.p)

## (Intercept)     Mheight 
##   29.917437    0.541747

plot4 + geom_jitter() +
geom_abline ( intercept=0, slope=1 ) +
geom_abline ( intercept=coef(mod.p)[1], slope=coef(mod.p)[2] )

1. (h)

summary(mod.g)

## 
## Call:
## lm(formula = height ~ mother, data = galton.s)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8814 -1.5446  0.0983  1.4452  6.7717 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 43.15546    3.05709  14.117  < 2e-16 ***
## mother       0.32655    0.04761   6.859 2.42e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.253 on 431 degrees of freedom
## Multiple R-squared:  0.09841,    Adjusted R-squared:  0.09631 
## F-statistic: 47.04 on 1 and 431 DF,  p-value: 2.421e-11

summary(mod.p)

## 
## Call:
## lm(formula = Dheight ~ Mheight, data = pearson)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.397 -1.529  0.036  1.492  9.053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.91744    1.62247   18.44   <2e-16 ***
## Mheight      0.54175    0.02596   20.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2402 
## F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

It seems as though both of the models agree with each other and predict a weak, positive correlation between mother’s height and daughter’s height. However, the model built using Pearson’s data (Adj. R-Squared = 0.2402) is better than the model built with Galton’s data (Adj. R-Squared = 0.03959).

1. (i)

plot_5 <- ggplot ( pearson, aes ( x=Mheight, y=Dheight ) )
plot_5 + geom_point() +
geom_abline ( aes ( intercept = coef(mod.p)[1], slope = coef(mod.p)[2] ) ) +
geom_point ( data = galton,
aes ( x=mother, y=height ),
shape=2, color="red" ) +
geom_abline ( aes ( intercept = coef(mod.g)[1], slope = coef(mod.g)[2] ),
lty=2, color="red" )

Here, it is more apparent that the two datasets are slightly different, but it is fair to say that they roughly agree with each other. Galton’s data tends to have higher daughter heights overall, and a less-steep slope for the regression line, but overall these two datasets are definitely similar.

2

2. (a)

mauna <- read.csv("mauna.csv")
plotm1 <- ggplot ( mauna, aes ( x=month, y=CO2 ) )
plotm1 + geom_jitter()

The general trend is that as the month increases (i.e. as time goes on), CO2 levels rise at a relatively constant rate.

2. (b)

mod.m <- lm ( CO2 ~ month, data = mauna )
plotm1 + geom_jitter() +
geom_abline ( intercept=coef(mod.m)[1], slope=coef(mod.m)[2], color='red' )

summary(mod.m)

## 
## Call:
## lm(formula = CO2 ~ month, data = mauna)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0399 -1.9476 -0.0017  1.9113  6.5149 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.115e+02  2.424e-01  1284.9   <2e-16 ***
## month       1.090e-01  8.958e-04   121.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.618 on 466 degrees of freedom
## Multiple R-squared:  0.9695, Adjusted R-squared:  0.9694 
## F-statistic: 1.479e+04 on 1 and 466 DF,  p-value: < 2.2e-16

The adjusted R-squared value is 0.9694. The residual standard error is 2.618 on 466 degrees of freedom. The residual standard error of 2.618 does indeed correspond to the scatter seen above and below the line in the plot. This appears to be an excellent linear model, due to the very high R-squared value, very low p-value, and just from visually examining the regression line and seeing how well it fits and describes the data.

2. (c)

CO2.res = resid(mod.m)
plotm2 <- ggplot ( mauna, aes ( x=month, y=CO2.res ) )
plotm2 + geom_jitter()

The pattern that can be seen here is residuals vs. month is that the residuals are very high in the early month, low in the center months, and then are very high again in the late months. The shape of this plot looks almost quadratic.

attach(mauna)
month2 <- month^2
mod.m2 <- lm ( CO2 ~ month + month2, data = mauna )
summary(mod.m2)

## 
## Call:
## lm(formula = CO2 ~ month + month2, data = mauna)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0195 -1.7120  0.2144  1.7957  4.8345 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.148e+02  3.039e-01 1035.65   <2e-16 ***
## month       6.739e-02  2.993e-03   22.52   <2e-16 ***
## month2      8.862e-05  6.179e-06   14.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.182 on 465 degrees of freedom
## Multiple R-squared:  0.9788, Adjusted R-squared:  0.9787 
## F-statistic: 1.075e+04 on 2 and 465 DF,  p-value: < 2.2e-16

The new adjusted R-squared value of this model is 0.9787, and the new residual standard error is 2.182 on 465 degrees of freedom. The addition of the month^2 term has seemingly improved the model.

CO2.res = resid(mod.m2)
plotm2 <- ggplot ( mauna, aes ( x=month, y=CO2.res ) )
plotm2 + geom_jitter()

Here, as depicted by the residual vs. month plot, the residuals are much more evenly distributed above and below 0.0, thus providing further evidence that the second model is better than the first model.

2. (d)

mod.m3 <- lm ( CO2 ~ monthname, data = mauna )
CO2.res2 = resid(mod.m3)
plotm3 <- ggplot ( mauna, aes ( x=monthname, y=CO2.res2 ) )
plotm3 + geom_jitter()

In this residual vs. monthname plot, the residuals are evenly distributed both above and below zero, however the residual values are extremely high, indicating a very weak model.

2. (e)

mauna$monthname <- factor ( mauna$monthname,
levels = c ( "Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
) )
mod.m4 <- lm ( CO2 ~ monthname, data = mauna )
plotm4 <- ggplot ( mauna, aes ( x=monthname, y=CO2 ) )
plotm4 + geom_jitter()

There is no clear pattern to be seen in the plot. Since the “monthname” variable only has 12 different values (for each month), rather than the “month” variable having many more, the plot appears very scattered and there are multiple data-points above each month. Therefore there is no linear pattern to be seen.

summary(mod.m4)

## 
## Call:
## lm(formula = CO2 ~ monthname, data = mauna)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.752 -14.014  -1.674  14.126  27.645 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  336.4308     2.4069 139.777   <2e-16 ***
## monthnameFeb   0.7726     3.4039   0.227    0.821    
## monthnameMar   1.6238     3.4039   0.477    0.634    
## monthnameApr   2.8636     3.4039   0.841    0.401    
## monthnameMay   3.4513     3.4039   1.014    0.311    
## monthnameJun   2.8974     3.4039   0.851    0.395    
## monthnameJul   1.4856     3.4039   0.436    0.663    
## monthnameAug  -0.4728     3.4039  -0.139    0.890    
## monthnameSep  -2.1879     3.4039  -0.643    0.521    
## monthnameOct  -2.2615     3.4039  -0.664    0.507    
## monthnameNov  -0.9628     3.4039  -0.283    0.777    
## monthnameDec   0.2638     3.4039   0.078    0.938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.03 on 456 degrees of freedom
## Multiple R-squared:  0.01507,    Adjusted R-squared:  -0.008691 
## F-statistic: 0.6342 on 11 and 456 DF,  p-value: 0.7998

The new adjusted R-squared value is -0.008691, and the residual standard error is 15.03 on 456 degrees of freedom. This is clearly not a good model at all given the extremely low adjusted R-squared value and high residual standard error.

CO2.res2 = resid(mod.m4)
plotm5 <- ggplot ( mauna, aes ( x=monthname, y=CO2.res2 ) )
plotm5 + geom_jitter()

Even with the re-made plot with the months arranged in chronological order, the residual values are extremely high (although they are still evenly distributed above and below zero), indicating a very weak model.