Lecture 7 Homework

Import data

Galton.original = read.csv("C:/Users/Will/OneDrive/Documents/School/375T Predictive Analytics/HW3/Galton.csv")
Galton = Galton.original
library(boot)

a)

Mothers = Galton$Mother
Heights = Galton$Height
n = length(Heights)

for (i in 1:n)
{
  if (Galton$Gender[i] == "female")
  {
    Heights[i] = 1.08*Heights[i]  
  }
  
}
Mothers = Mothers * 1.08

Parents = (Mothers + Galton$Father)/2

heightModel = lm(Heights ~ Parents)
summary(heightModel)

## 
## Call:
## lm(formula = Heights ~ Parents)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5007 -1.4864  0.0957  1.5136  9.1281 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.82606    2.82206   7.025 4.12e-12 ***
## Parents      0.71392    0.04076  17.513  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.244 on 931 degrees of freedom
## Multiple R-squared:  0.2478, Adjusted R-squared:  0.247 
## F-statistic: 306.7 on 1 and 931 DF,  p-value: < 2.2e-16

beta0 = coef(heightModel)[1]
beta1 = coef(heightModel)[2]

We come to a slope = 0.71392 with a very low p-value. Our interpretation is that we expect children to grow to about 0.71 inches in height for every inch of their parents.

b)

t.95 = qt(.975, 2)
Parents.mean = mean(Parents)
Parents.meanSq = sum((Parents - Parents.mean)^2)

x0 = 65
yHat = beta0 + beta1*x0
stddev.y = sqrt(1+(1/n)+((x0-Parents.mean)^2)/(Parents.meanSq))
LowerBound = yHat - t.95*stddev.y
UpperBound = yHat + t.95*stddev.y

LowerBound

## (Intercept) 
##    61.91342

yHat

## (Intercept) 
##    66.23091

UpperBound

## (Intercept) 
##    70.54841

x0 = 76
yHat = beta0 + beta1*x0
stddev.y = sqrt(1+(1/n)+((x0-Parents.mean)^2)/(Parents.meanSq))
LowerBound = yHat - t.95*stddev.y
UpperBound = yHat + t.95*stddev.y

LowerBound

## (Intercept) 
##    69.74644

yHat

## (Intercept) 
##    74.08404

UpperBound

## (Intercept) 
##    78.42165

For the boy whose parents are an average height of 65 inches, the boy’s real height is expected to lie within the interval (64.91, 70.55) inches 95% of the time

For the boy whose parents are an average height of 76 inches, we expect the boy’s real height to lie within the interval (69.75, 78.42) 95% of the time.

c)

plot(heightModel)

The residuals do not contradict our a linear model. We see virtually equal spread above and below zero throughout the chart. The QQ plot reinforces the normality of our distribution. There is nothing to suggest that our linear model is a poor fit.

d)

r = n%%2
N = n-r

randomN = sample(n,r)
NewHeights = Heights[-randomN]
NewParents = Parents[-randomN]

NewGalton = do.call(rbind, Map(data.frame, Heights = NewHeights, Parents =NewParents))

## Here we are dropping one row in order to make the data equally divisible by two. This will not severely
## impact our analysis because our sample size is so large

NewHeights = NewGalton$Heights
NewParents = NewGalton$Parents

train = sample(N, N/2)

lm.fit = lm(NewHeights~NewParents, subset = train)
MSE = numeric(3)

for (i in 1:3)
{
  lm.fit = lm(NewHeights~poly(NewParents, i), subset = train) 
  MSE[i] = mean((NewHeights-predict(lm.fit))[-train]^2)
}

MSE

## [1] 8.625476 8.582509 8.563070

The three test errors are presented in order for polynomials 1, 2, 3. We should expect there to be a noticable drop in MSE from a degree 1 polynomial to a degree 2, and little change after that. This is not the result that we obtained. Rather, there appears to be littel difference between any of the MSE’s This could be due to operator error or because our sample size is so large.

e)

MSE = numeric(3)
for (i in 1:3)
NewGalton = cbind(Heights, Parents)
NewGalton = data.frame(NewGalton)

for (i in 1:3)
{
  glm.fit = glm(Heights~poly(Parents, i), data = NewGalton)
  MSE[i] = cv.glm(NewGalton, glm.fit)$delta[1]
}


MSE

## [1] 5.044283 5.039813 5.044239

LOOCV approach gives us smaller MSE’s than the Validation set approach as expected. However, like our validation approach problem, our MSE did not significantly differ from degree 1 to 2.

f)

k = 5

r = n%%5
N = n-r

randomN = sample(n,r)
NewHeights = Heights[-randomN]
NewParents = Parents[-randomN]

NewGalton = do.call(rbind, Map(data.frame, Heights = NewHeights, Parents =NewParents))

## Here we are dropping one row in order to make the data equally divisible by five. This will not severely
## impact our analysis because our sample size is so large

NewHeights = NewGalton$Heights
NewParents = NewGalton$Parents

train = sample(N, N/k)

lm.fit = lm(NewHeights~NewParents, subset = train)
MSE = numeric(3)

for (i in 1:3)
{
  lm.fit = lm(NewHeights~poly(NewParents, i), subset = train) 
  MSE[i] = mean((NewHeights-predict(lm.fit, NewGalton, K = k))[-train]^2)
}
MSE

## [1] 5.027581 5.031942 5.047440

The k-fold test error is greater LOOCV, but smaller than the validation set approach, as we expect it would be. It suffers from the same problem of polynomials erros being similar or larger than the linear model. The consistency leads me to believe that this an operator error that I cannot identify.

Lecture 7 Homework

William Chance - wtc464

February 12, 2019

Import data

a)

b)

c)

d)

e)

f)