Galton.original = read.csv("C:/Users/Will/OneDrive/Documents/School/375T Predictive Analytics/HW3/Galton.csv")
Galton = Galton.original
library(boot)
Mothers = Galton$Mother
Heights = Galton$Height
n = length(Heights)
for (i in 1:n)
{
if (Galton$Gender[i] == "female")
{
Heights[i] = 1.08*Heights[i]
}
}
Mothers = Mothers * 1.08
Parents = (Mothers + Galton$Father)/2
heightModel = lm(Heights ~ Parents)
summary(heightModel)
##
## Call:
## lm(formula = Heights ~ Parents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5007 -1.4864 0.0957 1.5136 9.1281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.82606 2.82206 7.025 4.12e-12 ***
## Parents 0.71392 0.04076 17.513 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.244 on 931 degrees of freedom
## Multiple R-squared: 0.2478, Adjusted R-squared: 0.247
## F-statistic: 306.7 on 1 and 931 DF, p-value: < 2.2e-16
beta0 = coef(heightModel)[1]
beta1 = coef(heightModel)[2]
We come to a slope = 0.71392 with a very low p-value. Our interpretation is that we expect children to grow to about 0.71 inches in height for every inch of their parents.
t.95 = qt(.975, 2)
Parents.mean = mean(Parents)
Parents.meanSq = sum((Parents - Parents.mean)^2)
x0 = 65
yHat = beta0 + beta1*x0
stddev.y = sqrt(1+(1/n)+((x0-Parents.mean)^2)/(Parents.meanSq))
LowerBound = yHat - t.95*stddev.y
UpperBound = yHat + t.95*stddev.y
LowerBound
## (Intercept)
## 61.91342
yHat
## (Intercept)
## 66.23091
UpperBound
## (Intercept)
## 70.54841
x0 = 76
yHat = beta0 + beta1*x0
stddev.y = sqrt(1+(1/n)+((x0-Parents.mean)^2)/(Parents.meanSq))
LowerBound = yHat - t.95*stddev.y
UpperBound = yHat + t.95*stddev.y
LowerBound
## (Intercept)
## 69.74644
yHat
## (Intercept)
## 74.08404
UpperBound
## (Intercept)
## 78.42165
For the boy whose parents are an average height of 65 inches, the boy’s real height is expected to lie within the interval (64.91, 70.55) inches 95% of the time
For the boy whose parents are an average height of 76 inches, we expect the boy’s real height to lie within the interval (69.75, 78.42) 95% of the time.
plot(heightModel)
The residuals do not contradict our a linear model. We see virtually equal spread above and below zero throughout the chart. The QQ plot reinforces the normality of our distribution. There is nothing to suggest that our linear model is a poor fit.
r = n%%2
N = n-r
randomN = sample(n,r)
NewHeights = Heights[-randomN]
NewParents = Parents[-randomN]
NewGalton = do.call(rbind, Map(data.frame, Heights = NewHeights, Parents =NewParents))
## Here we are dropping one row in order to make the data equally divisible by two. This will not severely
## impact our analysis because our sample size is so large
NewHeights = NewGalton$Heights
NewParents = NewGalton$Parents
train = sample(N, N/2)
lm.fit = lm(NewHeights~NewParents, subset = train)
MSE = numeric(3)
for (i in 1:3)
{
lm.fit = lm(NewHeights~poly(NewParents, i), subset = train)
MSE[i] = mean((NewHeights-predict(lm.fit))[-train]^2)
}
MSE
## [1] 8.625476 8.582509 8.563070
The three test errors are presented in order for polynomials 1, 2, 3. We should expect there to be a noticable drop in MSE from a degree 1 polynomial to a degree 2, and little change after that. This is not the result that we obtained. Rather, there appears to be littel difference between any of the MSE’s This could be due to operator error or because our sample size is so large.
MSE = numeric(3)
for (i in 1:3)
NewGalton = cbind(Heights, Parents)
NewGalton = data.frame(NewGalton)
for (i in 1:3)
{
glm.fit = glm(Heights~poly(Parents, i), data = NewGalton)
MSE[i] = cv.glm(NewGalton, glm.fit)$delta[1]
}
MSE
## [1] 5.044283 5.039813 5.044239
LOOCV approach gives us smaller MSE’s than the Validation set approach as expected. However, like our validation approach problem, our MSE did not significantly differ from degree 1 to 2.
k = 5
r = n%%5
N = n-r
randomN = sample(n,r)
NewHeights = Heights[-randomN]
NewParents = Parents[-randomN]
NewGalton = do.call(rbind, Map(data.frame, Heights = NewHeights, Parents =NewParents))
## Here we are dropping one row in order to make the data equally divisible by five. This will not severely
## impact our analysis because our sample size is so large
NewHeights = NewGalton$Heights
NewParents = NewGalton$Parents
train = sample(N, N/k)
lm.fit = lm(NewHeights~NewParents, subset = train)
MSE = numeric(3)
for (i in 1:3)
{
lm.fit = lm(NewHeights~poly(NewParents, i), subset = train)
MSE[i] = mean((NewHeights-predict(lm.fit, NewGalton, K = k))[-train]^2)
}
MSE
## [1] 5.027581 5.031942 5.047440
The k-fold test error is greater LOOCV, but smaller than the validation set approach, as we expect it would be. It suffers from the same problem of polynomials erros being similar or larger than the linear model. The consistency leads me to believe that this an operator error that I cannot identify.