Tuning With Cross Validation

First, we import the dataset

library(readr)
Galton <- read_csv("~/Desktop/Galton.csv")

## Parsed with column specification:
## cols(
##   Gender = col_character(),
##   Family = col_double(),
##   Height = col_double(),
##   Father = col_double(),
##   Mother = col_double()
## )

We access the heights of the mothers in the dataset and multiply it by 1.08

Galton$Mother<- (Galton$Mother) * 1.08

Here, we multiply the heights of all females by 1.08

for (i in 1:nrow(Galton))
{
  if (Galton$Gender[i]=="female")
  {
    Galton$Height[i]<-1.08*Galton$Height[i]
  }
}

Here, a for loop is used to average the heights of father and mother in each row

parentHeight<-vector()

for(i in 1:nrow(Galton))
{
parentHeight[i]<-(Galton$Father[i] + Galton$Mother[i])/2
}

For every 10 inches the parent grows, a child is expected to grow, on average, 7.13 inches.

reg<-lm(Galton$Height~parentHeight)
summary(reg)

## 
## Call:
## lm(formula = Galton$Height ~ parentHeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5007 -1.4864  0.0957  1.5136  9.1281 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.82606    2.82206   7.025 4.12e-12 ***
## parentHeight  0.71392    0.04076  17.513  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.244 on 931 degrees of freedom
## Multiple R-squared:  0.2478, Adjusted R-squared:  0.247 
## F-statistic: 306.7 on 1 and 931 DF,  p-value: < 2.2e-16

Predicted heights are found for the adult height of a boy whose average height is 65 inches and 76 inches respectively. Upper and lower bounds of a 95% confidence interval is also found using the predict function and are displayed below.

newdata=data.frame(parentHeight=65)
predict(reg, newdata, interval="predict")

##        fit      lwr      upr
## 1 66.23091 61.81268 70.64915

newdata2=data.frame(parentHeight=76)
predict(reg, newdata2,interval="predict")

##        fit      lwr      upr
## 1 74.08404 69.64523 78.52285

As we can see from the scatterplot, we can see that it passes the linearity test due to its homoskedacity. Since the residual plot for the children's heights shows a linear relationship, this does not pass the independence test. The equal spread test is also passed since the plot is homoskedastic. Looking at the qqplot, we can see that this fails the normality test for approximately the upper 5% and the lower 5% of the data. There is a seemingly straight line for the qqplot in the middle 90% of the data which indicates normality however.

plot(resid(reg))

plot(resid(reg)~Galton$Height)

qqplot(Galton$Height, parentHeight)

Here, the validation set approach is used on the Galton dataset. The test error that results from predicting adult child's height with polynomials 1 to 3 is shown below. As one can see, the polynomial with degree 2 has the lowest mean square error.

set.seed(7)
train=sample(Galton$Height,466)
lm.fit=lm(Galton$Height~poly(parentHeight,3),subset=train)
summary(lm.fit)

## 
## Call:
## lm(formula = Galton$Height ~ poly(parentHeight, 3), subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7944 -0.4275 -0.2275  0.0000  4.7056 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -165.96      24.03  -6.906 1.66e-11 ***
## poly(parentHeight, 3)1  9595.93     991.11   9.682  < 2e-16 ***
## poly(parentHeight, 3)2 -6662.33     698.90  -9.533  < 2e-16 ***
## poly(parentHeight, 3)3  2902.37     308.78   9.399  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 462 degrees of freedom
## Multiple R-squared:  0.8372, Adjusted R-squared:  0.8362 
## F-statistic: 792.1 on 3 and 462 DF,  p-value: < 2.2e-16

mean((Galton$Height-predict(lm.fit,Galton))[-train]^2)

## [1] 213744.3

lm.fit1=lm(Galton$Height~poly(parentHeight,1),subset=train)
lm.fit2=lm(Galton$Height~poly(parentHeight,2),subset=train)
mean((Galton$Height-predict(lm.fit1,Galton))[-train]^2)

## [1] 18.65345

mean((Galton$Height-predict(lm.fit2,Galton))[-train]^2)

## [1] 175.499

set.seed(7)
library(boot)
library(ISLR)
Galton$parentHeight<-parentHeight
glm.fit=glm(Height~parentHeight,data=Galton)
cv.error=cv.glm(Galton,glm.fit)
cv.error$delta[1]

## [1] 5.044283

glm.fit2=glm(Height~poly(parentHeight,2),data=Galton)
cv.error2=cv.glm(Galton,glm.fit2)
cv.error2$delta[1]

## [1] 5.039813

glm.fit3=glm(Height~poly(parentHeight,3),data=Galton)
cv.error3=cv.glm(Galton,glm.fit3)
cv.error3$delta[1]

## [1] 5.044239

The validation set approach is shown below. The polynomial with degree 2 has the lowest mean squared error.

cv.errork <- cv.glm(Galton, glm.fit, K=5)
cv.error$delta[1]

## [1] 5.044283

cv.error2k<-cv.glm(Galton, glm.fit2, K=5)
cv.error2k$delta[1]

## [1] 5.065259

cv.error3k<-cv.glm(Galton, glm.fit3, K=5)
cv.error3k$delta[1]

## [1] 5.047099

The polynomial with degree 1 has the lowest mean square error as we can see below. These values are very close to the errors for the validation set approach and the difference varies for each polynomial.