Homework 2 Due: Septemeber 20, 2019

1. Chapter 3, Question 3

Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get βˆ0 = 50, βˆ1 = 20 , βˆ 2 = 0 . 07 , βˆ 3 = 35 , βˆ 4 = 0 . 01 , βˆ 5 = − 10 .

(a)

Which answer is correct, and why?

  1. For a fixed value of IQ and GPA, males earn more on average than females.

  2. For a fixed value of IQ and GPA, females earn more on average than males.

  3. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

  4. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

I believe that iii. is correct because in general females earn 35K more than males (holding all other variables constant), but the interaction term affects this. If GPA is at least a 3.5, then the interaction term will reverse this (3.5 * -10 = -35), and males will have a higher starting salary on average.

(b)

Predict the salary of a female with IQ of 110 and a GPA of 4.0.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.2.1       ✔ forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
response <- 50 + 20*4.0 + 0.07*110 + 35*1 + 0.01*110*4.0 + -10*4.0*1; response
## [1] 137.1

(c)

True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

IT DEPENDS!!! It depends on lots of things including sample size, I would like to see a p-value and get more information. However looking at the other coefficients I would guess that there is little evidence of an interaction effect; true.

2. Chapter 3, Question 8

This question involves the use of simple linear regression on the Auto data set.

(a)

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

fit_lm <- lm(mpg~hp, data = mtcars)
summary(fit_lm)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

i.

Is there a relationship between the predictor and the response? Yes, there is a negative relationship between horsepower and MPG

ii.

How strong is the relationship between the predictor and the response? This is significant meaning it is relatively strong with a very tiny p-value.

iii.

Is the relationship between the predictor and the response positive or negative? negative

iv.

What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

str(fit_lm$coefficients)
##  Named num [1:2] 30.0989 -0.0682
##  - attr(*, "names")= chr [1:2] "(Intercept)" "hp"
MPG_out <- fit_lm$coefficients[1] + fit_lm$coefficients[1] * 98; MPG_out
## (Intercept) 
##    2979.787

(b)

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(mtcars$hp, mtcars$mpg, pch=16, col=rainbow(12))
abline(fit_lm, lwd=3)

(c)

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

plot(fit_lm)

Linear does not have the best fit, perhaps a quadratic model would have a better fit. Also the Maserati Bora seems to have a lot of leverage and appears to be an outlier.

3. Chapter 3, Question

Undergrad

4. Chapter 3, Question 13

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

(a)

Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0,1) distribution. This represents a feature, X.

set.seed(1)
x <- rnorm(100,0,1)

(b)

Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0,0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.

eps <- rnorm(100, mean=0, sd=sqrt(.25))

(c)

Using x and eps, generate a vector y according to the model

Y =−1+0.5X+ε. (3.39) What is the length of the vector y? What are the values of β0 and β1 in this linear model?

Y <- -1 + .5*x + eps
length(Y)
## [1] 100

\(\beta_0 = -1\) and \(\beta_1 = 0.5\)

(d)

Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

dat <- data.frame(x,Y)
dat$Type <-rep("Sim",100)
p <- ggplot(data = dat, aes(x=x, y=Y)) + geom_point()
p

There is a positive relationshipe between x and Y. At first glance, it appears to be linear.

(e)

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ0 and βˆ1 compare to β0 and β1?

fit_sim <- lm(Y~x)
sum <- summary(fit_sim)
#CI for later
CI_original_b0 <- fit_sim$coefficients[1] + c(-1,1)*qnorm(.975)*sum$coefficients[3]
CI_original_b1 <- fit_sim$coefficients[2] + c(-1,1)*qnorm(.975)*sum$coefficients[4]

These values are pretty close to the actuals!

(f)

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

x_pop <- seq(from = -2,to = 2,length=100)
Y_pop <- -1 + .5*x_pop
data <- cbind(x_pop,Y_pop,rep("Population", 100)) %>% as.data.frame()
names(data) <- c("x", "Y", "Type")

library(varhandle)
data$x <- unfactor(data$x)
data$Y <- unfactor(data$Y)
data2 <- rbind(dat,data) %>% as.data.frame()
data2$Type <- factor(data2$Type)

ggplot(data2, aes(x=x,y=Y, col=Type, group=Type)) + geom_point(alpha=.5)+ geom_smooth(method = "lm", alpha=.1, se=F)

(g)

Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
fit_sim2 <- lm(Y~x+I(x^2))
summary(fit_sim2)
## 
## Call:
## lm(formula = Y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14
Anova(fit_sim2)
## Anova Table (Type II tests)
## 
## Response: Y
##            Sum Sq Df F value    Pr(>F)    
## x         20.3597  1 88.7301 2.403e-15 ***
## I(x^2)     0.4516  1  1.9682    0.1638    
## Residuals 22.2573 97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit_sim,fit_sim2)
## Analysis of Variance Table
## 
## Model 1: Y ~ x
## Model 2: Y ~ x + I(x^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     98 22.709                           
## 2     97 22.257  1   0.45163 1.9682 0.1638

Looking at the ANOVA table, adding the \(x^2\) term does seem to improve the fit (p-value < 0.05).

(h)

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term ε in (b). Describe your results.

set.seed(1)
x <- rnorm(100,0,1)

eps <- rnorm(100, mean=0, sd=sqrt(.01))
Y <- -1 + .5*x + eps

dat <- data.frame(x,Y)
dat$Type <-rep("Sim",100)
p <- ggplot(data = dat, aes(x=x, y=Y)) + geom_point()
p

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ0 and βˆ1 compare to β0 and β1?

fit_sim <- lm(Y~x)
fit_sim$coefficients
## (Intercept)           x 
##   -1.003769    0.499894
sum <- summary(fit_sim)
#CI for later
CI_lessNoise_b0 <- fit_sim$coefficients[1] + c(-1,1)*qnorm(.975)*sum$coefficients[3]
CI_lessNoise_b1 <- fit_sim$coefficients[2] + c(-1,1)*qnorm(.975)*sum$coefficients[4]

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

Y_pop <- -1 + .5*x
data <- cbind(x,Y_pop,rep("Population", 100)) %>% as.data.frame()
names(data) <- c("x", "Y", "Type")

library(varhandle)
data$x <- unfactor(data$x)
data$Y <- unfactor(data$Y)
data2 <- rbind(dat,data) %>% as.data.frame()


ggplot(data2, aes(x=x,y=Y, col=Type, group=Type)) + geom_point(alpha=.5)+ geom_smooth(method = "lm", alpha=.1, se=F)

Points seem to be tighter and closer to the linear model

(i)

Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ε in (b). Describe your results.

set.seed(1)
x <- rnorm(100,0,1)

eps <- rnorm(100, mean=0, sd=sqrt(5))
Y <- -1 + .5*x + eps

dat <- data.frame(x,Y)
dat$Type <-rep("Sim",100)
p <- ggplot(data = dat, aes(x=x, y=Y)) + geom_point()
p

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ0 and βˆ1 compare to β0 and β1?

fit_sim <- lm(Y~x)
fit_sim$coefficients
## (Intercept)           x 
##  -1.0842832   0.4976289
sum <- summary(fit_sim)
#CI for later
CI_Noisy_b0 <- fit_sim$coefficients[1] + c(-1,1)*qnorm(.975)*sum$coefficients[3]
CI_Noisy_b1 <- fit_sim$coefficients[2] + c(-1,1)*qnorm(.975)*sum$coefficients[4]

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

Y_pop <- -1 + .5*x
data <- cbind(x,Y_pop,rep("Population", 100)) %>% as.data.frame()
names(data) <- c("x", "Y", "Type")

library(varhandle)
data$x <- unfactor(data$x)
data$Y <- unfactor(data$Y)
data2 <- rbind(dat,data) %>% as.data.frame()


ggplot(data2, aes(x=x,y=Y, col=Type, group=Type)) + geom_point(alpha=.5)+ geom_smooth(method = "lm", alpha=.1, se=F)

(j)

What are the confidence intervals for β0 and β1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

Original <- c(CI_original_b0, CI_original_b1) 
Less_Noisy <- c(CI_lessNoise_b0,CI_lessNoise_b1) 
Noisy <- c(CI_Noisy_b0,CI_Noisy_b1)
Original
## [1] -1.1138921 -0.9238005  0.3938993  0.6050404
Less_Noisy
## [1] -1.0227784 -0.9847601  0.4787799  0.5210081
Noisy
## [1] -1.50934099 -0.65922550  0.02550305  0.96975476

As predicted the amount of noise affects the width of the confidence intervals. The most noise yields the largest confidence intervals.

5. Chapter 3, Question 15

This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

(a)

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
out <- apply(Boston[-1],2,function(x, alpha = 0.05){
 lm <- lm(crim~x, data=Boston)
 sum <- summary(lm)
 return (c(sum$coefficients[2], sum$coefficients[8] < .05))
})

out[2,] ==1
##      zn   indus    chas     nox      rm     age     dis     rad     tax 
##    TRUE    TRUE   FALSE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE 
## ptratio   black   lstat    medv 
##    TRUE    TRUE    TRUE    TRUE

All predictors except “chas” are significant in their own simple linear models.

apply(Boston[-1],2,function(x, alpha = 0.05){
 ggplot(data=Boston, aes(x=x, y=crim)) + geom_point() + geom_smooth(method = "lm")
})
## $zn

## 
## $indus

## 
## $chas

## 
## $nox

## 
## $rm

## 
## $age

## 
## $dis

## 
## $rad

## 
## $tax

## 
## $ptratio

## 
## $black

## 
## $lstat

## 
## $medv

(b)

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 : βj = 0?

lm <- lm(crim~., data=Boston)
sum <- summary(lm)
sum
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

zn, dis, rad, black, and medv are significant in the full model.

(c)

How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

df_betas <- cbind(out[1,], sum$coefficients[-1,1]) %>% as.data.frame()
colnames(df_betas) <- c("Simple", "Multiple")
df_betas
##              Simple      Multiple
## zn      -0.07393498   0.044855215
## indus    0.50977633  -0.063854824
## chas    -1.89277655  -0.749133611
## nox     31.24853120 -10.313534912
## rm      -2.68405122   0.430130506
## age      0.10778623   0.001451643
## dis     -1.55090168  -0.987175726
## rad      0.61791093   0.588208591
## tax      0.02974225  -0.003780016
## ptratio  1.15198279  -0.271080558
## black   -0.03627964  -0.007537505
## lstat    0.54880478   0.126211376
## medv    -0.36315992  -0.198886821
str(df_betas)
## 'data.frame':    13 obs. of  2 variables:
##  $ Simple  : num  -0.0739 0.5098 -1.8928 31.2485 -2.6841 ...
##  $ Multiple: num  0.0449 -0.0639 -0.7491 -10.3135 0.4301 ...
ggplot(df_betas, aes(x=Simple, y=Multiple, label=rownames(df_betas))) +
  geom_text()

Wow nox changed a lot…

(d)

Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form

Y = β0 +β1X +β2X2 +β3X3 +ε.

apply(Boston[-1],2,function(x, alpha = 0.05){
 lm <- lm(crim~x+I(x^2) + I(x^3), data=Boston)
 sum <- summary(lm)
})
## $zn
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.821 -4.614 -1.294  0.473 84.130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.846e+00  4.330e-01  11.192  < 2e-16 ***
## x           -3.322e-01  1.098e-01  -3.025  0.00261 ** 
## I(x^2)       6.483e-03  3.861e-03   1.679  0.09375 .  
## I(x^3)      -3.776e-05  3.139e-05  -1.203  0.22954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared:  0.05824,    Adjusted R-squared:  0.05261 
## F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06
## 
## 
## $indus
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -2.514  0.054  0.764 79.713 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.6625683  1.5739833   2.327   0.0204 *  
## x           -1.9652129  0.4819901  -4.077 5.30e-05 ***
## I(x^2)       0.2519373  0.0393221   6.407 3.42e-10 ***
## I(x^3)      -0.0069760  0.0009567  -7.292 1.20e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2552 
## F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $chas
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## x            -1.8928     1.5061  -1.257    0.209    
## I(x^2)            NA         NA      NA       NA    
## I(x^3)            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094
## 
## 
## $nox
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.110 -2.068 -0.255  0.739 78.302 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   233.09      33.64   6.928 1.31e-11 ***
## x           -1279.37     170.40  -7.508 2.76e-13 ***
## I(x^2)       2248.54     279.90   8.033 6.81e-15 ***
## I(x^3)      -1245.70     149.28  -8.345 6.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.2928 
## F-statistic: 70.69 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $rm
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.485  -3.468  -2.221  -0.015  87.219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 112.6246    64.5172   1.746   0.0815 .
## x           -39.1501    31.3115  -1.250   0.2118  
## I(x^2)        4.5509     5.0099   0.908   0.3641  
## I(x^3)       -0.1745     0.2637  -0.662   0.5086  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared:  0.06779,    Adjusted R-squared:  0.06222 
## F-statistic: 12.17 on 3 and 502 DF,  p-value: 1.067e-07
## 
## 
## $age
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.762 -2.673 -0.516  0.019 82.842 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2.549e+00  2.769e+00  -0.920  0.35780   
## x            2.737e-01  1.864e-01   1.468  0.14266   
## I(x^2)      -7.230e-03  3.637e-03  -1.988  0.04738 * 
## I(x^3)       5.745e-05  2.109e-05   2.724  0.00668 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared:  0.1742, Adjusted R-squared:  0.1693 
## F-statistic: 35.31 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $dis
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.757  -2.588   0.031   1.267  76.378 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.0476     2.4459  12.285  < 2e-16 ***
## x           -15.5543     1.7360  -8.960  < 2e-16 ***
## I(x^2)        2.4521     0.3464   7.078 4.94e-12 ***
## I(x^3)       -0.1186     0.0204  -5.814 1.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared:  0.2778, Adjusted R-squared:  0.2735 
## F-statistic: 64.37 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $rad
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.381  -0.412  -0.269   0.179  76.217 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.605545   2.050108  -0.295    0.768
## x            0.512736   1.043597   0.491    0.623
## I(x^2)      -0.075177   0.148543  -0.506    0.613
## I(x^3)       0.003209   0.004564   0.703    0.482
## 
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3965 
## F-statistic: 111.6 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $tax
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.273  -1.389   0.046   0.536  76.950 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.918e+01  1.180e+01   1.626    0.105
## x           -1.533e-01  9.568e-02  -1.602    0.110
## I(x^2)       3.608e-04  2.425e-04   1.488    0.137
## I(x^3)      -2.204e-07  1.889e-07  -1.167    0.244
## 
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3651 
## F-statistic:  97.8 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $ptratio
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.833 -4.146 -1.655  1.408 82.697 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 477.18405  156.79498   3.043  0.00246 **
## x           -82.36054   27.64394  -2.979  0.00303 **
## I(x^2)        4.63535    1.60832   2.882  0.00412 **
## I(x^3)       -0.08476    0.03090  -2.743  0.00630 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared:  0.1138, Adjusted R-squared:  0.1085 
## F-statistic: 21.48 on 3 and 502 DF,  p-value: 4.171e-13
## 
## 
## $black
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.096  -2.343  -2.128  -1.439  86.790 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.826e+01  2.305e+00   7.924  1.5e-14 ***
## x           -8.356e-02  5.633e-02  -1.483    0.139    
## I(x^2)       2.137e-04  2.984e-04   0.716    0.474    
## I(x^3)      -2.652e-07  4.364e-07  -0.608    0.544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared:  0.1498, Adjusted R-squared:  0.1448 
## F-statistic: 29.49 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $lstat
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.2009656  2.0286452   0.592   0.5541  
## x           -0.4490656  0.4648911  -0.966   0.3345  
## I(x^2)       0.0557794  0.0301156   1.852   0.0646 .
## I(x^3)      -0.0008574  0.0005652  -1.517   0.1299  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## 
## $medv
## 
## Call:
## lm(formula = crim ~ x + I(x^2) + I(x^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.427  -1.976  -0.437   0.439  73.655 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.1655381  3.3563105  15.840  < 2e-16 ***
## x           -5.0948305  0.4338321 -11.744  < 2e-16 ***
## I(x^2)       0.1554965  0.0171904   9.046  < 2e-16 ***
## I(x^3)      -0.0014901  0.0002038  -7.312 1.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared:  0.4202, Adjusted R-squared:  0.4167 
## F-statistic: 121.3 on 3 and 502 DF,  p-value: < 2.2e-16

This shows some evidence of non-linear association. Many predictors have significant predictors for quadratic and cubic terms.

6. Chapter 5, Question 2

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a)

What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

1-\(1 \over n\)

(b)

What is the probability that the second bootstrap observation is not the jth observation from the original sample?

I think the question is confusing and is trying to ask: given the first bootstrap observation is not the jth observation, what is the probability the second bootstrap sample is not the jth observation. This would be \((1- \frac {1} {n}) ^2\)

(c)

Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)^n.

The argument from part b leads into this and comes from the fact that a boostrap sample is with replacement.

(d)

When n = 5, what is the probability that the jth observation is in the bootstrap sample?

prob <- function(n){
  ret <- (1-1/n)^n
  return(1- ret)
}

prob(n=5)
## [1] 0.67232

(e)

When n = 100, what is the probability that the jth observation is in the bootstrap sample?

prob(n=100)
## [1] 0.6339677

(f)

When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?

prob(n=10000)
## [1] 0.632139

(g)

Create a plot that displays, for each integer value of n from 1 to 100,000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.

n <- c(1:100000)
Probability <- prob(n=n)
df <- data.frame(n, Probability)
ggplot(df, aes(x=n, y=Probability)) + geom_point() + geom_line()

This seems to level out pretty quickly and not deviate much with sample size.

7. Chapter 5, Question 3

Undergrad

8. Choose one of these two: Chapter 5, Question 7 OR Chapter 5, Question 8 (skip part d)

I chose 8 We will now perform cross-validation on a simulated data set. ##(a) Generate a simulated data set as follows:

set.seed(1)
y=rnorm(100)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

In this data set, what is n and what is p? Write out the model used to generate the data in equation form. n = 100, p = 1,

\(y = x - 2x^2 + \epsilon\)

(b) Create a scatterplot of X against Y. Comment on what you find.

ggplot() + geom_point(aes(x=x, y=y)) 

Looks quadratic with some noise, exactly as the model. Defintely NOT linear.

(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

i. Y = β0 + β1X + ε

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
data <- data.frame(x,y)
fit <-glm(y~x, data=data)
cv.error <- cv.glm(data, fit)
cv.error$delta
## [1] 5.890979 5.888812

ii. Y = β0 + β1X + β2X2 + ε

fit2 <-glm(y~x + I(x^2), data=data)
cv.error <- cv.glm(data, fit2)
cv.error$delta
## [1] 1.086596 1.086326

iii. Y = β0 +β1X +β2X2 +β3X3 +ε

fit3 <-glm(y~x + I(x^2) + I(x^3), data=data)
cv.error <- cv.glm(data, fit3)
cv.error$delta
## [1] 1.102585 1.102227

iv. Y = β0 +β1X +β2X2 +β3X3 +β4X4 +ε.

fit4 <-glm(y~x + I(x^2) + I(x^3) + I(x^4), data=data)
cv.error <- cv.glm(data, fit4)
cv.error$delta
## [1] 1.114772 1.114334

It seems like fit2 is the best model (as expected).

(d) Repeat (c) using another random seed, and report your results.

(a) Generate a simulated data set as follows:

set.seed(62090)
y=rnorm(100)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

In this data set, what is n and what is p? Write out the model used to generate the data in equation form. n = 100, p = 1,

\(y = x - 2x^2 + \epsilon\)

(b) Create a scatterplot of X against Y. Comment on what you find.

ggplot() + geom_point(aes(x=x, y=y)) 

Looks quadratic with some noise, exactly as the model. Defintely NOT linear.

(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

i. Y = β0 + β1X + ε

data <- data.frame(x,y)
fit <-glm(y~x, data=data)
cv.error <- cv.glm(data, fit)
cv.error$delta
## [1] 8.172267 8.168295
fit2 <-glm(y~x + I(x^2), data=data)
cv.error <- cv.glm(data, fit2)
cv.error$delta
## [1] 1.157316 1.156983
fit3 <-glm(y~x + I(x^2) + I(x^3), data=data)
cv.error <- cv.glm(data, fit3)
cv.error$delta
## [1] 1.179113 1.178631
fit4 <-glm(y~x + I(x^2) + I(x^3) + I(x^4), data=data)
cv.error <- cv.glm(data, fit4)
cv.error$delta
## [1] 1.209000 1.208256

Are your results the same as what you got in (c)? Why?

It affected the numbers a little, but not the result that fit2 is the best model!

(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.

Answered already (fit2-quadratic, yes as expected)

(f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

list <- list(fit, fit2, fit3, fit4)
map(list, summary) %>% map(.,function(item){item$coefficients})
## [[1]]
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -1.798341  0.2748702 -6.542512 2.768321e-09
## x            1.142119  0.2844951  4.014547 1.166651e-04
## 
## [[2]]
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  0.04356162 0.13165399   0.3308796 7.414491e-01
## x            0.88104115 0.11044340   7.9773095 2.998520e-12
## I(x^2)      -1.97730438 0.08356904 -23.6607283 4.501259e-42
## 
## [[3]]
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  0.04298628 0.13223346   0.3250787 7.458289e-01
## x            0.94416386 0.19173960   4.9241984 3.520683e-06
## I(x^2)      -1.97915931 0.08405773 -23.5452396 1.149842e-41
## I(x^3)      -0.02366139 0.05862517  -0.4036046 6.874011e-01
## 
## [[4]]
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.02670569 0.15707359 -0.1700202 8.653558e-01
## x            0.94200594 0.19207638  4.9043298 3.866602e-06
## I(x^2)      -1.80638965 0.22560323 -8.0069317 2.910940e-12
## I(x^3)      -0.02234304 0.05874442 -0.3803432 7.045398e-01
## I(x^4)      -0.03655411 0.04428365 -0.8254539 4.111820e-01

9. (UNDERGRAD ONLY)

10.

(GRAD ONLY) Write your own function that performs discriminant analysis when p=1, but replace the normal distribution with a Laplace distribution. Now perform classification on the iris data (excluding the species setosa from the analysis) using sepal width as the only predictor. Compare the classification of LDA (when using the normal distribution) to the classification when using the Laplace distribution. How does the decision boundary change when using a laplace distribution as compared to a normal distribution?

iris2 <- iris %>% filter(Species != "setosa")
ggplot(data=iris2, aes(x=Sepal.Width, fill= Species)) + geom_histogram(bins = 12) + geom_density() + facet_grid(Species~.)

\[p_x(k) = \frac{\pi_k f_k(x)}{\pi_1 f_1(x) + \pi_2 f_2(x)}\]

\[f_k(x) = \frac{1}{2b} exp(-\frac{|{x-\mu_k|}}{b})\]

assuming b is the same for all k:

\[p_x(k) = \frac{\pi_k \frac{1}{2b}exp(-\frac{|{x-\mu_k|}}{b})}{\pi_1\frac{1}{2b}exp(-\frac{|{x-\mu_1|}}{b}) + \pi_2\frac{1}{2b}exp(-\frac{|{x-\mu_2|}}{b})}\]

\[p_x(k) = \frac{\pi_k exp(-\frac{|{x-\mu_k|}}{b})}{\pi_1\exp(-\frac{|{x-\mu_1|}}{b}) + \pi_2exp(-\frac{|{x-\mu_2|}}{b})}\]

Since the denominator does not depend on k: \(p_x(k)\) \(\alpha\) \(\pi_k exp(-\frac{|{x-\mu_k|}}{b})\)

Taking the log:

\(log[p_x(k)]\) \(\alpha\) \(log(\pi_k) - \frac{|x-\mu_k|}{b}\)

Since there are only two categories \(\pi_k = 1/2\):

\(log[p_x(k)]\) \(\alpha\) \(\frac{|x-\mu_k|}{b}\)

Estimating b (Mean Absolute Deviaton MAD)

x_bar <- mean(iris2$Sepal.Width)

mu_vec <-iris2 %>% group_by(Species) %>% summarise(mean(Sepal.Width), mad(Sepal.Width))
b_hat <- mu_vec$`mad(Sepal.Width)`[1]

mu_1 <- mu_vec[1,2]
mu_2 <- mu_vec[2,2]
  
discriminant <- function(inputs){
  map(inputs, function(x){c(abs(x-mu_1)/b_hat, abs(x-mu_2)/b_hat)})
}

lout <-  discriminant(iris2$Sepal.Width)
test <- rep(NA,100)
for (i in 1:100){
  test[i] <- ifelse(lout[[i]][[1]]>lout[[i]][[2]], "virg", "versi")
}
data <- data.frame(iris2$Sepal.Width, iris2$Species, test)
ggplot(data, aes(x=iris2.Sepal.Width, y=0, fill = iris2.Species, col=test))+geom_point(shape=21, size=3)

lda_fit <- lda(iris2$Species~iris2$Sepal.Width)
## Warning in lda.default(x, grouping, ...): group setosa is empty
pred <- predict(lda_fit, iris2)
iris2$pred <- pred$class
ggplot(iris2, aes(x=Sepal.Width, y=0, fill = Species, col=pred))+geom_point(shape=21, size=3)

After ALL THIS MADNESS!!! I think that I am showing the boundaries are the same (at least for this example).

Theoretically the assumption that \(\sigma\) is the same in LDA is essentially the reason that the decision boundry is \(\frac{\mu_1+\mu_2}{2}\). Since the Laplace distribution is also symmetric and b (the MAD) is assumed to be the same for each distribution, the decision boundry is also \(\frac{\mu_1+\mu_2}{2}\).