2)

4)

a)

a)

Since the predictor is uniformly distributed, 10 percent of available observations will be used to make each prediction.

b)

In the case where p = 2, 1/10 of the X1 observations will be used, and 1/10 of the X2 observations will be used to make each prediction. If we think of this criteria visually as a “box”, the area of such a box would be .1 * .1 = 1/100. So 1/100 of the available observations would be used to make the prediction.

c)

Using the same logic as b), we would have \(.1^{100}\) as the decimal representation of the average proportion of available observations that would be used for prediction.

d)

As p increases, the fraction of available observations used to predict decreases exponentially. That is to say, there are few observations “near” any given test observation as the number of predictors increases.

e)

I am thinking of this as the “volume” of the hypercube being fixed at .1. When p = 1, the only side is of length .1. When p = 2, each side is \(\sqrt{.1} = .32\). For p = 100, using the same logic, we have \(x^{100}=.1\) so \(x = .1^{1/100} = .977\).

b)

We would need to know what percent threshold was used to choose nearby observations. The exercise showed that the fraction of available observations used for prediction is a function of this percent threshold (in the problem it was 10 percent). For example, if there is high dimensionality, a relatively high percent threshold can also be chosen as a corrective measure. Additionally, the sample size matters; with a very high sample size, the fraction of available observations will be high enough to make good predictions even with a large number of predictors.

c)

Use your judgement?

d)

I would disagree just because the statement is so absolute. More data is usually a good thing because it can allow you to make better predictions and more precise inferences (if you use the data correctly). However, there are a few reasons why more data isn’t inherently good: first, more data might lead the researchers to be overconfident about the results. There could exist major measurement error or selection bias in the sample, but the test errors are so small that the researchers think they have a perfect prediction. In reality, the model would not work perfectly on another sample with the selection bias and measurement error corrected. Second, more data might induce ML researchers to use a complex non-parametric method because they tend to have better predictions with large samples. However if the true relationship is measured really well using a parametric method like regression, they are sacrificing a lot of interpretation for only a small increase in prediction accuracy. Lastly, with huge amounts of data comes questions about ethics - how was this data obtained and does it contain information that needs to be encrypted.

3)

5)

a)

LDA would perform better on the test set but QDA would perform better on the training set if the Bayes decision boundary is linear.

b)

QDA would perform better on the training and the test set if the Bayes decision boundary is non-linear.

c)

As n increases, we would expect the test prediction accuracy of QDA to improve relative to LDA. This is because QDA’s drawback is its higher variance due to increased flexibility. With a higher sample size, variance decreases.

d)

False. If the true model is linear, the LDA will reflect this fact and give a more accurate test error. The QDA will try its best to create a quadratic relationship from noise that will result in overfitting.

8)

Setting K = 1 results in the most flexible KNN algoritm possible. The regression had an average error rate of 25 percent which is not much higher than the 18 percent error rate for KNN. Since we don’t know the KNN test error, we can only assume it is somewhere relatively close to (or higher than) the regression test error of 30 percent. So to be safe from overfitting and for increased interpretability we should use the regression.

4)

11)

a)

mpg1 <- rep(0, nrow(Auto))
mpg1[Auto$mpg > median(Auto$mpg)] <- 1
Auto$mpg1 <- as.factor(mpg1)

b)

Auto <- Auto[, -9]
par(mfrow = c(2, 4))
for(i in 1:8){
        plot(mpg1, Auto[, i], ylab = names(Auto)[i], xlab = "mpg1")
}

There looks to be a relationship between the mpg1 variable and the following variables: horsepower, weight, acceleration, and displacement.

c)

set.seed(8675309)
train <- rep(FALSE, 392)
train[sample(1:392, 392*.8)] <- TRUE
autotrain <- Auto[train,]
autotest <- Auto[!train,]

d)

ldafit <- lda(mpg1 ~ horsepower + weight + acceleration + displacement, data = autotrain)
ldafit
## Call:
## lda(mpg1 ~ horsepower + weight + acceleration + displacement, 
##     data = autotrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4760383 0.5239617 
## 
## Group means:
##   horsepower   weight acceleration displacement
## 0  129.62416 3631.215     14.72752     275.0000
## 1   77.89634 2326.896     16.53841     115.4116
## 
## Coefficients of linear discriminants:
##                       LD1
## horsepower    0.007246064
## weight       -0.001058601
## acceleration -0.008635755
## displacement -0.009478145
ldapred <- predict(ldafit, autotest)

table(ldapred$class, autotest$mpg1)
##    
##      0  1
##   0 35  0
##   1 12 32
12/79
## [1] 0.1518987

The test error is about 15.2 percent.

e)

qdafit <- qda(mpg1 ~ horsepower + weight + acceleration + displacement, data = autotrain)
qdafit
## Call:
## qda(mpg1 ~ horsepower + weight + acceleration + displacement, 
##     data = autotrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4760383 0.5239617 
## 
## Group means:
##   horsepower   weight acceleration displacement
## 0  129.62416 3631.215     14.72752     275.0000
## 1   77.89634 2326.896     16.53841     115.4116
qdapred <- predict(qdafit, autotest)

table(qdapred$class, autotest$mpg1)
##    
##      0  1
##   0 37  1
##   1 10 31
11/79
## [1] 0.1392405

The test error is about 13.9 percent.

f)

glmfit <- glm(mpg1 ~ horsepower + weight + acceleration, data = autotrain, family = "binomial")
glmfit
## 
## Call:  glm(formula = mpg1 ~ horsepower + weight + acceleration, family = "binomial", 
##     data = autotrain)
## 
## Coefficients:
##  (Intercept)    horsepower        weight  acceleration  
##    13.020228     -0.042410     -0.003678      0.094346  
## 
## Degrees of Freedom: 312 Total (i.e. Null);  309 Residual
## Null Deviance:       433.2 
## Residual Deviance: 162.3     AIC: 170.3
glmprobs <- predict(glmfit, autotest)
glmpreds <- rep(0, nrow(autotest))
glmpreds[glmprobs > .5] <- 1
glmpreds
##  [1] 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0
## [39] 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0
## [77] 1 0 1
table(glmpreds, autotest$mpg1)
##         
## glmpreds  0  1
##        0 42  7
##        1  5 25
12/79
## [1] 0.1518987

The test error is about 15.2 percent.

g)

library(class)
knn1 <- knn(autotrain, autotest, k = 1, cl = autotrain$mpg1)

table(knn1, autotest$mpg1)
##     
## knn1  0  1
##    0 37  0
##    1 10 32
10/79
## [1] 0.1265823
knn2 <- knn(autotrain, autotest, k = 2, cl = autotrain$mpg1)

table(knn2, autotest$mpg1)
##     
## knn2  0  1
##    0 39  0
##    1  8 32
8/79
## [1] 0.1012658
knn3 <- knn(autotrain, autotest, k = 3, cl = autotrain$mpg1)

table(knn3, autotest$mpg1)
##     
## knn3  0  1
##    0 39  0
##    1  8 32
8/79
## [1] 0.1012658
knn4 <- knn(autotrain, autotest, k = 4, cl = autotrain$mpg1)

table(knn4, autotest$mpg1)
##     
## knn4  0  1
##    0 37  0
##    1 10 32
11/79
## [1] 0.1392405

The test error for k = 1 is about 12.65 percent, k = 2 has a test error of about 10.1 percent, k = 3 is about 10.1 percent, and the test error seems to only increase as k increases. So k = 3 and k = 2 are tied for the best model on this dataset.

5)

a)

The plot implies that men are more likely to be admitted than women.

#male acceptance rate
1198/(1198+1493)
## [1] 0.4451877
#female acceptance rate
557/(557+1278)
## [1] 0.3035422

The male acceptance rate is 44.5 percent and the female acceptance rate is 30.3 percent.

b)

There looks to be heterogeneity in the amount of males or females that apply to a given department. Females tended to apply to the departments that have lower overall acceptance rates, and males tended to apply to the departments with higher overall acceptance rates. So there does not look to be a gender bias university-wide; the difference in overall acceptance rates by gender were due to differences in department acceptance rates.

c)

Initally, it seemed like the college had a higher acceptance rate for males than females. However, once the department variable is controlled for, the gender difference disappears.

d)

Again, females tended to apply to departments with lower acceptance rates and males tended to apply to departments with higher acceptance rates. Looking at the plots from b), it seems like the acceptance rates for males were very close to the acceptance rates for females within each department. So the confounding variable is department, or department acceptance rate.

e)

data(UCBAdmissions)
Adm <- as.integer(UCBAdmissions)[(1:(6*2))*2-1]
Rej <- as.integer(UCBAdmissions)[(1:(6*2))*2]
Dept <- gl(6,2,6*2,labels=c("A","B","C","D","E","F"))
Sex <- gl(2,1,6*2,labels=c("Male","Female"))
Ratio <- Adm/(Rej+Adm)
berk <- data.frame(Adm,Rej,Sex,Dept,Ratio)

head(berk)
##   Adm Rej    Sex Dept     Ratio
## 1 512 313   Male    A 0.6206061
## 2  89  19 Female    A 0.8240741
## 3 353 207   Male    B 0.6303571
## 4  17   8 Female    B 0.6800000
## 5 120 205   Male    C 0.3692308
## 6 202 391 Female    C 0.3406408
LogReg.gender <- glm(cbind(Adm,Rej)~Sex,data=berk,family=binomial("logit"))
summary(LogReg.gender)
## 
## Call:
## glm(formula = cbind(Adm, Rej) ~ Sex, family = binomial("logit"), 
##     data = berk)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -16.7915   -4.7613   -0.4365    5.1025   11.2022  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.22013    0.03879  -5.675 1.38e-08 ***
## SexFemale   -0.61035    0.06389  -9.553  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.06  on 11  degrees of freedom
## Residual deviance: 783.61  on 10  degrees of freedom
## AIC: 856.55
## 
## Number of Fisher Scoring iterations: 4

Based on this logistic regression model, females have a significantly lower probability of acceptance than males.

f)

LogReg.gender1 <- glm(cbind(Adm,Rej)~Sex + Dept,data=berk,family=binomial("logit"))
summary(LogReg.gender1)
## 
## Call:
## glm(formula = cbind(Adm, Rej) ~ Sex + Dept, family = binomial("logit"), 
##     data = berk)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
## -1.2487   3.7189  -0.0560   0.2706   1.2533  -0.9243   0.0826  -0.0858  
##       9       10       11       12  
##  1.2205  -0.8509  -0.2076   0.2052  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.58205    0.06899   8.436   <2e-16 ***
## SexFemale    0.09987    0.08085   1.235    0.217    
## DeptB       -0.04340    0.10984  -0.395    0.693    
## DeptC       -1.26260    0.10663 -11.841   <2e-16 ***
## DeptD       -1.29461    0.10582 -12.234   <2e-16 ***
## DeptE       -1.73931    0.12611 -13.792   <2e-16 ***
## DeptF       -3.30648    0.16998 -19.452   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.056  on 11  degrees of freedom
## Residual deviance:  20.204  on  5  degrees of freedom
## AIC: 103.14
## 
## Number of Fisher Scoring iterations: 4

The coefficient for females is now positive - albeit not significant. This makes sense given the context of the problem. Before controlling for department acceptance rates, the coefficient for female acceptance probability was significantly negative. Then once department acceptance rate is controlled for, the coefficient flips its sign (which is the paradox) and is no longer significant. That is, there is not an overall gender bias in acceptance rates once department acceptance rates are accounted for.