Introduction to Statistical Learning Solutions (ISLR)

Ch 4 Exercises

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

\(p(x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\) (4.2)

\(\frac{p(x)}{1-p(x)} = e^{\beta_0+\beta_1x}\) (4.3)

The equations are equivalent since,

\(1-p(x) = 1-\frac{e^{\beta_0+\beta_1x}}{1+\beta_0+\beta_1x} = \frac{1+e^{\beta_0+\beta_1x}-e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}} = \frac{1}{1+e^{\beta_0+\beta_1x}}\)

\(\frac{1}{1-p(x)} = 1+e^{\beta_0+\beta_1x}\) (1)

and substituting 1 into 4.2 yields 4.3,

\(\frac{p(x)}{1-p(x)} = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}(1+e^{\beta_0+\beta_1x})=e^{\beta_0+\beta_1x}\)

2

\(p_k(x) = \frac{\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) }{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2)}\) (4.12)

\(\sigma_k=x\cdot\frac{\mu_k}{\sigma^2} -\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)\) (4.13)

\(log(p_k(x)) = log(\frac{\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) }{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2)})=log(\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) ) - log(\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2))\) (Eq. 1)

Let \(K = log(\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2))\) then Eq. 1 becomes \(log(p_k(x)) = log(\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) ) - K\)

Further simplifications can be made,

\(log(p_k(x)) = log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})+log(exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2)) - K = -\frac{1}{2\sigma^2}(x-\mu_k)^2 + log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)

\(= -\frac{1}{2\sigma^2}(x^2-2x\mu_k+\mu_k^2) + log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)

\(= x\cdot\frac{\mu_k}{\sigma^2}+\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)

Where since the Bayes classifier assigns an observation to the class which \(p_k(x)\) is maximized, it assigns to class j if \(p_j(x) > p_i(x)\) for all \(i\neq j\).

\(p_j(x) > p_i(x) = log(p_j(x)) > log(p_i(x))\) for all \(i\neq j\)

\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma^2} + log(\pi_j)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K > x\cdot\frac{\mu_i}{\sigma^2}+\frac{\mu_i^2}{2\sigma^2} + log(\pi_i)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)

\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma^2} + log(\pi_j)> x\cdot\frac{\mu_i}{\sigma^2}+\frac{\mu_i^2}{2\sigma^2} + log(\pi_i)\) for all \(i\neq j\) since \(\sigma_1 = \sigma_2 = ... = \sigma_n\)

Thus the Bayes classifier assigns an observation to the discriminant function which is maximized.

\(\sigma_k = x\cdot\frac{\mu_k}{\sigma^2} -\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)\)

3

In the previous problem the quadratic term \(\frac{x^2}{2\sigma^2}\) can be eliminated since LDA assumes \(\sigma = \sigma_1 = \sigma_2 = ... = \sigma_n\). On the other hand, QDA makes no such assumption.

Following the steps in the previous question yields

where if no assumption regarding \(\sigma\) is made then an observation is assigned to the class for which the following is true,

\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma_j^2} + log(\pi_j)-\frac{x^2}{2\sigma_j^2}+log(\frac{1}{\sqrt{2\pi\sigma_j}})> x\cdot\frac{\mu_i}{\sigma_i^2}+\frac{\mu_i^2}{2\sigma_i^2} + log(\pi_i)-\frac{x^2}{2\sigma_i^2}+log(\frac{1}{\sqrt{2\pi\sigma_i}})\) for all \(j \neq i\)

4

4a

If X is an uniformly distributed random variable on [0,1], then \(P( X \leq x) = F(x) = x\) where \(0 \leq x \leq 1\). Thus, for any observation \(x_i\) on [0,1] the probability that a value is on the interval [x_1-.05,x_1+.05] is 10% since \(P( X \leq x) = \int_{x_i-.05}^{x_i+.05}f(x)dx =F(x_1-.05)-F(x_1+.05) = (x_i+.05)-(x_i-.05) = .1\) or 10%

Assuming that \(x \in [0.05,0.95]\).

4b

Similarly to the previous problem, If \(X_1\), and \(X_2\) are uniformly distributed random variables then there is a 10% chance of any observation being in the 10% neighborhood of either \(X_1\), or \(X_2\).

Assuming that \(X_1\) and \(X_2\) are independent and let \(x \in X_1\) and \(y \in X_2\), and \(A := [x-.05,x+0.05]\), \(B := [y-0.05,y+0.05]\) so that A and B are neighborhood of \(X_1\), and \(X_2\) respectively. Then \(p(x \in A, y \in B) = P(x \in A)P(y \in B) = 0.01\) or 1%

Thus, on average only 1% of observations are within the 10% neighborhood of both cases.

4c

\(0.1^{100}\)

4d

From the previous question one can note that if all features are independent, and uniformly distributed on [0,1] then with each new feature the amount of neighbors close to an observation decreases. In practice this means that KNN may use neighbors that are very far from a particular observation, yielding very poor results.

4e

Hyper cube

5

5a

On the training set QDA will perform better than LDA due to a decrease in bias. On the other hand, if Bayes decision boundary is linear then LDA will perform better on the test data set.

5b

Expect QDA to perform better in both cases.

5c

As the sample size increases it is expected that the prediction accuracy of QDA relative to LDA will improve. In general, this will be the case with non-linear methods relative to linear methods since with a increase in sample size there will be an decrease in bias that comes from a non-linear method, but variance will also tend to decrease as n increases.

5d

False.

For small values of n, QDA will tend to model noise and thus the test error will be higher in this case.

6

6a

\(p(x) = \frac{exp(\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2) }{1+exp(\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2)}=\frac{exp(-6+0.05\cdot x_1+x_2)}{1+exp(-6+0.05\cdot x_1+x_2)}\approx .3775\) or about 37.75%

6b

Let \(\hat{Y} = \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2} x_2 = -6+0.05x_1+3.5\), then the logistic regression function can be expressed as \(p1(x) = \frac{ e^{ \hat{Y}} }{ 1 + e^{ \hat{Y}} }\) where if the probability the students gets an A in the class is .5 then, \(p2(x)=\frac{ e^{ \hat{Y}} }{ 1+e^{ \hat{Y}} } = .5\)

so,

\(e^{\hat{Y}} = .5(1+e^{\hat{Y}}) = .5+.5e^{\hat{Y}}\)

\(e^{\hat{Y}} - .5e^{\hat{Y}}= .5\)

\(e^{\hat{Y}}(1 - .5)= .5\)

\(.5e^{\hat{Y}}= .5\)

\(log(.5)+\hat{Y} = log(.5)\)

\(\hat{Y} = -6+0.05x_1+3.5 = 0\)

\(x_1 = 50\)

7

The normal density function is \(f(x) = \frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{\sqrt{2\pi\sigma^2}}(x-\mu_k)^2)\),

where for k=0, \(f(4) = \frac{1}{\sqrt{2\pi6}}exp(-\frac{1}{2(6)^2}(4-0)^2) \approx 0.05568\) and k=1, \(f(4) = \frac{1}{\sqrt{2\pi6}}exp(-\frac{1}{2(6)^2}(4-10)^2) \approx 0.0403\) then plugging into Bayes formula gives the probability 0.751.

8

For the case that K = 1, KNN will give a classification error of 0% for training data, since the closest point is itself. Since the average error is 18%, then the error on the test set is 36%, which is 6% higher than the test error obtained by logistic regression. This indicates that logistic regression is a better approach.

9

9a

Since the odds is \(\frac{p(x)}{1-p(x)} = 0.37\)

Then \(p(x) = 0.37*(1-p(x))\) \(p(x)+0.37p(x)=0.37\) \(p(x) = \frac{0.37}{1.37} \approx 0.27\) or 27%

9b

\(p(x) = 0.16\) then \(1 - p(x) = 0.84\)

Thus,

\(\frac{p(x)}{1-p(x)} = \frac{0.16}{0.84} \approx 0.19\)

Applied Exercises

10

library("ISLR")

10a

As before, there is one nominal feature, the Direction, with all other features being numerical.

Moreover, the scatter plot of the Weekly data indicates all the Lag variables are weakly correlated to each other. From the scatter plot, it can also be noted that the volumes of shares traded is rapidly increasing, in a yearly basis.

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

plot(Weekly)

As previously observed, the Lag variables are weakly correlated and there is a strong correlation between Year and Volume.

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

About 44% of the data is classified as Down, and 55% is classified as Up.

table(Weekly$Direction)/sum(table(Weekly$Direction))

## 
##      Down        Up 
## 0.4444444 0.5555556

It is important to note that Directional is simply a nominal form of the Today feature. This should be the case since Today gives the percentage increase, or decrease of the current week, and the variable Direction simply maps these values to ‘Up’, and ‘Down’ respectively.

head(Weekly[,c(8,9)])

##    Today Direction
## 1 -0.270      Down
## 2 -2.576      Down
## 3  3.514        Up
## 4  0.712        Up
## 5  1.178        Up
## 6 -1.372      Down

10b

As can be seen below, the only statistically significant variables (at a 5% confidence value) are Lag1, and Lag2.

glm.fit <- glm(Direction~.-Year-Today,data=Weekly,family="binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = "binomial", 
##     data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

10c

From the table below it can be observed that the algorithm classifies about 56% of the observations correctly. It also performs well on the days that the stock increases, since about 92% of ‘Up’ entries are correctly classified. However, it performs poorly on days the market is down, with about 11% of these cases classified correctly.

glm.probs <- predict(glm.fit,type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly$Direction)

##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

mean(glm.pred == Weekly$Direction)

## [1] 0.5610652

558/(558+47)

## [1] 0.922314

56/(428+56)

## [1] 0.1157025

10d

train <- Weekly[,"Year"] <= 2008

glm.fit <- glm(Direction~Lag2,data = Weekly,subset = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")

glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.

##         
## glm.pred Down  Up
##     Down   63  85
##     Up    421 520

mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.

## [1] 0.5353535

10e

library(MASS)

lda.fit <- lda(Direction~Lag2,data=Weekly,subset=train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.pred <- predict(lda.fit,Weekly[!train,])
lda.class <- lda.pred$class

table(lda.class,Weekly[!train,9])

##          
## lda.class Down Up
##      Down    9  5
##      Up     34 56

mean(lda.class == Weekly[!train,9])

## [1] 0.625

10f

qda.fit <- qda(Direction~Lag2,data=Weekly,subset=train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.pred <- predict(qda.fit,Weekly[!train,])
qda.class <- qda.pred$class

table(qda.class,Weekly[!train,9])

##          
## qda.class Down Up
##      Down    0  0
##      Up     43 61

mean(qda.class == Weekly[!train,9])

## [1] 0.5865385

10g

library(class)

train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])

train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]

knn.pred <- knn(train.X,test.X,train.Direction,k=1)

table(knn.pred,test.Direction)

##         test.Direction
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

mean(knn.pred == test.Direction)

## [1] 0.5

10h

The best performing model is LDA, since it has the highest prediction accuracy (about 68%).

10i

# Logistic Regression

# Increase in assignment probability
glm.fit <- glm(Direction~Lag2,data = Weekly,subset = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")

glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.55] = "Up"

table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.

##         
## glm.pred Down  Up
##     Down  196 266
##     Up    288 339

mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.

## [1] 0.4912764

# Non Linear Transformation

# Up to Cubic power
glm.fit <- glm(Direction~Lag2+I(Lag2^2)+I(Lag2^3),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2) + I(Lag2^3), family = "binomial", 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.194  -1.245   1.008   1.108   1.142  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 0.1608285  0.0714552   2.251   0.0244 *
## Lag2        0.0491970  0.0340597   1.444   0.1486  
## I(Lag2^2)   0.0095243  0.0072076   1.321   0.1864  
## I(Lag2^3)   0.0005092  0.0005445   0.935   0.3497  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1348.5  on 981  degrees of freedom
## AIC: 1356.5
## 
## Number of Fisher Scoring iterations: 4

glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")

glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.

##         
## glm.pred Down  Up
##       Up  484 605

mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.

## [1] 0.5555556

# Square Root
glm.fit <- glm(Direction~sqrt(abs(Lag2)),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ sqrt(abs(Lag2)), family = "binomial", 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.405  -1.263   1.058   1.093   1.136  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.09488    0.15028   0.631    0.528
## sqrt(abs(Lag2))  0.09961    0.11788   0.845    0.398
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.0  on 983  degrees of freedom
## AIC: 1358
## 
## Number of Fisher Scoring iterations: 3

glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")

glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.

##         
## glm.pred Down  Up
##       Up  484 605

mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.

## [1] 0.5555556

# Log
glm.fit <- glm(Direction~log10(abs(Lag2)),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial", 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.305  -1.269   1.074   1.088   1.167  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.20916    0.06410   3.263   0.0011 **
## log10(abs(Lag2))  0.06823    0.13149   0.519   0.6038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.4  on 983  degrees of freedom
## AIC: 1358.4
## 
## Number of Fisher Scoring iterations: 3

glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")

glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.

##         
## glm.pred Down  Up
##       Up  484 605

mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.

## [1] 0.5555556

train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])

train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]


errors <- c()

maxK <- 100
step <- 2

for(j in seq(1,maxK,step)){
  knn.pred <- knn(train.X,test.X,train.Direction,k=j)
  table(knn.pred,test.Direction)
  errors <- c(1-mean(knn.pred == test.Direction),errors)
}

data <- cbind(seq(1,maxK,step),errors)
plot(data,type="l",xlab="k")

knn.pred <- knn(train.X,test.X,train.Direction,k=which.min(errors))
table(knn.pred,test.Direction)

##         test.Direction
## knn.pred Down Up
##     Down   18 24
##     Up     25 37

mean(knn.pred == test.Direction)

## [1] 0.5288462

11

11a

mpg01 <- rep(0,nrow(Auto))
mpg01[Auto[,'mpg']>median(Auto[,'mpg'])] <- 1

mpg01 = as.factor(mpg01)
Data = data.frame(Auto,mpg01)


table(mpg01)

## mpg01
##   0   1 
## 196 196

11b

Horsepower,weight and displacement seem useful in predicting mpg01.

par(mfrow = c(2,2))
boxplot(acceleration~mpg01,Data,main="Accel~mpg01")
boxplot(weight~mpg01,Data,main="Weight~mpg01")
boxplot(horsepower~mpg01,Data,main="Horsep~mpg01")
boxplot(displacement~mpg01,Data,main="Disp~mpg01")

11c

Data.train = Data[1:196,]
Data.test = Data[196:392,]

11d

Using the LDA model with the predictors horsepower, and displacement yields a miss classification error of 8.1%. Adding remaining predictors does not provide a significant improvement.

lda.fit = lda(mpg01~weight+displacement,Data)
lda.fit

## Call:
## lda(mpg01 ~ weight + displacement, data = Data)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##     weight displacement
## 0 3620.403     273.1582
## 1 2334.765     115.6658
## 
## Coefficients of linear discriminants:
##                       LD1
## weight       -0.001011194
## displacement -0.006968032

lda.pred = predict(lda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(lda.pred$class,Data.test[,'mpg01'])

##    
##       0   1
##   0  56   5
##   1  11 125

mean(lda.pred$class != Data.test[,'mpg01'])

## [1] 0.08121827

11e

The test error is slightly higher in this case. 10% of the test data is misclassified.

qda.fit = qda(mpg01~horsepower+displacement,Data)
qda.fit

## Call:
## qda(mpg01 ~ horsepower + displacement, data = Data)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##   horsepower displacement
## 0  130.11224     273.1582
## 1   78.82653     115.6658

qda.pred = predict(qda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(qda.pred$class,Data.test[,'mpg01'])

##    
##       0   1
##   0  57  11
##   1  10 119

mean(qda.pred$class != Data.test[,'mpg01'])

## [1] 0.106599

11f

Logistic regression is the worst model so far, with 18% of the observations being misclassified. However, lowering the minimum probability threshold improves the model.

logit.fit = glm(mpg01~horsepower+displacement,family = binomial,Data.train)
summary(logit.fit)

## 
## Call:
## glm(formula = mpg01 ~ horsepower + displacement, family = binomial, 
##     data = Data.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.30936  -0.17525  -0.00666   0.34107   2.11244  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   9.19005    2.03526   4.515 6.32e-06 ***
## horsepower   -0.04605    0.02242  -2.054     0.04 *  
## displacement -0.03501    0.00745  -4.700 2.60e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 251.761  on 195  degrees of freedom
## Residual deviance:  84.575  on 193  degrees of freedom
## AIC: 90.575
## 
## Number of Fisher Scoring iterations: 8

logit.pred = predict(logit.fit,Data.test[,c('horsepower','weight','displacement')])

# min P = 0.5
logit.class = ifelse(logit.pred>0.5,1,0)

table(logit.class,Data.test[,'mpg01'])

##            
## logit.class  0  1
##           0 65 35
##           1  2 95

mean(logit.class != Data.test[,'mpg01'])

## [1] 0.1878173

# min P = 0.25
logit.class = ifelse(logit.pred>0.25,1,0)

table(logit.class,Data.test[,'mpg01'])

##            
## logit.class  0  1
##           0 65 31
##           1  2 99

mean(logit.class != Data.test[,'mpg01'])

## [1] 0.1675127

11g

The test error lies between 0.13~0.18. The best value of K obtained is 5. However, due to ties occurring when predicting value there is some variation in the model (since in this cases KNN chooses the class randomly).

library('class')
knn.preds = sapply(1:20,function(x){
    pred = knn(Data.train[,c('horsepower','weight','displacement')],
               Data.test[,c('horsepower','weight','displacement')], 
                cl = Data.train[,'mpg01'],k=x,l=x/2)
    mean(pred != Data.test[,'mpg01'])
})


plot(1/(1:20),knn.preds,type = 'l',xlab='1/K')

summary(knn.preds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1320  0.1510  0.1574  0.1561  0.1624  0.1726

which.min(knn.preds)

## [1] 2

12 12a

Power = function(){2^3}
Power()

## [1] 8

12b

Power2 = function(x,a){x^a}
Power2(3,8)

## [1] 6561

12c

Power2(10,3)

## [1] 1000

Power2(8,17)

## [1] 2.2518e+15

Power2(131,3)

## [1] 2248091

12d

Power3 = function(x,a){ result=x^a; return(result)}

12e

x = 1:10
y = sapply(x,function(x){Power3(x,3)})

plot(x,y,log='x',main='x ~ x^2',type='l')

12f

PlotPower = function(x,a){
  y = sapply(x,function(x){x^a})
  plot(x,y,main=paste('x~x^',a,sep = ''),type='l')
}

PlotPower(1:10,3)

13

library('MASS')

mcrim = median(Boston[,'crim'])
mcrim01 = ifelse(Boston[,'crim']>mcrim,1,0)  # create label, 1 is above the meadian and 0 otherwise.

table(mcrim01)

## mcrim01
##   0   1 
## 253 253

Data = cbind(Boston,mcrim01)

par(mfrow = c(2,2))
boxplot(medv~mcrim01,Data,main='medv~mcrim01')
boxplot(lstat~mcrim01,Data,main='lstat~mcrim01')
boxplot(dis~mcrim01,Data,main='dis~mcrim01')
boxplot(age~mcrim01,Data,main='age~mcrim01')

From the plots above we can read the following; While there is a great variation, the median value of homes in a town is associated with higher crime. The a higher percentage of a lower status population is positively associated with a crime rate. The same can be said about the proportion of homes built before 1940.

Interestingly enough, the weighted mean of distance to employment centers has a strong negative correlation with crime.

Introduction to Statistical Learning Solutions (ISLR)

Ch 4 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

2

3

4

4a

4b

4c

4d

4e

5

5a

5b

5c

5d

6

6a

6b

7

8

9

9a

9b

Applied Exercises

10

10a

10b

10c

10d

10e

10f

10g

10h

10i

11

11a

11b

11c

11d

11e

11f

11g

12

12a

12b

12c

12d

12e

12f

13