Chapter 2 Problem 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This is a regression problem since CEO salary is likely quantitative. This is also an inference problem as we are interested in understanding which factors affect CEO salary.
n = 500 (firms in the US)
p = 3: profit, number of employees, industry

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a classification problem since whether a new product will be a success or a failure is a binary variable. This is also a prediction problem since we’re interested in whether the new product will be a success or a failure.
n = 20 similar products previously launched
p = 13 price charged, marketing budget, competition price, and ten other variables

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression scenario since % change is quantitative. This is also a prediction problem since “We are interested in predicting the % change in the USD/Euro exchange rate.”
n = 52 weeks of 2012 weekly data
p = 3: % change in US market, % change in British market, % change in German market

Chapter 2 Problem 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias. The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance. A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

Chapter 2 Problem 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f. A non-parametric approach does not assume a functional form for f and so requires a very large number of observations to accurately estimate f. The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach. The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.

Chapter 3 Problem 10

This question should be answered using the Carseats data set.

library(ISLR)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.

fit<-lm(Sales~Price+Urban+US)
summary(fit)

Call:
lm(formula = Sales ~ Price + Urban + US)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful some of the variables in the model are qualitative!

From the table above, price and US are significant predictors of Sales, for every $1 increase my price, my sales go down by $54. Sales inside of the US are $1,200 higher than sales outside of the US. Urban has no effect on Sales.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

\(Sales = 13.043469 -0.054459Price-0.021916Urban_{Yes}+1.200573XUS_{Yes}\)

(d) For which of the predictors can you reject the null hypothesis.

Price and US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit<-lm(Sales~Price+US)
summary(fit)

Call:
lm(formula = Sales ~ Price + US)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Terrible, each model explains around 23% of the variance in Sales.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(fit)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?
R has built in functions to that can help us identify influential points using various statistics with one simple command. Researchers have suggested several cutoff levels or upper limits as to what is the acceptable influence an observation should have before being considered an outlier. For example, the average leverage \(\frac{(p+1)}{n}\) which for us is \(\frac{(2+1)}{400} = 0.0075\).

par(mfrow=c(2,2))
plot(fit)

summary(influence.measures(fit))
Potentially influential observations of
     lm(formula = Sales ~ Price + US) :

    dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00  

R points out a few observations that violate various rules for each influence measure. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed and compare.

outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales~Price+US,data=Carseats.small)
summary(fit2)

Call:
lm(formula = Sales ~ Price + US, data = Carseats.small)

Residuals:
   Min     1Q Median     3Q    Max 
-5.263 -1.605 -0.039  1.590  5.428 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
Price       -0.053973   0.005511  -9.794  < 2e-16 ***
USYes        1.255018   0.248856   5.043 7.15e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.29 on 374 degrees of freedom
Multiple R-squared:  0.2387,    Adjusted R-squared:  0.2347 
F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

With these potential outliers or influential observations removed, very little changes from the linear model fit to the full data set. The confidence interval for the coefficient estimates produced by the linear model fit to the full data set contain the estimates of the coefficients for the estimates of the model with the outliers removed. It’s safe to include all of the data points in our model.

Chapter 4 Problem 10.

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

library(ISLR)
summary(Weekly)
      Year           Lag1               Lag2               Lag3               Lag4         
 Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580   1st Qu.: -1.1580  
 Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410   Median :  0.2380  
 Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472   Mean   :  0.1458  
 3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
      Lag5              Volume            Today          Direction 
 Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950   Down:484  
 1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540   Up  :605  
 Median :  0.2340   Median :1.00268   Median :  0.2410             
 Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499             
 3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050             
 Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260             
pairs(Weekly)

cor(Weekly[, -9])
              Year         Lag1        Lag2        Lag3         Lag4         Lag5      Volume
Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923 -0.030519101  0.84194162
Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876 -0.008183096 -0.06495131
Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535 -0.072499482 -0.08551314
Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865  0.060657175 -0.06928771
Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000 -0.075675027 -0.06107462
Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027  1.000000000 -0.05851741
Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617 -0.058517414  1.00000000
Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873  0.011012698 -0.03307778
              Today
Year   -0.032459894
Lag1   -0.075031842
Lag2    0.059166717
Lag3   -0.071243639
Lag4   -0.007825873
Lag5    0.011012698
Volume -0.033077783
Today   1.000000000

Year and Volume appear to have a strong positive relationship. The rest of the relationships seem weak; that is, there aren’t any other relationships among any of the lag variables and certainly not any associated with Direction.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

attach(Weekly)
The following objects are masked from Weekly (pos = 7):

    Direction, Lag1, Lag2, Lag3, Lag4, Lag5, Today, Volume, Year
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6949  -1.2565   0.9913   1.0849   1.4579  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Lag 2 appears to have an association with Direction (p-value=0.03).

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction)
        Direction
glm.pred Down  Up
    Down   54  48
    Up    430 557

Percentage of correct predictions: (54+557)/(54+557+48+430) = 56.1%. Our logistic regression model predicts Direction correctly 557/(557+48) = 92.1% of the time for weeks that the market goes up. When the market is down, our logistic regression is wrong most of the time 54/(430+54) = 11.2%.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
        Direction.0910
glm.pred Down Up
    Down    9  5
    Up     34 56
mean(glm.pred == Direction.0910)
[1] 0.625

(e) Repeat (d) using LDA.

library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
      Direction.0910
       Down Up
  Down    9  5
  Up     34 56
mean(lda.pred$class == Direction.0910)
[1] 0.625

(f) Repeat (d) using QDA.

qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
         Direction.0910
qda.class Down Up
     Down    0  0
     Up     43 61
mean(qda.class == Direction.0910)
[1] 0.5865385

Hilarious, our quadratic discriminant picks up all of the time and is still right 58.7% of the time.

(g) Repeat (d) using KNN with K = 1.

library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
        Direction.0910
knn.pred Down Up
    Down   21 30
    Up     22 31
mean(knn.pred == Direction.0910)
[1] 0.5

(h) Which of these methods appears to provide the best results on this data?
Logistic regression and LDA gave us the highest accuracy rates.

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

# Logistic regression with Lag2:Lag1
glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
        Direction.0910
glm.pred Down Up
    Down    1  1
    Up     42 60
mean(glm.pred == Direction.0910)
[1] 0.5865385
# LDA with Lag2 interaction with Lag1
lda.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.class = predict(lda.fit, Weekly.0910)$class
table(lda.class, Direction.0910)
         Direction.0910
lda.class Down Up
     Down    0  1
     Up     43 60
mean(lda.class == Direction.0910)
[1] 0.5769231
# QDA with sqrt(abs(Lag2))
qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
         Direction.0910
qda.class Down Up
     Down   12 13
     Up     31 48
mean(qda.class == Direction.0910)
[1] 0.5769231
# KNN k =10
knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
        Direction.0910
knn.pred Down Up
    Down   17 18
    Up     26 43
mean(knn.pred == Direction.0910)
[1] 0.5769231
# KNN k = 100
knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
        Direction.0910
knn.pred Down Up
    Down    9 12
    Up     34 49
mean(knn.pred == Direction.0910)
[1] 0.5576923

The original LDA and logistic regression outperform all of these models with an accuracy of 62.5%.

Chapter 4 Problem 11.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

library(ISLR)
summary(Auto)
      mpg          cylinders      displacement     horsepower        weight      acceleration  
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613   Min.   : 8.00  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225   1st Qu.:13.78  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804   Median :15.50  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978   Mean   :15.54  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:17.02  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140   Max.   :24.80  
                                                                                               
      year           origin                      name         mpg01    
 Min.   :70.00   Min.   :1.000   amc matador       :  5   Min.   :0.0  
 1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5   1st Qu.:0.0  
 Median :76.00   Median :1.000   toyota corolla    :  5   Median :0.5  
 Mean   :75.98   Mean   :1.577   amc gremlin       :  4   Mean   :0.5  
 3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4   3rd Qu.:1.0  
 Max.   :82.00   Max.   :3.000   chevrolet chevette:  4   Max.   :1.0  
                                 (Other)           :365                
attach(Auto)
The following object is masked _by_ .GlobalEnv:

    mpg01

The following objects are masked from Auto (pos = 5):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year
mpg01 = rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto = data.frame(Auto, mpg01)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

cor(Auto[, -9])
                    mpg  cylinders displacement horsepower     weight acceleration       year
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442    0.4233285  0.5805410
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834 -0.3456474
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005 -0.3698552
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955 -0.4163615
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392 -0.3091199
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000  0.2903161
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161  1.0000000
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458  0.1815277
mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566    0.3468215  0.4299042
mpg01.1       0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566    0.3468215  0.4299042
                 origin      mpg01    mpg01.1
mpg           0.5652088  0.8369392  0.8369392
cylinders    -0.5689316 -0.7591939 -0.7591939
displacement -0.6145351 -0.7534766 -0.7534766
horsepower   -0.4551715 -0.6670526 -0.6670526
weight       -0.5850054 -0.7577566 -0.7577566
acceleration  0.2127458  0.3468215  0.3468215
year          0.1815277  0.4299042  0.4299042
origin        1.0000000  0.5136984  0.5136984
mpg01         0.5136984  1.0000000  1.0000000
mpg01.1       0.5136984  1.0000000  1.0000000
pairs(Auto)  # doesn't work well since mpg01 is 0 or 1

Negatively associated with cylinders, weight, displacement, horsepower. Perfectly associated with mpg of course.

(c) Split the data into a training set and a test set.

train = (year%%2 == 0)  # if the year is even
test = !train
Auto.train = Auto[train, ]
Auto.test = Auto[test, ]
mpg01.test = mpg01[test]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# LDA
library(MASS)
lda.fit = lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
lda.pred = predict(lda.fit, Auto.test)
mean(lda.pred$class != mpg01.test)
[1] 0.1263736

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# QDA
qda.fit = qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
qda.pred = predict(qda.fit, Auto.test)
mean(qda.pred$class != mpg01.test)
[1] 0.1318681

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Logistic regression
glm.fit = glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, 
    family = binomial, subset = train)
glm.probs = predict(glm.fit, Auto.test, type = "response")
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
mean(glm.pred != mpg01.test)
[1] 0.1208791

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(class)
train.X = cbind(cylinders, weight, displacement, horsepower)[train, ]
test.X = cbind(cylinders, weight, displacement, horsepower)[test, ]
train.mpg01 = mpg01[train]
set.seed(1)
# KNN(k=1)
knn.pred = knn(train.X, test.X, train.mpg01, k = 1)
mean(knn.pred != mpg01.test)
[1] 0.1538462
# KNN(k=10)
knn.pred = knn(train.X, test.X, train.mpg01, k = 10)
mean(knn.pred != mpg01.test)
[1] 0.1648352
# KNN(k=100)
knn.pred = knn(train.X, test.X, train.mpg01, k = 100)
mean(knn.pred != mpg01.test)
[1] 0.1428571
K Test Error Rate
1 15.4%
2 16.5%
3 14.3%

100 nearest neighbors seems to perform the best.

---
title: "Session 1: Statistical Learning, Linear Regression, and Classification"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

### Chapter 2 Problem 2
__Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide *n* and *p*.__

__(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.__

This is a regression problem since CEO salary is likely quantitative. This is also an inference problem as we are interested in understanding which factors affect CEO salary.  
*n* = 500 (firms in the US)  
*p* = 3: profit, number of employees, industry

__(b) We are considering launching a new product and wish to know whether it will be a *success* or a *failure*. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a *success* or *failure*, price charged for the product, marketing budget, competition price, and ten other variables.__

This is a classification problem since whether a new product will be a *success* or a *failure* is a binary variable. This is also a prediction problem since we're interested in whether the new product will be a *success* or a *failure*.  
*n* = 20 similar products previously launched  
*p* = 13 price charged, marketing budget, competition price, and ten other variables

__(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.__

This is a regression scenario since % change is quantitative. This is also a prediction problem since "We are interested in predicting the % change in the USD/Euro exchange rate."   
*n* = 52 weeks of 2012 weekly data  
*p* = 3: % change in US market, % change in British market, % change in German market


### Chapter 2 Problem 5
__What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?__

The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias. The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance. A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

### Chapter 2 Problem 6
__Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?__

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for *f*. A non-parametric approach does not assume a functional form for *f* and so requires a very large number of observations to accurately estimate *f*. The advantages of a parametric approach to regression or classification are the simplifying of modeling *f* to a few parameters and not as many observations are required compared to a non-parametric approach. The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate *f* if the form of *f* assumed is wrong or to overfit the observations if more flexible models are used.


### Chapter 3 Problem 10
**This question should be answered using the `Carseats` data set.**
```{r readInData,echo=TRUE}
library(ISLR)
attach(Carseats)
```

**(a) Fit a multiple regression model to predict `Sales` using `Price`,`Urban`, and `US`.**
```{r fitMod}
fit<-lm(Sales~Price+Urban+US)
summary(fit)
```
**(b) Provide an interpretation of each coefficient in the model. Be careful some of the variables in the model are qualitative!**

From the table above, price and US are significant predictors of Sales, for every \$1 increase my price, my sales go down by \$54. Sales inside of the US are \$1,200 higher than sales outside of the US. Urban has no effect on Sales.

**(c) Write out the model in equation form, being careful to handle the qualitative variables properly.**

$Sales = 13.043469 -0.054459Price-0.021916Urban_{Yes}+1.200573XUS_{Yes}$

**(d) For which of the predictors can you reject the null hypothesis.**

`Price` and `US`

**(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.**
```{r fitMod2}
fit<-lm(Sales~Price+US)
summary(fit)
```
**(f) How well do the models in (a) and (e) fit the data?**

Terrible, each model explains around 23% of the variance in Sales.

**(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).**
```{r confint}
confint(fit)
```

**(h) Is there evidence of outliers or high leverage observations in the model from (e)?**  
R has built in functions to that can help us identify influential points using various statistics with one simple command. Researchers have suggested several cutoff levels or upper limits as to what is the acceptable influence an observation should have before being considered an outlier. For example, the average leverage $\frac{(p+1)}{n}$ which for us is $\frac{(2+1)}{400} = 0.0075$. 
```{R}
par(mfrow=c(2,2))
plot(fit)
summary(influence.measures(fit))
```
R points out a few observations that violate various rules for each influence measure. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed and compare.
```{R}
outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales~Price+US,data=Carseats.small)
summary(fit2)
```
With these potential outliers or influential observations removed, very little changes from the linear model fit to the full data set. The confidence interval for the coefficient estimates produced by the linear model fit to the full data set contain the estimates of the coefficients for the estimates of the model with the outliers removed. It's safe to include all of the data points in our model.

### Chapter 4 Problem 10.  
__This question should be answered using the `Weekly` data set, which is part of the `ISLR` package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.__  
__(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?__  

```{r}
library(ISLR)
summary(Weekly)
pairs(Weekly)
cor(Weekly[, -9])
```
`Year` and `Volume` appear to have a strong positive relationship. The rest of the relationships seem weak; that is, there aren't any other relationships among any of the lag variables and certainly not any associated with `Direction`.

__(b) Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors. Use the `summary` function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?__  

```{r}
attach(Weekly)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
```

`Lag 2` appears to have an association with `Direction` (*p*-value=0.03).

__(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.__

```{r}
glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction)
```
Percentage of correct predictions: (54+557)/(54+557+48+430) = 56.1%. Our logistic regression model predicts `Direction` correctly 557/(557+48) = 92.1% of the time for weeks that the market goes `up`. When the market is down, our logistic regression is wrong most of the time 54/(430+54) = 11.2%.

__(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).__

```{r}
train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
mean(glm.pred == Direction.0910)
```

__(e) Repeat (d) using LDA.__
```{r}
library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
mean(lda.pred$class == Direction.0910)
```

__(f) Repeat (d) using QDA.__
```{r}
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
mean(qda.class == Direction.0910)
```
Hilarious, our quadratic discriminant picks `up` all of the time and is still right 58.7% of the time.

__(g) Repeat (d) using KNN with *K* = 1.__

```{r}
library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
mean(knn.pred == Direction.0910)
```

__(h) Which of these methods appears to provide the best results on this data?__  
Logistic regression and LDA gave us the highest accuracy rates.

__(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for *K* in the KNN classifier.__

```{r}
# Logistic regression with Lag2:Lag1
glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
mean(glm.pred == Direction.0910)

# LDA with Lag2 interaction with Lag1
lda.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.class = predict(lda.fit, Weekly.0910)$class
table(lda.class, Direction.0910)
mean(lda.class == Direction.0910)

# QDA with sqrt(abs(Lag2))
qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
mean(qda.class == Direction.0910)

# KNN k =10
knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
mean(knn.pred == Direction.0910)

# KNN k = 100
knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
mean(knn.pred == Direction.0910)
```

The original LDA and logistic regression outperform all of these models with an accuracy of 62.5%.

### Chapter 4 Problem 11. 
__In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the `Auto` data set.__

__(a) Create a binary variable, `mpg01`, that contains a 1 if `mpg` contains a value above its median, and a 0 if `mpg` contains a value below its median. You can compute the median using the `median()` function. Note you may find it helpful to use the `data.frame()` function to create a single data set containing both `mpg01` and the other `Auto` variables.__

```{r}
library(ISLR)
summary(Auto)
attach(Auto)
mpg01 = rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto = data.frame(Auto, mpg01)
```

__(b) Explore the data graphically in order to investigate the association between `mpg01` and the other features. Which of the other features seem most likely to be useful in predicting `mpg01`? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.__

```{r}
cor(Auto[, -9])
pairs(Auto)  # doesn't work well since mpg01 is 0 or 1
```

Negatively associated with `cylinders`, `weight`, `displacement`, `horsepower`. Perfectly associated with `mpg` of course.

__(c) Split the data into a training set and a test set.__

```{r}
train = (year%%2 == 0)  # if the year is even
test = !train
Auto.train = Auto[train, ]
Auto.test = Auto[test, ]
mpg01.test = mpg01[test]
```

__(d) Perform LDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?__

```{r}
# LDA
library(MASS)
lda.fit = lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
lda.pred = predict(lda.fit, Auto.test)
mean(lda.pred$class != mpg01.test)
```

__(e) Perform QDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?__

```{r}
# QDA
qda.fit = qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
qda.pred = predict(qda.fit, Auto.test)
mean(qda.pred$class != mpg01.test)
```

__(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?__

```{r}
# Logistic regression
glm.fit = glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, 
    family = binomial, subset = train)
glm.probs = predict(glm.fit, Auto.test, type = "response")
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
mean(glm.pred != mpg01.test)
```

__(g) Perform KNN on the training data, with several values of *K*, in order to predict `mpg01`. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of *K* seems to perform the best on this data set?__

```{r}
library(class)
train.X = cbind(cylinders, weight, displacement, horsepower)[train, ]
test.X = cbind(cylinders, weight, displacement, horsepower)[test, ]
train.mpg01 = mpg01[train]
set.seed(1)
# KNN(k=1)
knn.pred = knn(train.X, test.X, train.mpg01, k = 1)
mean(knn.pred != mpg01.test)

# KNN(k=10)
knn.pred = knn(train.X, test.X, train.mpg01, k = 10)
mean(knn.pred != mpg01.test)

# KNN(k=100)
knn.pred = knn(train.X, test.X, train.mpg01, k = 100)
mean(knn.pred != mpg01.test)
```
| *K* | Test Error Rate|
|:-----:|----------------|
|1 | 15.4% |
|2 | 16.5% |
|3 | 14.3% |

100 nearest neighbors seems to perform the best. 
