Linear Regression

1. Simple Linear Regression

The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston.

We will seek to predict medv using 13 predictors such as

rm: average number of rooms per house
age: average age of houses
lstat: percent of households with low socioeconomic status

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

lm.fit = lm(medv ~ lstat, data=Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We can use the name() function in order to find out what other pieces of information are stored in lm.fit

names(lm.fit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

We can do lm.fit$coefficients to extract the quantities but it is safer to use the extractor fuctions like coef() to access them.

coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

In order to obtain a confidence interval for the coefficient, we can use the confint() command.

confint(lm.fit)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.

predict(lm.fit, data.frame(lstat=c(5, 10, 15)), interval="confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

We can plot medv and lstat along with the least squares regression line using plot() and abline() functions. abline() function can be used to draw any line. To draw a line with intercept a and slope b, we type abline(a, b). The lwd=3 command causes the width of the regression line to be increased by a factor of 3. The pch option is to create different plotting symbols.

plot(Boston$lstat, Boston$medv, pch="+")
abline(lm.fit, lwd=3, col="red")

Now we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). And by using par(), we can tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously.

par(mfrow=c(2, 2))
plot(lm.fit)

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals.

plot(predict(lm.fit), residuals(lm.fit)) # residuals vs. fitted values

plot(predict(lm.fit), rstudent(lm.fit)) # studentized residuals vs. fitted valuse

Leverage statistics can be computed for any number of predictors using the hatvalues() function. The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))

## 375 
## 375

2. Multiple Linear Regression

lm.fit = lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

If we want to perform a regression using all of the predictors, we can do the following:

lm.fit = lm(medv ~ ., data=Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

summary(lm.fit)$r.sq

## [1] 0.7406427

summary(lm.fit)$sigma

## [1] 4.745298

To check collinearity:

vif(lm.fit)

##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

If we want to regress using all predictors except age:

lm.fit1 = lm(medv ~ .-age, data = Boston)
summary(lm.fit1)

## 
## Call:
## lm(formula = medv ~ . - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Alternatively, we can use update() function.

lm.fit1 = update(lm.fit, ~.-age)

3. Interaction Terms

a:b tells R to include an interaction term betwenn a and b. The syntax a*b simultaneously includes a, b, and the interaction term a*b as predictors.

summary(lm(medv ~ lstat*age, data = Boston))

## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

4. Non-linear Transformations of the Predictors

The lm() function can also accommodate non-linear transformatins of the predictors. For example, given a predictor X, we can create a predictor$X^2$ using I(X^2). The funtion I() is needed since the ^ has a special meaning in a formula.

lm.fit2=lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We can use anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.

lm.fit=lm(medv~lstat, data=Boston)
anova(lm.fit, lm.fit2)

## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova() function performs a hypothesis test comparing the two models.

null hypothesis: the two models fit the data equally well
alternative hypothesis: the full model is superior.

Here the F-stat is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is far superior to the model that only contatins the predictor lstat.

par(mfrow=c(2, 2))
plot(lm.fit2)

If we want to include more higher order predictors, we can do the following:

lm.fit5 = lm(medv ~ poly(lstat, 5), data=Boston)
summary(lm.fit5)

## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

6. Qualitative Predictors

Now we examine the Carseats data in the ISLR library. We will attmempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

This data includes qualitativ predictors such as Shelveloc, as indicator of the quality of the shelving location; Bad, Medium, and Good. Given a qualitative variable such as Shelveloc, R generates dummy variables automatically.

lm.fit = lm(Sales~.+Income:Advertising+Price:Age, data=Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The contrasts() function returns the coding that R uses for the dummy variables.

contrasts(Carseats$ShelveLoc)

##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Classification

1. The Stock Market Data

names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

dim(Smarket)

## [1] 1250    9

summary(Smarket)

##       Year           Lag1                Lag2          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  
##  Median :2003   Median : 0.039000   Median : 0.039000  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000  
##       Lag3                Lag4                Lag5         
##  Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  
##  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  
##  Median : 0.038500   Median : 0.038500   Median : 0.03850  
##  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  
##  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  
##  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300  
##      Volume           Today           Direction 
##  Min.   :0.3561   Min.   :-4.922000   Down:602  
##  1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  
##  Median :1.4229   Median : 0.038500             
##  Mean   :1.4783   Mean   : 0.003138             
##  3rd Qu.:1.6417   3rd Qu.: 0.596750             
##  Max.   :3.1525   Max.   : 5.733000

cor(Smarket[,-9]) #cor() only for the quantative variable

##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

The correlations between the lag variables and today’s returns are close to zero. i.e, there appears to be litte correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume.

plot(Smarket$Volume) #because the data is in order of time

2. Logistic Regression

We will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. We must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Lag1 has the smallest associated p-value. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, 0.145 is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

To access the coefficients for the fitted model, we can use coef() or summary(fitted model)$coef.

coef(glm.fit)

##  (Intercept)         Lag1         Lag2         Lag3         Lag4 
## -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938 
##         Lag5       Volume 
##  0.010313068  0.135440659

summary(glm.fit)$coef

##                 Estimate Std. Error    z value  Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3         0.011085108 0.04993854  0.2219750 0.8243333
## Lag4         0.009358938 0.04997413  0.1872757 0.8514445
## Lag5         0.010313068 0.04951146  0.2082966 0.8349974
## Volume       0.135440659 0.15835970  0.8552723 0.3924004

summary(glm.fit)$coef[,4] # to access the p-values

## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##   0.6006983   0.1452272   0.3983491   0.8243333   0.8514445   0.8349974 
##      Volume 
##   0.3924004

The predict() function is used to predict $P(Y=1|X)$. The type="response option tells R to output probabilities of the form $P(Y=1|X)$, as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.

contrasts(Smarket$Direction)

##      Up
## Down  0
## Up    1

So we know that prediction values are the probability of the market going up since R is encoding ‘up’ as ‘1’.

glm.probs=predict(glm.fit, type="response")
glm.probs[1:10]

##         1         2         3         4         5         6         7 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 
##         8         9        10 
## 0.5092292 0.5176135 0.4888378

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.

glm.pred=rep("Down", 1250)
glm.pred[glm.probs>0.5]="Up"

The table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

table(glm.pred, Smarket$Direction)

##         
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

The fraction of days for which the prediction was correct.

(507+145)/1250

## [1] 0.5216

The mean() function can be used to compute the fraction of days for which the prediction was correct. This is equavelent to the result from the above calculation.

mean(glm.pred==Smarket$Direction)

## [1] 0.5216

training error rate: 100-52.2=47.8%

The training error rate is often overly optismistic, since it tends to underestimate the test error rate.

Creating the Training and the Test Sets

train = (Smarket$Year<2005) # creates a boolean vector
Smarket.2005=Smarket[!train,]
dim(Smarket.2005)

## [1] 252   9

Direction.2005=Smarket$Direction[!train]

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument.

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
glm.prob=predict(glm.fit, Smarket.2005, type="response")

glm.pred=rep("Down", 252)
glm.pred[glm.prob>0.5]="Up"
table(glm.pred, Direction.2005)

##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

mean(glm.pred==Direction.2005)

## [1] 0.4801587

mean(glm.pred!=Direction.2005) #test error rate

## [1] 0.5198413

Using predictors that have no relationship with the response tends to cause deterioration in the test error rate (since such predictors cuse an increase in variance without a corresponding decrease in bias), so removing such predictors may in turn yield an improvement.

We refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

glm.fit = glm(Direction ~ Lag1+Lag2, data=Smarket, family=binomial, subset=train)
glm.probs=predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down", 252)
glm.pred[glm.probs>0.5]="Up"
table(glm.pred, Direction.2005)

##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106

mean(glm.pred==Direction.2005) # portion of correctly predicted

## [1] 0.5595238

106/(106+76)

## [1] 0.5824176

Suppose we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we wnat to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1 respectively, and on a day when they equal 1.5 and -0.8. We use predict() function.

predict(glm.fit, newdata = data.frame(Lag1=c(1.2, 1.5), Lag2=c(1.1, -0.8)), type="response")

##         1         2 
## 0.4791462 0.4960939

3. Linear Discriminant Analysis(LDA)

In R, we fit an LDA model using thd lda() function, which is part of the MASS library. The syntax for the lda() function is identical to that of lm() except for the absence of the family option. We fit the model using only the observations before 2005.

lda.fit=lda(Direction ~ Lag1+Lag2, data=Smarket, subset=train)
lda.fit

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

Group Means: There is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines.

Coefficients of Linear Discriminants: These are the multipliers of the elements of $X=x$. If $-0.642*Lag1-0.514*Lag2$ is large, then the LDA classifier will predict a market increases, and if it is small, then the LDA classifier will predict a market decline.

plot(lda.fit)

lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)

## [1] "class"     "posterior" "x"

class: contains LDA’s predictions about the movement of the market

posterior: matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class

x: contains the linear discriminats

lda.class=lda.pred$class
table(lda.class, Direction.2005)

##          Direction.2005
## lda.class Down  Up
##      Down   35  35
##      Up     76 106

mean(lda.class==Direction.2005)

## [1] 0.5595238

The LDA and logistic regression predictioins are almost identical.

Applying a 50% threshold to the posterior probabilities allows us to recreate the predictions contained in lda.pred$class.

sum(lda.pred$posterior[,1]>=0.5)

## [1] 70

sum(lda.pred$posterior[,1]<0.5)

## [1] 182

Note The posterior probability output by the model corresponds to the probability that the market will decrease.

If we want to use a posterior probability threshold other than 50% in order to make predictions, then we could easily do so by changing the command.

4. Quadratic Discriminant Analysis

qda.fit=qda(Direction~Lag1+Lag2, data=Smarket, subset=train)
qda.fit

## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544

qda.class=predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)

##          Direction.2005
## qda.class Down  Up
##      Down   30  20
##      Up     81 121

mean(qda.class==Direction.2005)

## [1] 0.5992063

K-Nearest Neighbors

We perform KNN using the knn() function, which is part of the class library. The function requries four inputs.

1. A matrix containing the predictors associated with the training data, labeled train.X below.
1. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
1. A vector containing the class labels for the training observations, labeled train.Direction below.
1. A value for K, the number of nearest beighbors to be used by the classifier.

attach(Smarket)
train.X=cbind(Lag1, Lag2)[train, ]
test.X=cbind(Lag1, Lag2)[!train, ]
train.Direction=Direction[train]

set.seed(1)
knn.pred=knn(train.X, test.X, train.Direction, k=1)
table(knn.pred, Direction.2005)

##         Direction.2005
## knn.pred Down Up
##     Down   43 58
##     Up     68 83

(83+43)/252

## [1] 0.5

Let’s use k=3:

knn.pred=knn(train.X, test.X, train.Direction, k=3)
table(knn.pred, Direction.2005)

##         Direction.2005
## knn.pred Down Up
##     Down   48 54
##     Up     63 87

mean(knn.pred==Direction.2005)

## [1] 0.5357143

It appears that for thid dats, QDA provides the best results of the methods that we have examined so far.

detach(Smarket)

6. Application: Caravan Insurance Data

KNN Methods

This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.

dim(Caravan)

## [1] 5822   86

attach(Caravan)
summary(Purchase)

##   No  Yes 
## 5474  348

Since the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

For example, let’s consider salary and age variables. As far as KNN is concerned, a difference of $1,000 in salary is enormous compared to a difference of 50 years in age. Consequently, salary will drive the KNN classification results, and age will have almost no effect.

A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. The scale() function standardize the scale.

standardize.X = scale(Caravan[, -86]) # 86 case has qualitative Purchase variable
var(Caravan[, 1])

## [1] 165.0378

var(Caravan[, 2])

## [1] 0.1647078

var(standardize.X[, 1])

## [1] 1

var(standardize.X[, 2])

## [1] 1

Now every column of standardized.X has a standard deviation of one and a mean of zero.

Now we split the observations into a test set and a training set. We fit a KNN model on the training data using K=1, and evaluate its performance on the test data.

test=1:1000
train.X=standardize.X[-test, ]
test.X=standardize.X[test, ]
train.Y=Purchase[-test]
test.Y=Purchase[test]
set.seed(1)
knn.pred=knn(train.X, test.X, train.Y, k=1)
mean(test.Y!=knn.pred) #KNN error rate

## [1] 0.118

The KNN error rate is about 12% which is fairly small. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting “No”.

table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  873  50
##      Yes  68   9

9/(68+9)

## [1] 0.1168831

If the company tries to sell insurance to a random selection of customers, then the success rate will be only 6%.
If the company would like to try to sell insurance only to customers who are likely to buy it, then the success rate is 11.7$. This is double the rate that one would obtain from random guessing.

Using K=3, the success rate increases to 19% and with K=5 the rate is 26.7%:

knn.pred=knn(train.X, test.X, train.Y, k=3)
table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  920  54
##      Yes  21   5

5/(21+5)

## [1] 0.1923077

knn.pred=knn(train.X, test.X, train.Y, k=5)
table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  930  55
##      Yes  11   4

4/15

## [1] 0.2666667

Logistic Regression

If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem in prediction. If we instead use 0.25 as the threshold, we get much better results: we predict that 33 people will purchase insurance, and we are correct for about 33% of these people.

glm.fit=glm(Purchase~., data=Caravan, family=binomial, subset=-test)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

glm.probs=predict(glm.fit, Caravan[test,], type="response")
glm.pred=rep("No", 1000)
glm.pred[glm.probs>0.5]="Yes"
table(glm.pred, test.Y)

##         test.Y
## glm.pred  No Yes
##      No  934  59
##      Yes   7   0

glm.pred=rep("No" ,1000)
glm.pred[glm.probs>0.25]="Yes"
table(glm.pred, test.Y)

##         test.Y
## glm.pred  No Yes
##      No  919  48
##      Yes  22  11

11/(22+11)

## [1] 0.3333333

Linear Regression and Classification