Source file ⇒ seminar_5_.Rmd
The MASS
library contains the Boston
data set, which records medv
(median house value) for 506 neighborhoods around Boston.
We will seek to predict medv
using 13 predictors such as
rm
: average number of rooms per houseage
: average age of houseslstat
: percent of households with low socioeconomic statusnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
lm.fit = lm(medv ~ lstat, data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We can use the name()
function in order to find out what other pieces of information are stored in lm.fit
names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
We can do lm.fit$coefficients
to extract the quantities but it is safer to use the extractor fuctions like coef()
to access them.
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
In order to obtain a confidence interval for the coefficient, we can use the confint()
command.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
The predict()
function can be used to produce confidence intervals and prediction intervals for the prediction of medv
for a given value of lstat
.
predict(lm.fit, data.frame(lstat=c(5, 10, 15)), interval="confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
We can plot medv
and lstat
along with the least squares regression line using plot()
and abline()
functions. abline()
function can be used to draw any line. To draw a line with intercept a and slope b, we type abline(a, b)
. The lwd=3
command causes the width of the regression line to be increased by a factor of 3. The pch
option is to create different plotting symbols.
plot(Boston$lstat, Boston$medv, pch="+")
abline(lm.fit, lwd=3, col="red")
Now we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot()
function directly to the output from lm()
. And by using par()
, we can tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously.
par(mfrow=c(2, 2))
plot(lm.fit)
Alternatively, we can compute the residuals from a linear regression fit using the residuals()
function. The function rstudent()
will return the studentized residuals.
plot(predict(lm.fit), residuals(lm.fit)) # residuals vs. fitted values
plot(predict(lm.fit), rstudent(lm.fit)) # studentized residuals vs. fitted valuse
Leverage statistics can be computed for any number of predictors using the hatvalues()
function. The which.max()
function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
## 375
## 375
lm.fit = lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
If we want to perform a regression using all of the predictors, we can do the following:
lm.fit = lm(medv ~ ., data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
summary(lm.fit)$r.sq
## [1] 0.7406427
summary(lm.fit)$sigma
## [1] 4.745298
To check collinearity:
vif(lm.fit)
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
If we want to regress using all predictors except age
:
lm.fit1 = lm(medv ~ .-age, data = Boston)
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
Alternatively, we can use update()
function.
lm.fit1 = update(lm.fit, ~.-age)
a:b
tells R to include an interaction term betwenn a
and b
. The syntax a*b
simultaneously includes a
, b
, and the interaction term a*b
as predictors.
summary(lm(medv ~ lstat*age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The lm()
function can also accommodate non-linear transformatins of the predictors. For example, given a predictor X, we can create a predictor\(X^2\) using I(X^2)
. The funtion I()
is needed since the ^
has a special meaning in a formula.
lm.fit2=lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We can use anova()
function to further quantify the extent to which the quadratic fit is superior to the linear fit.
lm.fit=lm(medv~lstat, data=Boston)
anova(lm.fit, lm.fit2)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova()
function performs a hypothesis test comparing the two models.
Here the F-stat is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat
and lstat^2
is far superior to the model that only contatins the predictor lstat
.
par(mfrow=c(2, 2))
plot(lm.fit2)
If we want to include more higher order predictors, we can do the following:
lm.fit5 = lm(medv ~ poly(lstat, 5), data=Boston)
summary(lm.fit5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Now we examine the Carseats
data in the ISLR
library. We will attmempt to predict Sales
(child car seat sales) in 400 locations based on a number of predictors.
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
This data includes qualitativ predictors such as Shelveloc
, as indicator of the quality of the shelving location; Bad, Medium, and Good. Given a qualitative variable such as Shelveloc
, R generates dummy variables automatically.
lm.fit = lm(Sales~.+Income:Advertising+Price:Age, data=Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
The contrasts()
function returns the coding that R uses for the dummy variables.
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
dim(Smarket)
## [1] 1250 9
summary(Smarket)
## Year Lag1 Lag2
## Min. :2001 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
## Median :2003 Median : 0.039000 Median : 0.039000
## Mean :2003 Mean : 0.003834 Mean : 0.003919
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000
## Lag3 Lag4 Lag5
## Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
## 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
## Median : 0.038500 Median : 0.038500 Median : 0.03850
## Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
## 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
## Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
## Volume Today Direction
## Min. :0.3561 Min. :-4.922000 Down:602
## 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
## Median :1.4229 Median : 0.038500
## Mean :1.4783 Mean : 0.003138
## 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. :3.1525 Max. : 5.733000
cor(Smarket[,-9]) #cor() only for the quantative variable
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
The correlations between the lag variables and today’s returns are close to zero. i.e, there appears to be litte correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year
and Volume
.
plot(Smarket$Volume) #because the data is in order of time
We will fit a logistic regression model in order to predict Direction
using Lag1
through Lag5
and Volume
. The glm()
function fits generalized linear models, a class of models that includes logistic regression. We must pass in the argument family=binomial
in order to tell R to run a logistic regression rather than some other type of generalized linear model.
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Lag1
has the smallest associated p-value. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, 0.145 is still relatively large, and so there is no clear evidence of a real association between Lag1
and Direction
.
To access the coefficients for the fitted model, we can use coef()
or summary(fitted model)$coef
.
coef(glm.fit)
## (Intercept) Lag1 Lag2 Lag3 Lag4
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938
## Lag5 Volume
## 0.010313068 0.135440659
summary(glm.fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3 0.011085108 0.04993854 0.2219750 0.8243333
## Lag4 0.009358938 0.04997413 0.1872757 0.8514445
## Lag5 0.010313068 0.04951146 0.2082966 0.8349974
## Volume 0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fit)$coef[,4] # to access the p-values
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974
## Volume
## 0.3924004
The predict()
function is used to predict \(P(Y=1|X)\). The type="response
option tells R to output probabilities of the form \(P(Y=1|X)\), as opposed to other information such as the logit. If no data set is supplied to the predict()
function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
contrasts(Smarket$Direction)
## Up
## Down 0
## Up 1
So we know that prediction values are the probability of the market going up since R is encoding ‘up’ as ‘1’.
glm.probs=predict(glm.fit, type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509
## 8 9 10
## 0.5092292 0.5176135 0.4888378
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up
or Down
.
glm.pred=rep("Down", 1250)
glm.pred[glm.probs>0.5]="Up"
The table()
function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.
table(glm.pred, Smarket$Direction)
##
## glm.pred Down Up
## Down 145 141
## Up 457 507
The fraction of days for which the prediction was correct.
(507+145)/1250
## [1] 0.5216
The mean()
function can be used to compute the fraction of days for which the prediction was correct. This is equavelent to the result from the above calculation.
mean(glm.pred==Smarket$Direction)
## [1] 0.5216
The training error rate is often overly optismistic, since it tends to underestimate the test error rate.
train = (Smarket$Year<2005) # creates a boolean vector
Smarket.2005=Smarket[!train,]
dim(Smarket.2005)
## [1] 252 9
Direction.2005=Smarket$Direction[!train]
We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset
argument.
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
glm.prob=predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down", 252)
glm.pred[glm.prob>0.5]="Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred==Direction.2005)
## [1] 0.4801587
mean(glm.pred!=Direction.2005) #test error rate
## [1] 0.5198413
Using predictors that have no relationship with the response tends to cause deterioration in the test error rate (since such predictors cuse an increase in variance without a corresponding decrease in bias), so removing such predictors may in turn yield an improvement.
We refit the logistic regression using just Lag1
and Lag2
, which seemed to have the highest predictive power in the original logistic regression model.
glm.fit = glm(Direction ~ Lag1+Lag2, data=Smarket, family=binomial, subset=train)
glm.probs=predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down", 252)
glm.pred[glm.probs>0.5]="Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
mean(glm.pred==Direction.2005) # portion of correctly predicted
## [1] 0.5595238
106/(106+76)
## [1] 0.5824176
Suppose we want to predict the returns associated with particular values of Lag1
and Lag2
. In particular, we wnat to predict Direction
on a day when Lag1
and Lag2
equal 1.2 and 1.1 respectively, and on a day when they equal 1.5 and -0.8. We use predict()
function.
predict(glm.fit, newdata = data.frame(Lag1=c(1.2, 1.5), Lag2=c(1.1, -0.8)), type="response")
## 1 2
## 0.4791462 0.4960939
In R, we fit an LDA model using thd lda()
function, which is part of the MASS
library. The syntax for the lda()
function is identical to that of lm()
except for the absence of the family option. We fit the model using only the observations before 2005.
lda.fit=lda(Direction ~ Lag1+Lag2, data=Smarket, subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
Group Means: There is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines.
Coefficients of Linear Discriminants: These are the multipliers of the elements of \(X=x\). If \(-0.642*Lag1-0.514*Lag2\) is large, then the LDA classifier will predict a market increases, and if it is small, then the LDA classifier will predict a market decline.
plot(lda.fit)
lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)
## [1] "class" "posterior" "x"
class: contains LDA’s predictions about the movement of the market
posterior: matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class
x: contains the linear discriminats
lda.class=lda.pred$class
table(lda.class, Direction.2005)
## Direction.2005
## lda.class Down Up
## Down 35 35
## Up 76 106
mean(lda.class==Direction.2005)
## [1] 0.5595238
The LDA and logistic regression predictioins are almost identical.
Applying a 50% threshold to the posterior probabilities allows us to recreate the predictions contained in lda.pred$class
.
sum(lda.pred$posterior[,1]>=0.5)
## [1] 70
sum(lda.pred$posterior[,1]<0.5)
## [1] 182
Note The posterior probability output by the model corresponds to the probability that the market will decrease.
If we want to use a posterior probability threshold other than 50% in order to make predictions, then we could easily do so by changing the command.
qda.fit=qda(Direction~Lag1+Lag2, data=Smarket, subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
qda.class=predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)
## Direction.2005
## qda.class Down Up
## Down 30 20
## Up 81 121
mean(qda.class==Direction.2005)
## [1] 0.5992063
We perform KNN using the knn()
function, which is part of the class
library. The function requries four inputs.
train.X
below.test.X
below.train.Direction
below.attach(Smarket)
train.X=cbind(Lag1, Lag2)[train, ]
test.X=cbind(Lag1, Lag2)[!train, ]
train.Direction=Direction[train]
set.seed(1)
knn.pred=knn(train.X, test.X, train.Direction, k=1)
table(knn.pred, Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 43 58
## Up 68 83
(83+43)/252
## [1] 0.5
Let’s use k=3:
knn.pred=knn(train.X, test.X, train.Direction, k=3)
table(knn.pred, Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 48 54
## Up 63 87
mean(knn.pred==Direction.2005)
## [1] 0.5357143
It appears that for thid dats, QDA provides the best results of the methods that we have examined so far.
detach(Smarket)
This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase
, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.
dim(Caravan)
## [1] 5822 86
attach(Caravan)
summary(Purchase)
## No Yes
## 5474 348
Since the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.
For example, let’s consider salary
and age
variables. As far as KNN is concerned, a difference of $1,000 in salary is enormous compared to a difference of 50 years in age. Consequently, salary
will drive the KNN classification results, and age
will have almost no effect.
A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. The scale()
function standardize the scale.
standardize.X = scale(Caravan[, -86]) # 86 case has qualitative Purchase variable
var(Caravan[, 1])
## [1] 165.0378
var(Caravan[, 2])
## [1] 0.1647078
var(standardize.X[, 1])
## [1] 1
var(standardize.X[, 2])
## [1] 1
Now every column of standardized.X
has a standard deviation of one and a mean of zero.
Now we split the observations into a test set and a training set. We fit a KNN model on the training data using K=1, and evaluate its performance on the test data.
test=1:1000
train.X=standardize.X[-test, ]
test.X=standardize.X[test, ]
train.Y=Purchase[-test]
test.Y=Purchase[test]
set.seed(1)
knn.pred=knn(train.X, test.X, train.Y, k=1)
mean(test.Y!=knn.pred) #KNN error rate
## [1] 0.118
The KNN error rate is about 12% which is fairly small. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting “No”.
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 873 50
## Yes 68 9
9/(68+9)
## [1] 0.1168831
If the company tries to sell insurance to a random selection of customers, then the success rate will be only 6%.
If the company would like to try to sell insurance only to customers who are likely to buy it, then the success rate is 11.7$. This is double the rate that one would obtain from random guessing.
Using K=3, the success rate increases to 19% and with K=5 the rate is 26.7%:
knn.pred=knn(train.X, test.X, train.Y, k=3)
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 920 54
## Yes 21 5
5/(21+5)
## [1] 0.1923077
knn.pred=knn(train.X, test.X, train.Y, k=5)
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 930 55
## Yes 11 4
4/15
## [1] 0.2666667
If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem in prediction. If we instead use 0.25 as the threshold, we get much better results: we predict that 33 people will purchase insurance, and we are correct for about 33% of these people.
glm.fit=glm(Purchase~., data=Caravan, family=binomial, subset=-test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.probs=predict(glm.fit, Caravan[test,], type="response")
glm.pred=rep("No", 1000)
glm.pred[glm.probs>0.5]="Yes"
table(glm.pred, test.Y)
## test.Y
## glm.pred No Yes
## No 934 59
## Yes 7 0
glm.pred=rep("No" ,1000)
glm.pred[glm.probs>0.25]="Yes"
table(glm.pred, test.Y)
## test.Y
## glm.pred No Yes
## No 919 48
## Yes 22 11
11/(22+11)
## [1] 0.3333333