Classification

1.0 Overview
Logistic Regression
Example 1: College Admissions
Interpreting odds ratios
- Measuring model fit
Example 2: Stock Market Data

Data Sources:

Heavily borrowed from:

Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
UCLA Example link
Wikipedia

require(knitr)

## Loading required package: knitr

1.0 Overview

Linear regression ([tutorial here()]) assumes that the response variable Y is quantitative. However, in many situations we are dealing with qualitative response variables. Generally, we will refer to these types of variables as categorical variables. For example: eye color is categorical since it has values like brown, blue, and green. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.

There are many classification techniques, or classifiers, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:

Logistic Regression
Linear Discriminant Analysis
K-nearest neighbors

Later notebooks (link here) will include more complicated classifiers such as:

Generalized additive models
Tree methods
Random forests
Support Vector Machines

Just like linear regression, in classification we have a set of training observations which we leverage to build a classifier, and we test our model performance on the test data to simulate out of sample error. In this notebook we will use a dataset of credit card information as model inputs to predict whether an individual will default on their credit card payment.

# Load the textbook R package
require(ISLR)
# Load in the credit data
attach(Default)

# Lets take a look at the data
str(Default)

## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

# How many people actual default?
tmp <- table(default)
(tmp[[2]]/tmp[[1]])*100

## [1] 3.445

We can see that these data have 10000 observations of 4 variables, and that only about 3% of people actually default. Let’s create a few diagnostic plots to get a sense of the data. Remember, the goal here will be to predict whether someone will default on their credit card payment, based on the variables student, balance and income.

library(ggplot2); library(gridExtra)

## Loading required package: grid

x <- qplot(x=balance, y=income, color=default, shape=default, geom='point')+scale_shape(solid=FALSE)
y <- qplot(x=default, y=balance, fill=default, geom='boxplot')+guides(fill=FALSE)
z <- qplot(x=default, y=income, fill=default, geom='boxplot')+guides(fill=FALSE)
# Plot
x

plot of chunk unnamed-chunk-4

grid.arrange(y, z, nrow=1)

plot of chunk unnamed-chunk-4

We see that the relationship between the predictor balance and the default rate is rather pronounced, this will likely be a major predictor in the forthcoming model.

Logistic Regression

In the Default dataset, where the response variable falls into two categories (yes or no in this case). Rather than modeling the response Y directly, logistic regression models the probability that Y belongs to a particular category. Therefore, in our case with the Default data, the logistic regression models the probability of defaulting. For example, the probability of default given balance can be written as

\[Pr(default = Yes|balance)\]

where the values range from 0 to 1. Then for any given value of balance, a prediction can be made for default. For example, we could predict that default = yes for any person whose predicted probability of defaulting is > 0.5. Yet, we can use these probabilities in various fashions. For example, a company who wishes to conservatively predict individuals who are at risk of default could choose a lower probability threshold, say > 0.1.

Logistic regression uses the logistic function fitted by maximum likelihood. The logistic function will always produce an S-shaped curve, so regardless of the value of x, we will always return a sensible prediction. To interpret the model we can rearrange the equation so that we return the __odds_. Odds are traditionally used instead of probabilities in horse-racing, since they relate more naturally to the correct betting strategy. Taking the log of the equation simplifies it further. Now it looks similar to what we have seen in linear regression.

\[log(p(X) / 1-p(X)) = β0 + β1X\]

The left hand side is now the log-odds or logit. Instead of linear regression where one unit change in the response variable Y results in one unit change of X, in logistic regression, increasing X by one unit changes the log odds by \(β1\). Since the probability response is not a straight line, the amount of change one unit has across values of X changes depending on the value of X. However, if \(β1\) is positive, then increasing X will result in an increase in probability, and vice-versa.

Estimating the Regression Coefficients.

The coefficients \(β0\) and \(β1\) are initially unknown, and must be estimated based on the available training data. Recall that linear regression uses the least squares approach to estimate the coefficients. For logistic regression, the more general method of maximum likelihood is preferred for its robust statistical properties. Basically, the algorithm tries to find coefficients that maximize the likelihood that the probabilities are closest to 1 for people who defaulted, and close to zero for people who did not.

Let’s fit a basic model to these values and inspect the results.

logit <- glm(default ~ balance, data=Default, family='binomial')
summary(logit)

## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.270  -0.146  -0.059  -0.022   3.759  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.65133    0.36116   -29.5   <2e-16 ***
## balance       0.00550    0.00022    24.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600
## 
## Number of Fisher Scoring iterations: 8

Here we see that \(β1\) is 0.005, this means that balance is associated with an increase in probability of default. Specifically, a one-unit increase in balance increases the log odds of default by 0.0055 units.

Other components of the logistic regression output are similar to linear regression. We can use the standard error to measure accuracy, and the P-value to assess significant influence in the model.

for qualitative predictors, we can take the same dummy variable approach as in linear regression

# Create a new dummy variable for students
Default$studentD <- 0
Default$studentD[Default$student=="Yes"] <- 1

logit <- glm(default ~ studentD, data=Default, family='binomial')
summary(logit)

## 
## Call:
## glm(formula = default ~ studentD, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.297  -0.297  -0.243  -0.243   2.659  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5041     0.0707  -49.55  < 2e-16 ***
## studentD      0.4049     0.1150    3.52  0.00043 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2908.7  on 9998  degrees of freedom
## AIC: 2913
## 
## Number of Fisher Scoring iterations: 6

From these diagnostics we see that the model predicts that students tend to have higher default probabilities than non-students.

Multiple Logistic Regression

logit <- glm(default ~ income + balance + studentD, family='binomial', data=Default)
summary(logit)

## 
## Call:
## glm(formula = default ~ income + balance + studentD, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.469  -0.142  -0.056  -0.020   3.738  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.09e+01   4.92e-01  -22.08   <2e-16 ***
## income       3.03e-06   8.20e-06    0.37   0.7115    
## balance      5.74e-03   2.32e-04   24.74   <2e-16 ***
## studentD    -6.47e-01   2.36e-01   -2.74   0.0062 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1580
## 
## Number of Fisher Scoring iterations: 8

There is a very interesting result here. Recall that for the bi variate logistic regression between default and student predicted that students are more likely to default than non-students. However, we see here that when paired with balance and income, students are less likely to default than non-students. What is going on here? This means that for a fixed value of balance and income, a student is less likely to default than a non-student. In-fact, the student default rate is below non-student default rates for all values of balance. Yet, if we averaged the students variable, the overall student default rate is higher than the non-student default rate, which is why the bi variate regression indicates a positive coefficient. The reason for this is because the features student and balance are mildly correlated. Thus, students tend to hold higher levels of debt, which is associated with higher probability of defaulting. Thus we could conclude that a student is more likely to default than a non-student if no credit card balance information is available. However, students are actually less risky than non-students with the same credit card balance. This phenomenon is known as confounding.

Logistic Regression for > 2 Response Classes

While we could extend the logistic model for > 2 response classes, the discriminant analysis approach, discussed here is preferred.

Example 1: College Admissions

In this dataset we are interested in building a predictive model to see how variables, like the GRE (Graduate Record Exam scores), GPA, and prestige of the individuals undergraduate institution, effect graduate school admission. The response variable Y being admit / not-admit.

# Load the data
data <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
str(data)

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

# Some basic statistics
summary(data)

##      admit            gre           gpa            rank     
##  Min.   :0.000   Min.   :220   Min.   :2.26   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:520   1st Qu.:3.13   1st Qu.:2.00  
##  Median :0.000   Median :580   Median :3.40   Median :2.00  
##  Mean   :0.318   Mean   :588   Mean   :3.39   Mean   :2.48  
##  3rd Qu.:1.000   3rd Qu.:660   3rd Qu.:3.67   3rd Qu.:3.00  
##  Max.   :1.000   Max.   :800   Max.   :4.00   Max.   :4.00

sapply(data, sd) # standard deviation

##    admit      gre      gpa     rank 
##   0.4661 115.5165   0.3806   0.9445

# Two-way contigency table, to ensure no 0 cells
xtabs(~admit + rank, data=data)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

Next we will have to clean the data a little, since the admit and feature is an integer type, while we need it as categorical factors.

data$rank <- factor(data$rank)

# Fit the first logit model
logit <- glm(admit ~ gre + gpa + rank, data=data, family = 'binomial')
summary(logit)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.627  -0.866  -0.639   1.149   2.079  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.98998    1.13995   -3.50  0.00047 ***
## gre          0.00226    0.00109    2.07  0.03847 *  
## gpa          0.80404    0.33182    2.42  0.01539 *  
## rank2       -0.67544    0.31649   -2.13  0.03283 *  
## rank3       -1.34020    0.34531   -3.88  0.00010 ***
## rank4       -1.55146    0.41783   -3.71  0.00020 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.5
## 
## Number of Fisher Scoring iterations: 4

Excellent, these are rather strong looking results. We see that all the variables are significant. Using the coefficients we can begin to interpret the meaning of the results. R outputs these as log odds.

For every one unit change in gre, the log odds of admission increases by 0.0022.
For every one unit increase in gpa, the log odds of being admitted increases by 0.804.
The rank variables have a slightly different interpretation. If you attended a rank 2, versus a rank 1, changes the log odds of admission by -0.675. Recall with dummy variables, that we are treating rank 1 as the baseline feature. So all the visible ranks are in reference to rank 1.

Next we can use the confint function to obtain confidence intervals for the coefficient estimates. For logistic regression, these are based on the log-likelihood function. We could also get confidence using the standard errors.

confint(logit) # Using LL

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept) -6.2716202 -1.792547
## gre          0.0001376  0.004436
## gpa          0.1602959  1.464143
## rank2       -1.3008888 -0.056746
## rank3       -2.0276713 -0.670372
## rank4       -2.4000265 -0.753543

confint.default(logit) # Using standard errors

##                  2.5 %    97.5 %
## (Intercept) -6.2242419 -1.755716
## gre          0.0001202  0.004409
## gpa          0.1536837  1.454391
## rank2       -1.2957513 -0.055135
## rank3       -2.0169921 -0.663416
## rank4       -2.3703986 -0.732529

Since the rank variable is qualitative, and is split into individual ranks, we could also test the overall effect of rank in the model using a wald.test from the aod library.

library(aod)
wald.test(b = coef(logit), Sigma = vcov(logit), Terms=4:6)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011

As expected, the effect of rank is statistically significant. We can further this analysis by testing if there are significant differences between levels 2-4 of the rank variable. This is not present in the existing model since we are comparing all levels of rank to the baseline rank 1. Below we test that the coefficient for rank=2 is equal to the coefficient for rank=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).

l <- cbind(0,0,0,1,-1,0)
wald.test(b=coef(logit), Sigma=vcov(logit), L=l)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019

Indeed, there is a significant difference between rank 2 and 3.

Interpreting odds ratios

Currently, the coefficients are output as log ratios, which are very difficult to interpret in a meaningful way. To remedy this, we can exponentiation the coefficients and interpret them as odds ratios.

# Odds ratios
exp(coef(logit))

## (Intercept)         gre         gpa       rank2       rank3       rank4 
##      0.0185      1.0023      2.2345      0.5089      0.2618      0.2119

# Odds ratio and 95% CI
exp(cbind(OR = coef(logit), confint(logit)))

## Waiting for profiling to be done...

##                 OR    2.5 % 97.5 %
## (Intercept) 0.0185 0.001889 0.1665
## gre         1.0023 1.000138 1.0044
## gpa         2.2345 1.173858 4.3238
## rank2       0.5089 0.272290 0.9448
## rank3       0.2618 0.131642 0.5115
## rank4       0.2119 0.090716 0.4707

Now we can say more relevant things like, for every one unit increase in gpa, the odds of being admitted to graduate school (vs not being admitted) increases by a factor of 2.23.

We can also use the predicted probabilities to understand the model. First we will calculate predicted probabilities of admission at each value of rank, holding gre and gpa at their means.

# Create new dataframe
ranktest <- with(data, data.frame(gre=mean(gre), gpa=mean(gpa), rank=factor(1:4)))
ranktest

##     gre  gpa rank
## 1 587.7 3.39    1
## 2 587.7 3.39    2
## 3 587.7 3.39    3
## 4 587.7 3.39    4

ranktest$rankP <- predict(logit, newdata=ranktest, type='response')
ranktest

##     gre  gpa rank  rankP
## 1 587.7 3.39    1 0.5166
## 2 587.7 3.39    2 0.3523
## 3 587.7 3.39    3 0.2186
## 4 587.7 3.39    4 0.1847

Very cool, now we can see very clearly, that if all else remains constant, a rank 1 school student will have a 52% chance of being admitted, while the lowest ranked schools will have an 18% chance of admittance.

We can also extend this same procedure for a range of gre values, rather than just the means. We now test the effect of gre ranging from a score between 200 and 800 at each rank.

tmp <- with(data, data.frame(gre=rep(seq(200, 800, length.out=100), 4), gpa=mean(gpa), rank = factor(rep(1:4, each = 100))))

# Get probabilities and standard errors (for plotting)
probs <- cbind(tmp, predict(logit, newdata=tmp, type='link', se=TRUE))
# Get a look at this new data
head(probs)

##     gre  gpa rank     fit se.fit residual.scale
## 1 200.0 3.39    1 -0.8115 0.5148              1
## 2 206.1 3.39    1 -0.7978 0.5091              1
## 3 212.1 3.39    1 -0.7840 0.5034              1
## 4 218.2 3.39    1 -0.7703 0.4978              1
## 5 224.2 3.39    1 -0.7566 0.4922              1
## 6 230.3 3.39    1 -0.7429 0.4866              1

probs <- within(probs, {
  PredProb <- plogis(fit) # fit logistic curve
  LL <- plogis(fit - (1.96 * se.fit)) # create lower limits
  UL <- plogis(fit + (1.96 * se.fit))
})

Graph the output…

ggplot(probs, aes(x=gre, y=PredProb)) +
  geom_ribbon(aes(ymin=LL, ymax=UL, fill=rank), alpha=0.2) + 
  geom_line(aes(color=rank))

plot of chunk unnamed-chunk-17

Measuring model fit

Coming later…

Example 2: Stock Market Data

This data can be found in the ISLR package. The dataset is percentage returns for the S&P 500 stock index over 1 250 dates from 2001 to 2005. For each date there are percentage returns recorded for each of the five previous trading dates, Lag1 - Lag5. There is also recorded volume (the number of shares traded on the previous day, in billions), Today: the percentage return on the given date, and Direction: whether the market was up or down on the given date.

library(ISLR)
str(Smarket)

## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

# Correlation minus the Direction variable
cor(Smarket[,-9])

##           Year      Lag1      Lag2      Lag3      Lag4      Lag5   Volume
## Year   1.00000  0.029700  0.030596  0.033195  0.035689  0.029788  0.53901
## Lag1   0.02970  1.000000 -0.026294 -0.010803 -0.002986 -0.005675  0.04091
## Lag2   0.03060 -0.026294  1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3   0.03319 -0.010803 -0.025897  1.000000 -0.024051 -0.018808 -0.04182
## Lag4   0.03569 -0.002986 -0.010854 -0.024051  1.000000 -0.027084 -0.04841
## Lag5   0.02979 -0.005675 -0.003558 -0.018808 -0.027084  1.000000 -0.02200
## Volume 0.53901  0.040910 -0.043383 -0.041824 -0.048414 -0.022002  1.00000
## Today  0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860  0.01459
##            Today
## Year    0.030095
## Lag1   -0.026155
## Lag2   -0.010250
## Lag3   -0.002448
## Lag4   -0.006900
## Lag5   -0.034860
## Volume  0.014592
## Today   1.000000

From this output we can see there is no real correlation between the percent returns Today and any of the lag variables, but there is a notable correlation between the year and volume. The correlation we are seeing is that Volume is increasing over time.

attach(Smarket)
plot(Volume)

plot of chunk unnamed-chunk-19

Next let’s start by fitting a logistic regression will all the variables using the glm() function, with family='binomial.

# Split data into testing and training
train<-Smarket[Year<2005,]
test<-Smarket[Year==2005,]

Here we fit a logistic model to the stock market data using first the build in R functions, then using the caret package.

library(caret)

## Loading required package: lattice

logit <- glm(Direction ~ Lag1+Lag2+Lag3, family='binomial', data=train)
summary(logit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.34   -1.19    1.07    1.16    1.34  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.03223    0.06338    0.51     0.61
## Lag1        -0.05552    0.05171   -1.07     0.28
## Lag2        -0.04430    0.05167   -0.86     0.39
## Lag3         0.00881    0.05150    0.17     0.86
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 994  degrees of freedom
## AIC: 1389
## 
## Number of Fisher Scoring iterations: 3

# Run the model on the test set
test.probs <-predict(logit, test, type='response')
pred.logit <- rep('Down',length(test.probs))
pred.logit[test.probs>=0.5] <- 'Up'

table(pred.logit, test$Direction)

##           
## pred.logit Down  Up
##       Down   39  31
##       Up     72 110

confusionMatrix(test$Direction, pred.logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   39  72
##       Up     31 110
##                                         
##                Accuracy : 0.591         
##                  95% CI : (0.528, 0.653)
##     No Information Rate : 0.722         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.137         
##  Mcnemar's Test P-Value : 8.1e-05       
##                                         
##             Sensitivity : 0.557         
##             Specificity : 0.604         
##          Pos Pred Value : 0.351         
##          Neg Pred Value : 0.780         
##              Prevalence : 0.278         
##          Detection Rate : 0.155         
##    Detection Prevalence : 0.440         
##       Balanced Accuracy : 0.581         
##                                         
##        'Positive' Class : Down          
##

modelFit<- train(Direction~Lag1+Lag2+Lag3, method='glm',preProcess=c('scale', 'center'), data=train, family=binomial(link='logit'))

summary(modelFit)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.34   -1.19    1.07    1.16    1.34  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.0321     0.0634    0.51     0.61
## Lag1         -0.0683     0.0636   -1.07     0.28
## Lag2         -0.0545     0.0635   -0.86     0.39
## Lag3          0.0109     0.0635    0.17     0.86
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 994  degrees of freedom
## AIC: 1389
## 
## Number of Fisher Scoring iterations: 3

confusionMatrix(test$Direction, predict(modelFit, test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   39  72
##       Up     31 110
##                                         
##                Accuracy : 0.591         
##                  95% CI : (0.528, 0.653)
##     No Information Rate : 0.722         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.137         
##  Mcnemar's Test P-Value : 8.1e-05       
##                                         
##             Sensitivity : 0.557         
##             Specificity : 0.604         
##          Pos Pred Value : 0.351         
##          Neg Pred Value : 0.780         
##              Prevalence : 0.278         
##          Detection Rate : 0.155         
##    Detection Prevalence : 0.440         
##       Balanced Accuracy : 0.581         
##                                         
##        'Positive' Class : Down          
##

As you can see, we are able to obtain about 59% accuracy overall, which isn’t very good. However, we have 78% prediction accuracy on ‘Up’ days. Perhaps this means we can trade with confidence when our model predicts and ‘Up’ day in the future. The models are not overly significant because lagged stock market data isn’t usually the best indicator of performance.

To predict Direction for new values of Lag1-Lag3 we simply use the predict() function and feed in a data frame of new values.