R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.

Load Packages

In this section, we install and load the necessary packages.

Import Data

In this section, we import the necessary data for this lab.

Stock Market Case

We use the Weekly.csv data set, which is similar in nature to the Smarket data from the R lab.

This data set consists of percentage returns for the S&P 500 stock index over 1,089 weekly returns for 21 years, from the beginning of 1990 until the end of 2010. For each week, we have recorded the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous week, in billions), Today (the percentage return for this week) and Direction (whether the market was Up or Down on this week).

Do the following tasks and answer the questions below.

Task 1: Data exploration

Produce some numerical and graphical summaries of the Weekly data.

# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail

dim(Weekly)
## [1] 1089    9
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: chr  "Down" "Down" "Up" "Up" ...
colnames(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
tail(Weekly)
##      Year   Lag1   Lag2   Lag3   Lag4   Lag5   Volume  Today Direction
## 1084 2010  0.043 -2.173  3.599  0.015  0.586 4.177436 -0.861      Down
## 1085 2010 -0.861  0.043 -2.173  3.599  0.015 3.205160  2.969        Up
## 1086 2010  2.969 -0.861  0.043 -2.173  3.599 4.242568  1.281        Up
## 1087 2010  1.281  2.969 -0.861  0.043 -2.173 4.835082  0.283        Up
## 1088 2010  0.283  1.281  2.969 -0.861  0.043 4.454044  1.034        Up
## 1089 2010  1.034  0.283  1.281  2.969 -0.861 2.707105  0.069        Up
# use summary() to print the descriptive statistics
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##   Direction        
##  Length:1089       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Correct the type of 'Direction' which has to be factor
Weekly$Direction<- as.factor(Weekly$Direction)

# use pairs() to produce a matrix that contains all of the pairwise correlations among the predictors in a data set.
pairs(Weekly, col=Weekly$Direction)

# use cor to create the correlation matrix of all numerical variables.
cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Question 1 : Does there appear to be any patterns? There’s no correlations between the lag variables and today’s.

Task 2: Logistic Regression

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results.

# Use glm() to run a logistic analysis on Lag1 through Lag5 and Volume as predictors and Direction as the response
glm.fits = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Weekly, family=binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Question 2: Do any of the predictors appear to be statistically significant? If so, which ones? Lag2 is the only predictor with a significant Pvalue.

Task 3: Confusion Matrix

Compute the confusion matrix and overall fraction of correct predictions.

# predict the Direction probability of the whole dataset using the fitted logistic regression
glm.probs = predict(glm.fits,type = "response")

# create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5
glm.pred = ifelse(glm.probs>0.5,"Up","Down")

# Use table() function to produce a confusion matrix
confusionMatrixweekly <- table(Weekly$Direction,glm.pred)
confusionMatrixweekly
##       glm.pred
##        Down  Up
##   Down   54 430
##   Up     48 557

Use the confusion matrix to compute Accuracy, Sensitivity and Specificity.

# Accuracy

# Sensitivity

# Specificity

Question 3: Explain what the confusion matrix is telling you about the types of errors made by logistic regression. In other words, interpret the Accuracy, Sensitivity and Specificity.

Task 4: Training and Testing Sets

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions (accuracy) for the held out data (that is, the data from 2009 and 2010).

# set seed to 1 
set.seed(1)

## split the data into training and testing sets based on the year. 
# Use the data before 2009 as the training set and use the data of years 2009 and 2010 as the testing test


# Use glm() to run a logistic analysis on Lag2 as predictor and Direction as the response


# predict the Direction probability of the test dataset using the fitted logistic regression


# create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5



# Use table() function to produce a confusion matrix


# Calculate accuracy

Question 4: Is this classifier better than the logistic model fitted in Task 2? Explain.

Task 5

Repeat Task 4 using KNN with K = 1 and K = 10. Note that you should only use Lag2 as the predictor and use the training and testing sets you developed in Task 4.

### KNN for k=1
## IMPORTANT: you must use as.matrix() function to covert to matrix
# This is a requirement imposed by knn() function
# So, you should write knn(as.matrix(trainWeekly[,'Lag2']), as.matrix(testWeekly[,'Lag2']), trainWeekly$Direction, k = 1)


# Use table() function to produce a confusion matrix


# Calculate accuracy

### KNN k = 10

# Use table() function to produce a confusion matrix


# Calculate accuracy

Task 6

Plot ROC curve and compute AUC for the latest logistic regression, KNN (k = 1) and KNN (k = 10).

# ROC curve for logistic regression


# ROC curve for KNN k = 1


# ROC curve for KNN k = 10

Question 5 : Which of these methods appears to provide the best results on this data? Use accuracy, AUC and ROC Curve results.