setwd("/Users/traves/Dropbox/SM339/homework 5")
djiadata = read.csv("djiadata.csv")
nikdata = read.csv("nikdata.csv")
new = merge(x = djiadata, y = nikdata, by.x = "TodaysDate", by.y = "Date")
new
## TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 1 58 NA 58 NA NA
## 2 1-Oct-09 36 -30 36 -155 33
## 3 1-Sep-09 19 -48 19 38 -42
## 4 10-Aug-09 3 114 3 112 24
## 5 10-Sep-09 25 50 25 202 -81
## 6 11-Aug-09 4 -32 4 61 112
## 7 11-Sep-09 26 80 26 -69 202
## 8 12-Aug-09 5 -97 5 -150 61
## 9 13-Aug-09 6 120 6 82 -150
## 10 14-Aug-09 7 37 7 80 82
## 11 14-Oct-09 43 -15 43 -16 60
## 12 14-Sep-09 27 -22 27 -242 -69
## 13 15-Oct-09 44 145 44 178 -16
## 14 15-Sep-09 28 21 28 16 -242
## 15 16-Oct-09 45 47 45 19 178
## 16 16-Sep-09 29 57 29 53 16
## 17 17-Aug-09 8 -77 8 -329 80
## 18 17-Sep-09 30 108 30 173 53
## 19 18-Aug-09 9 -186 9 16 -329
## 20 18-Sep-09 31 -8 31 -73 173
## 21 19-Aug-09 10 83 10 -81 16
## 22 19-Oct-09 46 -67 46 -21 19
## 23 2-Nov-09 56 -250 56 -232 144
## 24 2-Oct-09 37 -203 37 -247 -155
## 25 2-Sep-09 20 -186 20 -250 38
## 26 20-Aug-09 11 61 11 179 -81
## 27 20-Oct-09 47 96 47 100 -21
## 28 21-Aug-09 12 71 12 -145 179
## 29 21-Oct-09 48 -51 48 -3 100
## 30 22-Oct-09 49 -92 49 -66 -3
## 31 23-Oct-09 50 132 50 16 -66
## 32 24-Aug-09 13 156 13 343 -145
## 33 25-Aug-09 14 3 14 -84 343
## 34 25-Sep-09 32 -41 32 -278 174
## 35 26-Aug-09 15 30 15 142 -84
## 36 26-Oct-09 51 -109 51 80 16
## 37 27-Aug-09 16 4 16 -166 142
## 38 27-Oct-09 52 -104 52 -150 80
## 39 28-Aug-09 17 37 17 60 -166
## 40 28-Oct-09 53 14 53 -137 -150
## 41 28-Sep-09 33 -42 33 -256 -278
## 42 29-Oct-09 54 -119 54 -184 -137
## 43 29-Sep-09 34 124 34 91 -256
## 44 3-Nov-09 57 NA 57 -120 NA
## 45 3-Sep-09 21 -30 21 -66 -250
## 46 30-Oct-09 55 200 55 144 NA
## 47 30-Sep-09 35 -47 35 33 91
## 48 31-Aug-09 18 -36 18 -42 60
## 49 4-Sep-09 22 64 22 -28 -66
## 50 5-Oct-09 38 -22 38 -57 -247
## 51 6-Aug-09 1 -39 1 136 -122
## 52 6-Oct-09 39 112 39 17 -57
## 53 7-Aug-09 2 -25 2 24 136
## 54 7-Oct-09 40 132 40 108 17
## 55 7-Sep-09 23 97 23 134 -28
## 56 8-Oct-09 41 -6 41 33 108
## 57 9-Oct-09 42 61 42 184 33
## 58 9-Sep-09 24 56 24 -81 72
a. The three records with NA values are records 1, 44, 46.
new[c(1, 44, 46), ]
## TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 1 58 NA 58 NA NA
## 44 3-Nov-09 57 NA 57 -120 NA
## 46 30-Oct-09 55 200 55 144 NA
b. We remove records 1 and 44. Date of record with missing lagging Nikkei variable is 30 Oct 09.
new = new[-c(1, 44), ]
new$TodaysDate[which(is.na(new$lag.Nik))]
## [1] 30-Oct-09
## 58 Levels: 1-Oct-09 1-Sep-09 10-Aug-09 10-Sep-09 11-Aug-09 ... 9-Sep-09
c. Reorder the data frame new by TodaysDate:
require(date)
## Loading required package: date
new$TodaysDate = gsub("-", "", as.character(new$TodaysDate))
new = new[order(as.date(new$TodaysDate, order = "dmy")), ]
Deal with the last NA:
which(is.na(new$lag.Nik)) # record 46 is now record 55
## [1] 55
new[c(54, 55), ]
## TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 42 29Oct09 54 -119 54 -184 -137
## 46 30Oct09 55 200 55 144 NA
new[55, 6] = new[54, 5]
new[55, 6]
## [1] -184
The missing value was -184.
d. Rename variables and add variable Up. There are 29 records with Up equal to 1.
new = new[, -c(2, 4)]
names(new) = c("Date", "DJIAch", "Nik225ch", "lagNik")
new$Up = as.integer(new$Nik225ch > 0)
sum(new$Up)
## [1] 29
e. Separate records into training and testing records:
set.seed(1002)
trainingrows = sample(1:dim(new)[1], 40)
training = new[trainingrows, ]
testing = new[-trainingrows, ]
attach(training)
Get correlation coefficients: the correlation between DJIAch and Nik225ch is 0.668 and the correlation between lagNik and Nik225ch is -0.188. DJIAch looks like it would be a better predictor of Nik225ch since it has the higher correlation coefficient.
cor(training$DJIAch, training$Nik225ch)
## [1] 0.6682
cor(training$lagNik, training$Nik225ch)
## [1] -0.188
f. We fit a linear model for Nik225ch against both DJIAch and lagNik. It turns out that lagNik has a high p-value so we remove it and refit the model. The resulting equation is Nik225ch = -29.7884 + 1.0113 * DJIAch. Setting Nik225ch equal to zero and solving gives the cutoff point, DJIAch = 29.45555. We define the variable pred to be 1 when DJIAch > 29.45555 and 0 otherwise.
modfboth = lm(Nik225ch ~ DJIAch + lagNik, data = training)
summary(modfboth)
##
## Call:
## lm(formula = Nik225ch ~ DJIAch + lagNik, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.3 -74.2 -10.9 68.8 221.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.314 18.540 -1.69 0.10 .
## DJIAch 0.991 0.185 5.34 4.8e-06 ***
## lagNik -0.109 0.138 -0.79 0.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116 on 37 degrees of freedom
## Multiple R-squared: 0.456, Adjusted R-squared: 0.426
## F-statistic: 15.5 on 2 and 37 DF, p-value: 1.3e-05
modfdjia = lm(Nik225ch ~ DJIAch, data = training)
summary(modfdjia)
##
## Call:
## lm(formula = Nik225ch ~ DJIAch, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -206.75 -71.79 -4.79 63.95 220.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.788 18.345 -1.62 0.11
## DJIAch 1.011 0.183 5.54 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115 on 38 degrees of freedom
## Multiple R-squared: 0.447, Adjusted R-squared: 0.432
## F-statistic: 30.7 on 1 and 38 DF, p-value: 2.46e-06
29.7884/1.0113
## [1] 29.46
pred = as.integer(DJIAch > 29.45555)
g. We fit and assess three logistic models to describe Up. The first and third contain the statistically insignificant variable lagNik (p=0.336 and p=0.47, respectively), so we prefer the second model that uses DJIAch as the explanatory variable. The equation for our best model is
log(odds of Up=1) = -0.278 + 0.016 * DJIAch
or
(odds of Up=1) = exp(-0.278 + 0.016 * DJIAch) = (0.757) * \( (1.0164)^{DJIAch} \)
or
Prob(Up=1) = (0.757) * \( (1.0164)^{DJIAch} \)/(1+(0.757) * \( (1.0164)^{DJIAch} \))
mod.g1 = glm(Up ~ lagNik, family = binomial(link = "logit"), data = training)
summary(mod.g1)
##
## Call:
## glm(formula = Up ~ lagNik, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4460 -1.1255 -0.0099 1.1582 1.3750
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03742 0.32220 -0.12 0.91
## lagNik -0.00234 0.00243 -0.96 0.34
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55.452 on 39 degrees of freedom
## Residual deviance: 54.501 on 38 degrees of freedom
## AIC: 58.5
##
## Number of Fisher Scoring iterations: 4
mod.g2 = glm(Up ~ DJIAch, family = binomial(link = "logit"), data = training)
summary(mod.g2)
##
## Call:
## glm(formula = Up ~ DJIAch, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6545 -0.8733 0.0773 0.8919 2.0861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.27819 0.39925 -0.70 0.4859
## DJIAch 0.01630 0.00551 2.96 0.0031 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55.452 on 39 degrees of freedom
## Residual deviance: 41.003 on 38 degrees of freedom
## AIC: 45
##
## Number of Fisher Scoring iterations: 5
mod.g3 = glm(Up ~ DJIAch + lagNik, family = binomial(link = "logit"), data = training)
summary(mod.g3)
##
## Call:
## glm(formula = Up ~ DJIAch + lagNik, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6338 -0.8906 0.0644 0.8577 2.1226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.31006 0.40525 -0.77 0.4442
## DJIAch 0.01652 0.00569 2.90 0.0037 **
## lagNik -0.00197 0.00274 -0.72 0.4726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55.452 on 39 degrees of freedom
## Residual deviance: 40.484 on 37 degrees of freedom
## AIC: 46.48
##
## Number of Fisher Scoring iterations: 5
exp(coef(mod.g2))
## (Intercept) DJIAch
## 0.7572 1.0164
h. Plot the model:
invlogit = function(x) {
exp(x)/(1 + exp(x))
}
plot(new$DJIAch, new$Up, pch = 19, col = "red", main = "Up vs. change in DJIA",
ylab = "Prob(Up=1)", xlab = "change in DJIA")
lines(-300:200, sapply(predict(mod.g2, newdata = data.frame(DJIAch = -300:200)),
invlogit))
i. The coefficient can be interpreted as follows. Each increase of 1 unit in change in DJIA increases the log odds of Up being 1 (i.e. the log odds of a positive increase in the Nikkei 225 index) by 0.01630439. In terms of odds, each increase of 1 unit in the change in DJIA causes the odds that Up equals 1 to be multiplied by 1.0164380, that is the odds of Up being equal to 1 increase by 1.6438% for each point increase in DJIA.
coef(mod.g2)
## (Intercept) DJIAch
## -0.2782 0.0163
exp(coef(mod.g2))
## (Intercept) DJIAch
## 0.7572 1.0164
j. The accuracy of the best model from part (g) on the training data is 80% and the accuracy of the pred predictor from part (f) is 77.5%.
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(round(sapply(predict(mod.g2, newdata = training), invlogit)),
training$Up)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 4
## 1 4 16
##
## Accuracy : 0.8
## 95% CI : (0.644, 0.909)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 9.11e-05
##
## Kappa : 0.6
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8
## Specificity : 0.8
## Pos Pred Value : 0.8
## Neg Pred Value : 0.8
## Prevalence : 0.5
## Detection Rate : 0.4
## Detection Prevalence : 0.5
## Balanced Accuracy : 0.8
##
## 'Positive' Class : 0
##
confusionMatrix(pred, training$Up)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 5
## 1 4 15
##
## Accuracy : 0.775
## 95% CI : (0.615, 0.892)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.00034
##
## Kappa : 0.55
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.800
## Specificity : 0.750
## Pos Pred Value : 0.762
## Neg Pred Value : 0.789
## Prevalence : 0.500
## Detection Rate : 0.400
## Detection Prevalence : 0.525
## Balanced Accuracy : 0.775
##
## 'Positive' Class : 0
##
k. The area under the curve statistic for the best model from part (g) is 0.8.
require(e1071)
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
auc(training$Up, sapply(predict(mod.g2, newdata = training), invlogit))
## Area under the curve: 0.8
l. The best model from part (g) and the pred predictor from part (f) both have an accuracy of 68.75% on the test data. Both of these models do much better than chance guessing (50% accuracy), so these models do seem to be helpful for investment decisions. It is not surprising that the behavior of the DJIA should influence the behavior of the Nikkei 225 index since they both are highly correlated with the health of the world economy.
confusionMatrix(round(sapply(predict(mod.g2, newdata = testing), invlogit)),
testing$Up)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6 4
## 1 1 5
##
## Accuracy : 0.688
## 95% CI : (0.413, 0.89)
## No Information Rate : 0.562
## P-Value [Acc > NIR] : 0.227
##
## Kappa : 0.394
## Mcnemar's Test P-Value : 0.371
##
## Sensitivity : 0.857
## Specificity : 0.556
## Pos Pred Value : 0.600
## Neg Pred Value : 0.833
## Prevalence : 0.438
## Detection Rate : 0.375
## Detection Prevalence : 0.625
## Balanced Accuracy : 0.706
##
## 'Positive' Class : 0
##
confusionMatrix(as.integer(testing$DJIAch > 29.45555), testing$Up)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6 4
## 1 1 5
##
## Accuracy : 0.688
## 95% CI : (0.413, 0.89)
## No Information Rate : 0.562
## P-Value [Acc > NIR] : 0.227
##
## Kappa : 0.394
## Mcnemar's Test P-Value : 0.371
##
## Sensitivity : 0.857
## Specificity : 0.556
## Pos Pred Value : 0.600
## Neg Pred Value : 0.833
## Prevalence : 0.438
## Detection Rate : 0.375
## Detection Prevalence : 0.625
## Balanced Accuracy : 0.706
##
## 'Positive' Class : 0
##