Solutions to Worksheet

setwd("/Users/traves/Dropbox/SM339/homework 5")
djiadata = read.csv("djiadata.csv")
nikdata = read.csv("nikdata.csv")
new = merge(x = djiadata, y = nikdata, by.x = "TodaysDate", by.y = "Date")
new

##    TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 1              58                 NA  58         NA      NA
## 2    1-Oct-09  36                -30  36       -155      33
## 3    1-Sep-09  19                -48  19         38     -42
## 4   10-Aug-09   3                114   3        112      24
## 5   10-Sep-09  25                 50  25        202     -81
## 6   11-Aug-09   4                -32   4         61     112
## 7   11-Sep-09  26                 80  26        -69     202
## 8   12-Aug-09   5                -97   5       -150      61
## 9   13-Aug-09   6                120   6         82    -150
## 10  14-Aug-09   7                 37   7         80      82
## 11  14-Oct-09  43                -15  43        -16      60
## 12  14-Sep-09  27                -22  27       -242     -69
## 13  15-Oct-09  44                145  44        178     -16
## 14  15-Sep-09  28                 21  28         16    -242
## 15  16-Oct-09  45                 47  45         19     178
## 16  16-Sep-09  29                 57  29         53      16
## 17  17-Aug-09   8                -77   8       -329      80
## 18  17-Sep-09  30                108  30        173      53
## 19  18-Aug-09   9               -186   9         16    -329
## 20  18-Sep-09  31                 -8  31        -73     173
## 21  19-Aug-09  10                 83  10        -81      16
## 22  19-Oct-09  46                -67  46        -21      19
## 23   2-Nov-09  56               -250  56       -232     144
## 24   2-Oct-09  37               -203  37       -247    -155
## 25   2-Sep-09  20               -186  20       -250      38
## 26  20-Aug-09  11                 61  11        179     -81
## 27  20-Oct-09  47                 96  47        100     -21
## 28  21-Aug-09  12                 71  12       -145     179
## 29  21-Oct-09  48                -51  48         -3     100
## 30  22-Oct-09  49                -92  49        -66      -3
## 31  23-Oct-09  50                132  50         16     -66
## 32  24-Aug-09  13                156  13        343    -145
## 33  25-Aug-09  14                  3  14        -84     343
## 34  25-Sep-09  32                -41  32       -278     174
## 35  26-Aug-09  15                 30  15        142     -84
## 36  26-Oct-09  51               -109  51         80      16
## 37  27-Aug-09  16                  4  16       -166     142
## 38  27-Oct-09  52               -104  52       -150      80
## 39  28-Aug-09  17                 37  17         60    -166
## 40  28-Oct-09  53                 14  53       -137    -150
## 41  28-Sep-09  33                -42  33       -256    -278
## 42  29-Oct-09  54               -119  54       -184    -137
## 43  29-Sep-09  34                124  34         91    -256
## 44   3-Nov-09  57                 NA  57       -120      NA
## 45   3-Sep-09  21                -30  21        -66    -250
## 46  30-Oct-09  55                200  55        144      NA
## 47  30-Sep-09  35                -47  35         33      91
## 48  31-Aug-09  18                -36  18        -42      60
## 49   4-Sep-09  22                 64  22        -28     -66
## 50   5-Oct-09  38                -22  38        -57    -247
## 51   6-Aug-09   1                -39   1        136    -122
## 52   6-Oct-09  39                112  39         17     -57
## 53   7-Aug-09   2                -25   2         24     136
## 54   7-Oct-09  40                132  40        108      17
## 55   7-Sep-09  23                 97  23        134     -28
## 56   8-Oct-09  41                 -6  41         33     108
## 57   9-Oct-09  42                 61  42        184      33
## 58   9-Sep-09  24                 56  24        -81      72

a. The three records with NA values are records 1, 44, 46.

new[c(1, 44, 46), ]

##    TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 1              58                 NA  58         NA      NA
## 44   3-Nov-09  57                 NA  57       -120      NA
## 46  30-Oct-09  55                200  55        144      NA

b. We remove records 1 and 44. Date of record with missing lagging Nikkei variable is 30 Oct 09.

new = new[-c(1, 44), ]
new$TodaysDate[which(is.na(new$lag.Nik))]

## [1] 30-Oct-09
## 58 Levels:  1-Oct-09 1-Sep-09 10-Aug-09 10-Sep-09 11-Aug-09 ... 9-Sep-09

c. Reorder the data frame new by TodaysDate:

require(date)

## Loading required package: date

new$TodaysDate = gsub("-", "", as.character(new$TodaysDate))
new = new[order(as.date(new$TodaysDate, order = "dmy")), ]

Deal with the last NA:

which(is.na(new$lag.Nik))  # record 46 is now record 55

## [1] 55

new[c(54, 55), ]

##    TodaysDate X.x yesterdays.DJIA.ch X.y Nik.225.ch lag.Nik
## 42    29Oct09  54               -119  54       -184    -137
## 46    30Oct09  55                200  55        144      NA

new[55, 6] = new[54, 5]
new[55, 6]

## [1] -184

The missing value was -184.

d. Rename variables and add variable Up. There are 29 records with Up equal to 1.

new = new[, -c(2, 4)]
names(new) = c("Date", "DJIAch", "Nik225ch", "lagNik")
new$Up = as.integer(new$Nik225ch > 0)
sum(new$Up)

## [1] 29

e. Separate records into training and testing records:

set.seed(1002)
trainingrows = sample(1:dim(new)[1], 40)
training = new[trainingrows, ]
testing = new[-trainingrows, ]
attach(training)

Get correlation coefficients: the correlation between DJIAch and Nik225ch is 0.668 and the correlation between lagNik and Nik225ch is -0.188. DJIAch looks like it would be a better predictor of Nik225ch since it has the higher correlation coefficient.

cor(training$DJIAch, training$Nik225ch)

## [1] 0.6682

cor(training$lagNik, training$Nik225ch)

## [1] -0.188

f. We fit a linear model for Nik225ch against both DJIAch and lagNik. It turns out that lagNik has a high p-value so we remove it and refit the model. The resulting equation is Nik225ch = -29.7884 + 1.0113 * DJIAch. Setting Nik225ch equal to zero and solving gives the cutoff point, DJIAch = 29.45555. We define the variable pred to be 1 when DJIAch > 29.45555 and 0 otherwise.

modfboth = lm(Nik225ch ~ DJIAch + lagNik, data = training)
summary(modfboth)

## 
## Call:
## lm(formula = Nik225ch ~ DJIAch + lagNik, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -213.3  -74.2  -10.9   68.8  221.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -31.314     18.540   -1.69     0.10 .  
## DJIAch         0.991      0.185    5.34  4.8e-06 ***
## lagNik        -0.109      0.138   -0.79     0.44    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116 on 37 degrees of freedom
## Multiple R-squared:  0.456,  Adjusted R-squared:  0.426 
## F-statistic: 15.5 on 2 and 37 DF,  p-value: 1.3e-05

modfdjia = lm(Nik225ch ~ DJIAch, data = training)
summary(modfdjia)

## 
## Call:
## lm(formula = Nik225ch ~ DJIAch, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -206.75  -71.79   -4.79   63.95  220.02 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.788     18.345   -1.62     0.11    
## DJIAch         1.011      0.183    5.54  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115 on 38 degrees of freedom
## Multiple R-squared:  0.447,  Adjusted R-squared:  0.432 
## F-statistic: 30.7 on 1 and 38 DF,  p-value: 2.46e-06

29.7884/1.0113

## [1] 29.46

pred = as.integer(DJIAch > 29.45555)

g. We fit and assess three logistic models to describe Up. The first and third contain the statistically insignificant variable lagNik (p=0.336 and p=0.47, respectively), so we prefer the second model that uses DJIAch as the explanatory variable. The equation for our best model is

log(odds of Up=1) = -0.278 + 0.016 * DJIAch

(odds of Up=1) = exp(-0.278 + 0.016 * DJIAch) = (0.757) * \( (1.0164)^{DJIAch} \)

Prob(Up=1) = (0.757) * \( (1.0164)^{DJIAch} \)/(1+(0.757) * \( (1.0164)^{DJIAch} \))

mod.g1 = glm(Up ~ lagNik, family = binomial(link = "logit"), data = training)
summary(mod.g1)

## 
## Call:
## glm(formula = Up ~ lagNik, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4460  -1.1255  -0.0099   1.1582   1.3750  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03742    0.32220   -0.12     0.91
## lagNik      -0.00234    0.00243   -0.96     0.34
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 55.452  on 39  degrees of freedom
## Residual deviance: 54.501  on 38  degrees of freedom
## AIC: 58.5
## 
## Number of Fisher Scoring iterations: 4

mod.g2 = glm(Up ~ DJIAch, family = binomial(link = "logit"), data = training)
summary(mod.g2)

## 
## Call:
## glm(formula = Up ~ DJIAch, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6545  -0.8733   0.0773   0.8919   2.0861  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.27819    0.39925   -0.70   0.4859   
## DJIAch       0.01630    0.00551    2.96   0.0031 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 55.452  on 39  degrees of freedom
## Residual deviance: 41.003  on 38  degrees of freedom
## AIC: 45
## 
## Number of Fisher Scoring iterations: 5

mod.g3 = glm(Up ~ DJIAch + lagNik, family = binomial(link = "logit"), data = training)
summary(mod.g3)

## 
## Call:
## glm(formula = Up ~ DJIAch + lagNik, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6338  -0.8906   0.0644   0.8577   2.1226  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.31006    0.40525   -0.77   0.4442   
## DJIAch       0.01652    0.00569    2.90   0.0037 **
## lagNik      -0.00197    0.00274   -0.72   0.4726   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 55.452  on 39  degrees of freedom
## Residual deviance: 40.484  on 37  degrees of freedom
## AIC: 46.48
## 
## Number of Fisher Scoring iterations: 5

exp(coef(mod.g2))

## (Intercept)      DJIAch 
##      0.7572      1.0164

h. Plot the model:

invlogit = function(x) {
    exp(x)/(1 + exp(x))
}
plot(new$DJIAch, new$Up, pch = 19, col = "red", main = "Up vs. change in DJIA", 
    ylab = "Prob(Up=1)", xlab = "change in DJIA")
lines(-300:200, sapply(predict(mod.g2, newdata = data.frame(DJIAch = -300:200)), 
    invlogit))

plot of chunk unnamed-chunk-11

i. The coefficient can be interpreted as follows. Each increase of 1 unit in change in DJIA increases the log odds of Up being 1 (i.e. the log odds of a positive increase in the Nikkei 225 index) by 0.01630439. In terms of odds, each increase of 1 unit in the change in DJIA causes the odds that Up equals 1 to be multiplied by 1.0164380, that is the odds of Up being equal to 1 increase by 1.6438% for each point increase in DJIA.

coef(mod.g2)

## (Intercept)      DJIAch 
##     -0.2782      0.0163

exp(coef(mod.g2))

## (Intercept)      DJIAch 
##      0.7572      1.0164

j. The accuracy of the best model from part (g) on the training data is 80% and the accuracy of the pred predictor from part (f) is 77.5%.

require(caret)

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2

confusionMatrix(round(sapply(predict(mod.g2, newdata = training), invlogit)), 
    training$Up)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  4
##          1  4 16
##                                         
##                Accuracy : 0.8           
##                  95% CI : (0.644, 0.909)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 9.11e-05      
##                                         
##                   Kappa : 0.6           
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.8           
##             Specificity : 0.8           
##          Pos Pred Value : 0.8           
##          Neg Pred Value : 0.8           
##              Prevalence : 0.5           
##          Detection Rate : 0.4           
##    Detection Prevalence : 0.5           
##       Balanced Accuracy : 0.8           
##                                         
##        'Positive' Class : 0             
##

confusionMatrix(pred, training$Up)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  5
##          1  4 15
##                                         
##                Accuracy : 0.775         
##                  95% CI : (0.615, 0.892)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 0.00034       
##                                         
##                   Kappa : 0.55          
##  Mcnemar's Test P-Value : 1.00000       
##                                         
##             Sensitivity : 0.800         
##             Specificity : 0.750         
##          Pos Pred Value : 0.762         
##          Neg Pred Value : 0.789         
##              Prevalence : 0.500         
##          Detection Rate : 0.400         
##    Detection Prevalence : 0.525         
##       Balanced Accuracy : 0.775         
##                                         
##        'Positive' Class : 0             
##

k. The area under the curve statistic for the best model from part (g) is 0.8.

require(e1071)
require(pROC)

## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

auc(training$Up, sapply(predict(mod.g2, newdata = training), invlogit))

## Area under the curve: 0.8

l. The best model from part (g) and the pred predictor from part (f) both have an accuracy of 68.75% on the test data. Both of these models do much better than chance guessing (50% accuracy), so these models do seem to be helpful for investment decisions. It is not surprising that the behavior of the DJIA should influence the behavior of the Nikkei 225 index since they both are highly correlated with the health of the world economy.

confusionMatrix(round(sapply(predict(mod.g2, newdata = testing), invlogit)), 
    testing$Up)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 6 4
##          1 1 5
##                                        
##                Accuracy : 0.688        
##                  95% CI : (0.413, 0.89)
##     No Information Rate : 0.562        
##     P-Value [Acc > NIR] : 0.227        
##                                        
##                   Kappa : 0.394        
##  Mcnemar's Test P-Value : 0.371        
##                                        
##             Sensitivity : 0.857        
##             Specificity : 0.556        
##          Pos Pred Value : 0.600        
##          Neg Pred Value : 0.833        
##              Prevalence : 0.438        
##          Detection Rate : 0.375        
##    Detection Prevalence : 0.625        
##       Balanced Accuracy : 0.706        
##                                        
##        'Positive' Class : 0            
##

confusionMatrix(as.integer(testing$DJIAch > 29.45555), testing$Up)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 6 4
##          1 1 5
##                                        
##                Accuracy : 0.688        
##                  95% CI : (0.413, 0.89)
##     No Information Rate : 0.562        
##     P-Value [Acc > NIR] : 0.227        
##                                        
##                   Kappa : 0.394        
##  Mcnemar's Test P-Value : 0.371        
##                                        
##             Sensitivity : 0.857        
##             Specificity : 0.556        
##          Pos Pred Value : 0.600        
##          Neg Pred Value : 0.833        
##              Prevalence : 0.438        
##          Detection Rate : 0.375        
##    Detection Prevalence : 0.625        
##       Balanced Accuracy : 0.706        
##                                        
##        'Positive' Class : 0            
##