4.6.1 The Stock Market Data

library(ISLR)
head(Smarket)
cor(Smarket[,-9])
             Year         Lag1         Lag2         Lag3         Lag4         Lag5      Volume        Today
Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718  0.029787995  0.53900647  0.030095229
Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911 -0.005674606  0.04090991 -0.026155045
Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533 -0.003557949 -0.04338321 -0.010250033
Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036 -0.018808338 -0.04182369 -0.002447647
Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000 -0.027083641 -0.04841425 -0.006899527
Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641  1.000000000 -0.02200231 -0.034860083
Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246 -0.022002315  1.00000000  0.014591823
Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527 -0.034860083  0.01459182  1.000000000
plot(Smarket$Volume)

从相关系数矩阵可以看出:

var, sd, cov, cor

Variance(方差)表征一个随机变量(表现形式为向量)的分散程度,对应 R 函数 var(X),定义: \[ \sigma^2_X = \frac{\sum (x - \mu)^2}{N} \]

Standard deviation(标准差),对应 R 函数 sd(X),是 \(Var(X)\) 的平方根;

Covariance(协方差)对应 R 函数 cov(X, Y),表征两个随机变量是否有相同/相反的变化趋势: \[ \sigma_{X, Y} = \frac{\sum(x - \bar x)(y - \bar y)}{n - 1} \]

Correlation(相关系数)是 协方差的标准化版本,对应 R 函数 cor(X, Y)\[ \rho_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y} \in [-1, 1] \]

4.6.2 Logistic Regression

glm() 在基本的线性回归 lm() 基础上增加了 family 参数, 首先看下各个特征与股市涨跌之间是否存在关系:

f2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial)
summary(f2)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3
coef(f2)
 (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5       Volume 
-0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068  0.135440659 
summary(f2)$coef
                Estimate Std. Error    z value  Pr(>|z|)
(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
Lag3         0.011085108 0.04993854  0.2219750 0.8243333
Lag4         0.009358938 0.04997413  0.1872757 0.8514445
Lag5         0.010313068 0.04951146  0.2082966 0.8349974
Volume       0.135440659 0.15835970  0.8552723 0.3924004
summary(f2)$coef[,4]
(Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5      Volume 
  0.6006983   0.1452272   0.3983491   0.8243333   0.8514445   0.8349974   0.3924004 

所有特征的 p-value 都大于 0.05,说明它们与涨跌之间没有显著关系。

对模型使用 predict() 函数给出(基于此模型的)预测值,注意参数 type 的设置:

gprob <- predict(f2, type = 'response')
gprob[1:10]
        1         2         3         4         5         6         7         8         9        10 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 0.5176135 0.4888378 
contrasts(Smarket$Direction)
     Up
Down  0
Up    1

根据预测概率,大于 0.5 的为涨,否则为跌,与实际结果比较:

gpred <- rep('Down', 1250)
gpred[gprob > 0.5] <- 'Up'
table(gpred, Smarket$Direction)
      
gpred  Down  Up
  Down  145 141
  Up    457 507

怎样解读 confusion matrix?

用两种方法计算预测结果的准确率:

(145 + 507) / 1250
[1] 0.5216
mean(gpred == Smarket$Direction)
[1] 0.5216

结果一致。

逻辑回归的训练准确率 52%,测试准确率只能更低, 只要知道股市大趋势,单边押注,都会比这个准确率高得多,这是很多“股神”的套路。

为了研究测试错误率与训练错误率的关系,将数据集分为训练和测试两部分:

train <- (Smarket$Year < 2005)
sm2005 <- Smarket[!train,]
dim(sm2005)
[1] 252   9
direction2005 <- sm2005$Direction

这里 train 是一个布尔向量。

在 2005 年前的数据上创建逻辑回归模型:

f2.2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
            data=Smarket, family=binomial, subset = train)

基于此模型给出测试集上的预测结果:

gprob2005 <- predict(f2.2, sm2005, type = 'response')
gpred2005 <- rep('Down', 252)
gpred2005[gprob2005 > 0.5] <- 'Up'
table(gpred2005, direction2005)
         direction2005
gpred2005 Down Up
     Down   77 97
     Up     34 44
mean(gpred2005 != direction2005)
[1] 0.5198413

错误率确实上升了,而且超过了 50%,还不错随便猜。

如果去掉不相关的特征,只用相关性比较强的特征预测是否能得到比较好的模型? 动手尝试一下:

f2.p2 <- glm(Direction ~ Lag1 + Lag2, data=Smarket, family = binomial, subset=train)
gprob2005.p2 <- predict(f2.p2, sm2005, type="response")
gpred2005.p2 <- rep("Down", 252)
gpred2005.p2[gprob2005.p2 > 0.5] <- "Up"
table(gpred2005.p2, direction2005)
            direction2005
gpred2005.p2 Down  Up
        Down   35  35
        Up     76 106
mean(gpred2005.p2 == direction2005)
[1] 0.5595238

正确率上升到了 58%,说明假设是正确的。

给定 Lag1Lag2 的值,基于现有模型预测涨跌:

predict(f2.p2, newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)), type = 'response')
        1         2 
0.4791462 0.4960939 

在这两个组合下,模型都给出了(不明显的)下跌预期。

4.6.3 Linear Discriminant Analysis

对股票数据拟合 LDA 模型:

library(MASS)
lda_fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda_fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293
plot(lda_fit, dimen = 3)

先验概率:Down \(\hat\pi_1 = 0.492\) 和 Up \(\hat\pi_2 = 0.508\) 是怎样计算出来的?

Group means: 各分组内各个特征平均值 \(\mu_k\)(见式 4.15)。

对于 LDA 计算出来各特征的系数,与 \(X\) 组合后结果越大,Up 的概率越高: \[ -0.642 \times \text{Lag1} - 0.514 \times \text{Lag2} \]

将每个观测的 Lag1, Lag2 代入上式,就得到了此观测的 linear discriminants (p162), 这个值越大,Up 的概率越高,否则 Down 的概率高。 下面的图是表征各组(这里是 UpDown)中 linear discriminant 分布情况的 histogram 图。

用 LDA 模型预测股市变化:

lda.pred <- predict(lda_fit, sm2005)
lda.pred
$class
  [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up   Up   Up   Up  
 [17] Up   Down Up   Up   Down Down Down Up   Down Down Up   Up   Up   Down Down Up  
 [33] Up   Up   Up   Up   Up   Down Down Up   Up   Up   Up   Down Down Up   Up   Up  
 [49] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Down Up   Up  
 [65] Down Down Down Up   Up   Up   Up   Up   Up   Up   Down Up   Down Down Up   Up  
 [81] Up   Up   Up   Down Up   Down Down Up   Up   Up   Up   Up   Up   Down Down Down
 [97] Down Up   Up   Up   Up   Up   Down Up   Up   Down Up   Up   Up   Up   Up   Up  
[113] Up   Up   Up   Up   Down Up   Up   Up   Up   Up   Up   Down Down Up   Up   Down
[129] Up   Up   Down Down Down Up   Up   Up   Up   Up   Down Up   Up   Up   Up   Down
[145] Down Up   Up   Down Down Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up  
[161] Up   Up   Up   Up   Up   Up   Up   Up   Down Down Up   Down Down Up   Up   Up  
[177] Up   Up   Up   Down Up   Up   Up   Up   Up   Up   Up   Up   Down Down Up   Up  
[193] Up   Up   Up   Up   Up   Up   Up   Down Down Up   Down Up   Up   Down Down Up  
[209] Up   Down Down Up   Down Down Up   Up   Up   Up   Down Down Up   Up   Up   Down
[225] Down Down Down Down Up   Up   Up   Up   Down Down Up   Up   Up   Up   Up   Up  
[241] Down Down Up   Up   Up   Up   Up   Down Up   Up   Up   Up  
Levels: Down Up

$posterior
          Down        Up
999  0.4901792 0.5098208
1000 0.4792185 0.5207815
1001 0.4668185 0.5331815
1002 0.4740011 0.5259989
1003 0.4927877 0.5072123
1004 0.4938562 0.5061438
1005 0.4951016 0.5048984
1006 0.4872861 0.5127139
1007 0.4907013 0.5092987
1008 0.4844026 0.5155974
1009 0.4906963 0.5093037
1010 0.5119988 0.4880012
1011 0.4895152 0.5104848
1012 0.4706761 0.5293239
1013 0.4744593 0.5255407
1014 0.4799583 0.5200417
1015 0.4935775 0.5064225
1016 0.5030894 0.4969106
1017 0.4978806 0.5021194
1018 0.4886331 0.5113669
1019 0.5006568 0.4993432
1020 0.5108735 0.4891265
1021 0.5039925 0.4960075
1022 0.4916335 0.5083665
1023 0.5041772 0.4958228
1024 0.5026751 0.4973249
1025 0.4914043 0.5085957
1026 0.4805964 0.5194036
1027 0.4882718 0.5117282
1028 0.5062187 0.4937813
1029 0.5005996 0.4994004
1030 0.4972965 0.5027035
1031 0.4958546 0.5041454
1032 0.4811777 0.5188223
1033 0.4841417 0.5158583
1034 0.4726388 0.5273612
1035 0.4836418 0.5163582
1036 0.5091007 0.4908993
1037 0.5135941 0.4864059
1038 0.4933839 0.5066161
1039 0.4926856 0.5073144
1040 0.4978472 0.5021528
1041 0.4920914 0.5079086
1042 0.5056346 0.4943654
1043 0.5062288 0.4937712
1044 0.4881894 0.5118106
1045 0.4725293 0.5274707
1046 0.4832339 0.5167661
1047 0.4835086 0.5164914
1048 0.4913334 0.5086666
1049 0.4877566 0.5122434
1050 0.4724386 0.5275614
1051 0.4854877 0.5145123
1052 0.4932911 0.5067089
1053 0.4845973 0.5154027
1054 0.4723718 0.5276282
1055 0.4816170 0.5183830
1056 0.4914067 0.5085933
1057 0.4942755 0.5057245
1058 0.4841232 0.5158768
1059 0.5026064 0.4973936
1060 0.5062557 0.4937443
1061 0.4821800 0.5178200
1062 0.4885263 0.5114737
1063 0.5011825 0.4988175
1064 0.5000595 0.4999405
1065 0.5027377 0.4972623
1066 0.4870086 0.5129914
1067 0.4827213 0.5172787
1068 0.4996501 0.5003499
1069 0.4818079 0.5181921
1070 0.4651057 0.5348943
1071 0.4577867 0.5422133
1072 0.4775004 0.5224996
1073 0.5034250 0.4965750
1074 0.4801664 0.5198336
1075 0.5046171 0.4953829
1076 0.5044752 0.4955248
1077 0.4964663 0.5035337
1078 0.4892965 0.5107035
1079 0.4876236 0.5123764
1080 0.4805625 0.5194375
1081 0.4958518 0.5041482
1082 0.5115212 0.4884788
1083 0.4958572 0.5041428
1084 0.5082871 0.4917129
1085 0.5022091 0.4977909
1086 0.4875892 0.5124108
1087 0.4995948 0.5004052
1088 0.4841917 0.5158083
1089 0.4858843 0.5141157
1090 0.4826969 0.5173031
1091 0.4745012 0.5254988
1092 0.5008540 0.4991460
1093 0.5127766 0.4872234
1094 0.5135472 0.4864528
1095 0.5095127 0.4904873
1096 0.4950201 0.5049799
1097 0.4956088 0.5043912
1098 0.4964643 0.5035357
1099 0.4874363 0.5125637
1100 0.4970339 0.5029661
1101 0.5003751 0.4996249
1102 0.4846137 0.5153863
1103 0.4976914 0.5023086
1104 0.5043081 0.4956919
1105 0.4843366 0.5156634
1106 0.4860664 0.5139336
1107 0.4930417 0.5069583
1108 0.4887219 0.5112781
1109 0.4968147 0.5031853
1110 0.4944989 0.5055011
1111 0.4924743 0.5075257
1112 0.4980141 0.5019859
1113 0.4978727 0.5021273
1114 0.4994390 0.5005610
1115 0.5028317 0.4971683
1116 0.4964503 0.5035497
1117 0.4883202 0.5116798
1118 0.4899801 0.5100199
1119 0.4771957 0.5228043
1120 0.4694030 0.5305970
1121 0.4824692 0.5175308
1122 0.5037943 0.4962057
1123 0.5000974 0.4999026
1124 0.4805303 0.5194697
1125 0.4876953 0.5123047
1126 0.5070782 0.4929218
1127 0.4901776 0.5098224
1128 0.4860999 0.5139001
1129 0.5108497 0.4891503
1130 0.5135547 0.4864453
1131 0.5020218 0.4979782
1132 0.4956830 0.5043170
1133 0.4965536 0.5034464
1134 0.4964590 0.5035410
1135 0.4855719 0.5144281
1136 0.4951439 0.5048561
1137 0.5060048 0.4939952
1138 0.4880643 0.5119357
1139 0.4921175 0.5078825
1140 0.4927195 0.5072805
1141 0.4901661 0.5098339
1142 0.5001986 0.4998014
1143 0.5047746 0.4952254
1144 0.4875267 0.5124733
1145 0.4847648 0.5152352
1146 0.5028405 0.4971595
1147 0.5008435 0.4991565
1148 0.4825591 0.5174409
1149 0.4732124 0.5267876
1150 0.4797731 0.5202269
1151 0.4983172 0.5016828
1152 0.4968824 0.5031176
1153 0.4997031 0.5002969
1154 0.4914721 0.5085279
1155 0.4892300 0.5107700
1156 0.4787694 0.5212306
1157 0.4799234 0.5200766
1158 0.4913818 0.5086182
1159 0.4916287 0.5083713
1160 0.4948795 0.5051205
1161 0.4890900 0.5109100
1162 0.4790944 0.5209056
1163 0.4878531 0.5121469
1164 0.4861838 0.5138162
1165 0.4935558 0.5064442
1166 0.4941329 0.5058671
1167 0.5020762 0.4979238
1168 0.5043051 0.4956949
1169 0.4890430 0.5109570
1170 0.5062006 0.4937994
1171 0.5092767 0.4907233
1172 0.4893670 0.5106330
1173 0.4987776 0.5012224
1174 0.4997456 0.5002544
1175 0.4806852 0.5193148
1176 0.4790536 0.5209464
1177 0.4889496 0.5110504
1178 0.5039466 0.4960534
1179 0.4934174 0.5065826
1180 0.4748985 0.5251015
1181 0.4706261 0.5293739
1182 0.4868978 0.5131022
1183 0.4967554 0.5032446
1184 0.4929449 0.5070551
1185 0.4922853 0.5077147
1186 0.4933690 0.5066310
1187 0.5053601 0.4946399
1188 0.5030552 0.4969448
1189 0.4905837 0.5094163
1190 0.4762390 0.5237610
1191 0.4603392 0.5396608
1192 0.4697932 0.5302068
1193 0.4925300 0.5074700
1194 0.4861143 0.5138857
1195 0.4811376 0.5188624
1196 0.4812474 0.5187526
1197 0.4842383 0.5157617
1198 0.5026218 0.4973782
1199 0.5052312 0.4947688
1200 0.4813184 0.5186816
1201 0.5015397 0.4984603
1202 0.4877161 0.5122839
1203 0.4774171 0.5225829
1204 0.5168827 0.4831173
1205 0.5072640 0.4927360
1206 0.4833515 0.5166485
1207 0.4726701 0.5273299
1208 0.5032667 0.4967333
1209 0.5202350 0.4797650
1210 0.4950279 0.5049721
1211 0.5018767 0.4981233
1212 0.5089142 0.4910858
1213 0.4968911 0.5031089
1214 0.4951595 0.5048405
1215 0.4895942 0.5104058
1216 0.4904653 0.5095347
1217 0.5055318 0.4944682
1218 0.5055416 0.4944584
1219 0.4942470 0.5057530
1220 0.4857495 0.5142505
1221 0.4901606 0.5098394
1222 0.5069730 0.4930270
1223 0.5084764 0.4915236
1224 0.5041288 0.4958712
1225 0.5048299 0.4951701
1226 0.5023879 0.4976121
1227 0.4986903 0.5013097
1228 0.4824758 0.5175242
1229 0.4825469 0.5174531
1230 0.4831600 0.5168400
1231 0.5017497 0.4982503
1232 0.5058708 0.4941292
1233 0.4890321 0.5109679
1234 0.4911052 0.5088948
1235 0.4864250 0.5135750
1236 0.4847062 0.5152938
1237 0.4944890 0.5055110
1238 0.4962261 0.5037739
1239 0.5005702 0.4994298
1240 0.5039068 0.4960932
1241 0.4946376 0.5053624
1242 0.4864366 0.5135634
1243 0.4807022 0.5192978
1244 0.4851439 0.5148561
1245 0.4951734 0.5048266
1246 0.5005893 0.4994107
1247 0.4972210 0.5027790
1248 0.4791988 0.5208012
1249 0.4831673 0.5168327
1250 0.4892591 0.5107409

$x
              LD1
999   0.082930955
1000  0.591141023
1001  1.167230633
1002  0.833350217
1003 -0.037928918
1004 -0.087431416
1005 -0.145127188
1006  0.217013241
1007  0.058737918
1008  0.350686419
1009  0.058972979
1010 -0.927941342
1011  0.113701897
1012  0.987838737
1013  0.812068621
1014  0.556813626
1015 -0.074523144
1016 -0.515140289
1017 -0.273862313
1018  0.154583117
1019 -0.402459508
1020 -0.875788251
1021 -0.556975090
1022  0.015545602
1023 -0.565532775
1024 -0.495947609
1025  0.026166415
1026  0.527211570
1027  0.171326741
1028 -0.660106384
1029 -0.399808484
1030 -0.246804695
1031 -0.180012366
1032  0.500244192
1033  0.362782493
1034  0.896631218
1035  0.385966937
1036 -0.793634311
1037 -1.001885690
1038 -0.065552914
1039 -0.033201873
1040 -0.272314419
1041 -0.005670681
1042 -0.633046617
1043 -0.660573497
1044  0.175146637
1045  0.901720095
1046  0.404879430
1047  0.392142626
1048  0.029449128
1049  0.195203503
1050  0.905934040
1051  0.300376866
1052 -0.061251737
1053  0.341659902
1054  0.909037994
1055  0.479867168
1056  0.026053397
1057 -0.106858727
1058  0.363641966
1059 -0.492769332
1060 -0.661821863
1061  0.453754535
1062  0.159531470
1063 -0.426809386
1064 -0.374790801
1065 -0.498847876
1066  0.229875540
1067  0.428650036
1068 -0.355825460
1069  0.471014282
1070  1.246938110
1071  1.587993753
1072  0.670875991
1073 -0.530686262
1074  0.547161007
1075 -0.585910659
1076 -0.579335302
1077 -0.208347590
1078  0.123837117
1079  0.201370997
1080  0.528782243
1081 -0.179881743
1082 -0.905805001
1083 -0.180129682
1084 -0.755939119
1085 -0.474363477
1086  0.202965738
1087 -0.353266839
1088  0.360465824
1089  0.281992515
1090  0.429784970
1091  0.810123224
1092 -0.411592166
1093 -0.963987656
1094 -0.999710819
1095 -0.812728572
1096 -0.141351130
1097 -0.168625772
1098 -0.208256477
1099  0.210051585
1100 -0.234641576
1101 -0.389412417
1102  0.340901856
1103 -0.265096229
1104 -0.571594987
1105  0.353748684
1106  0.273550871
1107 -0.049700122
1108  0.150468866
1109 -0.224487016
1110 -0.117206229
1111 -0.023408285
1112 -0.280047873
1113 -0.273496614
1114 -0.346047774
1115 -0.503201916
1116 -0.207605002
1117  0.169083120
1118  0.092157721
1119  0.685017616
1120  1.047021374
1121  0.440340886
1122 -0.547795600
1123 -0.376548399
1124  0.530276844
1125  0.198047455
1126 -0.699925318
1127  0.083006167
1128  0.271997390
1129 -0.874684276
1130 -1.000055459
1131 -0.465685483
1132 -0.172060184
1133 -0.212390928
1134 -0.208010242
1135  0.296475772
1136 -0.147088487
1137 -0.650197630
1138  0.180943289
1139 -0.006880812
1140 -0.034769522
1141  0.083539881
1142 -0.381234629
1143 -0.593204905
1144  0.205860848
1145  0.333893785
1146 -0.503610163
1147 -0.411106558
1148  0.436175505
1149  0.869982263
1150  0.565406109
1151 -0.294086350
1152 -0.227620610
1153 -0.358280948
1154  0.023026374
1155  0.126921286
1156  0.611978284
1157  0.558436717
1158  0.027209375
1159  0.015766480
1160 -0.134837692
1161  0.133407233
1162  0.596901573
1163  0.190732993
1164  0.268108329
1165 -0.073517145
1166 -0.100250737
1167 -0.468206298
1168 -0.571459191
1169  0.135584682
1170 -0.659268844
1171 -0.801791899
1172  0.120573299
1173 -0.315410923
1174 -0.360251258
1175  0.523092162
1176  0.598791959
1177  0.139913365
1178 -0.554847495
1179 -0.067102527
1180  0.791671412
1181  0.990164016
1182  0.235010388
1183 -0.221740596
1184 -0.045212006
1185 -0.014654220
1186 -0.064862344
1187 -0.620328709
1188 -0.513557583
1189  0.064191220
1190  0.729428658
1191  1.468963681
1192  1.028881539
1193 -0.025991404
1194  0.271330458
1195  0.502107071
1196  0.497014757
1197  0.358304261
1198 -0.493480531
1199 -0.614359730
1200  0.493719580
1201 -0.443354925
1202  0.197081440
1203  0.674742718
1204 -1.154355765
1205 -0.708534977
1206  0.399425284
1207  0.895175712
1208 -0.523354194
1209 -1.309851988
1210 -0.141714681
1211 -0.458964060
1212 -0.784993693
1213 -0.228027139
1214 -0.147810001
1215  0.110042296
1216  0.069675020
1217 -0.628283901
1218 -0.628739409
1219 -0.105540307
1220  0.288241253
1221  0.083796001
1222 -0.695053021
1223 -0.764710791
1224 -0.563288279
1225 -0.595766965
1226 -0.482644843
1227 -0.311368001
1228  0.440039224
1229  0.436738015
1230  0.408305690
1231 -0.453081897
1232 -0.643987573
1233  0.136092195
1234  0.040023110
1235  0.256928430
1236  0.336613129
1237 -0.116748141
1238 -0.197222689
1239 -0.398448390
1240 -0.553006090
1241 -0.123635444
1242  0.256391693
1243  0.522303602
1244  0.316318193
1245 -0.148455458
1246 -0.399332776
1247 -0.243307536
1248  0.592055064
1249  0.407966622
1250  0.125571506
names(lda.pred)
[1] "class"     "posterior" "x"        
summary(lda.pred$x)
      LD1          
 Min.   :-1.30985  
 1st Qu.:-0.40047  
 Median :-0.02960  
 Mean   :-0.01719  
 3rd Qu.: 0.33769  
 Max.   : 1.58799  
summary(lda.pred$class)
Down   Up 
  70  182 
summary(lda.pred$posterior)
      Down              Up        
 Min.   :0.4578   Min.   :0.4798  
 1st Qu.:0.4847   1st Qu.:0.4994  
 Median :0.4926   Median :0.5074  
 Mean   :0.4923   Mean   :0.5077  
 3rd Qu.:0.5006   3rd Qu.:0.5153  
 Max.   :0.5202   Max.   :0.5422  

计算预测正确率:

lda.class <- lda.pred$class
table(lda.class, direction2005)
         direction2005
lda.class Down  Up
     Down   35  35
     Up     76 106
mean(lda.class == direction2005)
[1] 0.5595238

预测结果取决于一个观测在两个分组中,哪个概率更大些:

sum(lda.pred$posterior[,1] > 0.5)
[1] 70
sum(lda.pred$posterior[,1] < 0.5)
[1] 182
lda.class[1:20]
 [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up   Up   Up   Up   Up  
[18] Down Up   Up  
Levels: Down Up
lda.pred$posterior[1:20,]
          Down        Up
999  0.4901792 0.5098208
1000 0.4792185 0.5207815
1001 0.4668185 0.5331815
1002 0.4740011 0.5259989
1003 0.4927877 0.5072123
1004 0.4938562 0.5061438
1005 0.4951016 0.5048984
1006 0.4872861 0.5127139
1007 0.4907013 0.5092987
1008 0.4844026 0.5155974
1009 0.4906963 0.5093037
1010 0.5119988 0.4880012
1011 0.4895152 0.5104848
1012 0.4706761 0.5293239
1013 0.4744593 0.5255407
1014 0.4799583 0.5200417
1015 0.4935775 0.5064225
1016 0.5030894 0.4969106
1017 0.4978806 0.5021194
1018 0.4886331 0.5113669

第12和18个观测 Down 的概率大于 Up 概率,所以被标记为 Down

4.6.4 Quadratic Discriminant Analysis

用 QDA 模型拟合股票市场数据:

qdaf <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qdaf
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544
qda.class <- predict(qdaf, sm2005)$class
table(qda.class, direction2005)
         direction2005
qda.class Down  Up
     Down   30  20
     Up     81 121
mean(qda.class == direction2005)
[1] 0.5992063

Compared with linear discriminant analysis, the accurate rate increased from 0.56 to 0.59.

4.6.5 K-Nearest Neighbours

Predict stock market direction wth \(k=1\):

library(class)
train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train,]
test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train,]
train.direction <- Smarket$Direction[train]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.direction, k = 1)
table(knn.pred, direction2005)
        direction2005
knn.pred Down Up
    Down   43 58
    Up     68 83
mean(knn.pred == direction2005)
[1] 0.5

Predict stock market direction wth \(k=3\):

library(class)
knn.pred3 <- knn(train.X, test.X, train.direction, k = 3)
table(knn.pred3, direction2005)
         direction2005
knn.pred3 Down Up
     Down   48 54
     Up     63 87
mean(knn.pred3 == direction2005)
[1] 0.5357143

\(k\) 从 1 到 3,正确率小幅上升,继续增大 \(k\) 不会进一步提升正确率, 所以 QDA 是目前效果最好的分类方法。

4.6.6 An Application to Caravan Insurance Data

Load dataset Caravan and standardized it:

head(Caravan)
str(Caravan)
'data.frame':   5822 obs. of  86 variables:
 $ MOSTYPE : num  33 37 37 9 40 23 39 33 33 11 ...
 $ MAANTHUI: num  1 1 1 1 1 1 2 1 1 2 ...
 $ MGEMOMV : num  3 2 2 3 4 2 3 2 2 3 ...
 $ MGEMLEEF: num  2 2 2 3 2 1 2 3 4 3 ...
 $ MOSHOOFD: num  8 8 8 3 10 5 9 8 8 3 ...
 $ MGODRK  : num  0 1 0 2 1 0 2 0 0 3 ...
 $ MGODPR  : num  5 4 4 3 4 5 2 7 1 5 ...
 $ MGODOV  : num  1 1 2 2 1 0 0 0 3 0 ...
 $ MGODGE  : num  3 4 4 4 4 5 5 2 6 2 ...
 $ MRELGE  : num  7 6 3 5 7 0 7 7 6 7 ...
 $ MRELSA  : num  0 2 2 2 1 6 2 2 0 0 ...
 $ MRELOV  : num  2 2 4 2 2 3 0 0 3 2 ...
 $ MFALLEEN: num  1 0 4 2 2 3 0 0 3 2 ...
 $ MFGEKIND: num  2 4 4 3 4 5 3 5 3 2 ...
 $ MFWEKIND: num  6 5 2 4 4 2 6 4 3 6 ...
 $ MOPLHOOG: num  1 0 0 3 5 0 0 0 0 0 ...
 $ MOPLMIDD: num  2 5 5 4 4 5 4 3 1 4 ...
 $ MOPLLAAG: num  7 4 4 2 0 4 5 6 8 5 ...
 $ MBERHOOG: num  1 0 0 4 0 2 0 2 1 2 ...
 $ MBERZELF: num  0 0 0 0 5 0 0 0 1 0 ...
 $ MBERBOER: num  1 0 0 0 4 0 0 0 0 0 ...
 $ MBERMIDD: num  2 5 7 3 0 4 4 2 1 3 ...
 $ MBERARBG: num  5 0 0 1 0 2 1 5 8 3 ...
 $ MBERARBO: num  2 4 2 2 0 2 5 2 1 3 ...
 $ MSKA    : num  1 0 0 3 9 2 0 2 1 1 ...
 $ MSKB1   : num  1 2 5 2 0 2 1 1 1 2 ...
 $ MSKB2   : num  2 3 0 1 0 2 4 2 0 1 ...
 $ MSKC    : num  6 5 4 4 0 4 5 5 8 4 ...
 $ MSKD    : num  1 0 0 0 0 2 0 2 1 2 ...
 $ MHHUUR  : num  1 2 7 5 4 9 6 0 9 0 ...
 $ MHKOOP  : num  8 7 2 4 5 0 3 9 0 9 ...
 $ MAUT1   : num  8 7 7 9 6 5 8 4 5 6 ...
 $ MAUT2   : num  0 1 0 0 2 3 0 4 2 1 ...
 $ MAUT0   : num  1 2 2 0 1 3 1 2 3 2 ...
 $ MZFONDS : num  8 6 9 7 5 9 9 6 7 6 ...
 $ MZPART  : num  1 3 0 2 4 0 0 3 2 3 ...
 $ MINKM30 : num  0 2 4 1 0 5 4 2 7 2 ...
 $ MINK3045: num  4 0 5 5 0 2 3 5 2 3 ...
 $ MINK4575: num  5 5 0 3 9 3 3 3 1 3 ...
 $ MINK7512: num  0 2 0 0 0 0 0 0 0 1 ...
 $ MINK123M: num  0 0 0 0 0 0 0 0 0 0 ...
 $ MINKGEM : num  4 5 3 4 6 3 3 3 2 4 ...
 $ MKOOPKLA: num  3 4 4 4 3 3 5 3 3 7 ...
 $ PWAPART : num  0 2 2 0 0 0 0 0 0 2 ...
 $ PWABEDR : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PWALAND : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PPERSAUT: num  6 0 6 6 0 6 6 0 5 0 ...
 $ PBESAUT : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PMOTSCO : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PVRAAUT : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PAANHANG: num  0 0 0 0 0 0 0 0 0 0 ...
 $ PTRACTOR: num  0 0 0 0 0 0 0 0 0 0 ...
 $ PWERKT  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PBROM   : num  0 0 0 0 0 0 0 3 0 0 ...
 $ PLEVEN  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PPERSONG: num  0 0 0 0 0 0 0 0 0 0 ...
 $ PGEZONG : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PWAOREG : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PBRAND  : num  5 2 2 2 6 0 0 0 0 3 ...
 $ PZEILPL : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PPLEZIER: num  0 0 0 0 0 0 0 0 0 0 ...
 $ PFIETS  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PINBOED : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PBYSTAND: num  0 0 0 0 0 0 0 0 0 0 ...
 $ AWAPART : num  0 2 1 0 0 0 0 0 0 1 ...
 $ AWABEDR : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AWALAND : num  0 0 0 0 0 0 0 0 0 0 ...
 $ APERSAUT: num  1 0 1 1 0 1 1 0 1 0 ...
 $ ABESAUT : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AMOTSCO : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AVRAAUT : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AAANHANG: num  0 0 0 0 0 0 0 0 0 0 ...
 $ ATRACTOR: num  0 0 0 0 0 0 0 0 0 0 ...
 $ AWERKT  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ABROM   : num  0 0 0 0 0 0 0 1 0 0 ...
 $ ALEVEN  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ APERSONG: num  0 0 0 0 0 0 0 0 0 0 ...
 $ AGEZONG : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AWAOREG : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ABRAND  : num  1 1 1 1 1 0 0 0 0 1 ...
 $ AZEILPL : num  0 0 0 0 0 0 0 0 0 0 ...
 $ APLEZIER: num  0 0 0 0 0 0 0 0 0 0 ...
 $ AFIETS  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ AINBOED : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ABYSTAND: num  0 0 0 0 0 0 0 0 0 0 ...
 $ Purchase: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
std.X <- scale(Caravan[, -86])
var(Caravan[,1])
[1] 165.0378
var(std.X[,1])
[1] 1

Split train and test data:

test <- 1:1000
train.X <- std.X[-test,]
test.X <- std.X[test,]
train.Y <- Caravan$Purchase[-test]
test.Y <- Caravan$Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k=1)
mean(knn.pred != test.Y)
[1] 0.118
mean(test.Y != 'No')
[1] 0.059
table(knn.pred, test.Y)
        test.Y
knn.pred  No Yes
     No  873  50
     Yes  68   9

\(9 \div (68+9) \approx 0.117\), which is success rate based on KNN prediction, is double the rate from random guessing (0.059).

Predict with \(k=3\) and 5:

knn.pred3 <- knn(train.X, test.X, train.Y, k=3)
table(knn.pred3, test.Y)
knn.pred5 <- knn(train.X, test.X, train.Y, k=5)
table(knn.pred5, test.Y)

\(5 \div (21 + 5) \approx 0.192\), and \(4 \div (11 + 4) \approx 0.267\), are much higher than the result of \(k=1\).

Logistic regression predict with cut-off 0.5:

lr.fit <- glm(Purchase ~ ., data = Caravan, family = 'binomial', subset = -test)
purchase.prob <- predict(lr.fit, Caravan[test,], type = 'response')
purchase.res <- rep('No', 1000)
purchase.res[purchase.prob > 0.5] = 'Yes'
table(purchase.res, test.Y)

All 7 “purchased” prediction are all wrong! Change threshold to 0.25:

purchase.res[purchase.prob > 0.25] = 'Yes'
table(purchase.res, test.Y)

The success rate is \(11 \div (22 + 11) \approx 0.333\).

---
title: "Lab of Chapter 4"
output: html_notebook
---

# 4.6.1 The Stock Market Data
```{r}
library(ISLR)
head(Smarket)
cor(Smarket[,-9])
plot(Smarket$Volume)
```

从相关系数矩阵可以看出：

* *Today* 与其他变量都关系不大；

* *Volume* 和 *Year* 有一定的正相关关系（0.54）。

## var, sd, cov, cor

Variance（方差）表征一个随机变量（表现形式为向量）的分散程度，对应 R 函数 `var(X)`，定义：
$$
\sigma^2_X = \frac{\sum (x - \mu)^2}{N}
$$

Standard deviation（标准差），对应 R 函数 `sd(X)`，是 $Var(X)$ 的平方根；

Covariance（协方差）对应 R 函数 `cov(X, Y)`，表征两个随机变量是否有相同/相反的变化趋势：
$$
\sigma_{X, Y} = \frac{\sum(x - \bar x)(y - \bar y)}{n - 1}
$$

Correlation（相关系数）是 协方差的标准化版本，对应 R 函数 `cor(X, Y)`：
$$
\rho_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y} \in [-1, 1]
$$

# 4.6.2 Logistic Regression

`glm()` 在基本的线性回归 `lm()` 基础上增加了 `family` 参数，
首先看下各个特征与股市涨跌之间是否存在关系：
```{r}
f2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial)
summary(f2)
coef(f2)
summary(f2)$coef
summary(f2)$coef[,4]
```

所有特征的 *p-value* 都大于 0.05，说明它们与涨跌之间没有显著关系。

对模型使用 `predict()` 函数给出（基于此模型的）预测值，注意参数 `type` 的设置：
```{r}
gprob <- predict(f2, type = 'response')
gprob[1:10]
contrasts(Smarket$Direction)
```

根据预测概率，大于 0.5 的为涨，否则为跌，与实际结果比较：
```{r}
gpred <- rep('Down', 1250)
gpred[gprob > 0.5] <- 'Up'
table(gpred, Smarket$Direction)
```

怎样解读 confusion matrix？

用两种方法计算预测结果的准确率：
```{r}
(145 + 507) / 1250
mean(gpred == Smarket$Direction)
```

结果一致。

逻辑回归的训练准确率 52%，测试准确率只能更低，
只要知道股市大趋势，单边押注，都会比这个准确率高得多，这是很多“股神”的套路。

为了研究测试错误率与训练错误率的关系，将数据集分为训练和测试两部分：
```{r}
train <- (Smarket$Year < 2005)
sm2005 <- Smarket[!train,]
dim(sm2005)
direction2005 <- sm2005$Direction
```

这里 `train` 是一个布尔向量。

在 2005 年前的数据上创建逻辑回归模型：
```{r}
f2.2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
            data=Smarket, family=binomial, subset = train)
```

基于此模型给出测试集上的预测结果：
```{r}
gprob2005 <- predict(f2.2, sm2005, type = 'response')
gpred2005 <- rep('Down', 252)
gpred2005[gprob2005 > 0.5] <- 'Up'
table(gpred2005, direction2005)
mean(gpred2005 != direction2005)
```

错误率确实上升了，而且超过了 50%，还不错随便猜。

如果去掉不相关的特征，只用相关性比较强的特征预测是否能得到比较好的模型？
动手尝试一下：
```{r}
f2.p2 <- glm(Direction ~ Lag1 + Lag2, data=Smarket, family = binomial, subset=train)
gprob2005.p2 <- predict(f2.p2, sm2005, type="response")
gpred2005.p2 <- rep("Down", 252)
gpred2005.p2[gprob2005.p2 > 0.5] <- "Up"
table(gpred2005.p2, direction2005)
mean(gpred2005.p2 == direction2005)
```

正确率上升到了 58%，说明假设是正确的。

给定 *Lag1* 和 *Lag2* 的值，基于现有模型预测涨跌：
```{r}
predict(f2.p2, newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)), type = 'response')
```

在这两个组合下，模型都给出了（不明显的）下跌预期。

# 4.6.3 Linear Discriminant Analysis

对股票数据拟合 LDA 模型：
```{r}
library(MASS)
lda_fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda_fit
plot(lda_fit)
```

先验概率：Down $\hat\pi_1 = 0.492$ 和 Up $\hat\pi_2 = 0.508$ 是怎样计算出来的？

*Group means*: 各分组内各个特征平均值 $\mu_k$（见式 4.15）。

对于 LDA 计算出来各特征的系数，与 $X$ 组合后结果越大，*Up* 的概率越高：
$$
-0.642 \times \text{Lag1} - 0.514 \times \text{Lag2}
$$

将每个观测的 *Lag1*, *Lag2* 代入上式，就得到了此观测的 *linear discriminants* (p162)，
这个值越大，*Up* 的概率越高，否则 *Down* 的概率高。
下面的图是表征各组（这里是 *Up* 和 *Down*）中 linear discriminant 分布情况的 histogram 图。

用 LDA 模型预测股市变化：
```{r}
lda.pred <- predict(lda_fit, sm2005)
lda.pred
names(lda.pred)
summary(lda.pred$x)
summary(lda.pred$class)
summary(lda.pred$posterior)
```

计算预测正确率：
```{r}
lda.class <- lda.pred$class
table(lda.class, direction2005)
mean(lda.class == direction2005)
```

预测结果取决于一个观测在两个分组中，哪个概率更大些：
```{r}
sum(lda.pred$posterior[,1] > 0.5)
sum(lda.pred$posterior[,1] < 0.5)
lda.class[1:20]
lda.pred$posterior[1:20,]
```

第12和18个观测 Down 的概率大于 Up 概率，所以被标记为 *Down*。

# 4.6.4 Quadratic Discriminant Analysis

用 QDA 模型拟合股票市场数据：
```{r}
qdaf <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qdaf
qda.class <- predict(qdaf, sm2005)$class
table(qda.class, direction2005)
mean(qda.class == direction2005)
```

Compared with linear discriminant analysis, the accurate rate increased from 0.56 to 0.59.

# 4.6.5 K-Nearest Neighbours

Predict stock market direction wth $k=1$:
```{r}
library(class)
train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train,]
test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train,]
train.direction <- Smarket$Direction[train]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.direction, k = 1)
table(knn.pred, direction2005)
mean(knn.pred == direction2005)
```

Predict stock market direction wth $k=3$:
```{r}
library(class)
knn.pred3 <- knn(train.X, test.X, train.direction, k = 3)
table(knn.pred3, direction2005)
mean(knn.pred3 == direction2005)
```

$k$ 从 1 到 3，正确率小幅上升，继续增大 $k$ 不会进一步提升正确率，
所以 QDA 是目前效果最好的分类方法。

# 4.6.6 An Application to Caravan Insurance Data

Load dataset *Caravan* and standardized it:
```{r}
head(Caravan)
str(Caravan)
std.X <- scale(Caravan[, -86])
var(Caravan[,1])
var(std.X[,1])
```

Split train and test data:
```{r}
test <- 1:1000
train.X <- std.X[-test,]
test.X <- std.X[test,]
train.Y <- Caravan$Purchase[-test]
test.Y <- Caravan$Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k=1)
mean(knn.pred != test.Y)
mean(test.Y != 'No')
table(knn.pred, test.Y)
```

$9 \div (68+9) \approx 0.117$, which is success rate based on KNN prediction, is double the rate from random guessing (0.059).

Predict with $k=3$ and 5:
```{r}
knn.pred3 <- knn(train.X, test.X, train.Y, k=3)
table(knn.pred3, test.Y)
knn.pred5 <- knn(train.X, test.X, train.Y, k=5)
table(knn.pred5, test.Y)
```

$5 \div (21 + 5) \approx 0.192$, and $4 \div (11 + 4) \approx 0.267$, are much higher than the result of $k=1$.

Logistic regression predict with cut-off 0.5:
```{r}
lr.fit <- glm(Purchase ~ ., data = Caravan, family = 'binomial', subset = -test)
purchase.prob <- predict(lr.fit, Caravan[test,], type = 'response')
purchase.res <- rep('No', 1000)
purchase.res[purchase.prob > 0.5] = 'Yes'
table(purchase.res, test.Y)
```

All 7 "purchased" prediction are all wrong!
Change threshold to 0.25:
```{r}
purchase.res[purchase.prob > 0.25] = 'Yes'
table(purchase.res, test.Y)
```

The success rate is $11 \div (22 + 11) \approx 0.333$.