This week we will compare the two algorithms by method of cross validation.
Firstly, we will seperate the data into k subsets, and randomly select one called testing set and the others named training set. Then we will estimate beta, predict y and calculate R square together with MSE. Secondly, By selecting a different subset from previous k subsets as testing set, we can calculate the same statistcis as the former step. If we repeat this process for k times, we will get k result. Thirdly, we can average k R square and MSE for each algorithm, and compare the performance between algorithm by those cross validation metrics .
Point: Good models shold have a high predictive (small MSE) and explanatory power (large R square).
(a) The size of the dataset is 1000*400, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1096 0.2984 0.5673 0.7047 0.7736
## 10 0.1985 0.4791 0.7345 0.7617 0.8597
## 15 0.2063 0.6455 0.7463 0.8036 0.9000
## 20 0.3143 0.6684 0.7087 0.7747 0.8388
## 25 0.3125 0.7593 0.8217 0.8497 0.7825
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1757 0.2301 0.2711 0.4494 0.5873
## 10 0.2099 0.3124 0.5548 2.4581 0.3028
## 15 0.6553 0.5304 3.0372 4.4203 1.9081
## 20 0.5578 0.9174 13.3581 21.6094 21.7213
## 25 1.3733 1.4846 2.2022 14.2976 53.6985
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.06644 0.2890 0.4905 0.6177 0.7326
## 10 0.10830 0.4505 0.6631 0.8178 0.8652
## 15 0.14147 0.6140 0.7493 0.8596 0.9027
## 20 0.25290 0.7058 0.8187 0.8865 0.9407
## 25 0.30330 0.7357 0.8337 0.9181 0.9447
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1779 0.1788 0.1903 0.2009 0.2080
## 10 0.1779 0.1877 0.1784 0.1639 0.1808
## 15 0.2088 0.1676 0.1906 0.1862 0.2013
## 20 0.1866 0.1548 0.1798 0.1829 0.1496
## 25 0.1671 0.1702 0.2001 0.1749 0.1760
## [1] "R square"
## 0.1 0.2 0.3 0.4 0.5
## 5 -0.043190 -0.009389 -0.07679 -0.08698 -0.041013
## 10 -0.090226 -0.028604 -0.07139 0.05612 0.005412
## 15 -0.064865 -0.031539 0.00300 0.05601 0.002733
## 20 -0.061426 0.037362 0.10999 0.11188 0.101836
## 25 -0.009213 -0.023544 0.01201 0.06842 0.162239
## [1] "MSE"
## 0.1 0.2 0.3 0.4 0.5
## 5 0.002189 -0.05131 -0.08085 -0.2484 -0.3792
## 10 -0.031927 -0.12467 -0.37636 -2.2942 -0.1220
## 15 -0.446454 -0.36286 -2.84659 -4.2341 -1.7068
## 20 -0.371200 -0.76253 -13.17833 -21.4265 -21.5717
## 25 -1.206219 -1.31445 -2.00206 -14.1227 -53.5226
(b)
The size of the dataset is 400*1000, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.03302 0.06543 0.1030 0.2094 0.2368
## 10 0.06193 0.11328 0.1910 0.2563 0.2187
## 15 0.06849 0.21310 0.2776 0.2249 0.3189
## 20 0.08896 0.19941 0.2874 0.3006 0.2316
## 25 0.13308 0.21736 0.1164 0.2089 0.2333
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1648 0.7098 1.267 1.969 4.641
## 10 0.6436 2.3398 5.104 8.215 11.279
## 15 1.3311 4.8227 11.315 23.948 25.304
## 20 2.0609 6.2697 12.785 23.437 49.885
## 25 3.2783 9.7607 31.103 49.486 74.478
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.02494 0.2573 0.5019 0.5986 0.7534
## 10 0.13750 0.4513 0.6277 0.8180 0.8310
## 15 0.15652 0.5253 0.7449 0.8495 0.9011
## 20 0.28478 0.6258 0.7845 0.8956 0.9199
## 25 0.23939 0.6632 0.8446 0.8962 0.9502
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.2088 0.2251 0.1851 0.2255 0.1742
## 10 0.2035 0.2091 0.2110 0.1922 0.2400
## 15 0.1933 0.2205 0.1868 0.2046 0.1840
## 20 0.1799 0.2132 0.2043 0.1871 0.2125
## 25 0.2133 0.2113 0.2158 0.2228 0.1576
## [1] "R square"
## 0.1 0.2 0.3 0.4 0.5
## 5 -0.008078 0.1919 0.3989 0.3892 0.5166
## 10 0.075569 0.3380 0.4367 0.5617 0.6123
## 15 0.088028 0.3122 0.4672 0.6246 0.5822
## 20 0.195821 0.4264 0.4971 0.5950 0.6883
## 25 0.106311 0.4459 0.7283 0.6872 0.7169
## [1] "MSE"
## 0.1 0.2 0.3 0.4 0.5
## 5 0.04399 -0.4848 -1.082 -1.744 -4.467
## 10 -0.44005 -2.1307 -4.893 -8.023 -11.039
## 15 -1.13781 -4.6022 -11.128 -23.744 -25.120
## 20 -1.88100 -6.0566 -12.580 -23.250 -49.672
## 25 -3.06503 -9.5495 -30.887 -49.263 -74.320
When n(#row) > p(#col), it seems both methods get similar result. If n < p, smo is much better. The power of smo remain same, but the power of reg decreasd significantly.