This week we will compare the two algorithms by method of cross validation.
Firstly, we will seperate the data into k subsets, and randomly select one called testing set and the others named training set. Then we will estimate beta, predict y and calculate R square together with MSE. Secondly, By selecting a different subset from previous k subsets as testing set, we can calculate the same statistcis as the former step. If we repeat this process for k times, we will get k result. Thirdly, we can average k R square and MSE for each algorithm, and compare the performance between algorithm by those cross validation metrics .
Point: Good models shold have a high predictive (small MSE) and explanatory power (large R square).
(a) The size of the dataset is 1000*400, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.2008781 0.5026543 0.6790637 0.8271048 0.8859842
## 10 0.3167575 0.6900210 0.8320703 0.8886818 0.9323778
## 15 0.3795921 0.7764258 0.8677771 0.9210330 0.9528412
## 20 0.5087991 0.7972769 0.9027514 0.9435720 0.9624498
## 25 0.5360732 0.8413392 0.9089366 0.9536702 0.9695390
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1257506 0.1454949 0.1813484 0.1286047 0.1368703
## 10 0.2599441 0.2142148 0.1900072 0.2299578 0.2130743
## 15 0.4280426 0.2584904 0.3111214 0.3747653 0.2618787
## 20 0.3651019 0.5151241 0.3797341 0.3221434 0.3449017
## 25 0.6084528 0.4920019 0.6642314 0.3873418 0.5047170
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.08818019 0.2564669 0.5178642 0.6728792 0.7794191
## 10 0.12665561 0.5050797 0.6790230 0.8252676 0.8723655
## 15 0.17057855 0.5798422 0.7436915 0.8582639 0.9105872
## 20 0.32507019 0.6412060 0.8212027 0.9109824 0.9303408
## 25 0.32307054 0.7372403 0.8569574 0.9018533 0.9400922
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1730136 0.1892512 0.1811196 0.1861237 0.1639829
## 10 0.1699609 0.1716657 0.1811903 0.1639789 0.1601911
## 15 0.1995950 0.1692117 0.1761055 0.1710857 0.1739390
## 20 0.1722132 0.1805020 0.1740254 0.1535753 0.1719399
## 25 0.1705407 0.1703884 0.1617249 0.1912176 0.1765803
## [1] "R square"
## 0.1 0.2 0.3 0.4 0.5
## 5 -0.1126979 -0.2461874 -0.16119951 -0.15422558 -0.10656513
## 10 -0.1901019 -0.1849412 -0.15304729 -0.06341418 -0.06001229
## 15 -0.2090136 -0.1965836 -0.12408562 -0.06276911 -0.04225404
## 20 -0.1837289 -0.1560710 -0.08154864 -0.03258968 -0.03210900
## 25 -0.2130026 -0.1040989 -0.05197912 -0.05181690 -0.02944682
## [1] "MSE"
## 0.1 0.2 0.3 0.4 0.5
## 5 0.04726304 0.04375630 -0.0002288142 0.05751898 0.02711263
## 10 -0.08998315 -0.04254905 -0.0088169405 -0.06597890 -0.05288319
## 15 -0.22844760 -0.08927870 -0.1350159038 -0.20367970 -0.08793964
## 20 -0.19288868 -0.33462206 -0.2057087348 -0.16856810 -0.17296173
## 25 -0.43791209 -0.32161348 -0.5025065000 -0.19612422 -0.32813674
(b)
The size of the dataset is 400*1000, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.23831215 0.4694290 0.7370218 0.8061111 0.8738579
## 10 0.22639645 0.6219594 0.8187850 0.8948619 0.9344760
## 15 0.09038344 0.7105163 0.8614342 0.8983197 0.9394470
## 20 0.22971396 0.7421400 0.9009520 0.9359620 0.9602672
## 25 0.28864353 0.8065801 0.9050298 0.9294089 0.9625095
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.1722648 0.2487461 0.1596459 0.2139924 0.2096356
## 10 0.4782915 0.4599715 0.3612553 0.3100015 0.4166458
## 15 1.1582672 0.6828772 0.9998702 1.2233563 0.5253977
## 20 1.6411451 1.1028066 0.9308703 1.3647324 1.1486696
## 25 1.9559209 1.2641104 1.2851583 2.1250476 2.3967858
## $r.square
## 0.1 0.2 0.3 0.4 0.5
## 5 0.05608529 0.2342785 0.4491188 0.6104080 0.7440339
## 10 0.10248792 0.3790503 0.6840303 0.6934802 0.8314956
## 15 0.13530730 0.4618243 0.7801026 0.8400674 0.9019191
## 20 0.11470426 0.6223321 0.7617940 0.8424748 0.9293513
## 25 0.22305461 0.6276543 0.8005065 0.8781299 0.9279698
##
## $MSE
## 0.1 0.2 0.3 0.4 0.5
## 5 0.2069876 0.1931473 0.2172081 0.2019818 0.1940121
## 10 0.2167088 0.2290107 0.2081711 0.2731870 0.2006026
## 15 0.2181449 0.2385596 0.1940179 0.1999335 0.1993310
## 20 0.2802830 0.2227097 0.2102410 0.2513478 0.1804901
## 25 0.2426497 0.2435097 0.2570228 0.2111981 0.2416534
## [1] "R square"
## 0.1 0.2 0.3 0.4 0.5
## 5 -0.18222686 -0.2351505 -0.2879031 -0.19570304 -0.12982401
## 10 -0.12390853 -0.2429091 -0.1347547 -0.20138170 -0.10298042
## 15 0.04492386 -0.2486920 -0.0813316 -0.05825230 -0.03752791
## 20 -0.11500969 -0.1198078 -0.1391580 -0.09348718 -0.03091583
## 25 -0.06558891 -0.1789258 -0.1045233 -0.05127905 -0.03453968
## [1] "MSE"
## 0.1 0.2 0.3 0.4 0.5
## 5 0.03472278 -0.05559878 0.05756222 -0.01201065 -0.01562347
## 10 -0.26158267 -0.23096084 -0.15308422 -0.03681448 -0.21604323
## 15 -0.94012237 -0.44431760 -0.80585235 -1.02342274 -0.32606671
## 20 -1.36086210 -0.88009687 -0.72062937 -1.11338457 -0.96817946
## 25 -1.71327122 -1.02060077 -1.02813549 -1.91384953 -2.15513244
When n(#row) > p(#col), it seems both methods get similar result. If n < p, smo is much better. The power of smo remain same, but the power of reg decreasd significantly.