cross validation

Overview

This week we will compare the two algorithms by method of cross validation.

Firstly, we will seperate the data into k subsets, and randomly select one called testing set and the others named training set. Then we will estimate beta, predict y and calculate R square together with MSE. Secondly, By selecting a different subset from previous k subsets as testing set, we can calculate the same statistcis as the former step. If we repeat this process for k times, we will get k result. Thirdly, we can average k R square and MSE for each algorithm, and compare the performance between algorithm by those cross validation metrics .

Point: Good models shold have a high predictive (small MSE) and explanatory power (large R square).

Calculation

(a) The size of the dataset is 1000*400, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)

reg

## $r.square
##       0.1    0.2    0.3    0.4    0.5
## 5  0.1096 0.2984 0.5673 0.7047 0.7736
## 10 0.1985 0.4791 0.7345 0.7617 0.8597
## 15 0.2063 0.6455 0.7463 0.8036 0.9000
## 20 0.3143 0.6684 0.7087 0.7747 0.8388
## 25 0.3125 0.7593 0.8217 0.8497 0.7825
## 
## $MSE
##       0.1    0.2     0.3     0.4     0.5
## 5  0.1757 0.2301  0.2711  0.4494  0.5873
## 10 0.2099 0.3124  0.5548  2.4581  0.3028
## 15 0.6553 0.5304  3.0372  4.4203  1.9081
## 20 0.5578 0.9174 13.3581 21.6094 21.7213
## 25 1.3733 1.4846  2.2022 14.2976 53.6985

smo

## $r.square
##        0.1    0.2    0.3    0.4    0.5
## 5  0.06644 0.2890 0.4905 0.6177 0.7326
## 10 0.10830 0.4505 0.6631 0.8178 0.8652
## 15 0.14147 0.6140 0.7493 0.8596 0.9027
## 20 0.25290 0.7058 0.8187 0.8865 0.9407
## 25 0.30330 0.7357 0.8337 0.9181 0.9447
## 
## $MSE
##       0.1    0.2    0.3    0.4    0.5
## 5  0.1779 0.1788 0.1903 0.2009 0.2080
## 10 0.1779 0.1877 0.1784 0.1639 0.1808
## 15 0.2088 0.1676 0.1906 0.1862 0.2013
## 20 0.1866 0.1548 0.1798 0.1829 0.1496
## 25 0.1671 0.1702 0.2001 0.1749 0.1760

difference (smo-reg)

## [1] "R square"

##          0.1       0.2      0.3      0.4       0.5
## 5  -0.043190 -0.009389 -0.07679 -0.08698 -0.041013
## 10 -0.090226 -0.028604 -0.07139  0.05612  0.005412
## 15 -0.064865 -0.031539  0.00300  0.05601  0.002733
## 20 -0.061426  0.037362  0.10999  0.11188  0.101836
## 25 -0.009213 -0.023544  0.01201  0.06842  0.162239

## [1] "MSE"

##          0.1      0.2       0.3      0.4      0.5
## 5   0.002189 -0.05131  -0.08085  -0.2484  -0.3792
## 10 -0.031927 -0.12467  -0.37636  -2.2942  -0.1220
## 15 -0.446454 -0.36286  -2.84659  -4.2341  -1.7068
## 20 -0.371200 -0.76253 -13.17833 -21.4265 -21.5717
## 25 -1.206219 -1.31445  -2.00206 -14.1227 -53.5226

(b)

The size of the dataset is 400*1000, we use 5 fold cross validation. num<-seq(5,50,10) str<-seq(0.1,0.5,0.1)

reg

## $r.square
##        0.1     0.2    0.3    0.4    0.5
## 5  0.03302 0.06543 0.1030 0.2094 0.2368
## 10 0.06193 0.11328 0.1910 0.2563 0.2187
## 15 0.06849 0.21310 0.2776 0.2249 0.3189
## 20 0.08896 0.19941 0.2874 0.3006 0.2316
## 25 0.13308 0.21736 0.1164 0.2089 0.2333
## 
## $MSE
##       0.1    0.2    0.3    0.4    0.5
## 5  0.1648 0.7098  1.267  1.969  4.641
## 10 0.6436 2.3398  5.104  8.215 11.279
## 15 1.3311 4.8227 11.315 23.948 25.304
## 20 2.0609 6.2697 12.785 23.437 49.885
## 25 3.2783 9.7607 31.103 49.486 74.478

smo

## $r.square
##        0.1    0.2    0.3    0.4    0.5
## 5  0.02494 0.2573 0.5019 0.5986 0.7534
## 10 0.13750 0.4513 0.6277 0.8180 0.8310
## 15 0.15652 0.5253 0.7449 0.8495 0.9011
## 20 0.28478 0.6258 0.7845 0.8956 0.9199
## 25 0.23939 0.6632 0.8446 0.8962 0.9502
## 
## $MSE
##       0.1    0.2    0.3    0.4    0.5
## 5  0.2088 0.2251 0.1851 0.2255 0.1742
## 10 0.2035 0.2091 0.2110 0.1922 0.2400
## 15 0.1933 0.2205 0.1868 0.2046 0.1840
## 20 0.1799 0.2132 0.2043 0.1871 0.2125
## 25 0.2133 0.2113 0.2158 0.2228 0.1576

difference (smo-reg)

## [1] "R square"

##          0.1    0.2    0.3    0.4    0.5
## 5  -0.008078 0.1919 0.3989 0.3892 0.5166
## 10  0.075569 0.3380 0.4367 0.5617 0.6123
## 15  0.088028 0.3122 0.4672 0.6246 0.5822
## 20  0.195821 0.4264 0.4971 0.5950 0.6883
## 25  0.106311 0.4459 0.7283 0.6872 0.7169

## [1] "MSE"

##         0.1     0.2     0.3     0.4     0.5
## 5   0.04399 -0.4848  -1.082  -1.744  -4.467
## 10 -0.44005 -2.1307  -4.893  -8.023 -11.039
## 15 -1.13781 -4.6022 -11.128 -23.744 -25.120
## 20 -1.88100 -6.0566 -12.580 -23.250 -49.672
## 25 -3.06503 -9.5495 -30.887 -49.263 -74.320

Result interpretation

When n(#row) > p(#col), it seems both methods get similar result. If n < p, smo is much better. The power of smo remain same, but the power of reg decreasd significantly.

cross validation

Leyi Zhang

September 23, 2014

Overview

Calculation

Result interpretation