Question 9.1 Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components. Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)
set.seed(1)
crime <- read.table("\\Users\\smparker\\WorkFolders\\Documents\\data 7.2\\uscrime.txt", header = TRUE)
head(crime)
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob
## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682
#Removing the binary variable So
crime1 <- crime[-2]
head (crime1)
## M Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob
## 1 15.1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682
#PCA model based on the crime data
pca <- prcomp(crime1[,1:15],scale =TRUE)
#Summary and plotting of PCA crime data
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3802 1.6756 1.4202 1.16749 1.03667 0.74864 0.5988
## Proportion of Variance 0.3777 0.1872 0.1345 0.09087 0.07165 0.03736 0.0239
## Cumulative Proportion 0.3777 0.5649 0.6993 0.79020 0.86185 0.89921 0.9231
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55069 0.48478 0.44375 0.42652 0.32674 0.26644 0.2324
## Proportion of Variance 0.02022 0.01567 0.01313 0.01213 0.00712 0.00473 0.0036
## Cumulative Proportion 0.94334 0.95900 0.97213 0.98426 0.99138 0.99611 0.9997
## PC15
## Standard deviation 0.06595
## Proportion of Variance 0.00029
## Cumulative Proportion 1.00000
plot(pca)
attributes(pca)
## $names
## [1] "sdev" "rotation" "center" "scale" "x"
##
## $class
## [1] "prcomp"
var <- pca$sdev^2
propvar <-var/sum(var)
plot(propvar,xlab ="Principal Component",
ylab ="Proportion of Variance Explained",
ylim =c(0,1), type ="b")
Utilizing the above graph we can visualize how much of the variance is being explained by each of the principal components. The first component explains close to 40% of the variance, the second component explains 20% of the variance. Making use of the elbow diagram it is clearly visible that less and less of the variance is being explained beyond the 6th principle component.
# Performing PCA on the first 6 principal components
attributes(pca$x)
## $dim
## [1] 47 15
##
## $dimnames
## $dimnames[[1]]
## NULL
##
## $dimnames[[2]]
## [1] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7" "PC8" "PC9" "PC10"
## [11] "PC11" "PC12" "PC13" "PC14" "PC15"
pca$x
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -3.81396173 1.49480900 -1.0603159 0.52406615 0.84528748 0.22511769
## [2,] 1.39378610 -0.30940473 0.2040623 2.01004712 -0.40017797 -0.65963339
## [3,] -4.05081477 0.02927506 -0.3395327 -0.15086987 0.59155503 0.60971007
## [4,] 4.59616332 2.56050623 0.3297528 1.84346257 -0.09151119 0.78941109
## [5,] 1.76782290 -1.32278400 1.3740356 0.49113472 0.64210034 -0.49140529
## [6,] 2.62914861 -0.26847655 0.3101237 -1.60938033 1.47875512 -0.71070444
## [7,] 0.72645393 -0.32738471 -0.9599399 -0.38740571 0.82753989 -0.89946017
## [8,] 0.86025439 1.39484364 0.6927791 1.04326051 1.05777574 -0.03706974
## [9,] -3.17917314 1.08511871 1.3511201 0.52913646 0.54806163 0.15408027
## [10,] 0.60199208 -3.11479451 0.4673390 0.36299278 -0.58642324 0.47082637
## [11,] 3.04377542 2.73346118 1.5244970 0.18323539 -0.84280152 0.38091485
## [12,] 0.73319230 0.01732148 1.1055319 -0.93534947 -1.09714173 0.24719333
## [13,] -0.06538917 -1.06720637 1.7115343 -1.52051166 -1.30690240 1.04747677
## [14,] -0.10350086 -1.88098917 1.0116864 -0.74476842 0.16718590 0.50648502
## [15,] -3.04248733 0.22341805 -0.9321089 0.45593154 0.04048224 0.08214470
## [16,] -2.43899141 2.34887912 -1.7950680 0.78252902 0.48468438 -0.21379267
## [17,] -0.84695940 -1.50854412 -1.4237311 -0.93326334 0.79543267 -0.44159994
## [18,] 0.44967843 0.01794353 -0.4162988 0.38413857 3.72264356 0.19771947
## [19,] 2.78688483 1.02983223 0.7120392 -1.75632220 1.13565429 -0.95181565
## [20,] 2.43597420 1.54260143 -2.8466543 0.58558522 -0.34679165 0.73557815
## [21,] 0.77334017 -0.25055341 0.1887937 -1.01532413 -2.09549445 0.51157039
## [22,] -5.49386562 1.67693083 0.1432269 0.52331840 0.11002034 0.25813985
## [23,] -0.11311127 0.90006187 0.7489911 -0.11236256 0.01722450 0.20261185
## [24,] 1.03960328 -2.03182990 -2.8572832 0.99760296 -0.67660795 0.07573911
## [25,] -0.40823073 -2.67922871 1.8077718 -0.96987069 0.31232179 1.43239586
## [26,] 4.59398246 -1.15115199 -0.4537591 3.77479197 0.50370215 -0.22108628
## [27,] 0.22252243 -1.12965276 1.0415912 -2.40971203 -0.03163130 -1.18670888
## [28,] 0.02336005 -1.07769460 0.3208172 1.46899925 -0.71336389 -0.75664604
## [29,] 3.89430066 4.21532234 0.4266915 -1.25198228 0.91703286 0.80347287
## [30,] -3.81094539 0.80374078 1.4952845 0.42547959 0.67378663 -0.37880247
## [31,] -1.42459732 -1.64684422 -2.8653965 0.07109257 -1.28966751 -0.34826897
## [32,] 1.49499555 1.00654828 -0.7380728 -1.10566201 -0.47311960 1.36112745
## [33,] -1.42496929 -0.41360704 1.5205399 0.16147756 0.01490728 -0.23993431
## [34,] 1.71825488 -0.78610986 -0.8039133 -0.47288584 0.55874447 -0.51869082
## [35,] 1.41200277 2.75113922 -2.0297526 -1.81148045 -1.23507296 0.58048925
## [36,] 1.19331842 1.81970726 0.7712784 -0.22269753 -1.67372755 -2.48306586
## [37,] -3.72446655 0.48770357 2.5691107 1.93768250 -1.13200914 0.41515015
## [38,] -1.42767218 -2.72358805 -0.2711169 0.06885997 -0.39480798 0.98054881
## [39,] -2.95634262 1.23581592 0.2001766 -0.22270316 -0.11335822 -0.34318729
## [40,] -0.39403166 1.11441462 0.8659378 0.50802275 -0.49928311 0.71284028
## [41,] 1.11356353 -1.92797959 1.8341757 -0.41300988 -1.15273251 -0.95896014
## [42,] -1.34536528 -1.84855157 -1.6931182 -1.15528036 1.68481989 -0.38568562
## [43,] -1.97549443 0.65400400 1.4808205 0.68203789 0.05886983 -0.92578651
## [44,] 2.15628759 -0.61234215 -0.6295304 0.35273735 -0.93267343 -0.49690723
## [45,] -2.81341963 1.12379006 -3.4751405 -1.09138875 -1.30975799 -0.51150287
## [46,] 1.27666760 -1.34845267 0.8563272 -0.95154427 1.16939619 0.76049822
## [47,] 1.91646392 -2.84001774 -1.4753031 1.07615212 0.03707310 0.61947271
## PC7 PC8 PC9 PC10 PC11
## [1,] 0.23293604 0.188420565 0.588477019 -0.31107661 0.029512698
## [2,] -0.48862427 0.313999697 0.351041158 -0.08903699 -0.090989502
## [3,] -0.32443483 0.377639219 0.008787560 0.12222732 0.406783557
## [4,] -0.25062337 -1.560852996 0.493019186 -0.05425163 -0.042720507
## [5,] -0.27365756 -0.424185396 0.434133137 -0.34524583 0.209510380
## [6,] 0.04362165 0.524236678 -0.797930672 -0.41254229 0.032972902
## [7,] -0.21607703 0.713231310 0.307583742 -0.54951439 0.344401514
## [8,] -1.13106739 0.540977480 0.411362492 -0.27127601 0.524934080
## [9,] 0.40660095 0.039646985 -0.009798776 -0.17413874 -0.173779568
## [10,] 0.26866889 -0.423485945 -0.225820343 -0.30766025 0.074751069
## [11,] -0.15335190 0.643677975 0.187379172 0.65688192 0.084449047
## [12,] 0.30293369 0.330504907 0.194076881 0.12664704 -0.276916370
## [13,] 0.23057199 0.734714535 0.413082158 0.58841859 0.265790591
## [14,] -0.20217705 -0.005780294 0.496351170 -0.10831130 -0.081162070
## [15,] -1.04460920 -0.281933252 -0.298096355 0.98266596 -0.863742052
## [16,] -0.24230801 0.697607821 -0.190265369 -0.90635771 0.261019877
## [17,] 0.38388490 -0.691499477 0.584784712 0.07149469 -0.380263687
## [18,] 1.30155715 0.316008444 0.026815303 1.06517848 -0.213599385
## [19,] -0.23944874 0.029233855 -0.250618440 -0.75616077 -0.386930287
## [20,] 0.26520763 0.342140816 0.201951433 -0.33387171 -0.786843818
## [21,] 0.11211127 0.673164807 0.067626227 0.29284745 0.283835504
## [22,] 1.34753245 0.504529671 0.266651832 -0.19143220 1.141649407
## [23,] -1.55828555 0.576944979 -0.128224677 -0.13342013 -0.279675573
## [24,] 0.02474439 -0.247628513 0.115335431 -0.52684580 0.385775578
## [25,] -0.04215373 0.461035307 0.399760892 0.23352583 -0.093000965
## [26,] 0.35801996 1.001347760 -0.414337204 0.69245141 0.022591769
## [27,] 0.32119066 -0.060375606 -0.282299894 -0.07277882 0.261845858
## [28,] -0.13879854 -0.571443559 0.201184197 0.04584585 0.310299275
## [29,] 0.16679993 -0.855417031 -0.729350584 0.17719944 0.673990540
## [30,] 0.38597436 -0.868484936 -0.327493489 0.29863183 -0.049265368
## [31,] -0.05195320 -0.639459954 -1.452645328 0.24929516 0.421715622
## [32,] 0.67620197 -0.035577666 0.291494485 -0.13335344 -0.523014001
## [33,] -0.94180066 -0.183629298 0.449351618 0.16087168 0.073036234
## [34,] -0.31712044 0.123970314 0.261160626 -0.12877613 0.458255646
## [35,] 0.51608263 -0.267146763 -0.037536790 -0.32528584 -0.067570727
## [36,] 0.25943780 0.565391928 -0.117459856 0.61282562 -0.545667507
## [37,] 0.89792600 0.205082148 -0.754258100 -1.06567212 -1.215413351
## [38,] -0.85778773 -0.088561940 -0.563547410 0.01276699 0.425073513
## [39,] -1.03976047 -0.033264249 -0.160012992 -0.04715679 -0.174601629
## [40,] -0.56925183 -1.034631620 0.192452912 0.09353996 0.152545486
## [41,] 0.66916197 -0.515870807 0.473619410 -0.32827345 0.006667121
## [42,] -0.13822244 -0.620611580 0.851202267 0.06288717 -0.743999204
## [43,] 0.56818268 -0.782851679 -0.370274134 0.48153133 0.355397538
## [44,] 0.62656095 -0.163025743 0.313821216 -0.06473876 0.297433585
## [45,] -0.26364985 0.161261916 0.198102836 0.73674355 -0.133519659
## [46,] -0.38228168 0.073597678 -1.406020512 -0.04188695 -0.215673297
## [47,] 0.50153557 0.217351511 -0.264618150 -0.08541262 -0.165889864
## PC12 PC13 PC14 PC15
## [1,] 0.084982446 0.21359689 0.205136740 0.027793386
## [2,] -0.027781324 -0.03962376 -0.050788856 -0.036687438
## [3,] 0.202419253 -0.05290100 -0.526988040 0.062127288
## [4,] -0.112252436 0.03025186 0.001588820 0.045230815
## [5,] 0.673294686 0.57593617 0.183222166 -0.066037038
## [6,] -0.318981131 0.27533142 0.118375689 0.005226695
## [7,] -1.072944542 -0.02834628 0.085683627 -0.064827799
## [8,] 0.222045863 -0.50183261 -0.396849336 -0.012742192
## [9,] -0.041344390 0.36868247 -0.116758889 0.039130985
## [10,] 0.173488890 0.06774730 -0.086077852 0.015403181
## [11,] -0.548736305 0.34683138 -0.355015395 0.146307595
## [12,] -0.379768933 0.06918388 -0.131926806 0.031875635
## [13,] 0.405927180 -0.15002671 0.299151219 -0.036121926
## [14,] -0.364211818 -0.42375473 0.147274385 0.005637575
## [15,] -0.010384699 -0.41800511 0.244898684 0.030739089
## [16,] -0.001926467 0.11404701 0.118251932 -0.024671233
## [17,] 0.516179788 0.09946446 -0.290476614 0.008300765
## [18,] -0.057488575 0.47346785 0.091469396 0.013974766
## [19,] 0.480809100 -0.59981420 0.337238616 0.058113020
## [20,] -0.119484477 -0.12717681 -0.003573220 -0.008947907
## [21,] 0.147446682 0.30431578 0.184821076 -0.011493109
## [22,] 0.284135914 -0.30411822 0.016739027 -0.015532938
## [23,] 0.001059369 0.48179717 0.047138994 0.044302898
## [24,] 0.081064620 0.23829521 -0.144057955 -0.024548772
## [25,] 0.217004897 -0.21377094 -0.024863594 -0.055152332
## [26,] 0.181705150 -0.18505106 -0.118918199 -0.110168683
## [27,] -0.170964647 0.04010342 -0.260107576 0.080949179
## [28,] 0.067810231 0.33839284 0.413175030 -0.032032147
## [29,] 0.617257839 -0.05704383 0.061842414 0.013326574
## [30,] -0.367932861 -0.17141830 -0.199733122 -0.025885508
## [31,] -0.038108658 0.12042173 -0.282202515 0.073330432
## [32,] -0.423350198 0.01136968 -0.043868719 -0.078929490
## [33,] -0.024369864 0.16084512 0.090524456 0.113265174
## [34,] 0.029824141 -0.17974799 0.176464680 0.075243661
## [35,] -0.004661230 -0.07244008 -0.272029765 -0.103470686
## [36,] 0.543398035 -0.13423251 -0.377396628 -0.052744461
## [37,] 0.241288352 0.07854077 0.097063461 0.063424731
## [38,] -0.033154088 0.00341939 -0.174832328 -0.029088087
## [39,] -0.035560215 -0.07701236 -0.032644704 -0.099503547
## [40,] -0.247619848 -0.04575911 0.009142559 -0.071839901
## [41,] -0.223090029 0.10160949 -0.157313841 -0.085179749
## [42,] 0.085211989 -0.02293847 -0.313491456 -0.011436895
## [43,] -0.625489926 -0.38565832 0.333733688 -0.054826924
## [44,] -0.024778749 -0.15001567 0.316854952 0.141945758
## [45,] 0.108743701 0.18767185 0.543253672 -0.014875634
## [46,] -0.034765353 0.13091002 0.162001323 -0.109273929
## [47,] -0.055947364 -0.49154508 0.074868804 0.140369122
PCs <-pca$x[,1:6]
summary(PCs)
## PC1 PC2 PC3 PC4
## Min. :-5.4939 Min. :-3.11479 Min. :-3.4751 Min. :-2.40971
## 1st Qu.:-1.4263 1st Qu.:-1.23697 1st Qu.:-0.8680 1st Qu.:-0.94345
## Median : 0.4497 Median : 0.01732 Median : 0.3101 Median : 0.07109
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.4535 3rd Qu.: 1.11910 3rd Qu.: 1.0266 3rd Qu.: 0.52660
## Max. : 4.5962 Max. : 4.21532 Max. : 2.5691 Max. : 3.77479
## PC5 PC6
## Min. :-2.09549 Min. :-2.48307
## 1st Qu.:-0.69499 1st Qu.:-0.49416
## Median : 0.01723 Median : 0.08214
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.61683 3rd Qu.: 0.54603
## Max. : 3.72264 Max. : 1.43240
pcaCrime <-cbind(PCs, crime1[,15])
model1 <-lm(V7~.,data =as.data.frame(pcaCrime))
summary(model1)
##
## Call:
## lm(formula = V7 ~ ., data = as.data.frame(pcaCrime))
##
## Residuals:
## Min 1Q Median 3Q Max
## -289.78 -86.76 14.92 81.89 260.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.085 20.300 44.584 < 2e-16 ***
## PC1 89.117 8.621 10.337 7.36e-13 ***
## PC2 75.018 12.247 6.126 3.15e-07 ***
## PC3 38.075 14.449 2.635 0.0119 *
## PC4 222.730 17.576 12.672 1.38e-15 ***
## PC5 -2.104 19.794 -0.106 0.9159
## PC6 -50.000 27.410 -1.824 0.0756 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 139.2 on 40 degrees of freedom
## Multiple R-squared: 0.8874, Adjusted R-squared: 0.8705
## F-statistic: 52.54 on 6 and 40 DF, p-value: < 2.2e-16
model1$coefficients
## (Intercept) PC1 PC2 PC3 PC4 PC5
## 905.085106 89.116825 75.017789 38.075301 222.730156 -2.103705
## PC6
## -49.999757
beta0<-model1$coefficient[1]
betas<-model1$coefficients[2:6]
alphas <-pca$rotation[,1:5]%*%betas
summary(pca$rotation)
## PC1 PC2 PC3 PC4
## Min. :-0.36011 Min. :-0.39080 Min. :-0.65264 Min. :-0.11402
## 1st Qu.:-0.12937 1st Qu.:-0.07483 1st Qu.:-0.05689 1st Qu.: 0.02984
## Median : 0.10329 Median : 0.11374 Median : 0.04387 Median : 0.14992
## Mean : 0.06446 Mean : 0.06744 Mean :-0.02514 Mean : 0.16741
## 3rd Qu.: 0.28190 3rd Qu.: 0.24133 3rd Qu.: 0.09821 3rd Qu.: 0.29079
## Max. : 0.39717 Max. : 0.44174 Max. : 0.29184 Max. : 0.57588
## PC5 PC6 PC7 PC8
## Min. :-0.54945 Min. :-0.40227 Min. :-0.40742 Min. :-0.63588
## 1st Qu.:-0.14879 1st Qu.:-0.08372 1st Qu.: 0.04153 1st Qu.:-0.16037
## Median :-0.01729 Median : 0.02759 Median : 0.14541 Median : 0.03786
## Mean :-0.01062 Mean : 0.06866 Mean : 0.11058 Mean :-0.02608
## 3rd Qu.: 0.13323 3rd Qu.: 0.18123 3rd Qu.: 0.24765 3rd Qu.: 0.13961
## Max. : 0.56740 Max. : 0.57132 Max. : 0.43230 Max. : 0.26560
## PC9 PC10 PC11 PC12
## Min. :-0.37573 Min. :-0.53528 Min. :-0.454521 Min. :-0.55896
## 1st Qu.:-0.04830 1st Qu.:-0.10750 1st Qu.:-0.131482 1st Qu.:-0.14340
## Median : 0.00207 Median : 0.04780 Median : 0.042798 Median : 0.09090
## Mean : 0.04750 Mean : 0.03464 Mean :-0.003106 Mean : 0.01597
## 3rd Qu.: 0.12212 3rd Qu.: 0.13496 3rd Qu.: 0.205714 3rd Qu.: 0.16999
## Max. : 0.63478 Max. : 0.51040 Max. : 0.336892 Max. : 0.37835
## PC13 PC14 PC15
## Min. :-0.512378 Min. :-0.32858 Min. :-0.703237
## 1st Qu.:-0.074839 1st Qu.:-0.04619 1st Qu.:-0.008469
## Median :-0.018941 Median : 0.01395 Median : 0.003084
## Mean : 0.002719 Mean : 0.08847 Mean : 0.004122
## 3rd Qu.: 0.112342 3rd Qu.: 0.15597 3rd Qu.: 0.034109
## Max. : 0.509149 Max. : 0.69877 Max. : 0.704390
attributes
## function (x) .Primitive("attributes")
(pca$rotation)
## PC1 PC2 PC3 PC4 PC5
## M -0.299621651 0.02234471 0.206092406 0.343665047 -0.084281050
## Ed 0.333376623 -0.27425273 0.067599563 0.003731567 0.013554430
## Po1 0.355852775 0.21582732 0.042893927 0.170542009 0.255315925
## Po2 0.357078563 0.20737015 0.043865362 0.149923778 0.280968581
## LF 0.158964999 -0.32325373 0.291835477 0.204862901 -0.273794459
## M.F 0.103289228 -0.39080214 -0.159875463 0.482075361 -0.197981811
## Pop 0.152940574 0.44173764 0.050441232 -0.077218554 -0.017287070
## NW -0.260560205 0.29417874 0.097968609 0.334501224 0.195245659
## U1 0.028997051 -0.02265472 -0.652643201 0.100077953 -0.218707462
## U2 0.040646753 0.26683942 -0.580664129 0.113671290 -0.096486344
## Wealth 0.397173344 -0.00313705 -0.008791153 -0.079627704 0.071217657
## Ineq -0.360109089 0.11373640 0.026429360 0.247078877 -0.099589074
## Prob -0.273342185 -0.12700739 -0.104998129 0.055956699 0.567399568
## Time 0.001821896 0.39673014 0.204244603 -0.114024112 -0.549453494
## Crime 0.230417323 0.19396335 0.098446155 0.575883241 -0.005439264
## PC6 PC7 PC8 PC9 PC10
## M -0.402274660 0.27772065 -0.50333387 0.002069821 -0.074774855
## Ed 0.007998959 0.14540962 -0.16346730 0.634782269 -0.228257273
## Po1 -0.045927973 0.06107532 0.05161909 -0.256986250 0.077177550
## Po2 -0.072344908 0.05300705 0.03484300 -0.258961189 -0.009689514
## LF 0.571320554 0.17399374 0.25444573 -0.055983429 -0.140224742
## M.F 0.027585518 0.12044587 -0.15726847 -0.375725667 0.324917640
## Pop 0.564718325 -0.01633697 -0.63588300 -0.016049945 0.024803140
## NW 0.110048809 0.37455132 0.26560317 -0.040614583 -0.535284706
## U1 0.041380534 0.18737480 -0.18421810 0.075030767 -0.239052102
## U2 0.062042387 0.03005833 0.25897286 0.096573399 0.047797829
## Wealth -0.115901092 0.21757237 0.03786171 -0.009708249 0.093353713
## Ineq 0.252410371 -0.40742320 0.05312011 0.040343235 0.136054290
## Prob 0.253222193 0.37429233 0.01725547 0.285289952 0.510396571
## Time -0.095091758 0.43230010 0.19940004 0.147667548 0.398495704
## Crime -0.129277610 -0.36541536 0.07981248 0.444813855 0.133858146
## PC11 PC12 PC13 PC14 PC15
## M -0.444227084 0.13289605 -0.058446475 0.160327787 -0.005502974
## Ed 0.273687069 0.08698933 -0.395781017 0.254142233 -0.046862227
## Po1 0.042797813 0.37835114 -0.029006019 0.099045778 -0.703236535
## Po2 0.072538736 0.34205350 -0.091230582 0.151608097 0.704389818
## LF -0.408599693 0.20708045 0.113969028 -0.010999074 0.031149791
## M.F 0.336892298 -0.28069995 -0.232486798 -0.079617481 0.005539075
## Pop -0.005757335 -0.18265026 -0.056027938 -0.108326331 0.003084137
## NW 0.278215123 -0.30705853 -0.018940960 -0.038953153 -0.035623452
## U1 0.155604974 0.31204948 0.509148734 0.013945723 0.017493978
## U2 -0.454521310 -0.10298416 -0.512377770 0.067509490 -0.005322806
## Wealth -0.201845606 -0.55895793 0.391319261 0.503338086 -0.009197914
## Ineq 0.182010074 0.10051828 0.038575487 0.698767969 -0.007740862
## Prob -0.041683743 0.09090367 0.110714923 -0.053434539 0.039141990
## Time 0.229417628 0.12518170 0.003806471 -0.001782937 0.037068526
## Crime -0.061117981 -0.10415393 0.267554780 -0.328577896 0.037448778
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## M -0.299621651 0.02234471 0.206092406 0.343665047 -0.084281050
## Ed 0.333376623 -0.27425273 0.067599563 0.003731567 0.013554430
## Po1 0.355852775 0.21582732 0.042893927 0.170542009 0.255315925
## Po2 0.357078563 0.20737015 0.043865362 0.149923778 0.280968581
## LF 0.158964999 -0.32325373 0.291835477 0.204862901 -0.273794459
## M.F 0.103289228 -0.39080214 -0.159875463 0.482075361 -0.197981811
## Pop 0.152940574 0.44173764 0.050441232 -0.077218554 -0.017287070
## NW -0.260560205 0.29417874 0.097968609 0.334501224 0.195245659
## U1 0.028997051 -0.02265472 -0.652643201 0.100077953 -0.218707462
## U2 0.040646753 0.26683942 -0.580664129 0.113671290 -0.096486344
## Wealth 0.397173344 -0.00313705 -0.008791153 -0.079627704 0.071217657
## Ineq -0.360109089 0.11373640 0.026429360 0.247078877 -0.099589074
## Prob -0.273342185 -0.12700739 -0.104998129 0.055956699 0.567399568
## Time 0.001821896 0.39673014 0.204244603 -0.114024112 -0.549453494
## Crime 0.230417323 0.19396335 0.098446155 0.575883241 -0.005439264
## PC6 PC7 PC8 PC9 PC10
## M -0.402274660 0.27772065 -0.50333387 0.002069821 -0.074774855
## Ed 0.007998959 0.14540962 -0.16346730 0.634782269 -0.228257273
## Po1 -0.045927973 0.06107532 0.05161909 -0.256986250 0.077177550
## Po2 -0.072344908 0.05300705 0.03484300 -0.258961189 -0.009689514
## LF 0.571320554 0.17399374 0.25444573 -0.055983429 -0.140224742
## M.F 0.027585518 0.12044587 -0.15726847 -0.375725667 0.324917640
## Pop 0.564718325 -0.01633697 -0.63588300 -0.016049945 0.024803140
## NW 0.110048809 0.37455132 0.26560317 -0.040614583 -0.535284706
## U1 0.041380534 0.18737480 -0.18421810 0.075030767 -0.239052102
## U2 0.062042387 0.03005833 0.25897286 0.096573399 0.047797829
## Wealth -0.115901092 0.21757237 0.03786171 -0.009708249 0.093353713
## Ineq 0.252410371 -0.40742320 0.05312011 0.040343235 0.136054290
## Prob 0.253222193 0.37429233 0.01725547 0.285289952 0.510396571
## Time -0.095091758 0.43230010 0.19940004 0.147667548 0.398495704
## Crime -0.129277610 -0.36541536 0.07981248 0.444813855 0.133858146
## PC11 PC12 PC13 PC14 PC15
## M -0.444227084 0.13289605 -0.058446475 0.160327787 -0.005502974
## Ed 0.273687069 0.08698933 -0.395781017 0.254142233 -0.046862227
## Po1 0.042797813 0.37835114 -0.029006019 0.099045778 -0.703236535
## Po2 0.072538736 0.34205350 -0.091230582 0.151608097 0.704389818
## LF -0.408599693 0.20708045 0.113969028 -0.010999074 0.031149791
## M.F 0.336892298 -0.28069995 -0.232486798 -0.079617481 0.005539075
## Pop -0.005757335 -0.18265026 -0.056027938 -0.108326331 0.003084137
## NW 0.278215123 -0.30705853 -0.018940960 -0.038953153 -0.035623452
## U1 0.155604974 0.31204948 0.509148734 0.013945723 0.017493978
## U2 -0.454521310 -0.10298416 -0.512377770 0.067509490 -0.005322806
## Wealth -0.201845606 -0.55895793 0.391319261 0.503338086 -0.009197914
## Ineq 0.182010074 0.10051828 0.038575487 0.698767969 -0.007740862
## Prob -0.041683743 0.09090367 0.110714923 -0.053434539 0.039141990
## Time 0.229417628 0.12518170 0.003806471 -0.001782937 0.037068526
## Crime -0.061117981 -0.10415393 0.267554780 -0.328577896 0.037448778
#Finding the original coefficients
originalalpha <-alphas/sapply(crime1[,1:15], sd)
originalbeta0 <- beta0- sum(alphas*sapply(crime1[,1:15], mean)/sapply
(crime1[,1:15], sd))
originalalpha
## [,1]
## M 4.737871e+01
## Ed 1.118452e+01
## Po1 2.926894e+01
## Po2 2.927252e+01
## LF 1.168805e+03
## M.F 2.768812e+01
## Pop 8.280748e-01
## NW 7.456207e+00
## U1 -6.736417e+01
## U2 3.203151e+01
## Wealth 1.755549e-02
## Ineq 8.193341e+00
## Prob -1.170579e+03
## Time 1.899299e+00
## Crime 4.320767e-01
estimates <-as.matrix(crime1[,1:15])%*%originalalpha+originalbeta0
SSE =sum((estimates-crime1[,15])^2)
SSTot =sum((crime1[,15]- mean(crime1[,15]))^2)
R2 = 1-SSE/SSTot
R2_adjusted = R2-(1-R2)*4/(nrow(crime1)-4-1)
estimates
## [,1]
## [1,] 751.9095
## [2,] 1462.3937
## [3,] 498.5100
## [4,] 1930.1068
## [5,] 1123.7521
## [6,] 769.4856
## [7,] 820.6869
## [8,] 1342.9044
## [9,] 871.3165
## [10,] 824.9449
## [11,] 1482.0257
## [12,] 807.7953
## [13,] 548.4511
## [14,] 627.0400
## [15,] 716.6828
## [16,] 968.8631
## [17,] 452.6909
## [18,] 1008.3823
## [19,] 864.2353
## [20,] 1260.6638
## [21,] 740.6601
## [22,] 663.0696
## [23,] 965.9809
## [24,] 960.1356
## [25,] 519.8700
## [26,] 2050.5527
## [27,] 343.1815
## [28,] 1167.2270
## [29,] 1303.8200
## [30,] 776.0437
## [31,] 564.0336
## [32,] 840.4521
## [33,] 840.8981
## [34,] 862.1277
## [35,] 759.1462
## [36,] 1131.2265
## [37,] 1141.5403
## [38,] 579.3828
## [39,] 692.5910
## [40,] 1100.7443
## [41,] 839.9616
## [42,] 321.1900
## [43,] 986.2665
## [44,] 1107.8678
## [45,] 366.0195
## [46,] 735.9070
## [47,] 1046.2634
Question 10.1 Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).
library(tree)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
tree_model <-tree(Crime~., data = crime)
# plot the cool tree!
plot(tree_model)
text(tree_model, cex = 0.75)
# get all our data model insights
tree_model$frame
## var n dev yval splits.cutleft splits.cutright
## 1 Po1 47 6880927.66 905.0851 <7.65 >7.65
## 2 Pop 23 779243.48 669.6087 <22.5 >22.5
## 4 LF 12 243811.00 550.5000 <0.5675 >0.5675
## 8 <leaf> 7 48518.86 466.8571
## 9 <leaf> 5 77757.20 667.6000
## 5 <leaf> 11 179470.73 799.5455
## 3 NW 24 3604162.50 1130.7500 <7.65 >7.65
## 6 Pop 10 557574.90 886.9000 <21.5 >21.5
## 12 <leaf> 5 146390.80 1049.2000
## 13 <leaf> 5 147771.20 724.6000
## 7 Po1 14 2027224.93 1304.9286 <9.65 >9.65
## 14 <leaf> 6 170828.00 1041.0000
## 15 <leaf> 8 1124984.88 1502.8750
The frame above shows the tree and the nodel size, as well as the actual splits and cuts with deviances per note and leaf identifiers. Here an interpretation I have is that I see the leafs end at LF, Pop, and Po1 with prediction values, but I want to make sure we haven’t overfitted our model though, so let’s make sure the smallest leaf size is no less than 5% of our data. Based on the dataframe with the samllest leaf, 5 / 47 is approximately 10.5% of our data. This is a good first rule of thumb to make sure we haven’t over fit.
summary(tree_model)
##
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF" "NW"
## Number of terminal nodes: 7
## Residual mean deviance: 47390 = 1896000 / 40
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -573.900 -98.300 -1.545 0.000 110.600 490.100
Is there a way to get a better predictor by pruning the tree? We should go through this exercise so that we can understand where the optimal featuresset is for us to use in a regression model.
# prune option
optimal <-cv.tree(tree_model, , prune.tree)
# elbow plot of the optimal and lowest Mean Squared Error
plot(optimal$size, optimal$dev,type ="b")
# choose the 5
test_final <-prune.tree(tree_model, best =5)
plot(test_final)
text(test_final,cex = 0.75)
As you see in my code above, I chose the 5 leaf pruning, which was shown to be optimal based on the optimal models standard deviance. I then pruned the tree and the output below shows us now only 5 leafs. AGain here, I want to make sure I haven’t overfitted - but that I also have lowered the variance by choosing more optimal facotrs. I see the lowest leaf has only 6 - so that is definitely over 5%. This is an interesting insight I found, that the pruning didn’t necessarily bucket at much higher rates than the previous model. Now, from here, I would take those factors offered in the summarization below, and predict the point if I was given one like in HW 5 and 6. I will then use the R 2 this model outputs for me as the acc uracy. These are interpretations I will try below.
summary(test_final)
##
## Regression tree:
## snip.tree(tree = tree_model, nodes = c(4L, 6L))
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "NW"
## Number of terminal nodes: 5
## Residual mean deviance: 54210 = 2277000 / 42
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -573.9 -107.5 15.5 0.0 122.8 490.1
With this, I want to predict the accuracy of this final pruned model! Let’s see how good it does
preds <-predict(test_final)
SS_resids <-sum((preds-crime$Crime)^2)
SS_total <-sum((crime$Crime-mean(crime$Crime))^2)
R_square <-1-(SS_resids/SS_total)
R_square
## [1] 0.6691333
66% of our variability in the pruned decision tree is explained by this model. For a pruned black box model, I think this is pretty solid. We could continue to tinker here if needed.
forest <-randomForest(Crime~., data =crime)
forest
##
## Call:
## randomForest(formula = Crime ~ ., data = crime)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 84952.37
## % Var explained: 41.97
plot(forest)
We see the MSE is minimum at something before 100 - but I can’t tell explicitly from this. Another interesting insight is the variance or R 2 explained by this model is 44%, which is interesting. I assumed with a random forest model the accuracy would be great - but this confirms the learnings Joel mentioned around how the random forest model is a black box and we can’t really understand what goes into creating this model, but I read online that it does a decently ok job at predicting questions. Next, I wanted to just explore what different mtry variables would output in terms of accuracy. I will try a few mtrys here and comment on it.
forest_9<-randomForest(Crime~.,data =crime,mtry =9)
forest_9
##
## Call:
## randomForest(formula = Crime ~ ., data = crime, mtry = 9)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 9
##
## Mean of squared residuals: 94148.46
## % Var explained: 35.69
plot(forest_9)
Changing to another mtry factor of 2 we see:
forest_2<-randomForest(Crime~.,data =crime, mytry = 2)
forest_2
##
## Call:
## randomForest(formula = Crime ~ ., data = crime, mytry = 2)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 88117.04
## % Var explained: 39.81
plot(forest_2)
Now we see that tinkering with Mtry as a variable for the randomforest produces different R 2 , and higher the mtry variables. Let’s go back to the default since we can explain it best with the default variables in the randomForest function. Let’s get the MSE thats the lowest in the randomForest of n=500. This can tell us the number of trees that give us the optimal prediction accuracy. # this finds the lowest minimum MSE based ont hat elbow plot # above. I needed to do this specifically since I can ’ t tell # in that plot which forest it was.
which.min(forest$mse)
## [1] 123
Interesting, so even though we did a bootstrap forest of 500 trees, we found the 25th tree with the lowest MSE giving us the highest prediction. Another interpretation I want to explore is the variable importance based on this model.
importance(forest,type =2)
## IncNodePurity
## M 214193.16
## So 21522.48
## Ed 200541.54
## Po1 1269942.54
## Po2 1123454.79
## LF 273571.09
## M.F 262675.09
## Pop 314825.46
## NW 539605.11
## U1 116104.38
## U2 179517.14
## Wealth 673170.41
## Ineq 231194.83
## Prob 787041.23
## Time 190165.55
varImpPlot(forest)
The final insight here is that the Po2, Po1, Prob, Wealth, and NW factors are the most important in the randomForest model here. Let’s try and predict the value and then also get R 2 for this model too.
preds_rf <-predict(forest)
SS_resids_rf <-sum((preds_rf-crime$Crime)^2)
SS_total_rf <-sum((crime$Crime- mean(crime$Crime))^2)
R_square_rf <-1-(SS_resids_rf/SS_total_rf)
R_square_rf
## [1] 0.419735
We see the Rˆ2 I calculated is also .42 - which validates my work in the model. Cool!
Question 10.2
Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use.
Disease stratification and classification by its associated comorbidities is an excellent example for using logistic regression analysis. The primary comorbidities associated with the outcome variable multiple sclerosis are targeted using logistic regression with the following variables as predictors: hyperlipidemia, hypertension, diabetes, chronic lung disease, depression; arthritis.
Question 10.3 1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call. 2. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model.
set.seed(1)
germancredit <- read.table("\\Users\\smparker\\WorkFolders\\Documents\\data 10.3\\germancredit.txt", header = FALSE)
head(germancredit)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
## 1 A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1
## 2 A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1
## 3 A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2
## 4 A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2
## 5 A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2
## 6 A14 36 A32 A46 9055 A65 A73 2 A93 A101 4 A124 35 A143 A153 1 A172 2
## V19 V20 V21
## 1 A192 A201 1
## 2 A191 A201 2
## 3 A191 A201 1
## 4 A191 A201 1
## 5 A191 A201 2
## 6 A192 A201 1
Creating binary responses for logistic regression analysis.
germancredit$V21[germancredit$V21==1] <- 0
germancredit$V21[germancredit$V21==2] <- 1
Splitting 70/30 the data for training set(70%) and test set(30%).
# split our data
training <-sample(nrow(germancredit),0.7* nrow(germancredit), replace = FALSE)
# create the sets
training_data <-germancredit[training, ]
test_data <-germancredit[-training, ]
# Logistic regression model
model <-glm(V21~.,family =binomial(link ="logit"),data =training_data)
summary(model)
##
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4438 -0.6861 -0.3608 0.6750 2.4540
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.823e-01 1.332e+00 0.287 0.774162
## V1A12 -5.201e-01 2.681e-01 -1.940 0.052408 .
## V1A13 -1.150e+00 4.473e-01 -2.570 0.010173 *
## V1A14 -1.675e+00 2.750e-01 -6.091 1.12e-09 ***
## V2 2.570e-02 1.159e-02 2.217 0.026647 *
## V3A31 8.440e-02 6.580e-01 0.128 0.897943
## V3A32 -8.078e-01 4.996e-01 -1.617 0.105907
## V3A33 -7.683e-01 5.372e-01 -1.430 0.152634
## V3A34 -1.446e+00 5.127e-01 -2.821 0.004784 **
## V4A41 -1.513e+00 4.479e-01 -3.379 0.000728 ***
## V4A410 -2.412e+00 1.160e+00 -2.080 0.037543 *
## V4A42 -5.496e-01 3.195e-01 -1.720 0.085354 .
## V4A43 -9.142e-01 3.024e-01 -3.023 0.002503 **
## V4A44 -4.163e-01 9.455e-01 -0.440 0.659751
## V4A45 -1.562e-01 6.742e-01 -0.232 0.816732
## V4A46 -2.569e-01 5.085e-01 -0.505 0.613382
## V4A48 -1.531e+01 4.556e+02 -0.034 0.973202
## V4A49 -5.397e-01 4.017e-01 -1.344 0.179086
## V5 1.076e-04 5.600e-05 1.922 0.054633 .
## V6A62 -3.474e-01 3.579e-01 -0.971 0.331777
## V6A63 -2.440e-01 4.761e-01 -0.513 0.608232
## V6A64 -1.379e+00 6.535e-01 -2.110 0.034823 *
## V6A65 -8.106e-01 3.223e-01 -2.515 0.011910 *
## V7A72 -1.814e-01 5.243e-01 -0.346 0.729300
## V7A73 -5.253e-01 5.001e-01 -1.050 0.293529
## V7A74 -1.129e+00 5.455e-01 -2.070 0.038431 *
## V7A75 -5.927e-01 5.052e-01 -1.173 0.240705
## V8 3.523e-01 1.094e-01 3.219 0.001284 **
## V9A92 4.849e-02 4.760e-01 0.102 0.918863
## V9A93 -4.446e-01 4.691e-01 -0.948 0.343279
## V9A94 -4.288e-01 5.837e-01 -0.735 0.462524
## V10A102 3.052e-01 5.338e-01 0.572 0.567472
## V10A103 -3.086e-01 5.237e-01 -0.589 0.555669
## V11 -1.080e-01 1.073e-01 -1.007 0.314147
## V12A122 2.219e-01 3.161e-01 0.702 0.482767
## V12A123 3.274e-01 2.922e-01 1.120 0.262504
## V12A124 1.156e+00 5.656e-01 2.044 0.040944 *
## V13 -2.257e-02 1.140e-02 -1.980 0.047667 *
## V14A142 -5.214e-01 4.925e-01 -1.059 0.289757
## V14A143 -7.780e-01 2.848e-01 -2.732 0.006299 **
## V15A152 -6.323e-01 2.870e-01 -2.203 0.027579 *
## V15A153 -6.674e-01 6.202e-01 -1.076 0.281931
## V16 2.866e-01 2.236e-01 1.282 0.199939
## V17A172 1.565e+00 8.891e-01 1.760 0.078442 .
## V17A173 1.564e+00 8.582e-01 1.823 0.068370 .
## V17A174 1.400e+00 8.772e-01 1.596 0.110563
## V18 1.645e-01 3.004e-01 0.548 0.583871
## V19A192 -3.319e-01 2.413e-01 -1.376 0.168942
## V20A202 -2.137e+00 8.573e-01 -2.493 0.012665 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 613.21 on 651 degrees of freedom
## AIC: 711.21
##
## Number of Fisher Scoring iterations: 14
Utilizing ROC curve to calculate AUC values to determine fit of the model.
preds <-predict(model, test_data,type = "response")
# ROC curve
# AUC based on the predictive power.
roc_val <-roc(test_data$V21,round(preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_val
##
## Call:
## roc.default(response = test_data$V21, predictor = round(preds))
##
## Data: round(preds) in 208 controls (test_data$V21 0) < 92 cases (test_data$V21 1).
## Area under the curve: 0.679
Determining the optimum threshold by testing out different threshold values.
# Using .4 as the low threshold
preds_thres_4<-as.integer(preds>0.4)
conf_matrix_4<-as.matrix(table(preds_thres_4, test_data$V21))
conf_matrix_4
##
## preds_thres_4 0 1
## 0 167 38
## 1 41 54
The model shows that (40*5 = 175) + 43, total would be 243 there.
# Using .6 as the threshold value
preds_thres_6<-as.integer(preds>0.6)
conf_matrix_6<-as.matrix(table(preds_thres_6, test_data$V21))
conf_matrix_6
##
## preds_thres_6 0 1
## 0 190 60
## 1 18 32
# this is for .9 as the high threshold.
preds_thres_9<-as.integer(preds>0.9)
conf_matrix_9<-as.matrix(table(preds_thres_9, test_data$V21))
conf_matrix_9
##
## preds_thres_9 0 1
## 0 207 87
## 1 1 5
The .6 threshold model has a cost of (15x5) = 75 + 90 = 165 The model with .9 threshold value has (3x5) + 89 = 104
To avoid missclassification of the data the threshold value should be at .9, therefore minimizing the mistake of labeling a bad customer as good.