Question 9.1 Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components. Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

set.seed(1) 

crime <- read.table("\\Users\\smparker\\WorkFolders\\Documents\\data 7.2\\uscrime.txt", header = TRUE) 
head(crime) 
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682
#Removing the binary variable So
crime1 <- crime[-2]
head (crime1)
##      M   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682
#PCA model based on the crime data
pca <- prcomp(crime1[,1:15],scale =TRUE)
#Summary and plotting of PCA crime data
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.3802 1.6756 1.4202 1.16749 1.03667 0.74864 0.5988
## Proportion of Variance 0.3777 0.1872 0.1345 0.09087 0.07165 0.03736 0.0239
## Cumulative Proportion  0.3777 0.5649 0.6993 0.79020 0.86185 0.89921 0.9231
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.55069 0.48478 0.44375 0.42652 0.32674 0.26644 0.2324
## Proportion of Variance 0.02022 0.01567 0.01313 0.01213 0.00712 0.00473 0.0036
## Cumulative Proportion  0.94334 0.95900 0.97213 0.98426 0.99138 0.99611 0.9997
##                           PC15
## Standard deviation     0.06595
## Proportion of Variance 0.00029
## Cumulative Proportion  1.00000
plot(pca)

attributes(pca)
## $names
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
## 
## $class
## [1] "prcomp"
var <- pca$sdev^2
propvar <-var/sum(var)
plot(propvar,xlab ="Principal Component",
     ylab ="Proportion of Variance Explained",
     ylim =c(0,1), type ="b")

Utilizing the above graph we can visualize how much of the variance is being explained by each of the principal components. The first component explains close to 40% of the variance, the second component explains 20% of the variance. Making use of the elbow diagram it is clearly visible that less and less of the variance is being explained beyond the 6th principle component.

# Performing PCA on the first 6 principal components 
attributes(pca$x)
## $dim
## [1] 47 15
## 
## $dimnames
## $dimnames[[1]]
## NULL
## 
## $dimnames[[2]]
##  [1] "PC1"  "PC2"  "PC3"  "PC4"  "PC5"  "PC6"  "PC7"  "PC8"  "PC9"  "PC10"
## [11] "PC11" "PC12" "PC13" "PC14" "PC15"
pca$x
##               PC1         PC2        PC3         PC4         PC5         PC6
##  [1,] -3.81396173  1.49480900 -1.0603159  0.52406615  0.84528748  0.22511769
##  [2,]  1.39378610 -0.30940473  0.2040623  2.01004712 -0.40017797 -0.65963339
##  [3,] -4.05081477  0.02927506 -0.3395327 -0.15086987  0.59155503  0.60971007
##  [4,]  4.59616332  2.56050623  0.3297528  1.84346257 -0.09151119  0.78941109
##  [5,]  1.76782290 -1.32278400  1.3740356  0.49113472  0.64210034 -0.49140529
##  [6,]  2.62914861 -0.26847655  0.3101237 -1.60938033  1.47875512 -0.71070444
##  [7,]  0.72645393 -0.32738471 -0.9599399 -0.38740571  0.82753989 -0.89946017
##  [8,]  0.86025439  1.39484364  0.6927791  1.04326051  1.05777574 -0.03706974
##  [9,] -3.17917314  1.08511871  1.3511201  0.52913646  0.54806163  0.15408027
## [10,]  0.60199208 -3.11479451  0.4673390  0.36299278 -0.58642324  0.47082637
## [11,]  3.04377542  2.73346118  1.5244970  0.18323539 -0.84280152  0.38091485
## [12,]  0.73319230  0.01732148  1.1055319 -0.93534947 -1.09714173  0.24719333
## [13,] -0.06538917 -1.06720637  1.7115343 -1.52051166 -1.30690240  1.04747677
## [14,] -0.10350086 -1.88098917  1.0116864 -0.74476842  0.16718590  0.50648502
## [15,] -3.04248733  0.22341805 -0.9321089  0.45593154  0.04048224  0.08214470
## [16,] -2.43899141  2.34887912 -1.7950680  0.78252902  0.48468438 -0.21379267
## [17,] -0.84695940 -1.50854412 -1.4237311 -0.93326334  0.79543267 -0.44159994
## [18,]  0.44967843  0.01794353 -0.4162988  0.38413857  3.72264356  0.19771947
## [19,]  2.78688483  1.02983223  0.7120392 -1.75632220  1.13565429 -0.95181565
## [20,]  2.43597420  1.54260143 -2.8466543  0.58558522 -0.34679165  0.73557815
## [21,]  0.77334017 -0.25055341  0.1887937 -1.01532413 -2.09549445  0.51157039
## [22,] -5.49386562  1.67693083  0.1432269  0.52331840  0.11002034  0.25813985
## [23,] -0.11311127  0.90006187  0.7489911 -0.11236256  0.01722450  0.20261185
## [24,]  1.03960328 -2.03182990 -2.8572832  0.99760296 -0.67660795  0.07573911
## [25,] -0.40823073 -2.67922871  1.8077718 -0.96987069  0.31232179  1.43239586
## [26,]  4.59398246 -1.15115199 -0.4537591  3.77479197  0.50370215 -0.22108628
## [27,]  0.22252243 -1.12965276  1.0415912 -2.40971203 -0.03163130 -1.18670888
## [28,]  0.02336005 -1.07769460  0.3208172  1.46899925 -0.71336389 -0.75664604
## [29,]  3.89430066  4.21532234  0.4266915 -1.25198228  0.91703286  0.80347287
## [30,] -3.81094539  0.80374078  1.4952845  0.42547959  0.67378663 -0.37880247
## [31,] -1.42459732 -1.64684422 -2.8653965  0.07109257 -1.28966751 -0.34826897
## [32,]  1.49499555  1.00654828 -0.7380728 -1.10566201 -0.47311960  1.36112745
## [33,] -1.42496929 -0.41360704  1.5205399  0.16147756  0.01490728 -0.23993431
## [34,]  1.71825488 -0.78610986 -0.8039133 -0.47288584  0.55874447 -0.51869082
## [35,]  1.41200277  2.75113922 -2.0297526 -1.81148045 -1.23507296  0.58048925
## [36,]  1.19331842  1.81970726  0.7712784 -0.22269753 -1.67372755 -2.48306586
## [37,] -3.72446655  0.48770357  2.5691107  1.93768250 -1.13200914  0.41515015
## [38,] -1.42767218 -2.72358805 -0.2711169  0.06885997 -0.39480798  0.98054881
## [39,] -2.95634262  1.23581592  0.2001766 -0.22270316 -0.11335822 -0.34318729
## [40,] -0.39403166  1.11441462  0.8659378  0.50802275 -0.49928311  0.71284028
## [41,]  1.11356353 -1.92797959  1.8341757 -0.41300988 -1.15273251 -0.95896014
## [42,] -1.34536528 -1.84855157 -1.6931182 -1.15528036  1.68481989 -0.38568562
## [43,] -1.97549443  0.65400400  1.4808205  0.68203789  0.05886983 -0.92578651
## [44,]  2.15628759 -0.61234215 -0.6295304  0.35273735 -0.93267343 -0.49690723
## [45,] -2.81341963  1.12379006 -3.4751405 -1.09138875 -1.30975799 -0.51150287
## [46,]  1.27666760 -1.34845267  0.8563272 -0.95154427  1.16939619  0.76049822
## [47,]  1.91646392 -2.84001774 -1.4753031  1.07615212  0.03707310  0.61947271
##               PC7          PC8          PC9        PC10         PC11
##  [1,]  0.23293604  0.188420565  0.588477019 -0.31107661  0.029512698
##  [2,] -0.48862427  0.313999697  0.351041158 -0.08903699 -0.090989502
##  [3,] -0.32443483  0.377639219  0.008787560  0.12222732  0.406783557
##  [4,] -0.25062337 -1.560852996  0.493019186 -0.05425163 -0.042720507
##  [5,] -0.27365756 -0.424185396  0.434133137 -0.34524583  0.209510380
##  [6,]  0.04362165  0.524236678 -0.797930672 -0.41254229  0.032972902
##  [7,] -0.21607703  0.713231310  0.307583742 -0.54951439  0.344401514
##  [8,] -1.13106739  0.540977480  0.411362492 -0.27127601  0.524934080
##  [9,]  0.40660095  0.039646985 -0.009798776 -0.17413874 -0.173779568
## [10,]  0.26866889 -0.423485945 -0.225820343 -0.30766025  0.074751069
## [11,] -0.15335190  0.643677975  0.187379172  0.65688192  0.084449047
## [12,]  0.30293369  0.330504907  0.194076881  0.12664704 -0.276916370
## [13,]  0.23057199  0.734714535  0.413082158  0.58841859  0.265790591
## [14,] -0.20217705 -0.005780294  0.496351170 -0.10831130 -0.081162070
## [15,] -1.04460920 -0.281933252 -0.298096355  0.98266596 -0.863742052
## [16,] -0.24230801  0.697607821 -0.190265369 -0.90635771  0.261019877
## [17,]  0.38388490 -0.691499477  0.584784712  0.07149469 -0.380263687
## [18,]  1.30155715  0.316008444  0.026815303  1.06517848 -0.213599385
## [19,] -0.23944874  0.029233855 -0.250618440 -0.75616077 -0.386930287
## [20,]  0.26520763  0.342140816  0.201951433 -0.33387171 -0.786843818
## [21,]  0.11211127  0.673164807  0.067626227  0.29284745  0.283835504
## [22,]  1.34753245  0.504529671  0.266651832 -0.19143220  1.141649407
## [23,] -1.55828555  0.576944979 -0.128224677 -0.13342013 -0.279675573
## [24,]  0.02474439 -0.247628513  0.115335431 -0.52684580  0.385775578
## [25,] -0.04215373  0.461035307  0.399760892  0.23352583 -0.093000965
## [26,]  0.35801996  1.001347760 -0.414337204  0.69245141  0.022591769
## [27,]  0.32119066 -0.060375606 -0.282299894 -0.07277882  0.261845858
## [28,] -0.13879854 -0.571443559  0.201184197  0.04584585  0.310299275
## [29,]  0.16679993 -0.855417031 -0.729350584  0.17719944  0.673990540
## [30,]  0.38597436 -0.868484936 -0.327493489  0.29863183 -0.049265368
## [31,] -0.05195320 -0.639459954 -1.452645328  0.24929516  0.421715622
## [32,]  0.67620197 -0.035577666  0.291494485 -0.13335344 -0.523014001
## [33,] -0.94180066 -0.183629298  0.449351618  0.16087168  0.073036234
## [34,] -0.31712044  0.123970314  0.261160626 -0.12877613  0.458255646
## [35,]  0.51608263 -0.267146763 -0.037536790 -0.32528584 -0.067570727
## [36,]  0.25943780  0.565391928 -0.117459856  0.61282562 -0.545667507
## [37,]  0.89792600  0.205082148 -0.754258100 -1.06567212 -1.215413351
## [38,] -0.85778773 -0.088561940 -0.563547410  0.01276699  0.425073513
## [39,] -1.03976047 -0.033264249 -0.160012992 -0.04715679 -0.174601629
## [40,] -0.56925183 -1.034631620  0.192452912  0.09353996  0.152545486
## [41,]  0.66916197 -0.515870807  0.473619410 -0.32827345  0.006667121
## [42,] -0.13822244 -0.620611580  0.851202267  0.06288717 -0.743999204
## [43,]  0.56818268 -0.782851679 -0.370274134  0.48153133  0.355397538
## [44,]  0.62656095 -0.163025743  0.313821216 -0.06473876  0.297433585
## [45,] -0.26364985  0.161261916  0.198102836  0.73674355 -0.133519659
## [46,] -0.38228168  0.073597678 -1.406020512 -0.04188695 -0.215673297
## [47,]  0.50153557  0.217351511 -0.264618150 -0.08541262 -0.165889864
##               PC12        PC13         PC14         PC15
##  [1,]  0.084982446  0.21359689  0.205136740  0.027793386
##  [2,] -0.027781324 -0.03962376 -0.050788856 -0.036687438
##  [3,]  0.202419253 -0.05290100 -0.526988040  0.062127288
##  [4,] -0.112252436  0.03025186  0.001588820  0.045230815
##  [5,]  0.673294686  0.57593617  0.183222166 -0.066037038
##  [6,] -0.318981131  0.27533142  0.118375689  0.005226695
##  [7,] -1.072944542 -0.02834628  0.085683627 -0.064827799
##  [8,]  0.222045863 -0.50183261 -0.396849336 -0.012742192
##  [9,] -0.041344390  0.36868247 -0.116758889  0.039130985
## [10,]  0.173488890  0.06774730 -0.086077852  0.015403181
## [11,] -0.548736305  0.34683138 -0.355015395  0.146307595
## [12,] -0.379768933  0.06918388 -0.131926806  0.031875635
## [13,]  0.405927180 -0.15002671  0.299151219 -0.036121926
## [14,] -0.364211818 -0.42375473  0.147274385  0.005637575
## [15,] -0.010384699 -0.41800511  0.244898684  0.030739089
## [16,] -0.001926467  0.11404701  0.118251932 -0.024671233
## [17,]  0.516179788  0.09946446 -0.290476614  0.008300765
## [18,] -0.057488575  0.47346785  0.091469396  0.013974766
## [19,]  0.480809100 -0.59981420  0.337238616  0.058113020
## [20,] -0.119484477 -0.12717681 -0.003573220 -0.008947907
## [21,]  0.147446682  0.30431578  0.184821076 -0.011493109
## [22,]  0.284135914 -0.30411822  0.016739027 -0.015532938
## [23,]  0.001059369  0.48179717  0.047138994  0.044302898
## [24,]  0.081064620  0.23829521 -0.144057955 -0.024548772
## [25,]  0.217004897 -0.21377094 -0.024863594 -0.055152332
## [26,]  0.181705150 -0.18505106 -0.118918199 -0.110168683
## [27,] -0.170964647  0.04010342 -0.260107576  0.080949179
## [28,]  0.067810231  0.33839284  0.413175030 -0.032032147
## [29,]  0.617257839 -0.05704383  0.061842414  0.013326574
## [30,] -0.367932861 -0.17141830 -0.199733122 -0.025885508
## [31,] -0.038108658  0.12042173 -0.282202515  0.073330432
## [32,] -0.423350198  0.01136968 -0.043868719 -0.078929490
## [33,] -0.024369864  0.16084512  0.090524456  0.113265174
## [34,]  0.029824141 -0.17974799  0.176464680  0.075243661
## [35,] -0.004661230 -0.07244008 -0.272029765 -0.103470686
## [36,]  0.543398035 -0.13423251 -0.377396628 -0.052744461
## [37,]  0.241288352  0.07854077  0.097063461  0.063424731
## [38,] -0.033154088  0.00341939 -0.174832328 -0.029088087
## [39,] -0.035560215 -0.07701236 -0.032644704 -0.099503547
## [40,] -0.247619848 -0.04575911  0.009142559 -0.071839901
## [41,] -0.223090029  0.10160949 -0.157313841 -0.085179749
## [42,]  0.085211989 -0.02293847 -0.313491456 -0.011436895
## [43,] -0.625489926 -0.38565832  0.333733688 -0.054826924
## [44,] -0.024778749 -0.15001567  0.316854952  0.141945758
## [45,]  0.108743701  0.18767185  0.543253672 -0.014875634
## [46,] -0.034765353  0.13091002  0.162001323 -0.109273929
## [47,] -0.055947364 -0.49154508  0.074868804  0.140369122
PCs <-pca$x[,1:6]
summary(PCs)
##       PC1               PC2                PC3               PC4          
##  Min.   :-5.4939   Min.   :-3.11479   Min.   :-3.4751   Min.   :-2.40971  
##  1st Qu.:-1.4263   1st Qu.:-1.23697   1st Qu.:-0.8680   1st Qu.:-0.94345  
##  Median : 0.4497   Median : 0.01732   Median : 0.3101   Median : 0.07109  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 1.4535   3rd Qu.: 1.11910   3rd Qu.: 1.0266   3rd Qu.: 0.52660  
##  Max.   : 4.5962   Max.   : 4.21532   Max.   : 2.5691   Max.   : 3.77479  
##       PC5                PC6          
##  Min.   :-2.09549   Min.   :-2.48307  
##  1st Qu.:-0.69499   1st Qu.:-0.49416  
##  Median : 0.01723   Median : 0.08214  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.61683   3rd Qu.: 0.54603  
##  Max.   : 3.72264   Max.   : 1.43240
pcaCrime <-cbind(PCs, crime1[,15])
model1 <-lm(V7~.,data =as.data.frame(pcaCrime))
summary(model1)
## 
## Call:
## lm(formula = V7 ~ ., data = as.data.frame(pcaCrime))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -289.78  -86.76   14.92   81.89  260.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  905.085     20.300  44.584  < 2e-16 ***
## PC1           89.117      8.621  10.337 7.36e-13 ***
## PC2           75.018     12.247   6.126 3.15e-07 ***
## PC3           38.075     14.449   2.635   0.0119 *  
## PC4          222.730     17.576  12.672 1.38e-15 ***
## PC5           -2.104     19.794  -0.106   0.9159    
## PC6          -50.000     27.410  -1.824   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 139.2 on 40 degrees of freedom
## Multiple R-squared:  0.8874, Adjusted R-squared:  0.8705 
## F-statistic: 52.54 on 6 and 40 DF,  p-value: < 2.2e-16
model1$coefficients
## (Intercept)         PC1         PC2         PC3         PC4         PC5 
##  905.085106   89.116825   75.017789   38.075301  222.730156   -2.103705 
##         PC6 
##  -49.999757
beta0<-model1$coefficient[1]
betas<-model1$coefficients[2:6]
alphas <-pca$rotation[,1:5]%*%betas
summary(pca$rotation)
##       PC1                PC2                PC3                PC4          
##  Min.   :-0.36011   Min.   :-0.39080   Min.   :-0.65264   Min.   :-0.11402  
##  1st Qu.:-0.12937   1st Qu.:-0.07483   1st Qu.:-0.05689   1st Qu.: 0.02984  
##  Median : 0.10329   Median : 0.11374   Median : 0.04387   Median : 0.14992  
##  Mean   : 0.06446   Mean   : 0.06744   Mean   :-0.02514   Mean   : 0.16741  
##  3rd Qu.: 0.28190   3rd Qu.: 0.24133   3rd Qu.: 0.09821   3rd Qu.: 0.29079  
##  Max.   : 0.39717   Max.   : 0.44174   Max.   : 0.29184   Max.   : 0.57588  
##       PC5                PC6                PC7                PC8          
##  Min.   :-0.54945   Min.   :-0.40227   Min.   :-0.40742   Min.   :-0.63588  
##  1st Qu.:-0.14879   1st Qu.:-0.08372   1st Qu.: 0.04153   1st Qu.:-0.16037  
##  Median :-0.01729   Median : 0.02759   Median : 0.14541   Median : 0.03786  
##  Mean   :-0.01062   Mean   : 0.06866   Mean   : 0.11058   Mean   :-0.02608  
##  3rd Qu.: 0.13323   3rd Qu.: 0.18123   3rd Qu.: 0.24765   3rd Qu.: 0.13961  
##  Max.   : 0.56740   Max.   : 0.57132   Max.   : 0.43230   Max.   : 0.26560  
##       PC9                PC10               PC11                PC12         
##  Min.   :-0.37573   Min.   :-0.53528   Min.   :-0.454521   Min.   :-0.55896  
##  1st Qu.:-0.04830   1st Qu.:-0.10750   1st Qu.:-0.131482   1st Qu.:-0.14340  
##  Median : 0.00207   Median : 0.04780   Median : 0.042798   Median : 0.09090  
##  Mean   : 0.04750   Mean   : 0.03464   Mean   :-0.003106   Mean   : 0.01597  
##  3rd Qu.: 0.12212   3rd Qu.: 0.13496   3rd Qu.: 0.205714   3rd Qu.: 0.16999  
##  Max.   : 0.63478   Max.   : 0.51040   Max.   : 0.336892   Max.   : 0.37835  
##       PC13                PC14               PC15          
##  Min.   :-0.512378   Min.   :-0.32858   Min.   :-0.703237  
##  1st Qu.:-0.074839   1st Qu.:-0.04619   1st Qu.:-0.008469  
##  Median :-0.018941   Median : 0.01395   Median : 0.003084  
##  Mean   : 0.002719   Mean   : 0.08847   Mean   : 0.004122  
##  3rd Qu.: 0.112342   3rd Qu.: 0.15597   3rd Qu.: 0.034109  
##  Max.   : 0.509149   Max.   : 0.69877   Max.   : 0.704390
attributes
## function (x)  .Primitive("attributes")
(pca$rotation)
##                 PC1         PC2          PC3          PC4          PC5
## M      -0.299621651  0.02234471  0.206092406  0.343665047 -0.084281050
## Ed      0.333376623 -0.27425273  0.067599563  0.003731567  0.013554430
## Po1     0.355852775  0.21582732  0.042893927  0.170542009  0.255315925
## Po2     0.357078563  0.20737015  0.043865362  0.149923778  0.280968581
## LF      0.158964999 -0.32325373  0.291835477  0.204862901 -0.273794459
## M.F     0.103289228 -0.39080214 -0.159875463  0.482075361 -0.197981811
## Pop     0.152940574  0.44173764  0.050441232 -0.077218554 -0.017287070
## NW     -0.260560205  0.29417874  0.097968609  0.334501224  0.195245659
## U1      0.028997051 -0.02265472 -0.652643201  0.100077953 -0.218707462
## U2      0.040646753  0.26683942 -0.580664129  0.113671290 -0.096486344
## Wealth  0.397173344 -0.00313705 -0.008791153 -0.079627704  0.071217657
## Ineq   -0.360109089  0.11373640  0.026429360  0.247078877 -0.099589074
## Prob   -0.273342185 -0.12700739 -0.104998129  0.055956699  0.567399568
## Time    0.001821896  0.39673014  0.204244603 -0.114024112 -0.549453494
## Crime   0.230417323  0.19396335  0.098446155  0.575883241 -0.005439264
##                 PC6         PC7         PC8          PC9         PC10
## M      -0.402274660  0.27772065 -0.50333387  0.002069821 -0.074774855
## Ed      0.007998959  0.14540962 -0.16346730  0.634782269 -0.228257273
## Po1    -0.045927973  0.06107532  0.05161909 -0.256986250  0.077177550
## Po2    -0.072344908  0.05300705  0.03484300 -0.258961189 -0.009689514
## LF      0.571320554  0.17399374  0.25444573 -0.055983429 -0.140224742
## M.F     0.027585518  0.12044587 -0.15726847 -0.375725667  0.324917640
## Pop     0.564718325 -0.01633697 -0.63588300 -0.016049945  0.024803140
## NW      0.110048809  0.37455132  0.26560317 -0.040614583 -0.535284706
## U1      0.041380534  0.18737480 -0.18421810  0.075030767 -0.239052102
## U2      0.062042387  0.03005833  0.25897286  0.096573399  0.047797829
## Wealth -0.115901092  0.21757237  0.03786171 -0.009708249  0.093353713
## Ineq    0.252410371 -0.40742320  0.05312011  0.040343235  0.136054290
## Prob    0.253222193  0.37429233  0.01725547  0.285289952  0.510396571
## Time   -0.095091758  0.43230010  0.19940004  0.147667548  0.398495704
## Crime  -0.129277610 -0.36541536  0.07981248  0.444813855  0.133858146
##                PC11        PC12         PC13         PC14         PC15
## M      -0.444227084  0.13289605 -0.058446475  0.160327787 -0.005502974
## Ed      0.273687069  0.08698933 -0.395781017  0.254142233 -0.046862227
## Po1     0.042797813  0.37835114 -0.029006019  0.099045778 -0.703236535
## Po2     0.072538736  0.34205350 -0.091230582  0.151608097  0.704389818
## LF     -0.408599693  0.20708045  0.113969028 -0.010999074  0.031149791
## M.F     0.336892298 -0.28069995 -0.232486798 -0.079617481  0.005539075
## Pop    -0.005757335 -0.18265026 -0.056027938 -0.108326331  0.003084137
## NW      0.278215123 -0.30705853 -0.018940960 -0.038953153 -0.035623452
## U1      0.155604974  0.31204948  0.509148734  0.013945723  0.017493978
## U2     -0.454521310 -0.10298416 -0.512377770  0.067509490 -0.005322806
## Wealth -0.201845606 -0.55895793  0.391319261  0.503338086 -0.009197914
## Ineq    0.182010074  0.10051828  0.038575487  0.698767969 -0.007740862
## Prob   -0.041683743  0.09090367  0.110714923 -0.053434539  0.039141990
## Time    0.229417628  0.12518170  0.003806471 -0.001782937  0.037068526
## Crime  -0.061117981 -0.10415393  0.267554780 -0.328577896  0.037448778
pca$rotation
##                 PC1         PC2          PC3          PC4          PC5
## M      -0.299621651  0.02234471  0.206092406  0.343665047 -0.084281050
## Ed      0.333376623 -0.27425273  0.067599563  0.003731567  0.013554430
## Po1     0.355852775  0.21582732  0.042893927  0.170542009  0.255315925
## Po2     0.357078563  0.20737015  0.043865362  0.149923778  0.280968581
## LF      0.158964999 -0.32325373  0.291835477  0.204862901 -0.273794459
## M.F     0.103289228 -0.39080214 -0.159875463  0.482075361 -0.197981811
## Pop     0.152940574  0.44173764  0.050441232 -0.077218554 -0.017287070
## NW     -0.260560205  0.29417874  0.097968609  0.334501224  0.195245659
## U1      0.028997051 -0.02265472 -0.652643201  0.100077953 -0.218707462
## U2      0.040646753  0.26683942 -0.580664129  0.113671290 -0.096486344
## Wealth  0.397173344 -0.00313705 -0.008791153 -0.079627704  0.071217657
## Ineq   -0.360109089  0.11373640  0.026429360  0.247078877 -0.099589074
## Prob   -0.273342185 -0.12700739 -0.104998129  0.055956699  0.567399568
## Time    0.001821896  0.39673014  0.204244603 -0.114024112 -0.549453494
## Crime   0.230417323  0.19396335  0.098446155  0.575883241 -0.005439264
##                 PC6         PC7         PC8          PC9         PC10
## M      -0.402274660  0.27772065 -0.50333387  0.002069821 -0.074774855
## Ed      0.007998959  0.14540962 -0.16346730  0.634782269 -0.228257273
## Po1    -0.045927973  0.06107532  0.05161909 -0.256986250  0.077177550
## Po2    -0.072344908  0.05300705  0.03484300 -0.258961189 -0.009689514
## LF      0.571320554  0.17399374  0.25444573 -0.055983429 -0.140224742
## M.F     0.027585518  0.12044587 -0.15726847 -0.375725667  0.324917640
## Pop     0.564718325 -0.01633697 -0.63588300 -0.016049945  0.024803140
## NW      0.110048809  0.37455132  0.26560317 -0.040614583 -0.535284706
## U1      0.041380534  0.18737480 -0.18421810  0.075030767 -0.239052102
## U2      0.062042387  0.03005833  0.25897286  0.096573399  0.047797829
## Wealth -0.115901092  0.21757237  0.03786171 -0.009708249  0.093353713
## Ineq    0.252410371 -0.40742320  0.05312011  0.040343235  0.136054290
## Prob    0.253222193  0.37429233  0.01725547  0.285289952  0.510396571
## Time   -0.095091758  0.43230010  0.19940004  0.147667548  0.398495704
## Crime  -0.129277610 -0.36541536  0.07981248  0.444813855  0.133858146
##                PC11        PC12         PC13         PC14         PC15
## M      -0.444227084  0.13289605 -0.058446475  0.160327787 -0.005502974
## Ed      0.273687069  0.08698933 -0.395781017  0.254142233 -0.046862227
## Po1     0.042797813  0.37835114 -0.029006019  0.099045778 -0.703236535
## Po2     0.072538736  0.34205350 -0.091230582  0.151608097  0.704389818
## LF     -0.408599693  0.20708045  0.113969028 -0.010999074  0.031149791
## M.F     0.336892298 -0.28069995 -0.232486798 -0.079617481  0.005539075
## Pop    -0.005757335 -0.18265026 -0.056027938 -0.108326331  0.003084137
## NW      0.278215123 -0.30705853 -0.018940960 -0.038953153 -0.035623452
## U1      0.155604974  0.31204948  0.509148734  0.013945723  0.017493978
## U2     -0.454521310 -0.10298416 -0.512377770  0.067509490 -0.005322806
## Wealth -0.201845606 -0.55895793  0.391319261  0.503338086 -0.009197914
## Ineq    0.182010074  0.10051828  0.038575487  0.698767969 -0.007740862
## Prob   -0.041683743  0.09090367  0.110714923 -0.053434539  0.039141990
## Time    0.229417628  0.12518170  0.003806471 -0.001782937  0.037068526
## Crime  -0.061117981 -0.10415393  0.267554780 -0.328577896  0.037448778
#Finding the original coefficients
originalalpha <-alphas/sapply(crime1[,1:15], sd)
originalbeta0 <- beta0- sum(alphas*sapply(crime1[,1:15], mean)/sapply
(crime1[,1:15], sd))
originalalpha
##                 [,1]
## M       4.737871e+01
## Ed      1.118452e+01
## Po1     2.926894e+01
## Po2     2.927252e+01
## LF      1.168805e+03
## M.F     2.768812e+01
## Pop     8.280748e-01
## NW      7.456207e+00
## U1     -6.736417e+01
## U2      3.203151e+01
## Wealth  1.755549e-02
## Ineq    8.193341e+00
## Prob   -1.170579e+03
## Time    1.899299e+00
## Crime   4.320767e-01
estimates <-as.matrix(crime1[,1:15])%*%originalalpha+originalbeta0
SSE =sum((estimates-crime1[,15])^2)
SSTot =sum((crime1[,15]- mean(crime1[,15]))^2)
R2 = 1-SSE/SSTot
R2_adjusted = R2-(1-R2)*4/(nrow(crime1)-4-1)
estimates
##            [,1]
##  [1,]  751.9095
##  [2,] 1462.3937
##  [3,]  498.5100
##  [4,] 1930.1068
##  [5,] 1123.7521
##  [6,]  769.4856
##  [7,]  820.6869
##  [8,] 1342.9044
##  [9,]  871.3165
## [10,]  824.9449
## [11,] 1482.0257
## [12,]  807.7953
## [13,]  548.4511
## [14,]  627.0400
## [15,]  716.6828
## [16,]  968.8631
## [17,]  452.6909
## [18,] 1008.3823
## [19,]  864.2353
## [20,] 1260.6638
## [21,]  740.6601
## [22,]  663.0696
## [23,]  965.9809
## [24,]  960.1356
## [25,]  519.8700
## [26,] 2050.5527
## [27,]  343.1815
## [28,] 1167.2270
## [29,] 1303.8200
## [30,]  776.0437
## [31,]  564.0336
## [32,]  840.4521
## [33,]  840.8981
## [34,]  862.1277
## [35,]  759.1462
## [36,] 1131.2265
## [37,] 1141.5403
## [38,]  579.3828
## [39,]  692.5910
## [40,] 1100.7443
## [41,]  839.9616
## [42,]  321.1900
## [43,]  986.2665
## [44,] 1107.8678
## [45,]  366.0195
## [46,]  735.9070
## [47,] 1046.2634

Question 10.1 Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).

library(tree)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
tree_model <-tree(Crime~., data = crime)
# plot the cool tree!
plot(tree_model)
text(tree_model, cex = 0.75)

# get all our data model insights
tree_model$frame
##       var  n        dev      yval splits.cutleft splits.cutright
## 1     Po1 47 6880927.66  905.0851          <7.65           >7.65
## 2     Pop 23  779243.48  669.6087          <22.5           >22.5
## 4      LF 12  243811.00  550.5000        <0.5675         >0.5675
## 8  <leaf>  7   48518.86  466.8571                               
## 9  <leaf>  5   77757.20  667.6000                               
## 5  <leaf> 11  179470.73  799.5455                               
## 3      NW 24 3604162.50 1130.7500          <7.65           >7.65
## 6     Pop 10  557574.90  886.9000          <21.5           >21.5
## 12 <leaf>  5  146390.80 1049.2000                               
## 13 <leaf>  5  147771.20  724.6000                               
## 7     Po1 14 2027224.93 1304.9286          <9.65           >9.65
## 14 <leaf>  6  170828.00 1041.0000                               
## 15 <leaf>  8 1124984.88 1502.8750

The frame above shows the tree and the nodel size, as well as the actual splits and cuts with deviances per note and leaf identifiers. Here an interpretation I have is that I see the leafs end at LF, Pop, and Po1 with prediction values, but I want to make sure we haven’t overfitted our model though, so let’s make sure the smallest leaf size is no less than 5% of our data. Based on the dataframe with the samllest leaf, 5 / 47 is approximately 10.5% of our data. This is a good first rule of thumb to make sure we haven’t over fit.

summary(tree_model)
## 
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47390 = 1896000 / 40 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -573.900  -98.300   -1.545    0.000  110.600  490.100

Is there a way to get a better predictor by pruning the tree? We should go through this exercise so that we can understand where the optimal featuresset is for us to use in a regression model.

# prune option
optimal <-cv.tree(tree_model, , prune.tree)
# elbow plot of the optimal and lowest Mean Squared Error
plot(optimal$size, optimal$dev,type ="b")

# choose the 5
test_final <-prune.tree(tree_model, best =5)
plot(test_final)
text(test_final,cex = 0.75)

As you see in my code above, I chose the 5 leaf pruning, which was shown to be optimal based on the optimal models standard deviance. I then pruned the tree and the output below shows us now only 5 leafs. AGain here, I want to make sure I haven’t overfitted - but that I also have lowered the variance by choosing more optimal facotrs. I see the lowest leaf has only 6 - so that is definitely over 5%. This is an interesting insight I found, that the pruning didn’t necessarily bucket at much higher rates than the previous model. Now, from here, I would take those factors offered in the summarization below, and predict the point if I was given one like in HW 5 and 6. I will then use the R 2 this model outputs for me as the acc uracy. These are interpretations I will try below.

summary(test_final)
## 
## Regression tree:
## snip.tree(tree = tree_model, nodes = c(4L, 6L))
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "NW" 
## Number of terminal nodes:  5 
## Residual mean deviance:  54210 = 2277000 / 42 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -573.9  -107.5    15.5     0.0   122.8   490.1

With this, I want to predict the accuracy of this final pruned model! Let’s see how good it does

preds <-predict(test_final)
SS_resids <-sum((preds-crime$Crime)^2)
SS_total <-sum((crime$Crime-mean(crime$Crime))^2)
R_square <-1-(SS_resids/SS_total)
R_square
## [1] 0.6691333

66% of our variability in the pruned decision tree is explained by this model. For a pruned black box model, I think this is pretty solid. We could continue to tinker here if needed.

forest <-randomForest(Crime~., data =crime)
forest
## 
## Call:
##  randomForest(formula = Crime ~ ., data = crime) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 84952.37
##                     % Var explained: 41.97
plot(forest)

We see the MSE is minimum at something before 100 - but I can’t tell explicitly from this. Another interesting insight is the variance or R 2 explained by this model is 44%, which is interesting. I assumed with a random forest model the accuracy would be great - but this confirms the learnings Joel mentioned around how the random forest model is a black box and we can’t really understand what goes into creating this model, but I read online that it does a decently ok job at predicting questions. Next, I wanted to just explore what different mtry variables would output in terms of accuracy. I will try a few mtrys here and comment on it.

forest_9<-randomForest(Crime~.,data =crime,mtry =9)
forest_9
## 
## Call:
##  randomForest(formula = Crime ~ ., data = crime, mtry = 9) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##           Mean of squared residuals: 94148.46
##                     % Var explained: 35.69
plot(forest_9)

Changing to another mtry factor of 2 we see:

forest_2<-randomForest(Crime~.,data =crime, mytry = 2)
forest_2
## 
## Call:
##  randomForest(formula = Crime ~ ., data = crime, mytry = 2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 88117.04
##                     % Var explained: 39.81
plot(forest_2)

Now we see that tinkering with Mtry as a variable for the randomforest produces different R 2 , and higher the mtry variables. Let’s go back to the default since we can explain it best with the default variables in the randomForest function. Let’s get the MSE thats the lowest in the randomForest of n=500. This can tell us the number of trees that give us the optimal prediction accuracy. # this finds the lowest minimum MSE based ont hat elbow plot # above. I needed to do this specifically since I can ’ t tell # in that plot which forest it was.

which.min(forest$mse)
## [1] 123

Interesting, so even though we did a bootstrap forest of 500 trees, we found the 25th tree with the lowest MSE giving us the highest prediction. Another interpretation I want to explore is the variable importance based on this model.

importance(forest,type =2)
##        IncNodePurity
## M          214193.16
## So          21522.48
## Ed         200541.54
## Po1       1269942.54
## Po2       1123454.79
## LF         273571.09
## M.F        262675.09
## Pop        314825.46
## NW         539605.11
## U1         116104.38
## U2         179517.14
## Wealth     673170.41
## Ineq       231194.83
## Prob       787041.23
## Time       190165.55
varImpPlot(forest)

The final insight here is that the Po2, Po1, Prob, Wealth, and NW factors are the most important in the randomForest model here. Let’s try and predict the value and then also get R 2 for this model too.

preds_rf <-predict(forest)
SS_resids_rf <-sum((preds_rf-crime$Crime)^2)
SS_total_rf <-sum((crime$Crime- mean(crime$Crime))^2)
R_square_rf <-1-(SS_resids_rf/SS_total_rf)
R_square_rf
## [1] 0.419735

We see the Rˆ2 I calculated is also .42 - which validates my work in the model. Cool!

Question 10.2

Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use.

Disease stratification and classification by its associated comorbidities is an excellent example for using logistic regression analysis. The primary comorbidities associated with the outcome variable multiple sclerosis are targeted using logistic regression with the following variables as predictors: hyperlipidemia, hypertension, diabetes, chronic lung disease, depression; arthritis.

Question 10.3 1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call. 2. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model.

set.seed(1) 



germancredit <- read.table("\\Users\\smparker\\WorkFolders\\Documents\\data 10.3\\germancredit.txt", header = FALSE) 
head(germancredit) 
##    V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17 V18
## 1 A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173   1
## 2 A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173   1
## 3 A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172   2
## 4 A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173   2
## 5 A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173   2
## 6 A14 36 A32 A46 9055 A65 A73  2 A93 A101   4 A124  35 A143 A153   1 A172   2
##    V19  V20 V21
## 1 A192 A201   1
## 2 A191 A201   2
## 3 A191 A201   1
## 4 A191 A201   1
## 5 A191 A201   2
## 6 A192 A201   1

Creating binary responses for logistic regression analysis.

germancredit$V21[germancredit$V21==1] <- 0
germancredit$V21[germancredit$V21==2] <- 1

Splitting 70/30 the data for training set(70%) and test set(30%).

# split our data
training <-sample(nrow(germancredit),0.7* nrow(germancredit), replace = FALSE)
# create the sets
training_data <-germancredit[training, ]
test_data <-germancredit[-training, ]
# Logistic regression model
model <-glm(V21~.,family =binomial(link ="logit"),data =training_data)
summary(model)
## 
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4438  -0.6861  -0.3608   0.6750   2.4540  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.823e-01  1.332e+00   0.287 0.774162    
## V1A12       -5.201e-01  2.681e-01  -1.940 0.052408 .  
## V1A13       -1.150e+00  4.473e-01  -2.570 0.010173 *  
## V1A14       -1.675e+00  2.750e-01  -6.091 1.12e-09 ***
## V2           2.570e-02  1.159e-02   2.217 0.026647 *  
## V3A31        8.440e-02  6.580e-01   0.128 0.897943    
## V3A32       -8.078e-01  4.996e-01  -1.617 0.105907    
## V3A33       -7.683e-01  5.372e-01  -1.430 0.152634    
## V3A34       -1.446e+00  5.127e-01  -2.821 0.004784 ** 
## V4A41       -1.513e+00  4.479e-01  -3.379 0.000728 ***
## V4A410      -2.412e+00  1.160e+00  -2.080 0.037543 *  
## V4A42       -5.496e-01  3.195e-01  -1.720 0.085354 .  
## V4A43       -9.142e-01  3.024e-01  -3.023 0.002503 ** 
## V4A44       -4.163e-01  9.455e-01  -0.440 0.659751    
## V4A45       -1.562e-01  6.742e-01  -0.232 0.816732    
## V4A46       -2.569e-01  5.085e-01  -0.505 0.613382    
## V4A48       -1.531e+01  4.556e+02  -0.034 0.973202    
## V4A49       -5.397e-01  4.017e-01  -1.344 0.179086    
## V5           1.076e-04  5.600e-05   1.922 0.054633 .  
## V6A62       -3.474e-01  3.579e-01  -0.971 0.331777    
## V6A63       -2.440e-01  4.761e-01  -0.513 0.608232    
## V6A64       -1.379e+00  6.535e-01  -2.110 0.034823 *  
## V6A65       -8.106e-01  3.223e-01  -2.515 0.011910 *  
## V7A72       -1.814e-01  5.243e-01  -0.346 0.729300    
## V7A73       -5.253e-01  5.001e-01  -1.050 0.293529    
## V7A74       -1.129e+00  5.455e-01  -2.070 0.038431 *  
## V7A75       -5.927e-01  5.052e-01  -1.173 0.240705    
## V8           3.523e-01  1.094e-01   3.219 0.001284 ** 
## V9A92        4.849e-02  4.760e-01   0.102 0.918863    
## V9A93       -4.446e-01  4.691e-01  -0.948 0.343279    
## V9A94       -4.288e-01  5.837e-01  -0.735 0.462524    
## V10A102      3.052e-01  5.338e-01   0.572 0.567472    
## V10A103     -3.086e-01  5.237e-01  -0.589 0.555669    
## V11         -1.080e-01  1.073e-01  -1.007 0.314147    
## V12A122      2.219e-01  3.161e-01   0.702 0.482767    
## V12A123      3.274e-01  2.922e-01   1.120 0.262504    
## V12A124      1.156e+00  5.656e-01   2.044 0.040944 *  
## V13         -2.257e-02  1.140e-02  -1.980 0.047667 *  
## V14A142     -5.214e-01  4.925e-01  -1.059 0.289757    
## V14A143     -7.780e-01  2.848e-01  -2.732 0.006299 ** 
## V15A152     -6.323e-01  2.870e-01  -2.203 0.027579 *  
## V15A153     -6.674e-01  6.202e-01  -1.076 0.281931    
## V16          2.866e-01  2.236e-01   1.282 0.199939    
## V17A172      1.565e+00  8.891e-01   1.760 0.078442 .  
## V17A173      1.564e+00  8.582e-01   1.823 0.068370 .  
## V17A174      1.400e+00  8.772e-01   1.596 0.110563    
## V18          1.645e-01  3.004e-01   0.548 0.583871    
## V19A192     -3.319e-01  2.413e-01  -1.376 0.168942    
## V20A202     -2.137e+00  8.573e-01  -2.493 0.012665 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 613.21  on 651  degrees of freedom
## AIC: 711.21
## 
## Number of Fisher Scoring iterations: 14

Utilizing ROC curve to calculate AUC values to determine fit of the model.

preds <-predict(model, test_data,type = "response")
# ROC curve 
# AUC based on the predictive power.
roc_val <-roc(test_data$V21,round(preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_val
## 
## Call:
## roc.default(response = test_data$V21, predictor = round(preds))
## 
## Data: round(preds) in 208 controls (test_data$V21 0) < 92 cases (test_data$V21 1).
## Area under the curve: 0.679

Determining the optimum threshold by testing out different threshold values.

# Using .4 as the low threshold
preds_thres_4<-as.integer(preds>0.4)
conf_matrix_4<-as.matrix(table(preds_thres_4, test_data$V21))
conf_matrix_4
##              
## preds_thres_4   0   1
##             0 167  38
##             1  41  54

The model shows that (40*5 = 175) + 43, total would be 243 there.

# Using .6 as the threshold value
preds_thres_6<-as.integer(preds>0.6)
conf_matrix_6<-as.matrix(table(preds_thres_6, test_data$V21))
conf_matrix_6
##              
## preds_thres_6   0   1
##             0 190  60
##             1  18  32
# this is for .9 as the high threshold.
preds_thres_9<-as.integer(preds>0.9)
conf_matrix_9<-as.matrix(table(preds_thres_9, test_data$V21))
conf_matrix_9
##              
## preds_thres_9   0   1
##             0 207  87
##             1   1   5

The .6 threshold model has a cost of (15x5) = 75 + 90 = 165 The model with .9 threshold value has (3x5) + 89 = 104

To avoid missclassification of the data the threshold value should be at .9, therefore minimizing the mistake of labeling a bad customer as good.