Question 9.1 Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components.

Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

Data_crime = read.csv("uscrime.txt",sep = "")
str(Data_crime)
'data.frame':   47 obs. of  16 variables:
 $ M     : num  15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
 $ So    : int  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed    : num  9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
 $ Po1   : num  5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
 $ Po2   : num  5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
 $ LF    : num  0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
 $ M.F   : num  95 101.2 96.9 99.4 98.5 ...
 $ Pop   : int  33 13 18 157 18 25 4 50 39 7 ...
 $ NW    : num  30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
 $ U1    : num  0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
 $ U2    : num  4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
 $ Wealth: int  3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
 $ Ineq  : num  26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
 $ Prob  : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
 $ Time  : num  26.2 25.3 24.3 29.9 21.3 ...
 $ Crime : int  791 1635 578 1969 1234 682 963 1555 856 705 ...
print(summary(Data_crime))
       M              So             Ed             Po1             Po2       
 Min.   :11.9   Min.   :0.00   Min.   : 8.70   Min.   : 4.50   Min.   : 4.10  
 1st Qu.:13.0   1st Qu.:0.00   1st Qu.: 9.75   1st Qu.: 6.25   1st Qu.: 5.85  
 Median :13.6   Median :0.00   Median :10.80   Median : 7.80   Median : 7.30  
 Mean   :13.9   Mean   :0.34   Mean   :10.56   Mean   : 8.50   Mean   : 8.02  
 3rd Qu.:14.6   3rd Qu.:1.00   3rd Qu.:11.45   3rd Qu.:10.45   3rd Qu.: 9.70  
 Max.   :17.7   Max.   :1.00   Max.   :12.20   Max.   :16.60   Max.   :15.70  
       LF             M.F             Pop              NW             U1        
 Min.   :0.480   Min.   : 93.4   Min.   :  3.0   Min.   : 0.2   Min.   :0.0700  
 1st Qu.:0.530   1st Qu.: 96.4   1st Qu.: 10.0   1st Qu.: 2.4   1st Qu.:0.0805  
 Median :0.560   Median : 97.7   Median : 25.0   Median : 7.6   Median :0.0920  
 Mean   :0.561   Mean   : 98.3   Mean   : 36.6   Mean   :10.1   Mean   :0.0955  
 3rd Qu.:0.593   3rd Qu.: 99.2   3rd Qu.: 41.5   3rd Qu.:13.2   3rd Qu.:0.1040  
 Max.   :0.641   Max.   :107.1   Max.   :168.0   Max.   :42.3   Max.   :0.1420  
       U2           Wealth          Ineq           Prob             Time     
 Min.   :2.00   Min.   :2880   Min.   :12.6   Min.   :0.0069   Min.   :12.2  
 1st Qu.:2.75   1st Qu.:4595   1st Qu.:16.6   1st Qu.:0.0327   1st Qu.:21.6  
 Median :3.40   Median :5370   Median :17.6   Median :0.0421   Median :25.8  
 Mean   :3.40   Mean   :5254   Mean   :19.4   Mean   :0.0471   Mean   :26.6  
 3rd Qu.:3.85   3rd Qu.:5915   3rd Qu.:22.8   3rd Qu.:0.0544   3rd Qu.:30.5  
 Max.   :5.80   Max.   :6890   Max.   :27.6   Max.   :0.1198   Max.   :44.0  
     Crime     
 Min.   : 342  
 1st Qu.: 658  
 Median : 831  
 Mean   : 905  
 3rd Qu.:1058  
 Max.   :1993  
data <- Data_crime[,1:15]
pca <- prcomp(data,scale=TRUE)
pca
Standard deviations (1, .., p=15):
 [1] 2.4534 1.6739 1.4160 1.0781 0.9789 0.7438 0.5673 0.5544 0.4849 0.4471 0.4191
[12] 0.3580 0.2633 0.2418 0.0679

Rotation (n x k) = (15 x 15):
           PC1      PC2       PC3     PC4     PC5      PC6     PC7     PC8     PC9
M      -0.3037  0.06280  0.172420 -0.0204 -0.3583 -0.44913 -0.1571 -0.5537  0.1547
So     -0.3309 -0.15837  0.015543  0.2925 -0.1206 -0.10050  0.1965  0.2273 -0.6560
Ed      0.3396  0.21461  0.067740  0.0797 -0.0244 -0.00857 -0.2394 -0.1464 -0.4433
Po1     0.3086 -0.26982  0.050646  0.3333 -0.2353 -0.09578  0.0801  0.0461  0.1943
Po2     0.3110 -0.26396  0.053065  0.3519 -0.2047 -0.11952  0.0952  0.0317  0.1951
LF      0.1762  0.31943  0.271530 -0.1433 -0.3941  0.50423 -0.1593  0.2551  0.1439
M.F     0.1164  0.39434 -0.203162  0.0105 -0.5788 -0.07450  0.1555 -0.0551 -0.2438
Pop     0.1131 -0.46723  0.077021 -0.0321 -0.0832  0.54710  0.0905 -0.5908 -0.2024
NW     -0.2936 -0.22801  0.078816  0.2393 -0.3608  0.05122 -0.3115  0.2043  0.1898
U1      0.0405  0.00807 -0.659029 -0.1828 -0.1314  0.01739 -0.1735 -0.2021  0.0207
U2      0.0181 -0.27971 -0.578501 -0.0689 -0.1350  0.04816 -0.0753  0.2437  0.0558
Wealth  0.3797 -0.07719  0.010065  0.1178  0.0117 -0.15468 -0.1486  0.0863 -0.2320
Ineq   -0.3658 -0.02752 -0.000294 -0.0807 -0.2167  0.27203  0.3748  0.0718 -0.0249
Prob   -0.2589  0.15832 -0.117673  0.4930  0.1656  0.28354 -0.5616 -0.0860 -0.0531
Time   -0.0206 -0.38015  0.223566 -0.5406 -0.1476 -0.14820 -0.4420  0.1951 -0.2355
          PC10    PC11    PC12    PC13    PC14     PC15
M      -0.0144  0.3945  0.1658 -0.0514  0.0490  0.00514
So      0.0614  0.2340 -0.0575 -0.2937 -0.2936  0.00844
Ed      0.5189 -0.1182  0.4779  0.1944  0.0396 -0.02801
Po1    -0.1432 -0.1304  0.2261 -0.1859 -0.0949 -0.68942
Po2    -0.0593 -0.1389  0.1909 -0.1345 -0.0826  0.72003
LF      0.0308  0.3853  0.0271 -0.2774 -0.1539  0.03368
M.F    -0.3532 -0.2803 -0.2393  0.3162 -0.0413  0.00979
Pop    -0.0397  0.0585 -0.1835  0.1265 -0.0533  0.00015
NW      0.4920 -0.2070 -0.3667  0.2290  0.1323 -0.03708
U1      0.2277 -0.1786 -0.0931 -0.5904 -0.0234  0.01114
U2     -0.0475  0.4702  0.2844  0.4329 -0.0399  0.00736
Wealth -0.1122  0.3196 -0.3217 -0.1408  0.7003 -0.00257
Ineq   -0.0139 -0.1828  0.4376 -0.1218  0.5928  0.01776
Prob   -0.4253 -0.0898  0.1557 -0.0355  0.0476  0.02934
Time   -0.2926 -0.2636  0.1354 -0.0574 -0.0449  0.03768
#Plot the 1st and 2nd principal components
plot(pca$x[,1], pca$x[,2])


pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var), 1)
pca.var.per
 [1] 0.4 0.2 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
plot(pca.var.per, main = "Total Variance", xlab="Principal Component",
        ylab="Percent Variation", ylim = c(0,1), type = "b")

# From the Total Variance Plot, the first 4 principal components
# are explaining most of the variance in the data
# so I will make the judgement to use the first four components
# to make a linear regression model
pca_data <- pca$x[,1:4] 
crime_4_pc <- cbind(pca_data, Data_crime[,16])


smp_siz = floor(0.75*nrow(crime_4_pc)) 
set.seed(123)
train_ind = sample(seq_len(nrow(crime_4_pc)),size=smp_siz)
train = crime_4_pc[train_ind,]
test = crime_4_pc[-train_ind,]

# build a lr model with first 4 principal components
model <- lm(V5 ~ .  , data = as.data.frame(train))
test_cv <- cv.lm(as.data.frame(test), model,m=5)
Analysis of Variance Table

Response: V5
          Df Sum Sq Mean Sq F value Pr(>F)   
PC1        1 539158  539158   10.75 0.0135 * 
PC2        1  88136   88136    1.76 0.2267   
PC3        1 139723  139723    2.78 0.1391   
PC4        1 845689  845689   16.85 0.0045 **
Residuals  7 351243   50178                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


 As there is >1 explanatory variable, cross-validation
 predicted values for a fold are not a linear function
 of corresponding overall predicted values.  Lines that
 are shown for the different folds are approximate

fold 1 
Observations in test set: 2 
               5   11
Predicted   1687  895
cvpred      1323  762
V5          1993 1030
CV residual  670  268

Sum of squares = 521082    Mean square = 260541    n = 2 

fold 2 
Observations in test set: 3 
               4     6    8
Predicted    592 769.4  871
cvpred       880 791.8 1162
V5           511 696.0  754
CV residual -369 -95.8 -408

Sum of squares = 311832    Mean square = 103944    n = 3 

fold 3 
Observations in test set: 3 
                 1   3    9
Predicted   1328.2 796  838
cvpred      1302.3 656  739
V5          1234.0 856 1072
CV residual  -68.3 200  333

Sum of squares = 155472    Mean square = 51824    n = 3 

fold 4 
Observations in test set: 2 
               2      7
Predicted   1187 230.41
cvpred      1236   8.05
V5           963 373.00
CV residual -273 364.95

Sum of squares = 207736    Mean square = 103868    n = 2 

fold 5 
Observations in test set: 2 
              10   12
Predicted    580 1124
cvpred       802 1356
V5           566  849
CV residual -236 -507

Sum of squares = 312141    Mean square = 156071    n = 2 

Overall (Sum over all 2 folds) 
    ms 
125689 

test_cv
# Using cross validation it gives the overall mean square of prediction error
# is 125689 

# My Solution to Question 8.2
train_original = Data_crime[train_ind,]
test_original = Data_crime[-train_ind,]
train_original
model_orig <- lm(Crime ~ ., train_original)
test_orig_cv <- cv.lm(as.data.frame(test_original), model_orig,m=5)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingANOVA F-tests on an essentially perfect fit are unreliable
Analysis of Variance Table

Response: Crime
          Df Sum Sq Mean Sq F value Pr(>F)
M          1  19217   19217               
So         1   5922    5922               
Ed         1 955183  955183               
Po1        1 833986  833986               
Po2        1    425     425               
LF         1  16590   16590               
M.F        1  20652   20652               
Pop        1   2273    2273               
NW         1  22702   22702               
U1         1     30      30               
U2         1  86968   86968               
Residuals  0      0                       


 As there is >1 explanatory variable, cross-validation
 predicted values for a fold are not a linear function
 of corresponding overall predicted values.  Lines that
 are shown for the different folds are approximate

fold 1 
Observations in test set: 2 
                26    44
Predicted     1993  1030
cvpred       16265 -3679
Crime         1993  1030
CV residual -14272  4709

Sum of squares = 2.26e+08    Mean square = 1.13e+08    n = 2 

fold 2 
Observations in test set: 3 
             13    30   32
Predicted   511   696  754
cvpred      409  1872  935
Crime       511   696  754
CV residual 102 -1176 -181

Sum of squares = 1425699    Mean square = 475233    n = 3 

fold 3 
Observations in test set: 3 
                 5     9   33
Predicted   1234.0 856.0 1072
cvpred      1299.2 -96.2  365
Crime       1234.0 856.0 1072
CV residual  -65.2 952.2  707

Sum of squares = 1411375    Mean square = 470458    n = 3 

fold 4 
Observations in test set: 2 
               7    31
Predicted    963   373
cvpred      1319  1787
Crime        963   373
CV residual -356 -1414

Sum of squares = 2126770    Mean square = 1063385    n = 2 

fold 5 
Observations in test set: 2 
             38   47
Predicted   566  849
cvpred      186 1011
Crime       566  849
CV residual 380 -162

Sum of squares = 170374    Mean square = 85187    n = 2 

Overall (Sum over all 2 folds) 
      ms 
19249220 

# Using original solution, the overall mean square of prediction error is 19249220 which is much higher than the lm model with only 4 PCs 

#Next, convert the PCs back to predict 
# if pca$scale is TRUE,  need to re-scale, o/w
# use t(t(pca$x %*% t(pca$rotation)) + pca$center)
# convert 4 pca into original variables
lm_data <- t(t(pca$x[,1:4] %*% t(pca$rotation[,1:4])) * pca$scale + pca$center)
print("The original variables ")
[1] "The original variables "
print(lm_data)
         M      So    Ed   Po1   Po2    LF   M.F     Pop      NW     U1   U2 Wealth
 [1,] 15.1  1.1747  8.68  6.02  5.67 0.501  96.3  33.894 26.1005 0.1033 4.10   3862
 [2,] 13.5  0.0911 11.16  8.94  8.45 0.578  99.5  29.571  4.7373 0.0973 3.29   5623
 [3,] 15.4  1.0311  9.05  4.77  4.51 0.529  97.4  12.176 22.6931 0.0956 3.43   3743
 [4,] 12.2 -0.0166 11.45 14.50 13.67 0.556  96.5  99.166  5.7033 0.0939 3.93   6897
 [5,] 13.5  0.0581 11.74 10.02  9.54 0.601  99.7  23.732  4.2427 0.0794 2.45   5923
 [6,] 12.8  0.0798 11.74 12.72 12.08 0.576  98.6  55.082  5.5479 0.0872 3.19   6488
 [7,] 13.5  0.4596 10.67  9.77  9.28 0.545  98.9  34.934 11.6050 0.1027 3.80   5470
 [8,] 13.9  0.6704 10.36 10.99 10.42 0.541  96.4  60.238 17.7222 0.0833 3.34   5477
 [9,] 15.4  1.0515  9.17  6.48  6.12 0.538  95.5  36.371 25.0187 0.0757 2.85   4057
[10,] 13.7 -0.1122 11.74  6.91  6.57 0.614 102.0 -10.826 -0.9229 0.0932 2.53   5427
[11,] 13.0  0.0292 10.93 12.33 11.57 0.567  95.1 100.599  7.7986 0.0815 3.36   6298
[12,] 13.8  0.0105 10.90  8.35  7.83 0.588  97.8  47.115  4.9496 0.0864 2.92   5489
[13,] 14.2 -0.0246 10.99  6.89  6.46 0.607  98.5  28.819  3.8403 0.0794 2.37   5207
[14,] 14.1  0.1292 11.24  7.49  7.10 0.600 100.0   8.650  5.1287 0.0841 2.47   5273
[15,] 14.9  0.8712  9.24  5.33  5.01 0.526  97.8  18.221 19.4811 0.1048 3.82   4025
[16,] 14.4  1.0400  8.82  7.65  7.18 0.489  95.9  54.970 23.9987 0.1130 4.71   4342
[17,] 13.8  0.2967 10.75  7.21  6.84 0.562 100.7   5.473  6.9953 0.1109 3.71   5083
[18,] 13.7  0.9428 10.75 12.60 12.07 0.528  98.2  37.138 20.8539 0.0868 3.45   5691
[19,] 12.8  0.1155 11.44 13.38 12.65 0.568  96.9  79.307  7.7441 0.0832 3.29   6541
[20,] 12.2  0.0538 10.79 11.34 10.65 0.526  98.8  69.365  4.3735 0.1312 5.24   6172
[21,] 13.5 -0.1579 10.86  7.31  6.80 0.587  98.6  42.684  1.1473 0.1002 3.37   5418
[22,] 16.0  1.3140  8.15  4.15  3.88 0.508  95.0  32.130 29.7234 0.0902 3.52   3257
[23,] 14.0  0.3435 10.40  8.54  8.03 0.563  97.2  49.111 11.2602 0.0883 3.21   5216
[24,] 12.8 -0.0994 11.30  7.40  6.99 0.566 102.8  -0.436 -1.9409 0.1334 4.44   5546
[25,] 14.4  0.1186 11.44  6.97  6.64 0.617 100.4  -3.686  4.7354 0.0741 1.88   5182
[26,] 12.2 -0.1638 12.47 12.81 12.21 0.587 101.6  28.737 -1.4430 0.1028 3.48   6834
[27,] 13.9  0.0676 11.14  8.30  7.84 0.595  98.8  30.789  5.1705 0.0831 2.64   5469
[28,] 14.0  0.1358 10.93  6.99  6.60 0.585 100.0  12.441  5.0911 0.0962 3.03   5144
[29,] 12.3  0.2338 10.89 16.34 15.38 0.530  93.6 138.058 12.7329 0.0863 4.17   6993
[30,] 15.7  1.1214  9.07  5.87  5.56 0.539  95.6  29.092 26.1595 0.0733 2.69   3859
[31,] 13.7  0.1168 10.37  4.81  4.48 0.554 101.7  -2.872  2.6937 0.1338 4.48   4703
[32,] 13.0  0.0647 10.80 10.34  9.70 0.554  97.7  66.498  5.8112 0.1055 4.05   5894
[33,] 14.9  0.6308 10.11  6.69  6.34 0.570  97.5  21.979 16.0342 0.0765 2.52   4554
[34,] 13.0  0.0476 11.41 10.20  9.66 0.571 100.0  32.944  3.4661 0.1045 3.65   5973
[35,] 12.7  0.0770 10.13 10.54  9.80 0.523  96.2  96.119  7.1036 0.1223 5.11   5861
[36,] 13.5  0.0428 10.47  9.10  8.47 0.567  96.2  75.282  6.8511 0.0927 3.53   5567
[37,] 16.0  0.9178  9.17  4.40  4.12 0.563  95.6  23.742 22.6673 0.0663 2.18   3695
[38,] 14.3  0.1624 10.91  4.96  4.71 0.592 101.6 -18.408  3.9810 0.1013 2.91   4712
[39,] 15.1  0.9053  9.05  5.95  5.57 0.529  95.9  40.035 21.9164 0.0915 3.49   4069
[40,] 14.4  0.5783  9.96  8.37  7.88 0.550  96.2  53.337 16.1698 0.0851 3.23   4937
[41,] 13.9 -0.1730 11.62  7.60  7.17 0.622  99.9  16.463  0.0898 0.0787 2.15   5559
[42,] 13.9  0.4300 10.71  7.10  6.77 0.556 101.2  -4.900  8.8676 0.1118 3.71   4961
[43,] 15.0  0.8271  9.68  7.28  6.88 0.551  96.1  38.206 20.6725 0.0753 2.75   4464
[44,] 12.9 -0.1492 11.43  9.55  8.99 0.580  99.9  37.949  0.1106 0.1070 3.69   5986
[45,] 14.2  0.6609  8.78  4.80  4.40 0.497  97.9  37.275 15.3621 0.1379 5.36   4069
[46,] 13.4  0.1435 11.63 10.55 10.04 0.591  99.5  28.801  5.9393 0.0817 2.66   5959
[47,] 12.8 -0.1483 11.99  8.92  8.50 0.590 103.0  -4.365 -2.6840 0.1144 3.57   5963
      Ineq    Prob Time
 [1,] 25.4 0.07840 25.8
 [2,] 17.6 0.04183 24.8
 [3,] 25.3 0.07789 24.4
 [4,] 14.0 0.01892 31.9
 [5,] 16.3 0.04568 22.0
 [6,] 14.8 0.04106 23.2
 [7,] 18.7 0.06055 21.0
 [8,] 19.3 0.05751 25.7
 [9,] 24.6 0.06862 28.9
[10,] 17.5 0.04688 20.0
[11,] 16.3 0.00859 39.2
[12,] 18.4 0.02305 33.5
[13,] 19.0 0.02482 32.9
[14,] 18.6 0.04762 23.5
[15,] 24.2 0.07066 25.0
[16,] 23.9 0.06813 27.8
[17,] 19.5 0.06057 19.4
[18,] 18.3 0.09199 11.8
[19,] 15.1 0.03040 29.0
[20,] 16.3 0.03226 27.4
[21,] 18.4 0.01541 34.9
[22,] 27.8 0.07560 30.7
[23,] 19.8 0.03872 31.0
[24,] 17.3 0.04992 17.8
[25,] 18.7 0.05064 22.2
[26,] 12.8 0.04835 15.1
[27,] 18.2 0.03484 28.5
[28,] 19.2 0.04540 24.8
[29,] 14.5 0.01424 37.6
[30,] 25.2 0.07322 28.1
[31,] 20.6 0.05015 22.6
[32,] 17.3 0.02587 31.7
[33,] 22.0 0.05852 26.9
[34,] 16.4 0.04447 22.0
[35,] 18.0 0.01220 38.5
[36,] 18.6 0.01194 39.3
[37,] 25.6 0.05916 32.7
[38,] 20.3 0.05707 20.3
[39,] 24.5 0.05973 31.0
[40,] 21.1 0.04600 31.4
[41,] 17.4 0.02826 28.3
[42,] 19.9 0.07277 15.5
[43,] 22.8 0.06043 28.9
[44,] 16.2 0.02968 26.5
[45,] 24.3 0.05227 30.6
[46,] 16.4 0.04935 21.3
[47,] 15.6 0.05364 14.6
# Use cross validation to find the model with best R^2
library("DAAG")
lm_original <- cbind(lm_data, Data_crime[,16])
colnames(lm_original)[16] <- "Crime"
model_lm_original <- lm(Crime~., data = as.data.frame(lm_original))
model_lm_original$coef
(Intercept)           M          So          Ed         Po1         Po2          LF 
     -380.9        69.7      -132.5       -41.4        94.4          NA          NA 
        M.F         Pop          NW          U1          U2      Wealth        Ineq 
         NA          NA          NA          NA          NA          NA          NA 
       Prob        Time 
         NA          NA 
# Coefficients of original variables 
# Many dont have coefficients.
#(Intercept)           M          So          Ed         Po1         Po2          LF 
#     -380.9        69.7      -132.5       -41.4        94.4          NA          NA 
#        M.F         Pop          NW          U1          U2      Wealth        Ineq 
#         NA          NA          NA          NA          NA          NA          NA 
#       Prob        Time 
#         NA          NA 


test<-data.frame(M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5,
                 LF = 0.640, M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,
                 U2 = 3.6, Wealth = 3200,Ineq = 20.1,Prob = 0.04, Time = 39.0)

print(predict(model_orig,test))
  1 
410 
#Original prediction result from Question 8.2 is 410 

print(predict(model_lm_original ,test))
prediction from a rank-deficient fit may be misleading
   1 
1314 
#The output is 1314. 
#This prediction is quite reasonable than the original prediction.

Question 10.1 Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model.

In R,you can use the tree package or the rpart package,and the random Forest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).

# Regression Tree 
library(rpart)       # performing regression trees
library(rpart.plot)
Data_crime = read.csv("uscrime.txt",sep = "")
library(ISLR)
smp_siz = floor(0.75*nrow(Data_crime)) 
smp_siz
[1] 35
set.seed(123)
train_ind = sample(seq_len(nrow(Data_crime)),size=smp_siz)
train = Data_crime[train_ind,]
test = Data_crime[-train_ind,]
dim(test)
[1] 12 16
crime_tree_1<- rpart(Crime~ .,data=train, method = "anova")
summary(crime_tree_1)
Call:
rpart(formula = Crime ~ ., data = train, method = "anova")
  n= 35 

    CP nsplit rel error xerror  xstd
1 0.38      0      1.00   1.08 0.294
2 0.01      1      0.62   0.89 0.243

Variable importance
   Po1    Po2   Prob Wealth      M    Pop 
    25     25     15     13     12     10 

Node number 1: 35 observations,    complexity param=0.38
  mean=904, MSE=1.4e+05 
  left son=2 (17 obs) right son=3 (18 obs)
  Primary splits:
      Po1  < 7.65  to the left,  improve=0.380, (0 missing)
      Po2  < 7.2   to the left,  improve=0.380, (0 missing)
      Prob < 0.043 to the right, improve=0.298, (0 missing)
      NW   < 7.75  to the left,  improve=0.281, (0 missing)
      Pop  < 41.5  to the left,  improve=0.264, (0 missing)
  Surrogate splits:
      Po2    < 7.2   to the left,  agree=1.000, adj=1.000, (0 split)
      Prob   < 0.043 to the right, agree=0.800, adj=0.588, (0 split)
      Wealth < 5330  to the left,  agree=0.771, adj=0.529, (0 split)
      M      < 13.3  to the right, agree=0.743, adj=0.471, (0 split)
      Pop    < 38    to the left,  agree=0.714, adj=0.412, (0 split)

Node number 2: 17 observations
  mean=666, MSE=2.69e+04 

Node number 3: 18 observations
  mean=1.13e+03, MSE=1.44e+05 
rpart.plot(crime_tree_1)

plotcp(crime_tree_1)

#The tree only has only 3 nodes based on variables "Po1" and "NW".
# The reason might be the independent variables do not provide enough
# information to grow the tree. So to see whether nodes might be allowed
# I will loosen the control parameters

#A regression tree with loosened control parameters
crime_tree_2 <- rpart(Crime~ .,data=Data_crime, method = "anova",control=rpart.control(minsplit=5, minbucket=4))

summary(crime_tree_2)
Call:
rpart(formula = Crime ~ ., data = Data_crime, method = "anova", 
    control = rpart.control(minsplit = 5, minbucket = 4))
  n= 47 

      CP nsplit rel error xerror  xstd
1 0.3630      0     1.000  1.037 0.258
2 0.1481      1     0.637  0.896 0.211
3 0.1348      2     0.489  1.027 0.268
4 0.1187      3     0.354  1.009 0.266
5 0.0554      4     0.235  0.902 0.248
6 0.0517      5     0.180  0.872 0.240
7 0.0177      6     0.128  0.861 0.248
8 0.0100      7     0.111  0.795 0.240

Variable importance
   Po2    Po1   Prob Wealth   Ineq      M     NW     LF     Ed   Time    Pop    M.F 
    14     14     11     11      8      8      8      8      6      5      3      2 
    So 
     2 

Node number 1: 47 observations,    complexity param=0.363
  mean=905, MSE=1.46e+05 
  left son=2 (23 obs) right son=3 (24 obs)
  Primary splits:
      Po1    < 7.65   to the left,  improve=0.363, (0 missing)
      Po2    < 7.2    to the left,  improve=0.363, (0 missing)
      Prob   < 0.0418 to the right, improve=0.322, (0 missing)
      Wealth < 6470   to the left,  improve=0.289, (0 missing)
      NW     < 7.65   to the left,  improve=0.236, (0 missing)
  Surrogate splits:
      Po2    < 7.2    to the left,  agree=1.000, adj=1.000, (0 split)
      Wealth < 5330   to the left,  agree=0.830, adj=0.652, (0 split)
      Prob   < 0.0436 to the right, agree=0.809, adj=0.609, (0 split)
      M      < 13.2   to the right, agree=0.745, adj=0.478, (0 split)
      Ineq   < 17.2   to the right, agree=0.745, adj=0.478, (0 split)

Node number 2: 23 observations,    complexity param=0.0517
  mean=670, MSE=3.39e+04 
  left son=4 (12 obs) right son=5 (11 obs)
  Primary splits:
      Pop < 22.5   to the left,  improve=0.457, (0 missing)
      M   < 14.5   to the left,  improve=0.393, (0 missing)
      Po1 < 5.65   to the left,  improve=0.328, (0 missing)
      NW  < 5.4    to the left,  improve=0.318, (0 missing)
      U1  < 0.093  to the right, improve=0.212, (0 missing)
  Surrogate splits:
      NW   < 5.4    to the left,  agree=0.826, adj=0.636, (0 split)
      M    < 14.5   to the left,  agree=0.783, adj=0.545, (0 split)
      Time < 22.3   to the left,  agree=0.783, adj=0.545, (0 split)
      So   < 0.5    to the left,  agree=0.739, adj=0.455, (0 split)
      Ed   < 10.9   to the right, agree=0.739, adj=0.455, (0 split)

Node number 3: 24 observations,    complexity param=0.148
  mean=1.13e+03, MSE=1.5e+05 
  left son=6 (10 obs) right son=7 (14 obs)
  Primary splits:
      NW     < 7.65   to the left,  improve=0.283, (0 missing)
      M      < 13     to the left,  improve=0.271, (0 missing)
      Wealth < 6470   to the left,  improve=0.268, (0 missing)
      Po2    < 11.6   to the left,  improve=0.221, (0 missing)
      Time   < 21.1   to the left,  improve=0.219, (0 missing)
  Surrogate splits:
      Ed   < 11.4   to the right, agree=0.750, adj=0.4, (0 split)
      Ineq < 16.2   to the left,  agree=0.750, adj=0.4, (0 split)
      Time < 21.9   to the left,  agree=0.750, adj=0.4, (0 split)
      Pop  < 30     to the left,  agree=0.708, adj=0.3, (0 split)
      LF   < 0.588  to the right, agree=0.667, adj=0.2, (0 split)

Node number 4: 12 observations,    complexity param=0.0177
  mean=550, MSE=2.03e+04 
  left son=8 (8 obs) right son=9 (4 obs)
  Primary splits:
      Ed     < 11.3   to the left,  improve=0.500, (0 missing)
      LF     < 0.568  to the left,  improve=0.482, (0 missing)
      M.F    < 96.8   to the left,  improve=0.277, (0 missing)
      Wealth < 5080   to the left,  improve=0.233, (0 missing)
      U2     < 3.35   to the right, improve=0.230, (0 missing)
  Surrogate splits:
      LF  < 0.568  to the left,  agree=0.917, adj=0.75, (0 split)
      Po1 < 5.9    to the left,  agree=0.833, adj=0.50, (0 split)
      Po2 < 5.55   to the left,  agree=0.833, adj=0.50, (0 split)
      M.F < 98.1   to the left,  agree=0.833, adj=0.50, (0 split)
      U1  < 0.0785 to the right, agree=0.833, adj=0.50, (0 split)

Node number 5: 11 observations
  mean=800, MSE=1.63e+04 

Node number 6: 10 observations,    complexity param=0.0554
  mean=887, MSE=5.58e+04 
  left son=12 (6 obs) right son=13 (4 obs)
  Primary splits:
      M      < 13     to the left,  improve=0.684, (0 missing)
      Pop    < 21.5   to the right, improve=0.472, (0 missing)
      Wealth < 5830   to the right, improve=0.251, (0 missing)
      Time   < 21.1   to the left,  improve=0.218, (0 missing)
      Ed     < 11.7   to the left,  improve=0.178, (0 missing)
  Surrogate splits:
      Wealth < 5660   to the right, agree=0.8, adj=0.50, (0 split)
      Ineq   < 17.2   to the left,  agree=0.8, adj=0.50, (0 split)
      Ed     < 12     to the left,  agree=0.7, adj=0.25, (0 split)
      Po1    < 9.6    to the right, agree=0.7, adj=0.25, (0 split)
      LF     < 0.56   to the left,  agree=0.7, adj=0.25, (0 split)

Node number 7: 14 observations,    complexity param=0.135
  mean=1.3e+03, MSE=1.45e+05 
  left son=14 (4 obs) right son=15 (10 obs)
  Primary splits:
      Prob   < 0.0419 to the right, improve=0.457, (0 missing)
      Ed     < 11.1   to the left,  improve=0.438, (0 missing)
      M.F    < 99     to the left,  improve=0.438, (0 missing)
      Po2    < 11.6   to the left,  improve=0.368, (0 missing)
      Wealth < 6340   to the left,  improve=0.368, (0 missing)
  Surrogate splits:
      LF   < 0.52   to the left,  agree=0.857, adj=0.50, (0 split)
      NW   < 13.2   to the right, agree=0.857, adj=0.50, (0 split)
      Time < 21.4   to the left,  agree=0.857, adj=0.50, (0 split)
      So   < 0.5    to the right, agree=0.786, adj=0.25, (0 split)
      Po2  < 8.2    to the left,  agree=0.786, adj=0.25, (0 split)

Node number 8: 8 observations
  mean=479, MSE=7.14e+03 

Node number 9: 4 observations
  mean=693, MSE=1.62e+04 

Node number 12: 6 observations
  mean=728, MSE=1.82e+04 

Node number 13: 4 observations
  mean=1.13e+03, MSE=1.68e+04 

Node number 14: 4 observations
  mean=898, MSE=7.06e+03 

Node number 15: 10 observations,    complexity param=0.119
  mean=1.47e+03, MSE=1.07e+05 
  left son=30 (6 obs) right son=31 (4 obs)
  Primary splits:
      LF  < 0.574  to the left,  improve=0.762, (0 missing)
      Ed  < 10.9   to the left,  improve=0.396, (0 missing)
      M.F < 99     to the left,  improve=0.345, (0 missing)
      Po1 < 11.4   to the left,  improve=0.299, (0 missing)
      Po2 < 10.7   to the left,  improve=0.299, (0 missing)
  Surrogate splits:
      Ed     < 11.2   to the left,  agree=0.9, adj=0.75, (0 split)
      Wealth < 6470   to the left,  agree=0.9, adj=0.75, (0 split)
      Po1    < 11.8   to the left,  agree=0.8, adj=0.50, (0 split)
      Po2    < 11.2   to the left,  agree=0.8, adj=0.50, (0 split)
      M.F    < 99     to the left,  agree=0.8, adj=0.50, (0 split)

Node number 30: 6 observations
  mean=1.23e+03, MSE=2.45e+04 

Node number 31: 4 observations
  mean=1.82e+03, MSE=2.69e+04 
rpart.plot(crime_tree_2)

plotcp(crime_tree_2)


#Some qualitative takeaways:
# 1. crime_tree_2 has far more terminal nodes than crime_tree_1
# However, when I compare the end leaf nodes MSE 
# 2. crime_tree_2 is actually doing better in terms of minimizing mse
# this is reasonable that because we have a small sample (47 observations)
# so a more pruned tree might work better than a more fully grown tree
# 3. I also see for both trees, the same variables rank highest in terms of 
# Variable importance which are : Po1, Po2, and Weath
library(randomForest)
library(Metrics)

#Check if there are any missing values, answer is no
summary(Data_crime)
       M              So             Ed             Po1             Po2       
 Min.   :11.9   Min.   :0.00   Min.   : 8.70   Min.   : 4.50   Min.   : 4.10  
 1st Qu.:13.0   1st Qu.:0.00   1st Qu.: 9.75   1st Qu.: 6.25   1st Qu.: 5.85  
 Median :13.6   Median :0.00   Median :10.80   Median : 7.80   Median : 7.30  
 Mean   :13.9   Mean   :0.34   Mean   :10.56   Mean   : 8.50   Mean   : 8.02  
 3rd Qu.:14.6   3rd Qu.:1.00   3rd Qu.:11.45   3rd Qu.:10.45   3rd Qu.: 9.70  
 Max.   :17.7   Max.   :1.00   Max.   :12.20   Max.   :16.60   Max.   :15.70  
       LF             M.F             Pop              NW             U1        
 Min.   :0.480   Min.   : 93.4   Min.   :  3.0   Min.   : 0.2   Min.   :0.0700  
 1st Qu.:0.530   1st Qu.: 96.4   1st Qu.: 10.0   1st Qu.: 2.4   1st Qu.:0.0805  
 Median :0.560   Median : 97.7   Median : 25.0   Median : 7.6   Median :0.0920  
 Mean   :0.561   Mean   : 98.3   Mean   : 36.6   Mean   :10.1   Mean   :0.0955  
 3rd Qu.:0.593   3rd Qu.: 99.2   3rd Qu.: 41.5   3rd Qu.:13.2   3rd Qu.:0.1040  
 Max.   :0.641   Max.   :107.1   Max.   :168.0   Max.   :42.3   Max.   :0.1420  
       U2           Wealth          Ineq           Prob             Time     
 Min.   :2.00   Min.   :2880   Min.   :12.6   Min.   :0.0069   Min.   :12.2  
 1st Qu.:2.75   1st Qu.:4595   1st Qu.:16.6   1st Qu.:0.0327   1st Qu.:21.6  
 Median :3.40   Median :5370   Median :17.6   Median :0.0421   Median :25.8  
 Mean   :3.40   Mean   :5254   Mean   :19.4   Mean   :0.0471   Mean   :26.6  
 3rd Qu.:3.85   3rd Qu.:5915   3rd Qu.:22.8   3rd Qu.:0.0544   3rd Qu.:30.5  
 Max.   :5.80   Max.   :6890   Max.   :27.6   Max.   :0.1198   Max.   :44.0  
     Crime     
 Min.   : 342  
 1st Qu.: 658  
 Median : 831  
 Mean   : 905  
 3rd Qu.:1058  
 Max.   :1993  
#build a Random Forest model
rf <- randomForest(Crime ~ ., data=train)
summary(rf)
                Length Class  Mode     
call              3    -none- call     
type              1    -none- character
predicted        35    -none- numeric  
mse             500    -none- numeric  
rsq             500    -none- numeric  
oob.times        35    -none- numeric  
importance       15    -none- numeric  
importanceSD      0    -none- NULL     
localImportance   0    -none- NULL     
proximity         0    -none- NULL     
ntree             1    -none- numeric  
mtry              1    -none- numeric  
forest           11    -none- list     
coefs             0    -none- NULL     
y                35    -none- numeric  
test              0    -none- NULL     
inbag             0    -none- NULL     
terms             3    terms  call     
#Compare the predictions given by regression tree and random forest tree
regression_tree_result <- predict(crime_tree_1, test)
#   5    7    9   13   26   30   31   32   33   38   44   47 
# 887 1305  800  800 1305  800  550 1305  800  550  887  887  
mse(test$Crime,regression_tree_result)
[1] 111301
#MSE of regression tree is 111301

rf_result <- predict(rf, test)
# 5    7    9   13   26   30   31   32   33   38   44   47 
# 992 1006  783  667 1240  774  726 1099  762  666 1077  962 

mse(test$Crime,rf_result)
[1] 82609
#MSE of rf tree is 85615 which is lower than the regression tree
#random forest is performing better however, it lacks 
#the interpretability of a simple regression tree

Question 10.2 Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use.

For credit card marketing campaign, logistic regression is used to assign a probability to a potential prospect about how likely for he/she to respond to an offer. Some predictors are: 1. income 2. industry 3. no. of times they visit the company’s website 4. if they made purchases with this company before, how much they spent 5. what is their credit score

Question 10.3 1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not.

Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call.

Data_credit = read.csv("germancredit.txt",sep = "", header=FALSE)
dim(Data_credit)
[1] 1000   21
summary(Data_credit)
      V1                  V2            V3                 V4           
 Length:1000        Min.   : 4.0   Length:1000        Length:1000       
 Class :character   1st Qu.:12.0   Class :character   Class :character  
 Mode  :character   Median :18.0   Mode  :character   Mode  :character  
                    Mean   :20.9                                        
                    3rd Qu.:24.0                                        
                    Max.   :72.0                                        
       V5             V6                 V7                  V8      
 Min.   :  250   Length:1000        Length:1000        Min.   :1.00  
 1st Qu.: 1366   Class :character   Class :character   1st Qu.:2.00  
 Median : 2320   Mode  :character   Mode  :character   Median :3.00  
 Mean   : 3271                                         Mean   :2.97  
 3rd Qu.: 3972                                         3rd Qu.:4.00  
 Max.   :18424                                         Max.   :4.00  
      V9                V10                 V11           V12           
 Length:1000        Length:1000        Min.   :1.00   Length:1000       
 Class :character   Class :character   1st Qu.:2.00   Class :character  
 Mode  :character   Mode  :character   Median :3.00   Mode  :character  
                                       Mean   :2.85                     
                                       3rd Qu.:4.00                     
                                       Max.   :4.00                     
      V13           V14                V15                 V16      
 Min.   :19.0   Length:1000        Length:1000        Min.   :1.00  
 1st Qu.:27.0   Class :character   Class :character   1st Qu.:1.00  
 Median :33.0   Mode  :character   Mode  :character   Median :1.00  
 Mean   :35.5                                         Mean   :1.41  
 3rd Qu.:42.0                                         3rd Qu.:2.00  
 Max.   :75.0                                         Max.   :4.00  
     V17                 V18           V19                V20           
 Length:1000        Min.   :1.00   Length:1000        Length:1000       
 Class :character   1st Qu.:1.00   Class :character   Class :character  
 Mode  :character   Median :1.00   Mode  :character   Mode  :character  
                    Mean   :1.16                                        
                    3rd Qu.:1.00                                        
                    Max.   :2.00                                        
      V21     
 Min.   :1.0  
 1st Qu.:1.0  
 Median :1.0  
 Mean   :1.3  
 3rd Qu.:2.0  
 Max.   :2.0  
table(Data_credit$V21) #70% are good and 30% are bad

  1   2 
700 300 
# 1 = Good, 2 = Bad
Data_credit$V21
   [1] 1 2 1 1 2 1 1 1 1 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1
  [41] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1
  [81] 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1
 [121] 2 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
 [161] 1 1 1 1 1 1 2 1 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 2
 [201] 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 2 2 2 1 1
 [241] 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1
 [281] 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 2 1 2 1 1 1 1
 [321] 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 2
 [361] 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
 [401] 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 2 1 1 1 2
 [441] 1 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1 1 1 1
 [481] 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1 1 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 1 1 1 2 1 1
 [521] 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 2 2 2 2
 [561] 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 2 1 1 1 2 2 2 2 2 2 1
 [601] 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 1 1 2 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 1 1 2
 [641] 2 1 2 1 1 2 1 2 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1
 [681] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1
 [721] 2 2 2 1 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 2 1 2
 [761] 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1
 [801] 1 1 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 1 1
 [841] 2 1 2 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
 [881] 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 2 2
 [921] 1 1 2 1 2 2 1 2 1 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 2 1 2 2 1 2 2 2 1 1 1 1 2 1
 [961] 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
Data_credit$V21[Data_credit$V21==1] <- 0
Data_credit$V21[Data_credit$V21==2] <- 1

# Split data into training and testing
smp_siz = floor(0.75*nrow(Data_credit)) 
smp_siz
[1] 750
set.seed(123)
train_ind = sample(seq_len(nrow(Data_credit)),size=smp_siz)
train = Data_credit[train_ind,]
test = Data_credit[-train_ind,]

# Train a logistic Model
creditLogitModel <- glm(V21 ~ ., data=train,family=binomial(link="logit"))
summary(creditLogitModel)

Call:
glm(formula = V21 ~ ., family = binomial(link = "logit"), data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.199  -0.711  -0.372   0.743   2.832  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.41e-01   1.40e+00    0.31  0.75281    
V1A12       -3.13e-01   2.51e-01   -1.24  0.21331    
V1A13       -1.00e+00   4.12e-01   -2.43  0.01530 *  
V1A14       -1.81e+00   2.76e-01   -6.56  5.3e-11 ***
V2           2.65e-02   1.06e-02    2.50  0.01247 *  
V3A31       -9.82e-02   6.55e-01   -0.15  0.88088    
V3A32       -7.09e-01   4.84e-01   -1.47  0.14276    
V3A33       -7.76e-01   5.24e-01   -1.48  0.13901    
V3A34       -1.42e+00   4.94e-01   -2.86  0.00418 ** 
V4A41       -1.51e+00   4.42e-01   -3.42  0.00062 ***
V4A410      -1.46e+00   8.60e-01   -1.70  0.08848 .  
V4A42       -6.19e-01   3.02e-01   -2.05  0.04039 *  
V4A43       -7.62e-01   2.90e-01   -2.63  0.00858 ** 
V4A44       -5.40e-01   9.70e-01   -0.56  0.57778    
V4A45       -2.94e-01   6.23e-01   -0.47  0.63728    
V4A46        1.89e-01   4.50e-01    0.42  0.67533    
V4A48       -1.88e+00   1.31e+00   -1.44  0.14935    
V4A49       -6.74e-01   3.86e-01   -1.75  0.08096 .  
V5           1.10e-04   4.97e-05    2.21  0.02694 *  
V6A62       -4.57e-01   3.37e-01   -1.35  0.17550    
V6A63       -1.56e-01   4.66e-01   -0.33  0.73763    
V6A64       -1.50e+00   6.49e-01   -2.31  0.02115 *  
V6A65       -9.78e-01   3.12e-01   -3.14  0.00172 ** 
V7A72       -1.01e-01   5.18e-01   -0.19  0.84593    
V7A73       -1.76e-01   4.89e-01   -0.36  0.71867    
V7A74       -8.08e-01   5.26e-01   -1.54  0.12426    
V7A75       -9.55e-02   4.88e-01   -0.20  0.84470    
V8           3.27e-01   1.02e-01    3.21  0.00134 ** 
V9A92       -1.74e-01   4.41e-01   -0.40  0.69269    
V9A93       -9.09e-01   4.32e-01   -2.10  0.03552 *  
V9A94       -3.44e-01   5.27e-01   -0.65  0.51356    
V10A102      3.01e-01   4.42e-01    0.68  0.49646    
V10A103     -1.11e+00   5.22e-01   -2.13  0.03332 *  
V11         -6.20e-02   1.01e-01   -0.61  0.54076    
V12A122      7.51e-02   2.99e-01    0.25  0.80170    
V12A123      1.31e-01   2.71e-01    0.48  0.63001    
V12A124      2.63e-01   5.23e-01    0.50  0.61521    
V13         -1.83e-02   1.09e-02   -1.68  0.09333 .  
V14A142     -2.25e-02   4.50e-01   -0.05  0.96011    
V14A143     -7.14e-01   2.81e-01   -2.54  0.01108 *  
V15A152     -3.61e-01   2.86e-01   -1.26  0.20727    
V15A153     -2.33e-01   5.70e-01   -0.41  0.68294    
V16          2.18e-01   2.09e-01    1.04  0.29703    
V17A172      1.05e+00   1.00e+00    1.05  0.29546    
V17A173      9.36e-01   9.82e-01    0.95  0.34038    
V17A174      9.46e-01   9.90e-01    0.96  0.33937    
V18          2.92e-01   2.91e-01    1.00  0.31566    
V19A192     -1.67e-01   2.33e-01   -0.72  0.47373    
V20A202     -1.37e+00   7.27e-01   -1.89  0.05936 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 916.30  on 749  degrees of freedom
Residual deviance: 674.06  on 701  degrees of freedom
AIC: 772.1

Number of Fisher Scoring iterations: 5
#This outputs the probability
creditPredict <- predict(creditLogitModel,newdata = test[,-21],type="response")

#Given the default threshold of 0.5, below is confusion matrix
cm_1 <- table(round(creditPredict)<0.5,test$V21)
cm_1
       
          0   1
  FALSE  22  33
  TRUE  153  42
#         0   1
#  FALSE  22  33
#  TRUE  153  42
accuracy_1 <- (cm_1[1,1]+cm_1[2,2])/sum(cm_1)
accuracy_1 #0.256
[1] 0.256
#To improve the models by using a different threshold
threshold <- 0.7
cm_2 <- table(round(creditPredict) > threshold,test$V21)
names(dimnames(cm_2)) <- c("Predicted", "Observed")
cm_2
         Observed
Predicted   0   1
    FALSE 153  42
    TRUE   22  33
accuracy_2 <-  (cm_2[1,1]+cm_2[2,2])/sum(cm_2)
accuracy_2 #0.744, the accuracy improves significantly
[1] 0.744
  1. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model.
# Identify 1(bad) as 0(good) is 5 times worse (cost $5) than identify 0(good) as 1(bad) (cost $1)

threshold <- 0.7
cm_2 <- table(round(creditPredict > threshold),test$V21)
total_cost <- 5*cm_2[1,2] + 1*cm_2[2,1]
total_cost
[1] 300
cost_threshold= function(X){
  threshold <- X
  creditPredict <- predict(creditLogitModel,newdata = test[,-21],type="response")
  cm <- table(round(creditPredict > X),test$V21)
  total_cost <- 5*cm[1,2] + 1*cm[2,1]
  total_cost
  return(total_cost)
}

# Try out 13 different threshold ranges from 0.1 to 0.9
threshold_li <- seq(from = 0.1, to = 0.9, by = 0.05)
a <- length(threshold_li)

cost = rep(0,a)

for (x in 1:17){
  a <- threshold_li[x]
  cost[x] <- cost_threshold(a)
}

cm <- table(round(creditPredict > threshold_li[which.min(cost)]
),test$V21)
cm
   
     0  1
  0 92  4
  1 83 71
# Threshold 0.15 gives the least cost out of all the thresholds
# This makes sense as wrongly identify a true bad account cost so much higher
# than wrongly identify a good account
# the threshold is set especially low for an account to be identified bad ("0")
#     0  1
#  0 92  4
#  1 83 71
---
title: "ISYE6501x - WEEK 4 HW"
author: "Vivian Peng"
output:
  html_notebook: default
  pdf_document: default
---
Question 9.1
Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components.

Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

```{r}
Data_crime = read.csv("uscrime.txt",sep = "")
str(Data_crime)
print(summary(Data_crime))

data <- Data_crime[,1:15]
pca <- prcomp(data,scale=TRUE)
pca
#Plot the 1st and 2nd principal components
plot(pca$x[,1], pca$x[,2])

pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var), 1)
pca.var.per
plot(pca.var.per, main = "Total Variance", xlab="Principal Component",
        ylab="Percent Variation", ylim = c(0,1), type = "b")
# From the Total Variance Plot, the first 4 principal components
# are explaining most of the variance in the data
# so I will make the judgement to use the first four components
# to make a linear regression model
pca_data <- pca$x[,1:4] 
crime_4_pc <- cbind(pca_data, Data_crime[,16])


smp_siz = floor(0.75*nrow(crime_4_pc)) 
set.seed(123)
train_ind = sample(seq_len(nrow(crime_4_pc)),size=smp_siz)
train = crime_4_pc[train_ind,]
test = crime_4_pc[-train_ind,]

# build a lr model with first 4 principal components
model <- lm(V5 ~ .  , data = as.data.frame(train))
test_cv <- cv.lm(as.data.frame(test), model,m=5)
test_cv
# Using cross validation it gives the overall mean square of prediction error
# is 125689 

# My Solution to Question 8.2
train_original = Data_crime[train_ind,]
test_original = Data_crime[-train_ind,]
train_original
model_orig <- lm(Crime ~ ., train_original)
test_orig_cv <- cv.lm(as.data.frame(test_original), model_orig,m=5)

# Using original solution, the overall mean square of prediction error is 19249220 which is much higher than the lm model with only 4 PCs 

#Next, convert the PCs back to predict 
# if pca$scale is TRUE,  need to re-scale, o/w
# use t(t(pca$x %*% t(pca$rotation)) + pca$center)
# convert 4 pca into original variables
lm_data <- t(t(pca$x[,1:4] %*% t(pca$rotation[,1:4])) * pca$scale + pca$center)
print("The original variables ")
print(lm_data)

# Use cross validation to find the model with best R^2
library("DAAG")
lm_original <- cbind(lm_data, Data_crime[,16])
colnames(lm_original)[16] <- "Crime"
model_lm_original <- lm(Crime~., data = as.data.frame(lm_original))
model_lm_original$coef
# Coefficients of original variables 
# Many dont have coefficients.
#(Intercept)           M          So          Ed         Po1         Po2          LF 
#     -380.9        69.7      -132.5       -41.4        94.4          NA          NA 
#        M.F         Pop          NW          U1          U2      Wealth        Ineq 
#         NA          NA          NA          NA          NA          NA          NA 
#       Prob        Time 
#         NA          NA 


test<-data.frame(M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5,
                 LF = 0.640, M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,
                 U2 = 3.6, Wealth = 3200,Ineq = 20.1,Prob = 0.04, Time = 39.0)

print(predict(model_orig,test))
#Original prediction result from Question 8.2 is 410 

print(predict(model_lm_original ,test))
#The output is 1314. 
#This prediction is quite reasonable than the original prediction.

```
Question 10.1
Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can
using
(a) a regression tree model, and
(b) a random forest model. 

In R,you can use the tree package or the rpart package,and the random Forest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).

```{r cars}
# Regression Tree 
library(rpart)       # performing regression trees
library(rpart.plot)
Data_crime = read.csv("uscrime.txt",sep = "")
library(ISLR)
smp_siz = floor(0.75*nrow(Data_crime)) 
smp_siz
set.seed(123)
train_ind = sample(seq_len(nrow(Data_crime)),size=smp_siz)
train = Data_crime[train_ind,]
test = Data_crime[-train_ind,]
dim(test)

crime_tree_1<- rpart(Crime~ .,data=train, method = "anova")
summary(crime_tree_1)

rpart.plot(crime_tree_1)
plotcp(crime_tree_1)
#The tree only has only 3 nodes based on variables "Po1" and "NW".
# The reason might be the independent variables do not provide enough
# information to grow the tree. So to see whether nodes might be allowed
# I will loosen the control parameters

#A regression tree with loosened control parameters
crime_tree_2 <- rpart(Crime~ .,data=Data_crime, method = "anova",control=rpart.control(minsplit=5, minbucket=4))

summary(crime_tree_2)
rpart.plot(crime_tree_2)
plotcp(crime_tree_2)

#Some qualitative takeaways:
# 1. crime_tree_2 has far more terminal nodes than crime_tree_1
# However, when I compare the end leaf nodes MSE 
# 2. crime_tree_2 is actually doing better in terms of minimizing mse
# this is reasonable that because we have a small sample (47 observations)
# so a more pruned tree might work better than a more fully grown tree
# 3. I also see for both trees, the same variables rank highest in terms of 
# Variable importance which are : Po1, Po2, and Weath




```
```{r}
library(randomForest)
library(Metrics)

#Check if there are any missing values， answer is no
summary(Data_crime)


#build a Random Forest model
rf <- randomForest(Crime ~ ., data=train)
summary(rf)

#Compare the predictions given by regression tree and random forest tree
regression_tree_result <- predict(crime_tree_1, test)
#   5    7    9   13   26   30   31   32   33   38   44   47 
# 887 1305  800  800 1305  800  550 1305  800  550  887  887  
mse(test$Crime,regression_tree_result)
#MSE of regression tree is 111301

rf_result <- predict(rf, test)
# 5    7    9   13   26   30   31   32   33   38   44   47 
# 992 1006  783  667 1240  774  726 1099  762  666 1077  962 

mse(test$Crime,rf_result)
#MSE of rf tree is 85615 which is lower than the regression tree
#random forest is performing better however, it lacks 
#the interpretability of a simple regression tree

```
Question 10.2
Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use.

For credit card marketing campaign, logistic regression is used to assign a probability to a potential prospect about how likely for he/she to respond to an offer. Some predictors are:
1. income
2. industry
3. no. of times they visit the company's website
4. if they made purchases with this company before, how much they spent
5. what is their credit score

Question 10.3
1. Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / 
(description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. 

Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call.
```{r}
Data_credit = read.csv("germancredit.txt",sep = "", header=FALSE)
dim(Data_credit)
summary(Data_credit)
table(Data_credit$V21) #70% are good and 30% are bad
# 1 = Good, 2 = Bad
Data_credit$V21
Data_credit$V21[Data_credit$V21==1] <- 0
Data_credit$V21[Data_credit$V21==2] <- 1

# Split data into training and testing
smp_siz = floor(0.75*nrow(Data_credit)) 
smp_siz
set.seed(123)
train_ind = sample(seq_len(nrow(Data_credit)),size=smp_siz)
train = Data_credit[train_ind,]
test = Data_credit[-train_ind,]

# Train a logistic Model
creditLogitModel <- glm(V21 ~ ., data=train,family=binomial(link="logit"))
summary(creditLogitModel)

#This outputs the probability
creditPredict <- predict(creditLogitModel,newdata = test[,-21],type="response")

#Given the default threshold of 0.5, below is confusion matrix
cm_1 <- table(round(creditPredict)<0.5,test$V21)
cm_1
#         0   1
#  FALSE  22  33
#  TRUE  153  42
accuracy_1 <- (cm_1[1,1]+cm_1[2,2])/sum(cm_1)
accuracy_1 #0.256

#To improve the models by using a different threshold
threshold <- 0.7
cm_2 <- table(round(creditPredict) > threshold,test$V21)
names(dimnames(cm_2)) <- c("Predicted", "Observed")
cm_2
accuracy_2 <-  (cm_2[1,1]+cm_2[2,2])/sum(cm_2)
accuracy_2 #0.744, the accuracy improves significantly

```
2. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model.
```{r}
# Identify 1(bad) as 0(good) is 5 times worse (cost $5) than identify 0(good) as 1(bad) (cost $1)

threshold <- 0.7
cm_2 <- table(round(creditPredict > threshold),test$V21)
total_cost <- 5*cm_2[1,2] + 1*cm_2[2,1]
total_cost

cost_threshold= function(X){
  threshold <- X
  creditPredict <- predict(creditLogitModel,newdata = test[,-21],type="response")
  cm <- table(round(creditPredict > X),test$V21)
  total_cost <- 5*cm[1,2] + 1*cm[2,1]
  total_cost
  return(total_cost)
}

# Try out 13 different threshold ranges from 0.1 to 0.9
threshold_li <- seq(from = 0.1, to = 0.9, by = 0.05)
a <- length(threshold_li)

cost = rep(0,a)

for (x in 1:17){
  a <- threshold_li[x]
  cost[x] <- cost_threshold(a)
}

cm <- table(round(creditPredict > threshold_li[which.min(cost)]
),test$V21)
cm
# Threshold 0.15 gives the least cost out of all the thresholds
# This makes sense as wrongly identify a true bad account cost so much higher
# than wrongly identify a good account
# the threshold is set especially low for an account to be identified bad ("0")
#     0  1
#  0 92  4
#  1 83 71

```
