Homework3

# na.strings = "" will treat empty string as NA
loan_data <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/data622_fa2021/main/hw3/data/Loan_approval.csv', header = TRUE, na.strings = "")

loan_raw <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/data622_fa2021/main/hw3/data/Loan_approval.csv', header = TRUE)

1. Exploratory Data Analysis

Our loan data has 614 rows and 13 columns, 8 of which are categorical and 5 are numerical. The target variable is Loan_Status, which can be either Y (yes) or N (no). This let’s us know if the applicant’s loan is approved. There are 7 variables that have blank values. The one with the most blank values is Credit_History with 50 blanks.

loan_data %>% kbl() %>% kable_styling() %>% scroll_box(width = "750px", height = "250px")
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0.00 NA 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508.00 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0.00 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358.00 120 360 1 Urban Y
LP001008 Male No 0 Graduate No 6000 0.00 141 360 1 Urban Y
LP001011 Male Yes 2 Graduate Yes 5417 4196.00 267 360 1 Urban Y
LP001013 Male Yes 0 Not Graduate No 2333 1516.00 95 360 1 Urban Y
LP001014 Male Yes 3+ Graduate No 3036 2504.00 158 360 0 Semiurban N
LP001018 Male Yes 2 Graduate No 4006 1526.00 168 360 1 Urban Y
LP001020 Male Yes 1 Graduate No 12841 10968.00 349 360 1 Semiurban N
LP001024 Male Yes 2 Graduate No 3200 700.00 70 360 1 Urban Y
LP001027 Male Yes 2 Graduate NA 2500 1840.00 109 360 1 Urban Y
LP001028 Male Yes 2 Graduate No 3073 8106.00 200 360 1 Urban Y
LP001029 Male No 0 Graduate No 1853 2840.00 114 360 1 Rural N
LP001030 Male Yes 2 Graduate No 1299 1086.00 17 120 1 Urban Y
LP001032 Male No 0 Graduate No 4950 0.00 125 360 1 Urban Y
LP001034 Male No 1 Not Graduate No 3596 0.00 100 240 NA Urban Y
LP001036 Female No 0 Graduate No 3510 0.00 76 360 0 Urban N
LP001038 Male Yes 0 Not Graduate No 4887 0.00 133 360 1 Rural N
LP001041 Male Yes 0 Graduate NA 2600 3500.00 115 NA 1 Urban Y
LP001043 Male Yes 0 Not Graduate No 7660 0.00 104 360 0 Urban N
LP001046 Male Yes 1 Graduate No 5955 5625.00 315 360 1 Urban Y
LP001047 Male Yes 0 Not Graduate No 2600 1911.00 116 360 0 Semiurban N
LP001050 NA Yes 2 Not Graduate No 3365 1917.00 112 360 0 Rural N
LP001052 Male Yes 1 Graduate NA 3717 2925.00 151 360 NA Semiurban N
LP001066 Male Yes 0 Graduate Yes 9560 0.00 191 360 1 Semiurban Y
LP001068 Male Yes 0 Graduate No 2799 2253.00 122 360 1 Semiurban Y
LP001073 Male Yes 2 Not Graduate No 4226 1040.00 110 360 1 Urban Y
LP001086 Male No 0 Not Graduate No 1442 0.00 35 360 1 Urban N
LP001087 Female No 2 Graduate NA 3750 2083.00 120 360 1 Semiurban Y
LP001091 Male Yes 1 Graduate NA 4166 3369.00 201 360 NA Urban N
LP001095 Male No 0 Graduate No 3167 0.00 74 360 1 Urban N
LP001097 Male No 1 Graduate Yes 4692 0.00 106 360 1 Rural N
LP001098 Male Yes 0 Graduate No 3500 1667.00 114 360 1 Semiurban Y
LP001100 Male No 3+ Graduate No 12500 3000.00 320 360 1 Rural N
LP001106 Male Yes 0 Graduate No 2275 2067.00 NA 360 1 Urban Y
LP001109 Male Yes 0 Graduate No 1828 1330.00 100 NA 0 Urban N
LP001112 Female Yes 0 Graduate No 3667 1459.00 144 360 1 Semiurban Y
LP001114 Male No 0 Graduate No 4166 7210.00 184 360 1 Urban Y
LP001116 Male No 0 Not Graduate No 3748 1668.00 110 360 1 Semiurban Y
LP001119 Male No 0 Graduate No 3600 0.00 80 360 1 Urban N
LP001120 Male No 0 Graduate No 1800 1213.00 47 360 1 Urban Y
LP001123 Male Yes 0 Graduate No 2400 0.00 75 360 NA Urban Y
LP001131 Male Yes 0 Graduate No 3941 2336.00 134 360 1 Semiurban Y
LP001136 Male Yes 0 Not Graduate Yes 4695 0.00 96 NA 1 Urban Y
LP001137 Female No 0 Graduate No 3410 0.00 88 NA 1 Urban Y
LP001138 Male Yes 1 Graduate No 5649 0.00 44 360 1 Urban Y
LP001144 Male Yes 0 Graduate No 5821 0.00 144 360 1 Urban Y
LP001146 Female Yes 0 Graduate No 2645 3440.00 120 360 0 Urban N
LP001151 Female No 0 Graduate No 4000 2275.00 144 360 1 Semiurban Y
LP001155 Female Yes 0 Not Graduate No 1928 1644.00 100 360 1 Semiurban Y
LP001157 Female No 0 Graduate No 3086 0.00 120 360 1 Semiurban Y
LP001164 Female No 0 Graduate No 4230 0.00 112 360 1 Semiurban N
LP001179 Male Yes 2 Graduate No 4616 0.00 134 360 1 Urban N
LP001186 Female Yes 1 Graduate Yes 11500 0.00 286 360 0 Urban N
LP001194 Male Yes 2 Graduate No 2708 1167.00 97 360 1 Semiurban Y
LP001195 Male Yes 0 Graduate No 2132 1591.00 96 360 1 Semiurban Y
LP001197 Male Yes 0 Graduate No 3366 2200.00 135 360 1 Rural N
LP001198 Male Yes 1 Graduate No 8080 2250.00 180 360 1 Urban Y
LP001199 Male Yes 2 Not Graduate No 3357 2859.00 144 360 1 Urban Y
LP001205 Male Yes 0 Graduate No 2500 3796.00 120 360 1 Urban Y
LP001206 Male Yes 3+ Graduate No 3029 0.00 99 360 1 Urban Y
LP001207 Male Yes 0 Not Graduate Yes 2609 3449.00 165 180 0 Rural N
LP001213 Male Yes 1 Graduate No 4945 0.00 NA 360 0 Rural N
LP001222 Female No 0 Graduate No 4166 0.00 116 360 0 Semiurban N
LP001225 Male Yes 0 Graduate No 5726 4595.00 258 360 1 Semiurban N
LP001228 Male No 0 Not Graduate No 3200 2254.00 126 180 0 Urban N
LP001233 Male Yes 1 Graduate No 10750 0.00 312 360 1 Urban Y
LP001238 Male Yes 3+ Not Graduate Yes 7100 0.00 125 60 1 Urban Y
LP001241 Female No 0 Graduate No 4300 0.00 136 360 0 Semiurban N
LP001243 Male Yes 0 Graduate No 3208 3066.00 172 360 1 Urban Y
LP001245 Male Yes 2 Not Graduate Yes 1875 1875.00 97 360 1 Semiurban Y
LP001248 Male No 0 Graduate No 3500 0.00 81 300 1 Semiurban Y
LP001250 Male Yes 3+ Not Graduate No 4755 0.00 95 NA 0 Semiurban N
LP001253 Male Yes 3+ Graduate Yes 5266 1774.00 187 360 1 Semiurban Y
LP001255 Male No 0 Graduate No 3750 0.00 113 480 1 Urban N
LP001256 Male No 0 Graduate No 3750 4750.00 176 360 1 Urban N
LP001259 Male Yes 1 Graduate Yes 1000 3022.00 110 360 1 Urban N
LP001263 Male Yes 3+ Graduate No 3167 4000.00 180 300 0 Semiurban N
LP001264 Male Yes 3+ Not Graduate Yes 3333 2166.00 130 360 NA Semiurban Y
LP001265 Female No 0 Graduate No 3846 0.00 111 360 1 Semiurban Y
LP001266 Male Yes 1 Graduate Yes 2395 0.00 NA 360 1 Semiurban Y
LP001267 Female Yes 2 Graduate No 1378 1881.00 167 360 1 Urban N
LP001273 Male Yes 0 Graduate No 6000 2250.00 265 360 NA Semiurban N
LP001275 Male Yes 1 Graduate No 3988 0.00 50 240 1 Urban Y
LP001279 Male No 0 Graduate No 2366 2531.00 136 360 1 Semiurban Y
LP001280 Male Yes 2 Not Graduate No 3333 2000.00 99 360 NA Semiurban Y
LP001282 Male Yes 0 Graduate No 2500 2118.00 104 360 1 Semiurban Y
LP001289 Male No 0 Graduate No 8566 0.00 210 360 1 Urban Y
LP001310 Male Yes 0 Graduate No 5695 4167.00 175 360 1 Semiurban Y
LP001316 Male Yes 0 Graduate No 2958 2900.00 131 360 1 Semiurban Y
LP001318 Male Yes 2 Graduate No 6250 5654.00 188 180 1 Semiurban Y
LP001319 Male Yes 2 Not Graduate No 3273 1820.00 81 360 1 Urban Y
LP001322 Male No 0 Graduate No 4133 0.00 122 360 1 Semiurban Y
LP001325 Male No 0 Not Graduate No 3620 0.00 25 120 1 Semiurban Y
LP001326 Male No 0 Graduate NA 6782 0.00 NA 360 NA Urban N
LP001327 Female Yes 0 Graduate No 2484 2302.00 137 360 1 Semiurban Y
LP001333 Male Yes 0 Graduate No 1977 997.00 50 360 1 Semiurban Y
LP001334 Male Yes 0 Not Graduate No 4188 0.00 115 180 1 Semiurban Y
LP001343 Male Yes 0 Graduate No 1759 3541.00 131 360 1 Semiurban Y
LP001345 Male Yes 2 Not Graduate No 4288 3263.00 133 180 1 Urban Y
LP001349 Male No 0 Graduate No 4843 3806.00 151 360 1 Semiurban Y
LP001350 Male Yes NA Graduate No 13650 0.00 NA 360 1 Urban Y
LP001356 Male Yes 0 Graduate No 4652 3583.00 NA 360 1 Semiurban Y
LP001357 Male NA NA Graduate No 3816 754.00 160 360 1 Urban Y
LP001367 Male Yes 1 Graduate No 3052 1030.00 100 360 1 Urban Y
LP001369 Male Yes 2 Graduate No 11417 1126.00 225 360 1 Urban Y
LP001370 Male No 0 Not Graduate NA 7333 0.00 120 360 1 Rural N
LP001379 Male Yes 2 Graduate No 3800 3600.00 216 360 0 Urban N
LP001384 Male Yes 3+ Not Graduate No 2071 754.00 94 480 1 Semiurban Y
LP001385 Male No 0 Graduate No 5316 0.00 136 360 1 Urban Y
LP001387 Female Yes 0 Graduate NA 2929 2333.00 139 360 1 Semiurban Y
LP001391 Male Yes 0 Not Graduate No 3572 4114.00 152 NA 0 Rural N
LP001392 Female No 1 Graduate Yes 7451 0.00 NA 360 1 Semiurban Y
LP001398 Male No 0 Graduate NA 5050 0.00 118 360 1 Semiurban Y
LP001401 Male Yes 1 Graduate No 14583 0.00 185 180 1 Rural Y
LP001404 Female Yes 0 Graduate No 3167 2283.00 154 360 1 Semiurban Y
LP001405 Male Yes 1 Graduate No 2214 1398.00 85 360 NA Urban Y
LP001421 Male Yes 0 Graduate No 5568 2142.00 175 360 1 Rural N
LP001422 Female No 0 Graduate No 10408 0.00 259 360 1 Urban Y
LP001426 Male Yes NA Graduate No 5667 2667.00 180 360 1 Rural Y
LP001430 Female No 0 Graduate No 4166 0.00 44 360 1 Semiurban Y
LP001431 Female No 0 Graduate No 2137 8980.00 137 360 0 Semiurban Y
LP001432 Male Yes 2 Graduate No 2957 0.00 81 360 1 Semiurban Y
LP001439 Male Yes 0 Not Graduate No 4300 2014.00 194 360 1 Rural Y
LP001443 Female No 0 Graduate No 3692 0.00 93 360 NA Rural Y
LP001448 NA Yes 3+ Graduate No 23803 0.00 370 360 1 Rural Y
LP001449 Male No 0 Graduate No 3865 1640.00 NA 360 1 Rural Y
LP001451 Male Yes 1 Graduate Yes 10513 3850.00 160 180 0 Urban N
LP001465 Male Yes 0 Graduate No 6080 2569.00 182 360 NA Rural N
LP001469 Male No 0 Graduate Yes 20166 0.00 650 480 NA Urban Y
LP001473 Male No 0 Graduate No 2014 1929.00 74 360 1 Urban Y
LP001478 Male No 0 Graduate No 2718 0.00 70 360 1 Semiurban Y
LP001482 Male Yes 0 Graduate Yes 3459 0.00 25 120 1 Semiurban Y
LP001487 Male No 0 Graduate No 4895 0.00 102 360 1 Semiurban Y
LP001488 Male Yes 3+ Graduate No 4000 7750.00 290 360 1 Semiurban N
LP001489 Female Yes 0 Graduate No 4583 0.00 84 360 1 Rural N
LP001491 Male Yes 2 Graduate Yes 3316 3500.00 88 360 1 Urban Y
LP001492 Male No 0 Graduate No 14999 0.00 242 360 0 Semiurban N
LP001493 Male Yes 2 Not Graduate No 4200 1430.00 129 360 1 Rural N
LP001497 Male Yes 2 Graduate No 5042 2083.00 185 360 1 Rural N
LP001498 Male No 0 Graduate No 5417 0.00 168 360 1 Urban Y
LP001504 Male No 0 Graduate Yes 6950 0.00 175 180 1 Semiurban Y
LP001507 Male Yes 0 Graduate No 2698 2034.00 122 360 1 Semiurban Y
LP001508 Male Yes 2 Graduate No 11757 0.00 187 180 1 Urban Y
LP001514 Female Yes 0 Graduate No 2330 4486.00 100 360 1 Semiurban Y
LP001516 Female Yes 2 Graduate No 14866 0.00 70 360 1 Urban Y
LP001518 Male Yes 1 Graduate No 1538 1425.00 30 360 1 Urban Y
LP001519 Female No 0 Graduate No 10000 1666.00 225 360 1 Rural N
LP001520 Male Yes 0 Graduate No 4860 830.00 125 360 1 Semiurban Y
LP001528 Male No 0 Graduate No 6277 0.00 118 360 0 Rural N
LP001529 Male Yes 0 Graduate Yes 2577 3750.00 152 360 1 Rural Y
LP001531 Male No 0 Graduate No 9166 0.00 244 360 1 Urban N
LP001532 Male Yes 2 Not Graduate No 2281 0.00 113 360 1 Rural N
LP001535 Male No 0 Graduate No 3254 0.00 50 360 1 Urban Y
LP001536 Male Yes 3+ Graduate No 39999 0.00 600 180 0 Semiurban Y
LP001541 Male Yes 1 Graduate No 6000 0.00 160 360 NA Rural Y
LP001543 Male Yes 1 Graduate No 9538 0.00 187 360 1 Urban Y
LP001546 Male No 0 Graduate NA 2980 2083.00 120 360 1 Rural Y
LP001552 Male Yes 0 Graduate No 4583 5625.00 255 360 1 Semiurban Y
LP001560 Male Yes 0 Not Graduate No 1863 1041.00 98 360 1 Semiurban Y
LP001562 Male Yes 0 Graduate No 7933 0.00 275 360 1 Urban N
LP001565 Male Yes 1 Graduate No 3089 1280.00 121 360 0 Semiurban N
LP001570 Male Yes 2 Graduate No 4167 1447.00 158 360 1 Rural Y
LP001572 Male Yes 0 Graduate No 9323 0.00 75 180 1 Urban Y
LP001574 Male Yes 0 Graduate No 3707 3166.00 182 NA 1 Rural Y
LP001577 Female Yes 0 Graduate No 4583 0.00 112 360 1 Rural N
LP001578 Male Yes 0 Graduate No 2439 3333.00 129 360 1 Rural Y
LP001579 Male No 0 Graduate No 2237 0.00 63 480 0 Semiurban N
LP001580 Male Yes 2 Graduate No 8000 0.00 200 360 1 Semiurban Y
LP001581 Male Yes 0 Not Graduate NA 1820 1769.00 95 360 1 Rural Y
LP001585 NA Yes 3+ Graduate No 51763 0.00 700 300 1 Urban Y
LP001586 Male Yes 3+ Not Graduate No 3522 0.00 81 180 1 Rural N
LP001594 Male Yes 0 Graduate No 5708 5625.00 187 360 1 Semiurban Y
LP001603 Male Yes 0 Not Graduate Yes 4344 736.00 87 360 1 Semiurban N
LP001606 Male Yes 0 Graduate No 3497 1964.00 116 360 1 Rural Y
LP001608 Male Yes 2 Graduate No 2045 1619.00 101 360 1 Rural Y
LP001610 Male Yes 3+ Graduate No 5516 11300.00 495 360 0 Semiurban N
LP001616 Male Yes 1 Graduate No 3750 0.00 116 360 1 Semiurban Y
LP001630 Male No 0 Not Graduate No 2333 1451.00 102 480 0 Urban N
LP001633 Male Yes 1 Graduate No 6400 7250.00 180 360 0 Urban N
LP001634 Male No 0 Graduate No 1916 5063.00 67 360 NA Rural N
LP001636 Male Yes 0 Graduate No 4600 0.00 73 180 1 Semiurban Y
LP001637 Male Yes 1 Graduate No 33846 0.00 260 360 1 Semiurban N
LP001639 Female Yes 0 Graduate No 3625 0.00 108 360 1 Semiurban Y
LP001640 Male Yes 0 Graduate Yes 39147 4750.00 120 360 1 Semiurban Y
LP001641 Male Yes 1 Graduate Yes 2178 0.00 66 300 0 Rural N
LP001643 Male Yes 0 Graduate No 2383 2138.00 58 360 NA Rural Y
LP001644 NA Yes 0 Graduate Yes 674 5296.00 168 360 1 Rural Y
LP001647 Male Yes 0 Graduate No 9328 0.00 188 180 1 Rural Y
LP001653 Male No 0 Not Graduate No 4885 0.00 48 360 1 Rural Y
LP001656 Male No 0 Graduate No 12000 0.00 164 360 1 Semiurban N
LP001657 Male Yes 0 Not Graduate No 6033 0.00 160 360 1 Urban N
LP001658 Male No 0 Graduate No 3858 0.00 76 360 1 Semiurban Y
LP001664 Male No 0 Graduate No 4191 0.00 120 360 1 Rural Y
LP001665 Male Yes 1 Graduate No 3125 2583.00 170 360 1 Semiurban N
LP001666 Male No 0 Graduate No 8333 3750.00 187 360 1 Rural Y
LP001669 Female No 0 Not Graduate No 1907 2365.00 120 NA 1 Urban Y
LP001671 Female Yes 0 Graduate No 3416 2816.00 113 360 NA Semiurban Y
LP001673 Male No 0 Graduate Yes 11000 0.00 83 360 1 Urban N
LP001674 Male Yes 1 Not Graduate No 2600 2500.00 90 360 1 Semiurban Y
LP001677 Male No 2 Graduate No 4923 0.00 166 360 0 Semiurban Y
LP001682 Male Yes 3+ Not Graduate No 3992 0.00 NA 180 1 Urban N
LP001688 Male Yes 1 Not Graduate No 3500 1083.00 135 360 1 Urban Y
LP001691 Male Yes 2 Not Graduate No 3917 0.00 124 360 1 Semiurban Y
LP001692 Female No 0 Not Graduate No 4408 0.00 120 360 1 Semiurban Y
LP001693 Female No 0 Graduate No 3244 0.00 80 360 1 Urban Y
LP001698 Male No 0 Not Graduate No 3975 2531.00 55 360 1 Rural Y
LP001699 Male No 0 Graduate No 2479 0.00 59 360 1 Urban Y
LP001702 Male No 0 Graduate No 3418 0.00 127 360 1 Semiurban N
LP001708 Female No 0 Graduate No 10000 0.00 214 360 1 Semiurban N
LP001711 Male Yes 3+ Graduate No 3430 1250.00 128 360 0 Semiurban N
LP001713 Male Yes 1 Graduate Yes 7787 0.00 240 360 1 Urban Y
LP001715 Male Yes 3+ Not Graduate Yes 5703 0.00 130 360 1 Rural Y
LP001716 Male Yes 0 Graduate No 3173 3021.00 137 360 1 Urban Y
LP001720 Male Yes 3+ Not Graduate No 3850 983.00 100 360 1 Semiurban Y
LP001722 Male Yes 0 Graduate No 150 1800.00 135 360 1 Rural N
LP001726 Male Yes 0 Graduate No 3727 1775.00 131 360 1 Semiurban Y
LP001732 Male Yes 2 Graduate NA 5000 0.00 72 360 0 Semiurban N
LP001734 Female Yes 2 Graduate No 4283 2383.00 127 360 NA Semiurban Y
LP001736 Male Yes 0 Graduate No 2221 0.00 60 360 0 Urban N
LP001743 Male Yes 2 Graduate No 4009 1717.00 116 360 1 Semiurban Y
LP001744 Male No 0 Graduate No 2971 2791.00 144 360 1 Semiurban Y
LP001749 Male Yes 0 Graduate No 7578 1010.00 175 NA 1 Semiurban Y
LP001750 Male Yes 0 Graduate No 6250 0.00 128 360 1 Semiurban Y
LP001751 Male Yes 0 Graduate No 3250 0.00 170 360 1 Rural N
LP001754 Male Yes NA Not Graduate Yes 4735 0.00 138 360 1 Urban N
LP001758 Male Yes 2 Graduate No 6250 1695.00 210 360 1 Semiurban Y
LP001760 Male NA NA Graduate No 4758 0.00 158 480 1 Semiurban Y
LP001761 Male No 0 Graduate Yes 6400 0.00 200 360 1 Rural Y
LP001765 Male Yes 1 Graduate No 2491 2054.00 104 360 1 Semiurban Y
LP001768 Male Yes 0 Graduate NA 3716 0.00 42 180 1 Rural Y
LP001770 Male No 0 Not Graduate No 3189 2598.00 120 NA 1 Rural Y
LP001776 Female No 0 Graduate No 8333 0.00 280 360 1 Semiurban Y
LP001778 Male Yes 1 Graduate No 3155 1779.00 140 360 1 Semiurban Y
LP001784 Male Yes 1 Graduate No 5500 1260.00 170 360 1 Rural Y
LP001786 Male Yes 0 Graduate NA 5746 0.00 255 360 NA Urban N
LP001788 Female No 0 Graduate Yes 3463 0.00 122 360 NA Urban Y
LP001790 Female No 1 Graduate No 3812 0.00 112 360 1 Rural Y
LP001792 Male Yes 1 Graduate No 3315 0.00 96 360 1 Semiurban Y
LP001798 Male Yes 2 Graduate No 5819 5000.00 120 360 1 Rural Y
LP001800 Male Yes 1 Not Graduate No 2510 1983.00 140 180 1 Urban N
LP001806 Male No 0 Graduate No 2965 5701.00 155 60 1 Urban Y
LP001807 Male Yes 2 Graduate Yes 6250 1300.00 108 360 1 Rural Y
LP001811 Male Yes 0 Not Graduate No 3406 4417.00 123 360 1 Semiurban Y
LP001813 Male No 0 Graduate Yes 6050 4333.00 120 180 1 Urban N
LP001814 Male Yes 2 Graduate No 9703 0.00 112 360 1 Urban Y
LP001819 Male Yes 1 Not Graduate No 6608 0.00 137 180 1 Urban Y
LP001824 Male Yes 1 Graduate No 2882 1843.00 123 480 1 Semiurban Y
LP001825 Male Yes 0 Graduate No 1809 1868.00 90 360 1 Urban Y
LP001835 Male Yes 0 Not Graduate No 1668 3890.00 201 360 0 Semiurban N
LP001836 Female No 2 Graduate No 3427 0.00 138 360 1 Urban N
LP001841 Male No 0 Not Graduate Yes 2583 2167.00 104 360 1 Rural Y
LP001843 Male Yes 1 Not Graduate No 2661 7101.00 279 180 1 Semiurban Y
LP001844 Male No 0 Graduate Yes 16250 0.00 192 360 0 Urban N
LP001846 Female No 3+ Graduate No 3083 0.00 255 360 1 Rural Y
LP001849 Male No 0 Not Graduate No 6045 0.00 115 360 0 Rural N
LP001854 Male Yes 3+ Graduate No 5250 0.00 94 360 1 Urban N
LP001859 Male Yes 0 Graduate No 14683 2100.00 304 360 1 Rural N
LP001864 Male Yes 3+ Not Graduate No 4931 0.00 128 360 NA Semiurban N
LP001865 Male Yes 1 Graduate No 6083 4250.00 330 360 NA Urban Y
LP001868 Male No 0 Graduate No 2060 2209.00 134 360 1 Semiurban Y
LP001870 Female No 1 Graduate No 3481 0.00 155 36 1 Semiurban N
LP001871 Female No 0 Graduate No 7200 0.00 120 360 1 Rural Y
LP001872 Male No 0 Graduate Yes 5166 0.00 128 360 1 Semiurban Y
LP001875 Male No 0 Graduate No 4095 3447.00 151 360 1 Rural Y
LP001877 Male Yes 2 Graduate No 4708 1387.00 150 360 1 Semiurban Y
LP001882 Male Yes 3+ Graduate No 4333 1811.00 160 360 0 Urban Y
LP001883 Female No 0 Graduate NA 3418 0.00 135 360 1 Rural N
LP001884 Female No 1 Graduate No 2876 1560.00 90 360 1 Urban Y
LP001888 Female No 0 Graduate No 3237 0.00 30 360 1 Urban Y
LP001891 Male Yes 0 Graduate No 11146 0.00 136 360 1 Urban Y
LP001892 Male No 0 Graduate No 2833 1857.00 126 360 1 Rural Y
LP001894 Male Yes 0 Graduate No 2620 2223.00 150 360 1 Semiurban Y
LP001896 Male Yes 2 Graduate No 3900 0.00 90 360 1 Semiurban Y
LP001900 Male Yes 1 Graduate No 2750 1842.00 115 360 1 Semiurban Y
LP001903 Male Yes 0 Graduate No 3993 3274.00 207 360 1 Semiurban Y
LP001904 Male Yes 0 Graduate No 3103 1300.00 80 360 1 Urban Y
LP001907 Male Yes 0 Graduate No 14583 0.00 436 360 1 Semiurban Y
LP001908 Female Yes 0 Not Graduate No 4100 0.00 124 360 NA Rural Y
LP001910 Male No 1 Not Graduate Yes 4053 2426.00 158 360 0 Urban N
LP001914 Male Yes 0 Graduate No 3927 800.00 112 360 1 Semiurban Y
LP001915 Male Yes 2 Graduate No 2301 985.80 78 180 1 Urban Y
LP001917 Female No 0 Graduate No 1811 1666.00 54 360 1 Urban Y
LP001922 Male Yes 0 Graduate No 20667 0.00 NA 360 1 Rural N
LP001924 Male No 0 Graduate No 3158 3053.00 89 360 1 Rural Y
LP001925 Female No 0 Graduate Yes 2600 1717.00 99 300 1 Semiurban N
LP001926 Male Yes 0 Graduate No 3704 2000.00 120 360 1 Rural Y
LP001931 Female No 0 Graduate No 4124 0.00 115 360 1 Semiurban Y
LP001935 Male No 0 Graduate No 9508 0.00 187 360 1 Rural Y
LP001936 Male Yes 0 Graduate No 3075 2416.00 139 360 1 Rural Y
LP001938 Male Yes 2 Graduate No 4400 0.00 127 360 0 Semiurban N
LP001940 Male Yes 2 Graduate No 3153 1560.00 134 360 1 Urban Y
LP001945 Female No NA Graduate No 5417 0.00 143 480 0 Urban N
LP001947 Male Yes 0 Graduate No 2383 3334.00 172 360 1 Semiurban Y
LP001949 Male Yes 3+ Graduate NA 4416 1250.00 110 360 1 Urban Y
LP001953 Male Yes 1 Graduate No 6875 0.00 200 360 1 Semiurban Y
LP001954 Female Yes 1 Graduate No 4666 0.00 135 360 1 Urban Y
LP001955 Female No 0 Graduate No 5000 2541.00 151 480 1 Rural N
LP001963 Male Yes 1 Graduate No 2014 2925.00 113 360 1 Urban N
LP001964 Male Yes 0 Not Graduate No 1800 2934.00 93 360 0 Urban N
LP001972 Male Yes NA Not Graduate No 2875 1750.00 105 360 1 Semiurban Y
LP001974 Female No 0 Graduate No 5000 0.00 132 360 1 Rural Y
LP001977 Male Yes 1 Graduate No 1625 1803.00 96 360 1 Urban Y
LP001978 Male No 0 Graduate No 4000 2500.00 140 360 1 Rural Y
LP001990 Male No 0 Not Graduate No 2000 0.00 NA 360 1 Urban N
LP001993 Female No 0 Graduate No 3762 1666.00 135 360 1 Rural Y
LP001994 Female No 0 Graduate No 2400 1863.00 104 360 0 Urban N
LP001996 Male No 0 Graduate No 20233 0.00 480 360 1 Rural N
LP001998 Male Yes 2 Not Graduate No 7667 0.00 185 360 NA Rural Y
LP002002 Female No 0 Graduate No 2917 0.00 84 360 1 Semiurban Y
LP002004 Male No 0 Not Graduate No 2927 2405.00 111 360 1 Semiurban Y
LP002006 Female No 0 Graduate No 2507 0.00 56 360 1 Rural Y
LP002008 Male Yes 2 Graduate Yes 5746 0.00 144 84 NA Rural Y
LP002024 NA Yes 0 Graduate No 2473 1843.00 159 360 1 Rural N
LP002031 Male Yes 1 Not Graduate No 3399 1640.00 111 180 1 Urban Y
LP002035 Male Yes 2 Graduate No 3717 0.00 120 360 1 Semiurban Y
LP002036 Male Yes 0 Graduate No 2058 2134.00 88 360 NA Urban Y
LP002043 Female No 1 Graduate No 3541 0.00 112 360 NA Semiurban Y
LP002050 Male Yes 1 Graduate Yes 10000 0.00 155 360 1 Rural N
LP002051 Male Yes 0 Graduate No 2400 2167.00 115 360 1 Semiurban Y
LP002053 Male Yes 3+ Graduate No 4342 189.00 124 360 1 Semiurban Y
LP002054 Male Yes 2 Not Graduate No 3601 1590.00 NA 360 1 Rural Y
LP002055 Female No 0 Graduate No 3166 2985.00 132 360 NA Rural Y
LP002065 Male Yes 3+ Graduate No 15000 0.00 300 360 1 Rural Y
LP002067 Male Yes 1 Graduate Yes 8666 4983.00 376 360 0 Rural N
LP002068 Male No 0 Graduate No 4917 0.00 130 360 0 Rural Y
LP002082 Male Yes 0 Graduate Yes 5818 2160.00 184 360 1 Semiurban Y
LP002086 Female Yes 0 Graduate No 4333 2451.00 110 360 1 Urban N
LP002087 Female No 0 Graduate No 2500 0.00 67 360 1 Urban Y
LP002097 Male No 1 Graduate No 4384 1793.00 117 360 1 Urban Y
LP002098 Male No 0 Graduate No 2935 0.00 98 360 1 Semiurban Y
LP002100 Male No NA Graduate No 2833 0.00 71 360 1 Urban Y
LP002101 Male Yes 0 Graduate NA 63337 0.00 490 180 1 Urban Y
LP002103 NA Yes 1 Graduate Yes 9833 1833.00 182 180 1 Urban Y
LP002106 Male Yes NA Graduate Yes 5503 4490.00 70 NA 1 Semiurban Y
LP002110 Male Yes 1 Graduate NA 5250 688.00 160 360 1 Rural Y
LP002112 Male Yes 2 Graduate Yes 2500 4600.00 176 360 1 Rural Y
LP002113 Female No 3+ Not Graduate No 1830 0.00 NA 360 0 Urban N
LP002114 Female No 0 Graduate No 4160 0.00 71 360 1 Semiurban Y
LP002115 Male Yes 3+ Not Graduate No 2647 1587.00 173 360 1 Rural N
LP002116 Female No 0 Graduate No 2378 0.00 46 360 1 Rural N
LP002119 Male Yes 1 Not Graduate No 4554 1229.00 158 360 1 Urban Y
LP002126 Male Yes 3+ Not Graduate No 3173 0.00 74 360 1 Semiurban Y
LP002128 Male Yes 2 Graduate NA 2583 2330.00 125 360 1 Rural Y
LP002129 Male Yes 0 Graduate No 2499 2458.00 160 360 1 Semiurban Y
LP002130 Male Yes NA Not Graduate No 3523 3230.00 152 360 0 Rural N
LP002131 Male Yes 2 Not Graduate No 3083 2168.00 126 360 1 Urban Y
LP002137 Male Yes 0 Graduate No 6333 4583.00 259 360 NA Semiurban Y
LP002138 Male Yes 0 Graduate No 2625 6250.00 187 360 1 Rural Y
LP002139 Male Yes 0 Graduate No 9083 0.00 228 360 1 Semiurban Y
LP002140 Male No 0 Graduate No 8750 4167.00 308 360 1 Rural N
LP002141 Male Yes 3+ Graduate No 2666 2083.00 95 360 1 Rural Y
LP002142 Female Yes 0 Graduate Yes 5500 0.00 105 360 0 Rural N
LP002143 Female Yes 0 Graduate No 2423 505.00 130 360 1 Semiurban Y
LP002144 Female No NA Graduate No 3813 0.00 116 180 1 Urban Y
LP002149 Male Yes 2 Graduate No 8333 3167.00 165 360 1 Rural Y
LP002151 Male Yes 1 Graduate No 3875 0.00 67 360 1 Urban N
LP002158 Male Yes 0 Not Graduate No 3000 1666.00 100 480 0 Urban N
LP002160 Male Yes 3+ Graduate No 5167 3167.00 200 360 1 Semiurban Y
LP002161 Female No 1 Graduate No 4723 0.00 81 360 1 Semiurban N
LP002170 Male Yes 2 Graduate No 5000 3667.00 236 360 1 Semiurban Y
LP002175 Male Yes 0 Graduate No 4750 2333.00 130 360 1 Urban Y
LP002178 Male Yes 0 Graduate No 3013 3033.00 95 300 NA Urban Y
LP002180 Male No 0 Graduate Yes 6822 0.00 141 360 1 Rural Y
LP002181 Male No 0 Not Graduate No 6216 0.00 133 360 1 Rural N
LP002187 Male No 0 Graduate No 2500 0.00 96 480 1 Semiurban N
LP002188 Male No 0 Graduate No 5124 0.00 124 NA 0 Rural N
LP002190 Male Yes 1 Graduate No 6325 0.00 175 360 1 Semiurban Y
LP002191 Male Yes 0 Graduate No 19730 5266.00 570 360 1 Rural N
LP002194 Female No 0 Graduate Yes 15759 0.00 55 360 1 Semiurban Y
LP002197 Male Yes 2 Graduate No 5185 0.00 155 360 1 Semiurban Y
LP002201 Male Yes 2 Graduate Yes 9323 7873.00 380 300 1 Rural Y
LP002205 Male No 1 Graduate No 3062 1987.00 111 180 0 Urban N
LP002209 Female No 0 Graduate NA 2764 1459.00 110 360 1 Urban Y
LP002211 Male Yes 0 Graduate No 4817 923.00 120 180 1 Urban Y
LP002219 Male Yes 3+ Graduate No 8750 4996.00 130 360 1 Rural Y
LP002223 Male Yes 0 Graduate No 4310 0.00 130 360 NA Semiurban Y
LP002224 Male No 0 Graduate No 3069 0.00 71 480 1 Urban N
LP002225 Male Yes 2 Graduate No 5391 0.00 130 360 1 Urban Y
LP002226 Male Yes 0 Graduate NA 3333 2500.00 128 360 1 Semiurban Y
LP002229 Male No 0 Graduate No 5941 4232.00 296 360 1 Semiurban Y
LP002231 Female No 0 Graduate No 6000 0.00 156 360 1 Urban Y
LP002234 Male No 0 Graduate Yes 7167 0.00 128 360 1 Urban Y
LP002236 Male Yes 2 Graduate No 4566 0.00 100 360 1 Urban N
LP002237 Male No 1 Graduate NA 3667 0.00 113 180 1 Urban Y
LP002239 Male No 0 Not Graduate No 2346 1600.00 132 360 1 Semiurban Y
LP002243 Male Yes 0 Not Graduate No 3010 3136.00 NA 360 0 Urban N
LP002244 Male Yes 0 Graduate No 2333 2417.00 136 360 1 Urban Y
LP002250 Male Yes 0 Graduate No 5488 0.00 125 360 1 Rural Y
LP002255 Male No 3+ Graduate No 9167 0.00 185 360 1 Rural Y
LP002262 Male Yes 3+ Graduate No 9504 0.00 275 360 1 Rural Y
LP002263 Male Yes 0 Graduate No 2583 2115.00 120 360 NA Urban Y
LP002265 Male Yes 2 Not Graduate No 1993 1625.00 113 180 1 Semiurban Y
LP002266 Male Yes 2 Graduate No 3100 1400.00 113 360 1 Urban Y
LP002272 Male Yes 2 Graduate No 3276 484.00 135 360 NA Semiurban Y
LP002277 Female No 0 Graduate No 3180 0.00 71 360 0 Urban N
LP002281 Male Yes 0 Graduate No 3033 1459.00 95 360 1 Urban Y
LP002284 Male No 0 Not Graduate No 3902 1666.00 109 360 1 Rural Y
LP002287 Female No 0 Graduate No 1500 1800.00 103 360 0 Semiurban N
LP002288 Male Yes 2 Not Graduate No 2889 0.00 45 180 0 Urban N
LP002296 Male No 0 Not Graduate No 2755 0.00 65 300 1 Rural N
LP002297 Male No 0 Graduate No 2500 20000.00 103 360 1 Semiurban Y
LP002300 Female No 0 Not Graduate No 1963 0.00 53 360 1 Semiurban Y
LP002301 Female No 0 Graduate Yes 7441 0.00 194 360 1 Rural N
LP002305 Female No 0 Graduate No 4547 0.00 115 360 1 Semiurban Y
LP002308 Male Yes 0 Not Graduate No 2167 2400.00 115 360 1 Urban Y
LP002314 Female No 0 Not Graduate No 2213 0.00 66 360 1 Rural Y
LP002315 Male Yes 1 Graduate No 8300 0.00 152 300 0 Semiurban N
LP002317 Male Yes 3+ Graduate No 81000 0.00 360 360 0 Rural N
LP002318 Female No 1 Not Graduate Yes 3867 0.00 62 360 1 Semiurban N
LP002319 Male Yes 0 Graduate NA 6256 0.00 160 360 NA Urban Y
LP002328 Male Yes 0 Not Graduate No 6096 0.00 218 360 0 Rural N
LP002332 Male Yes 0 Not Graduate No 2253 2033.00 110 360 1 Rural Y
LP002335 Female Yes 0 Not Graduate No 2149 3237.00 178 360 0 Semiurban N
LP002337 Female No 0 Graduate No 2995 0.00 60 360 1 Urban Y
LP002341 Female No 1 Graduate No 2600 0.00 160 360 1 Urban N
LP002342 Male Yes 2 Graduate Yes 1600 20000.00 239 360 1 Urban N
LP002345 Male Yes 0 Graduate No 1025 2773.00 112 360 1 Rural Y
LP002347 Male Yes 0 Graduate No 3246 1417.00 138 360 1 Semiurban Y
LP002348 Male Yes 0 Graduate No 5829 0.00 138 360 1 Rural Y
LP002357 Female No 0 Not Graduate No 2720 0.00 80 NA 0 Urban N
LP002361 Male Yes 0 Graduate No 1820 1719.00 100 360 1 Urban Y
LP002362 Male Yes 1 Graduate No 7250 1667.00 110 NA 0 Urban N
LP002364 Male Yes 0 Graduate No 14880 0.00 96 360 1 Semiurban Y
LP002366 Male Yes 0 Graduate No 2666 4300.00 121 360 1 Rural Y
LP002367 Female No 1 Not Graduate No 4606 0.00 81 360 1 Rural N
LP002368 Male Yes 2 Graduate No 5935 0.00 133 360 1 Semiurban Y
LP002369 Male Yes 0 Graduate No 2920 16.12 87 360 1 Rural Y
LP002370 Male No 0 Not Graduate No 2717 0.00 60 180 1 Urban Y
LP002377 Female No 1 Graduate Yes 8624 0.00 150 360 1 Semiurban Y
LP002379 Male No 0 Graduate No 6500 0.00 105 360 0 Rural N
LP002386 Male No 0 Graduate NA 12876 0.00 405 360 1 Semiurban Y
LP002387 Male Yes 0 Graduate No 2425 2340.00 143 360 1 Semiurban Y
LP002390 Male No 0 Graduate No 3750 0.00 100 360 1 Urban Y
LP002393 Female NA NA Graduate No 10047 0.00 NA 240 1 Semiurban Y
LP002398 Male No 0 Graduate No 1926 1851.00 50 360 1 Semiurban Y
LP002401 Male Yes 0 Graduate No 2213 1125.00 NA 360 1 Urban Y
LP002403 Male No 0 Graduate Yes 10416 0.00 187 360 0 Urban N
LP002407 Female Yes 0 Not Graduate Yes 7142 0.00 138 360 1 Rural Y
LP002408 Male No 0 Graduate No 3660 5064.00 187 360 1 Semiurban Y
LP002409 Male Yes 0 Graduate No 7901 1833.00 180 360 1 Rural Y
LP002418 Male No 3+ Not Graduate No 4707 1993.00 148 360 1 Semiurban Y
LP002422 Male No 1 Graduate No 37719 0.00 152 360 1 Semiurban Y
LP002424 Male Yes 0 Graduate No 7333 8333.00 175 300 NA Rural Y
LP002429 Male Yes 1 Graduate Yes 3466 1210.00 130 360 1 Rural Y
LP002434 Male Yes 2 Not Graduate No 4652 0.00 110 360 1 Rural Y
LP002435 Male Yes 0 Graduate NA 3539 1376.00 55 360 1 Rural N
LP002443 Male Yes 2 Graduate No 3340 1710.00 150 360 0 Rural N
LP002444 Male No 1 Not Graduate Yes 2769 1542.00 190 360 NA Semiurban N
LP002446 Male Yes 2 Not Graduate No 2309 1255.00 125 360 0 Rural N
LP002447 Male Yes 2 Not Graduate No 1958 1456.00 60 300 NA Urban Y
LP002448 Male Yes 0 Graduate No 3948 1733.00 149 360 0 Rural N
LP002449 Male Yes 0 Graduate No 2483 2466.00 90 180 0 Rural Y
LP002453 Male No 0 Graduate Yes 7085 0.00 84 360 1 Semiurban Y
LP002455 Male Yes 2 Graduate No 3859 0.00 96 360 1 Semiurban Y
LP002459 Male Yes 0 Graduate No 4301 0.00 118 360 1 Urban Y
LP002467 Male Yes 0 Graduate No 3708 2569.00 173 360 1 Urban N
LP002472 Male No 2 Graduate No 4354 0.00 136 360 1 Rural Y
LP002473 Male Yes 0 Graduate No 8334 0.00 160 360 1 Semiurban N
LP002478 NA Yes 0 Graduate Yes 2083 4083.00 160 360 NA Semiurban Y
LP002484 Male Yes 3+ Graduate No 7740 0.00 128 180 1 Urban Y
LP002487 Male Yes 0 Graduate No 3015 2188.00 153 360 1 Rural Y
LP002489 Female No 1 Not Graduate NA 5191 0.00 132 360 1 Semiurban Y
LP002493 Male No 0 Graduate No 4166 0.00 98 360 0 Semiurban N
LP002494 Male No 0 Graduate No 6000 0.00 140 360 1 Rural Y
LP002500 Male Yes 3+ Not Graduate No 2947 1664.00 70 180 0 Urban N
LP002501 NA Yes 0 Graduate No 16692 0.00 110 360 1 Semiurban Y
LP002502 Female Yes 2 Not Graduate NA 210 2917.00 98 360 1 Semiurban Y
LP002505 Male Yes 0 Graduate No 4333 2451.00 110 360 1 Urban N
LP002515 Male Yes 1 Graduate Yes 3450 2079.00 162 360 1 Semiurban Y
LP002517 Male Yes 1 Not Graduate No 2653 1500.00 113 180 0 Rural N
LP002519 Male Yes 3+ Graduate No 4691 0.00 100 360 1 Semiurban Y
LP002522 Female No 0 Graduate Yes 2500 0.00 93 360 NA Urban Y
LP002524 Male No 2 Graduate No 5532 4648.00 162 360 1 Rural Y
LP002527 Male Yes 2 Graduate Yes 16525 1014.00 150 360 1 Rural Y
LP002529 Male Yes 2 Graduate No 6700 1750.00 230 300 1 Semiurban Y
LP002530 NA Yes 2 Graduate No 2873 1872.00 132 360 0 Semiurban N
LP002531 Male Yes 1 Graduate Yes 16667 2250.00 86 360 1 Semiurban Y
LP002533 Male Yes 2 Graduate No 2947 1603.00 NA 360 1 Urban N
LP002534 Female No 0 Not Graduate No 4350 0.00 154 360 1 Rural Y
LP002536 Male Yes 3+ Not Graduate No 3095 0.00 113 360 1 Rural Y
LP002537 Male Yes 0 Graduate No 2083 3150.00 128 360 1 Semiurban Y
LP002541 Male Yes 0 Graduate No 10833 0.00 234 360 1 Semiurban Y
LP002543 Male Yes 2 Graduate No 8333 0.00 246 360 1 Semiurban Y
LP002544 Male Yes 1 Not Graduate No 1958 2436.00 131 360 1 Rural Y
LP002545 Male No 2 Graduate No 3547 0.00 80 360 0 Rural N
LP002547 Male Yes 1 Graduate No 18333 0.00 500 360 1 Urban N
LP002555 Male Yes 2 Graduate Yes 4583 2083.00 160 360 1 Semiurban Y
LP002556 Male No 0 Graduate No 2435 0.00 75 360 1 Urban N
LP002560 Male No 0 Not Graduate No 2699 2785.00 96 360 NA Semiurban Y
LP002562 Male Yes 1 Not Graduate No 5333 1131.00 186 360 NA Urban Y
LP002571 Male No 0 Not Graduate No 3691 0.00 110 360 1 Rural Y
LP002582 Female No 0 Not Graduate Yes 17263 0.00 225 360 1 Semiurban Y
LP002585 Male Yes 0 Graduate No 3597 2157.00 119 360 0 Rural N
LP002586 Female Yes 1 Graduate No 3326 913.00 105 84 1 Semiurban Y
LP002587 Male Yes 0 Not Graduate No 2600 1700.00 107 360 1 Rural Y
LP002588 Male Yes 0 Graduate No 4625 2857.00 111 12 NA Urban Y
LP002600 Male Yes 1 Graduate Yes 2895 0.00 95 360 1 Semiurban Y
LP002602 Male No 0 Graduate No 6283 4416.00 209 360 0 Rural N
LP002603 Female No 0 Graduate No 645 3683.00 113 480 1 Rural Y
LP002606 Female No 0 Graduate No 3159 0.00 100 360 1 Semiurban Y
LP002615 Male Yes 2 Graduate No 4865 5624.00 208 360 1 Semiurban Y
LP002618 Male Yes 1 Not Graduate No 4050 5302.00 138 360 NA Rural N
LP002619 Male Yes 0 Not Graduate No 3814 1483.00 124 300 1 Semiurban Y
LP002622 Male Yes 2 Graduate No 3510 4416.00 243 360 1 Rural Y
LP002624 Male Yes 0 Graduate No 20833 6667.00 480 360 NA Urban Y
LP002625 NA No 0 Graduate No 3583 0.00 96 360 1 Urban N
LP002626 Male Yes 0 Graduate Yes 2479 3013.00 188 360 1 Urban Y
LP002634 Female No 1 Graduate No 13262 0.00 40 360 1 Urban Y
LP002637 Male No 0 Not Graduate No 3598 1287.00 100 360 1 Rural N
LP002640 Male Yes 1 Graduate No 6065 2004.00 250 360 1 Semiurban Y
LP002643 Male Yes 2 Graduate No 3283 2035.00 148 360 1 Urban Y
LP002648 Male Yes 0 Graduate No 2130 6666.00 70 180 1 Semiurban N
LP002652 Male No 0 Graduate No 5815 3666.00 311 360 1 Rural N
LP002659 Male Yes 3+ Graduate No 3466 3428.00 150 360 1 Rural Y
LP002670 Female Yes 2 Graduate No 2031 1632.00 113 480 1 Semiurban Y
LP002682 Male Yes NA Not Graduate No 3074 1800.00 123 360 0 Semiurban N
LP002683 Male No 0 Graduate No 4683 1915.00 185 360 1 Semiurban N
LP002684 Female No 0 Not Graduate No 3400 0.00 95 360 1 Rural N
LP002689 Male Yes 2 Not Graduate No 2192 1742.00 45 360 1 Semiurban Y
LP002690 Male No 0 Graduate No 2500 0.00 55 360 1 Semiurban Y
LP002692 Male Yes 3+ Graduate Yes 5677 1424.00 100 360 1 Rural Y
LP002693 Male Yes 2 Graduate Yes 7948 7166.00 480 360 1 Rural Y
LP002697 Male No 0 Graduate No 4680 2087.00 NA 360 1 Semiurban N
LP002699 Male Yes 2 Graduate Yes 17500 0.00 400 360 1 Rural Y
LP002705 Male Yes 0 Graduate No 3775 0.00 110 360 1 Semiurban Y
LP002706 Male Yes 1 Not Graduate No 5285 1430.00 161 360 0 Semiurban Y
LP002714 Male No 1 Not Graduate No 2679 1302.00 94 360 1 Semiurban Y
LP002716 Male No 0 Not Graduate No 6783 0.00 130 360 1 Semiurban Y
LP002717 Male Yes 0 Graduate No 1025 5500.00 216 360 NA Rural Y
LP002720 Male Yes 3+ Graduate No 4281 0.00 100 360 1 Urban Y
LP002723 Male No 2 Graduate No 3588 0.00 110 360 0 Rural N
LP002729 Male No 1 Graduate No 11250 0.00 196 360 NA Semiurban N
LP002731 Female No 0 Not Graduate Yes 18165 0.00 125 360 1 Urban Y
LP002732 Male No 0 Not Graduate NA 2550 2042.00 126 360 1 Rural Y
LP002734 Male Yes 0 Graduate No 6133 3906.00 324 360 1 Urban Y
LP002738 Male No 2 Graduate No 3617 0.00 107 360 1 Semiurban Y
LP002739 Male Yes 0 Not Graduate No 2917 536.00 66 360 1 Rural N
LP002740 Male Yes 3+ Graduate No 6417 0.00 157 180 1 Rural Y
LP002741 Female Yes 1 Graduate No 4608 2845.00 140 180 1 Semiurban Y
LP002743 Female No 0 Graduate No 2138 0.00 99 360 0 Semiurban N
LP002753 Female No 1 Graduate NA 3652 0.00 95 360 1 Semiurban Y
LP002755 Male Yes 1 Not Graduate No 2239 2524.00 128 360 1 Urban Y
LP002757 Female Yes 0 Not Graduate No 3017 663.00 102 360 NA Semiurban Y
LP002767 Male Yes 0 Graduate No 2768 1950.00 155 360 1 Rural Y
LP002768 Male No 0 Not Graduate No 3358 0.00 80 36 1 Semiurban N
LP002772 Male No 0 Graduate No 2526 1783.00 145 360 1 Rural Y
LP002776 Female No 0 Graduate No 5000 0.00 103 360 0 Semiurban N
LP002777 Male Yes 0 Graduate No 2785 2016.00 110 360 1 Rural Y
LP002778 Male Yes 2 Graduate Yes 6633 0.00 NA 360 0 Rural N
LP002784 Male Yes 1 Not Graduate No 2492 2375.00 NA 360 1 Rural Y
LP002785 Male Yes 1 Graduate No 3333 3250.00 158 360 1 Urban Y
LP002788 Male Yes 0 Not Graduate No 2454 2333.00 181 360 0 Urban N
LP002789 Male Yes 0 Graduate No 3593 4266.00 132 180 0 Rural N
LP002792 Male Yes 1 Graduate No 5468 1032.00 26 360 1 Semiurban Y
LP002794 Female No 0 Graduate No 2667 1625.00 84 360 NA Urban Y
LP002795 Male Yes 3+ Graduate Yes 10139 0.00 260 360 1 Semiurban Y
LP002798 Male Yes 0 Graduate No 3887 2669.00 162 360 1 Semiurban Y
LP002804 Female Yes 0 Graduate No 4180 2306.00 182 360 1 Semiurban Y
LP002807 Male Yes 2 Not Graduate No 3675 242.00 108 360 1 Semiurban Y
LP002813 Female Yes 1 Graduate Yes 19484 0.00 600 360 1 Semiurban Y
LP002820 Male Yes 0 Graduate No 5923 2054.00 211 360 1 Rural Y
LP002821 Male No 0 Not Graduate Yes 5800 0.00 132 360 1 Semiurban Y
LP002832 Male Yes 2 Graduate No 8799 0.00 258 360 0 Urban N
LP002833 Male Yes 0 Not Graduate No 4467 0.00 120 360 NA Rural Y
LP002836 Male No 0 Graduate No 3333 0.00 70 360 1 Urban Y
LP002837 Male Yes 3+ Graduate No 3400 2500.00 123 360 0 Rural N
LP002840 Female No 0 Graduate No 2378 0.00 9 360 1 Urban N
LP002841 Male Yes 0 Graduate No 3166 2064.00 104 360 0 Urban N
LP002842 Male Yes 1 Graduate No 3417 1750.00 186 360 1 Urban Y
LP002847 Male Yes NA Graduate No 5116 1451.00 165 360 0 Urban N
LP002855 Male Yes 2 Graduate No 16666 0.00 275 360 1 Urban Y
LP002862 Male Yes 2 Not Graduate No 6125 1625.00 187 480 1 Semiurban N
LP002863 Male Yes 3+ Graduate No 6406 0.00 150 360 1 Semiurban N
LP002868 Male Yes 2 Graduate No 3159 461.00 108 84 1 Urban Y
LP002872 NA Yes 0 Graduate No 3087 2210.00 136 360 0 Semiurban N
LP002874 Male No 0 Graduate No 3229 2739.00 110 360 1 Urban Y
LP002877 Male Yes 1 Graduate No 1782 2232.00 107 360 1 Rural Y
LP002888 Male No 0 Graduate NA 3182 2917.00 161 360 1 Urban Y
LP002892 Male Yes 2 Graduate No 6540 0.00 205 360 1 Semiurban Y
LP002893 Male No 0 Graduate No 1836 33837.00 90 360 1 Urban N
LP002894 Female Yes 0 Graduate No 3166 0.00 36 360 1 Semiurban Y
LP002898 Male Yes 1 Graduate No 1880 0.00 61 360 NA Rural N
LP002911 Male Yes 1 Graduate No 2787 1917.00 146 360 0 Rural N
LP002912 Male Yes 1 Graduate No 4283 3000.00 172 84 1 Rural N
LP002916 Male Yes 0 Graduate No 2297 1522.00 104 360 1 Urban Y
LP002917 Female No 0 Not Graduate No 2165 0.00 70 360 1 Semiurban Y
LP002925 NA No 0 Graduate No 4750 0.00 94 360 1 Semiurban Y
LP002926 Male Yes 2 Graduate Yes 2726 0.00 106 360 0 Semiurban N
LP002928 Male Yes 0 Graduate No 3000 3416.00 56 180 1 Semiurban Y
LP002931 Male Yes 2 Graduate Yes 6000 0.00 205 240 1 Semiurban N
LP002933 NA No 3+ Graduate Yes 9357 0.00 292 360 1 Semiurban Y
LP002936 Male Yes 0 Graduate No 3859 3300.00 142 180 1 Rural Y
LP002938 Male Yes 0 Graduate Yes 16120 0.00 260 360 1 Urban Y
LP002940 Male No 0 Not Graduate No 3833 0.00 110 360 1 Rural Y
LP002941 Male Yes 2 Not Graduate Yes 6383 1000.00 187 360 1 Rural N
LP002943 Male No NA Graduate No 2987 0.00 88 360 0 Semiurban N
LP002945 Male Yes 0 Graduate Yes 9963 0.00 180 360 1 Rural Y
LP002948 Male Yes 2 Graduate No 5780 0.00 192 360 1 Urban Y
LP002949 Female No 3+ Graduate NA 416 41667.00 350 180 NA Urban N
LP002950 Male Yes 0 Not Graduate NA 2894 2792.00 155 360 1 Rural Y
LP002953 Male Yes 3+ Graduate No 5703 0.00 128 360 1 Urban Y
LP002958 Male No 0 Graduate No 3676 4301.00 172 360 1 Rural Y
LP002959 Female Yes 1 Graduate No 12000 0.00 496 360 1 Semiurban Y
LP002960 Male Yes 0 Not Graduate No 2400 3800.00 NA 180 1 Urban N
LP002961 Male Yes 1 Graduate No 3400 2500.00 173 360 1 Semiurban Y
LP002964 Male Yes 2 Not Graduate No 3987 1411.00 157 360 1 Rural Y
LP002974 Male Yes 0 Graduate No 3232 1950.00 108 360 1 Rural Y
LP002978 Female No 0 Graduate No 2900 0.00 71 360 1 Rural Y
LP002979 Male Yes 3+ Graduate No 4106 0.00 40 180 1 Rural Y
LP002983 Male Yes 1 Graduate No 8072 240.00 253 360 1 Urban Y
LP002984 Male Yes 2 Graduate No 7583 0.00 187 360 1 Urban Y
LP002990 Female No 0 Graduate Yes 4583 0.00 133 360 0 Semiurban N
summary(loan_data) %>% kbl() %>% kable_styling() %>% scroll_box(width = "750px", height = "250px")
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
Length:614 Length:614 Length:614 Length:614 Length:614 Length:614 Min. : 150 Min. : 0 Min. : 9.0 Min. : 12 Min. :0.0000 Length:614 Length:614
Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360 1st Qu.:1.0000 Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 3812 Median : 1188 Median :128.0 Median :360 Median :1.0000 Mode :character Mode :character
NA NA NA NA NA NA Mean : 5403 Mean : 1621 Mean :146.4 Mean :342 Mean :0.8422 NA NA
NA NA NA NA NA NA 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0 3rd Qu.:360 3rd Qu.:1.0000 NA NA
NA NA NA NA NA NA Max. :81000 Max. :41667 Max. :700.0 Max. :480 Max. :1.0000 NA NA
NA NA NA NA NA NA NA NA NA’s :22 NA’s :14 NA’s :50 NA NA
missing <- loan_data %>% mutate_if(is.character, list(~na_if(.,""))) 

missing%>%
  summarise_all(list(~sum(is.na(.)))) %>%
  gather(key="Variable", value="Number_Missing") %>%
  arrange(desc(Number_Missing)) %>% kbl() %>% kable_styling() %>% scroll_box(width = "750px", height = "250px")
Variable Number_Missing
Credit_History 50
Self_Employed 32
LoanAmount 22
Dependents 15
Loan_Amount_Term 14
Gender 13
Married 3
Loan_ID 0
Education 0
ApplicantIncome 0
CoapplicantIncome 0
Property_Area 0
Loan_Status 0

Factors

loan_data <- loan_data %>% mutate_if(is.character, factor)

The following section we’ll continue to look at the data from the raw perspective (loan_raw).

Categorical Variables

There are several variables which have blank values "". These data points may have been intentionally skipped by customers from banks during the data collection process or they may just be missing. We will handle this later on.
* Loan_ID: unique identifier
* Gender: either Male or Female or blank
* Married: either No or Yes or blank
* Dependents: how many dependents does someone have? 0, 1, 2, 3+ or blank
* Education: Graduate or Not Graduate
* Self_Employed: No or Yes or blank
* Property_Area: Urban, Rural or Semiurban
* Loan_status: Y (yes) or N (no)
* Credit_History: does the credit history meet the guidelines? 1 = Yes, 0 = No

Married

Married applicants have a higher approval rate than non married applicants. It will be useful to look into if this has any correlation with income.

married_loan_status_count <- table(loan_raw$Married,loan_raw$Loan_Status)
married_loan_status_perct <- married_loan_status_count
married_loan_status_perct[1,] <- round(married_loan_status_perct[1,]/3 * 100, 2)
married_loan_status_perct[2,] <- round(married_loan_status_perct[2,]/213 * 100, 2)
married_loan_status_perct[3,] <- round(married_loan_status_perct[3,]/398 * 100, 2)

#set column names for married_loan_status_count
married_loan_status_count <- data.frame(married_loan_status_count)
colnames(married_loan_status_count) <- c('Married','Loan_Status','Count')

#set column names and row names for gender_loan_status_perct
rownames(married_loan_status_perct) <-  c("Blank", "Not Married", "Married")
colnames(married_loan_status_perct) <-  c("% Applications Not Approved", "% Applications Approved")

loan_data_Married <- loan_raw
loan_data_Married[loan_data_Married$Married == '',] <- "Blank"

t1 <- loan_data_Married %>% group_by(Married) %>% tally 
colnames(t1) <- c("Married","Count Loan Applications")
t2 <- married_loan_status_perct

knitr::kable(list(t1, t2)) 
Married Count Loan Applications
Blank 3
No 213
Yes 398
% Applications Not Approved % Applications Approved
Blank 0.00 100.00
Not Married 37.09 62.91
Married 28.39 71.61
ggplot(data=married_loan_status_count, aes(x=Married, y=Count, fill=Loan_Status)) + geom_bar(stat="identity",position="dodge")

Dependents

Applicants with 2 dependents appear to have the highest loan approval rate. It’d be interesting to see if the income per dependent has any impact on loan approval if we assume having more income makes it more likely to get a loan approved.

dep_loan_status_count <- table(loan_raw$Dependents,loan_raw$Loan_Status)
dep_loan_status_perct <- dep_loan_status_count
dep_loan_status_perct[1,] <- round(dep_loan_status_perct[1,]/15 * 100, 2)
dep_loan_status_perct[2,] <- round(dep_loan_status_perct[2,]/345 * 100, 2)
dep_loan_status_perct[3,] <- round(dep_loan_status_perct[3,]/102 * 100, 2)
dep_loan_status_perct[4,] <- round(dep_loan_status_perct[4,]/101 * 100, 2)
dep_loan_status_perct[5,] <- round(dep_loan_status_perct[5,]/51 * 100, 2)

#set column names for dep_loan_status_count
dep_loan_status_count <- data.frame(dep_loan_status_count)
colnames(dep_loan_status_count) <- c('Dependents','Loan_Status','Count')

#set column names and row names for gender_loan_status_perct
rownames(dep_loan_status_perct) <-  c("Blank", "0", "1","2","3+")
colnames(dep_loan_status_perct) <-  c("% Applications Not Approved", "% Applications Approved")

loan_data_Dep <- loan_raw
loan_data_Dep[loan_data_Dep$Dependents == '',] <- "Blank"

t1 <- loan_data_Dep %>% group_by(Dependents) %>% tally 
colnames(t1) <- c("Dependents","Count Loan Applications")
t2 <- dep_loan_status_perct

knitr::kable(list(t1, t2))
Dependents Count Loan Applications
0 345
1 102
2 101
3+ 51
Blank 15
% Applications Not Approved % Applications Approved
Blank 40.00 60.00
0 31.01 68.99
1 35.29 64.71
2 24.75 75.25
3+ 35.29 64.71
ggplot(data=dep_loan_status_count, aes(x=Dependents, y=Count, fill=Loan_Status)) + geom_bar(stat="identity",position="dodge")

Education

Applicants with Graduate education have a higher loan approval rate here.

edu_loan_status_count <- table(loan_raw$Education,loan_raw$Loan_Status)
edu_loan_status_perct <- edu_loan_status_count
edu_loan_status_perct[1,] <- round(edu_loan_status_perct[1,]/480 * 100, 2)
edu_loan_status_perct[2,] <- round(edu_loan_status_perct[2,]/134 * 100, 2)

#set column names for edu_loan_status_count
edu_loan_status_count <- data.frame(edu_loan_status_count)
colnames(edu_loan_status_count) <- c('Education','Loan_Status','Count')

#set column names for edu_loan_status_perct
colnames(edu_loan_status_perct) <-  c("% Applications Not Approved", "% Applications Approved")

t1 <- loan_raw %>% group_by(Education) %>% tally 
colnames(t1) <- c("Education","Count Loan Applications")
t2 <- edu_loan_status_perct

knitr::kable(list(t1, t2))
Education Count Loan Applications
Graduate 480
Not Graduate 134
% Applications Not Approved % Applications Approved
Graduate 29.17 70.83
Not Graduate 38.81 61.19
ggplot(data=edu_loan_status_count, aes(x=Education, y=Count, fill=Loan_Status)) + geom_bar(stat="identity",position="dodge")

Property Area

Semiurban applicants have the highest approval loan rating over rural and urban.

proparea_loan_status_count <- table(loan_raw$Property_Area,loan_raw$Loan_Status)
proparea_loan_status_perct <- proparea_loan_status_count
proparea_loan_status_perct[1,] <- round(proparea_loan_status_perct[1,]/179  * 100, 2)
proparea_loan_status_perct[2,] <- round(proparea_loan_status_perct[2,]/233  * 100, 2)
proparea_loan_status_perct[3,] <- round(proparea_loan_status_perct[3,]/202  * 100, 2)

#set column names for proparea_loan_status_count
proparea_loan_status_count <- data.frame(proparea_loan_status_count)
colnames(proparea_loan_status_count) <- c('Property_Area','Loan_Status','Count')

#set column names for proparea_loan_status_perct
colnames(proparea_loan_status_perct) <-  c("% Applications Not Approved", "% Applications Approved")

t1 <- loan_raw %>% group_by(Property_Area) %>% tally 
colnames(t1) <- c("Property_Area","Count Loan Applications")
t2 <- proparea_loan_status_perct

knitr::kable(list(t1, t2))
Property_Area Count Loan Applications
Rural 179
Semiurban 233
Urban 202
% Applications Not Approved % Applications Approved
Rural 38.55 61.45
Semiurban 23.18 76.82
Urban 34.16 65.84
ggplot(data=proparea_loan_status_count, aes(x=Property_Area, y=Count, fill=Loan_Status)) + geom_bar(stat="identity",position="dodge")

Credit History

Having an a credit history that meets the guidelines appears to be extremely important in whether the loan status is approved or not.

credhist_loan_status_count <- table(loan_raw$Credit_History,loan_raw$Loan_Status)
credhist_loan_status_perct <- credhist_loan_status_count
credhist_loan_status_perct[1,] <- round(credhist_loan_status_perct[1,]/89  * 100, 2)
credhist_loan_status_perct[2,] <- round(credhist_loan_status_perct[2,]/475  * 100, 2)

#set column names for credhist_loan_status_count
credhist_loan_status_count <- data.frame(credhist_loan_status_count)
colnames(credhist_loan_status_count) <- c('Credit_History','Loan_Status','Count')

#set column names for credhist_loan_status_perct
colnames(credhist_loan_status_perct) <-  c("% Applications Not Approved", "% Applications Approved")

t1 <- loan_raw %>% group_by(Credit_History) %>% tally 
colnames(t1) <- c("Credit_History","Count Loan Applications")
t2 <- credhist_loan_status_perct

knitr::kable(list(t1, t2))
Credit_History Count Loan Applications
0 89
1 475
NA 50
% Applications Not Approved % Applications Approved
0 92.13 7.87
1 20.42 79.58
ggplot(data=credhist_loan_status_count, aes(x=Credit_History, y=Count, fill=Loan_Status)) + geom_bar(stat="identity",position="dodge")

Numerical Variables

  • ApplicantIncome: how much money does the applicant make?
  • CoapplicantIncome: how much money does the coapplicant make? if there is no coapplicant this is 0.
  • LoanAmount: how much is the loan worth in thousands?
  • Loan_Amount_Term: how many months is the loan?

Now let’s use the pairs.panels function to see a lot of important information related to our numeric data:

  • Applicant income and loan_amount are strongly correlated
  • The most common Loan_Amount_Term is 360 months
numeric_loan_data <- dplyr::select(loan_data,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term)
pairs.panels(numeric_loan_data, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Inspecting ApplicantIncome and Loan Income

Here we can see that the ApplicantIncome does not have a huge effect on whether the Loan_Status was approved (Y) or not. The average ApplicantIncome is about the same for both groups is similar. There are a fewer more outliers of high incomes in the group where the loan status was approved.

approved <- loan_data[loan_data$Loan_Status == 'Y',]
denied <- loan_data[loan_data$Loan_Status == 'N',]

a <- ggplot(loan_data,aes(x=ApplicantIncome,color=Loan_Status))  + geom_boxplot()
b <- ggplot(approved,aes(x=ApplicantIncome,y=LoanAmount,color=Loan_Status)) + geom_point(color='blue') + xlab('Approved Applicant Income') + scale_x_continuous(limits = c(0, 25000)) + scale_y_continuous(limits = c(0, 650))
grid.arrange(a,b,nrow=2)#,nrow=2,ncol=2,layout_matrix=c(1,1,2,3)) 

c <- ggplot(denied,aes(x=ApplicantIncome,y=LoanAmount,color=Loan_Status)) + geom_point(color='red') + xlab('Denied Applicant Income') + scale_x_continuous(limits = c(0, 25000)) + scale_y_continuous(limits = c(0, 650))

grid.arrange(c)

In addition, upon investigating the sum of ApplicantIncome and CoapplicantIncome, we observe that it does not appear to have much prediction power with Loan_Status.

ggplot(data = loan_data, aes(x = Loan_Status, y = ApplicantIncome+CoapplicantIncome, fill=Loan_Status)) +
  geom_boxplot() +
  coord_flip()

LoanAmount Per ApplicantIncome

Now let’s see if the rate of the LoanAmount divided by ApplicantIncome has any prediction power when trying to deteremine if a Loan_Status will be approved or not. This would indicate that perhaps someone who is requesting a LoanAmount 5 times their income, they might not be approved but if they requested 3 times their income they could get approved.

Looking at the boxplots below, the average LoanAmtPerSalary is roughly the same for approved and not approved applications so this disbunks this theory. This variable might prove helpful in our modeling so we will keep it.

loan_data$LoanAmtPerSalary <- loan_data$LoanAmount*100000/loan_data$ApplicantIncome
ggplot(loan_data,aes(x=LoanAmtPerSalary,color=Loan_Status)) + geom_boxplot() + scale_x_continuous(limits = c(0, 30000))
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).

Data Prep for Model-fitting

I explicitly recoded the Y/N values into 1/0’s

Since credit history is a categorical value and fewer than 50 rows are missing it’s better to delete these data points rather than to try to interpret a value for them. For loan amount term and loan amount we will use the mice package to impute a value where it is missing.

Additional Data Processing / Manipulation Steps

So, first off, I need to convert Credit_History to factors so that the mice model that I’m going to use can detect that column as a categorical variable.

Combining ApplicantIncome and CoapplicantIncome into a new variable TotalIncome, and dropping the respective input columns. Loan_ID doesn’t help with the prediction obviously. So dropping it as well.

loan_knn_pre_imp <- loan_knn
loan_knn_pre_imp$Credit_History <- as.factor(loan_knn_pre_imp$Credit_History)

loan_knn_pre_imp <- loan_knn_pre_imp %>% mutate(TotalIncome = ApplicantIncome + CoapplicantIncome)
loan_knn_pre_imp <- loan_knn_pre_imp %>% dplyr::select(-c('Loan_ID','ApplicantIncome','CoapplicantIncome'))

# loan_knn_pre_imp[loan_knn_pre_imp$Dependents = "3+"] <- "3"

# recode dependents 3+ to 3
loan_knn_pre_imp$Dependents <- revalue(loan_knn_pre_imp$Dependents, c("3+"="3"))


str(loan_knn_pre_imp)
## 'data.frame':    614 obs. of  12 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married         : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents      : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ LoanAmount      : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term: int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status     : num  1 0 1 1 1 1 1 0 1 0 ...
##  $ LoanAmtPerSalary: num  NA 2793 2200 4646 2350 ...
##  $ TotalIncome     : num  5849 6091 3000 4941 6000 ...

Status quo of missing data

clean_loan_data <- loan_data

vis_dat(clean_loan_data)

I’ve set up a predictorMatrix where I can instruct mice to use which method for which column for imputation.

Set seed = 501. Retrieved the results.

# clean_loan_data <- complete(mice(clean_loan_data,m=5,meth='pmm',print=FALSE))
init <- mice(loan_knn_pre_imp, maxit=0) 
meth <- init$method
predM <- init$predictorMatrix
meth[c('LoanAmount','Loan_Amount_Term')] <- 'norm'
meth[c('Credit_History','Self_Employed','Gender','Married')] <- 'logreg'
meth[c('Dependents')] <- 'polyreg'
meth[c('Loan_Status','TotalIncome','Property_Area','Education')] = ''
loan_knn_imp1 <- mice(loan_knn_pre_imp, method=meth, predictorMatrix=predM, seed=501)
## 
##  iter imp variable
##   1   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   1   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   1   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   1   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   1   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   2   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   2   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   2   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   2   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   2   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   3   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   3   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   3   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   3   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   3   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   4   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   4   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   4   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   4   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   4   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   5   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   5   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   5   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   5   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary
##   5   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History  LoanAmtPerSalary

Manual Examinations

After some manual examinations of the different imputed results, I’ve decided to go with imputed column #3.

# Manual examination 
#Credit_History
loan_knn_imp1$imp$Credit_History
##     1 2 3 4 5
## 17  1 1 1 1 1
## 25  0 1 1 0 1
## 31  1 1 1 1 1
## 43  1 1 1 1 1
## 80  1 1 1 1 1
## 84  0 0 1 0 0
## 87  1 1 1 1 1
## 96  0 0 1 0 1
## 118 1 1 1 1 1
## 126 1 1 1 1 1
## 130 1 0 1 0 1
## 131 1 1 1 1 1
## 157 1 1 1 1 1
## 182 0 1 1 0 1
## 188 1 1 1 1 1
## 199 1 1 1 1 1
## 220 1 1 1 1 1
## 237 1 0 0 1 1
## 238 1 1 1 1 1
## 260 0 0 1 1 0
##  [ reached 'max' / getOption("max.print") -- omitted 30 rows ]
loan_knn[96:118,]
##      Loan_ID Gender Married Dependents    Education Self_Employed
## 96  LP001326   Male      No          0     Graduate          <NA>
## 97  LP001327 Female     Yes          0     Graduate            No
## 98  LP001333   Male     Yes          0     Graduate            No
## 99  LP001334   Male     Yes          0 Not Graduate            No
## 100 LP001343   Male     Yes          0     Graduate            No
## 101 LP001345   Male     Yes          2 Not Graduate            No
## 102 LP001349   Male      No          0     Graduate            No
##     ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 96             6782                 0         NA              360
## 97             2484              2302        137              360
## 98             1977               997         50              360
## 99             4188                 0        115              180
## 100            1759              3541        131              360
## 101            4288              3263        133              180
## 102            4843              3806        151              360
##     Credit_History Property_Area Loan_Status LoanAmtPerSalary
## 96              NA         Urban           0               NA
## 97               1     Semiurban           1         5515.298
## 98               1     Semiurban           1         2529.084
## 99               1     Semiurban           1         2745.941
## 100              1     Semiurban           1         7447.413
## 101              1         Urban           1         3101.679
## 102              1     Semiurban           1         3117.902
##  [ reached 'max' / getOption("max.print") -- omitted 16 rows ]
#Married
loan_knn_imp1$imp$Married
##       1   2   3   4   5
## 105 Yes Yes Yes  No Yes
## 229  No Yes Yes Yes Yes
## 436 Yes Yes  No  No  No
loan_knn[430:436,]
##      Loan_ID Gender Married Dependents    Education Self_Employed
## 430 LP002370   Male      No          0 Not Graduate            No
## 431 LP002377 Female      No          1     Graduate           Yes
## 432 LP002379   Male      No          0     Graduate            No
## 433 LP002386   Male      No          0     Graduate          <NA>
## 434 LP002387   Male     Yes          0     Graduate            No
## 435 LP002390   Male      No          0     Graduate            No
## 436 LP002393 Female    <NA>       <NA>     Graduate            No
##     ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 430            2717                 0         60              180
## 431            8624                 0        150              360
## 432            6500                 0        105              360
## 433           12876                 0        405              360
## 434            2425              2340        143              360
## 435            3750                 0        100              360
## 436           10047                 0         NA              240
##     Credit_History Property_Area Loan_Status LoanAmtPerSalary
## 430              1         Urban           1         2208.318
## 431              1     Semiurban           1         1739.332
## 432              0         Rural           0         1615.385
## 433              1     Semiurban           1         3145.387
## 434              1     Semiurban           1         5896.907
## 435              1         Urban           1         2666.667
## 436              1     Semiurban           1               NA
#Dependents
loan_knn_imp1$imp$Dependents
##     1 2 3 4 5
## 103 2 1 3 1 0
## 105 0 3 0 2 0
## 121 2 2 3 2 0
## 227 2 2 1 1 1
## 229 0 0 2 1 2
## 294 0 0 3 0 0
## 302 0 0 1 3 1
## 333 0 0 0 0 0
## 336 0 0 1 2 2
## 347 0 0 3 0 1
## 356 1 0 0 0 0
## 436 2 2 0 0 0
## 518 0 2 0 1 2
## 572 0 0 1 1 2
## 598 0 0 0 0 0
loan_knn[227:229,]
##      Loan_ID Gender Married Dependents    Education Self_Employed
## 227 LP001754   Male     Yes       <NA> Not Graduate           Yes
## 228 LP001758   Male     Yes          2     Graduate            No
## 229 LP001760   Male    <NA>       <NA>     Graduate            No
##     ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 227            4735                 0        138              360
## 228            6250              1695        210              360
## 229            4758                 0        158              480
##     Credit_History Property_Area Loan_Status LoanAmtPerSalary
## 227              1         Urban           0         2914.467
## 228              1     Semiurban           1         3360.000
## 229              1     Semiurban           1         3320.723

Decision

We picked impute #3.

loan_knn2 <- complete(loan_knn_imp1, 3) # 2nd argument if not provided is defaulted to 1
clean_loan_data <- loan_knn2

# have to redo loan_status as loan_knn's loan status had been recoded to numeric on purpose
clean_loan_data$Loan_Status <- as.factor(clean_loan_data$Loan_Status)
str(clean_loan_data)
## 'data.frame':    614 obs. of  12 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married         : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents      : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ LoanAmount      : num  21.1 128 66 120 141 ...
##  $ Loan_Amount_Term: num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status     : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 1 2 1 ...
##  $ LoanAmtPerSalary: num  403 2793 2200 4646 2350 ...
##  $ TotalIncome     : num  5849 6091 3000 4941 6000 ...

Imbalanced Dataset

Notice that the response variable is 31/69 split on the binary response, No and Yes, respectively.

imb_dat <- as.data.frame(prop.table(x = table(clean_loan_data$Loan_Status)))
colnames(imb_dat) <- c("Loan Status", "Freq")
imb_dat
##   Loan Status      Freq
## 1           0 0.3127036
## 2           1 0.6872964

Splitting Data into Training & Testing

Here we are going to use 80% of our data to train the model and reserve 20% to test the model we pick.

set.seed(1042)
sample_size <- floor(nrow(clean_loan_data)*0.8)
indices <- sample(1:nrow(clean_loan_data),sample_size)
train <- clean_loan_data[c(indices),]
test <- clean_loan_data[-c(indices),]

2. Linear Discriminant Analysis

LDA does not seem to be a good approach with this data set as the points provided by the available data are not linearly separable

train%>%
  ggplot(aes(x = log(LoanAmount), y= log(TotalIncome), color = Loan_Status)) + geom_point()

lda Cross Validation

predictions with the LDA model are less accurate than if we just used the binary classifier Credit_History to determine weather or not a loan would be approved

# cross validation 
ctrl <- trainControl(method = 'repeatedcv', repeats = 11)

lda model results

lda.fit <- train(Loan_Status ~ TotalIncome + LoanAmount,
             data = train,
             method = 'lda',
             trControl = ctrl
             )
test$lda <- predict(lda.fit, test)
confusionMatrix(test$lda, test$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 33 90
##                                           
##                Accuracy : 0.7317          
##                  95% CI : (0.6443, 0.8076)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.5467          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.54e-08        
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.2683          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

3. K-nearest Neighbor

First off, set seed = 688.

Create training/test partitions by calling createDataPartition. p is set to .8 to mean 80/20 split for train/test set.

Checking the structure of the train set (knn_train)

Checking the structure of the test set (knn_test)

str(knn_test)
## 'data.frame':    122 obs. of  12 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Married         : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 2 2 2 ...
##  $ Dependents      : Factor w/ 4 levels "0","1","2","3": 1 3 2 3 3 2 4 3 2 1 ...
##  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 2 1 1 1 1 1 ...
##  $ Self_Employed   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 2 1 ...
##  $ LoanAmount      : num  141 267 349 112 110 106 320 134 286 96 ...
##  $ Loan_Amount_Term: num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History  : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 1 2 ...
##  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 2 1 3 1 1 3 3 2 ...
##  $ Loan_Status     : Factor w/ 2 levels "0","1": 2 2 1 1 2 1 1 1 1 2 ...
##  $ LoanAmtPerSalary: num  2350 4929 2718 3328 2603 ...
##  $ TotalIncome     : num  6000 9613 23809 5282 5266 ...

Cross Validation

Perform a repeated 11-fold cross-validation, meaning the number of complete sets of folks to compute is 11. For this classification problem, we assigned our fitted model to knn.fit. The cross-validated results is plugged in the form of trControl.

# cleaning up some parallel computing
# https://stackoverflow.com/questions/25097729/un-register-a-doparallel-cluster
registerDoSEQ()

trControl <- trainControl(method  = "repeatedcv",
                          repeats  = 11)
knn.fit <- train(Loan_Status ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             preProcess = c("center","scale"),
             data       = knn_train
             )

knn.fit 
## k-Nearest Neighbors 
## 
## 492 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (14), scaled (14) 
## Resampling: Cross-Validated (10 fold, repeated 11 times) 
## Summary of sample sizes: 442, 444, 443, 443, 442, 444, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7122390  0.3217246
##    2  0.7089797  0.3061592
##    3  0.7694411  0.4090793
##    4  0.7605727  0.3808163
##    5  0.7838922  0.4188985
##    6  0.7732961  0.3882704
##    7  0.7848086  0.4070143
##    8  0.7832899  0.4023703
##    9  0.7912427  0.4176016
##   10  0.7873503  0.4063482
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Key Information on the value of K

Since our target variable is a binary factor of 2, by default, we use Accuracy as the determining performance metric. The optimal K is thus determined by Accuracy. K = 9 was finally selected. # of neighbors is 9.

plot(knn.fit)

Model Results

knn_pred <- predict(knn.fit, newdata = knn_test)
# options('max.print' = 100)  
# getOption("max.print")
confusionMatrix(knn_pred, knn_test$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 15  3
##          1 23 81
##                                           
##                Accuracy : 0.7869          
##                  95% CI : (0.7035, 0.8558)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.0104406       
##                                           
##                   Kappa : 0.4195          
##                                           
##  Mcnemar's Test P-Value : 0.0001944       
##                                           
##             Sensitivity : 0.3947          
##             Specificity : 0.9643          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.7788          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1230          
##    Detection Prevalence : 0.1475          
##       Balanced Accuracy : 0.6795          
##                                           
##        'Positive' Class : 0               
## 

Accuracy is 78.69% while balanced accuracy is only 67.95%.

4. Decision Trees

Now we will use a decision tree to see how well it will perform on our data.
* Our decision tree starts by splitting users based on their Credit_History. This makes sense based on our exploratory data analysis.
* Other variables used in the decision tree include LoanAmount, PropertyArea, etc.

loan_tree = tree(Loan_Status ~., train)
plot(loan_tree)
text(loan_tree)
title(main = "Unpruned Decision Tree")

Decision Tree Performance

Training Data

Now we will use our model to see how it performs on the training data. We see that the model predicted Loan_Status with an accuracy of ~83%. 81 instances were incorrectly classified.

pred_tree_train <- predict(loan_tree,train,type="class")
test_table <- table(pred_tree_train,train$Loan_Status) %>% kbl() %>% kable_styling()
test_table
0 1
0 89 11
1 70 321
mean(pred_tree_train == train$Loan_Status)
## [1] 0.8350305

Cross-validation for better performance

The first version of our model was a full, unpruned tree. Now we are going to prune it back to get the optimal tree using cross validation. We have plotted the number of misclassifications with the different trees. As we can see, the trees with size 2-4 have the fewest misclassifications. We will choose size 4 to have the fewest misclassifications.

set.seed(2311)
cv_trees = cv.tree(loan_tree,FUN = prune.misclass)
cv_trees
## $size
## [1] 15 10  4  2  1
## 
## $dev
## [1] 108 108 101  97 159
## 
## $k
## [1]      -Inf  0.000000  1.333333  2.500000 65.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv_trees)

Using a size = 4, our decision tree looks like the following:

loan_tree_pruned = prune.misclass(loan_tree,best=4)
plot(loan_tree_pruned)
text(loan_tree_pruned)

Testing Data

Now let’s see how our pruned performs on our testing data. The accuracy for our test data was ~82%, which was almost the same as our training data. 21 of the total observations were misclassified.

pred_tree_test  <- predict(loan_tree_pruned,test, type="class")
test_table <- table(pred_tree_test,test$Loan_Status) %>% kbl() %>% kable_styling()
test_table
0 1
0 14 2
1 19 88
mean(pred_tree_test == test$Loan_Status)
## [1] 0.8292683
confusionMatrix(pred_tree_test, test$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 14  2
##          1 19 88
##                                           
##                Accuracy : 0.8293          
##                  95% CI : (0.7509, 0.8911)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.0075806       
##                                           
##                   Kappa : 0.4804          
##                                           
##  Mcnemar's Test P-Value : 0.0004803       
##                                           
##             Sensitivity : 0.4242          
##             Specificity : 0.9778          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.8224          
##              Prevalence : 0.2683          
##          Detection Rate : 0.1138          
##    Detection Prevalence : 0.1301          
##       Balanced Accuracy : 0.7010          
##                                           
##        'Positive' Class : 0               
## 

5. Random Forests

Now we will develop a random forest model to see how well it will performs with our data. Parameters for a random forest include: mtry : Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3) ntree : Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times

Our initial model will have default parameters of mtry=sqrt(13) and ntree=500.

# find out no of cores 
no_cores <- detectCores() - 1

cl<-makePSOCKcluster(no_cores)
  
registerDoParallel(cl)
  
# start.time<-proc.time()
  
# model<-train(target~., data=trainingset, method='rf')
#drop loan id
train_rf1 <- train
test_rf1 <- test 

# Create model with default parameters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
mtry <- sqrt(ncol(train_rf1))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Loan_Status~., data=train_rf1, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl=control)
print(rf_default)
## Random Forest 
## 
## 491 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 443, 442, 441, 442, 442, 442, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8011678  0.4879166
## 
## Tuning parameter 'mtry' was held constant at a value of 3.464102
# stop.time<-proc.time()
# 
# run.time<-stop.time -start.time
# 
# print(run.time)
#   
# stopCluster(cl)

Our inital model has accuracy of about 80%. Let’s see if we can improve accuracy by finding an optimal mtry value. We will test different mtry values 1-10 by using gridsearch. We see from our results that the optimal mtry value for accuracy is 2.

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(123)
tunegrid <- expand.grid(.mtry=c(1:10))
rf_gridsearch <- train(Loan_Status~., data=train_rf1, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
## Random Forest 
## 
## 491 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 442, 442, 441, 442, 442, 441, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.6795697  0.0138783
##    2    0.8003464  0.4745384
##    3    0.8030964  0.4913187
##    4    0.7983611  0.4853485
##    5    0.7895170  0.4665550
##    6    0.7874632  0.4625780
##    7    0.7881434  0.4672603
##    8    0.7847562  0.4591980
##    9    0.7846871  0.4605619
##   10    0.7820074  0.4550405
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
plot(rf_gridsearch)

Next let’s find the optimal value for ntree. Again we’ll use gridsearch to test different ntree values. It’s evident from our results that optimal ntree value for accuracy is 1500.

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry=2)
modellist <- list()
for (ntree in c(500, 1000, 1500, 2000, 2500)) {
  set.seed(124)
    fit <- train(Loan_Status~., data=train_rf1, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl=control, ntree=ntree)
    key <- toString(ntree)
    modellist[[key]] <- fit
}
# compare results
results <- resamples(modellist)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 1000, 1500, 2000, 2500 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 500  0.7346939 0.7755102 0.7959184 0.8025006 0.8190816 0.8958333    0
## 1000 0.7346939 0.7755102 0.7959184 0.8011400 0.8163265 0.8958333    0
## 1500 0.7346939 0.7755102 0.7959184 0.8018203 0.8163265 0.8979592    0
## 2000 0.7346939 0.7755102 0.7959184 0.8011400 0.8163265 0.8958333    0
## 2500 0.7346939 0.7755102 0.7959184 0.8018203 0.8163265 0.8979592    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean  3rd Qu.      Max. NA's
## 500  0.2669735 0.4038375 0.4688651 0.4828621 0.545829 0.7435897    0
## 1000 0.2669735 0.4035402 0.4688651 0.4794511 0.541709 0.7435897    0
## 1500 0.2669735 0.4035402 0.4688651 0.4813131 0.541709 0.7476828    0
## 2000 0.2669735 0.4035402 0.4688651 0.4794511 0.541709 0.7435897    0
## 2500 0.2669735 0.4035402 0.4688651 0.4813131 0.541709 0.7476828    0
dotplot(results)

Our final random forest model will have mtry=2 and ntree=1500.

rf_final <- randomForest(Loan_Status ~ ., 
                        data = train_rf1, 
                        ntree = 1500, 
                        mtry = 2,
                        importance = TRUE,
                        proximity = TRUE)

print(rf_final)
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train_rf1, ntree = 1500,      mtry = 2, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.16%
## Confusion matrix:
##    0   1 class.error
## 0 73  86  0.54088050
## 1 13 319  0.03915663
#variable importance
round(importance(rf_final), 2)
##                      0     1 MeanDecreaseAccuracy MeanDecreaseGini
## Gender           -3.23  7.74                 4.91             3.08
## Married          -3.14  7.90                 5.27             3.78
## Dependents       -4.43 10.80                 6.82             8.80
## Education         0.92  2.94                 3.06             3.91
## Self_Employed    -4.98  8.16                 4.25             3.06
## LoanAmount       -3.72 24.77                20.67            25.34
## Loan_Amount_Term  7.10  9.66                11.98             8.48
## Credit_History   93.68 95.74               104.16            50.51
## Property_Area     1.28  3.23                 3.38             8.43
## LoanAmtPerSalary  3.34 17.43                16.68            27.54
## TotalIncome      -4.38 23.64                19.72            27.19
prediction <-predict(rf_final, test_rf1)
confusionMatrix(prediction, test_rf1$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 14  1
##          1 19 89
##                                           
##                Accuracy : 0.8374          
##                  95% CI : (0.7601, 0.8978)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.0039800       
##                                           
##                   Kappa : 0.4994          
##                                           
##  Mcnemar's Test P-Value : 0.0001439       
##                                           
##             Sensitivity : 0.4242          
##             Specificity : 0.9889          
##          Pos Pred Value : 0.9333          
##          Neg Pred Value : 0.8241          
##              Prevalence : 0.2683          
##          Detection Rate : 0.1138          
##    Detection Prevalence : 0.1220          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : 0               
## 
# stop.time<-proc.time()

# run.time<-stop.time -start.time

# print(run.time)

# Stopping Cluster
stopCluster(cl)

Accuracy of our final random forest model is about 83% on the test data with 19 instances misclassified. 2 are false negatives and 17 are false positives. Credit_history is the most important feature.

6. Model Performance

Model Performance Matrix
Metric LDA K-Nearest Neighbor (KNN) Decision Trees Random Forest
Accuracy 0.7317 0.7869 0.8293 0.8374
Balanced Accuracy 0.5000 0.6795 0.7010 0.7066
Sensitivity 0 0.3947 0.4242 0.4242

Notice that the sensitivity between Decision Trees, and RF is the same at 42.42%. It’s surprising to see that LDA and Random Forest ended up having the highest accuracy, which is usually the go-to metric to go for in an unbalanced dataset with the binary response that are not 50/50. The model we picked is Random Forest.