Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework3
Please use K-nearest neighbor (KNN) algorithm to predict the species variable. Please be sure to walk through the steps you took. (40 points)
Similar to past assignments when using the Palmer Penguins dataset, we’ll first do some quick exploratory analysis to examine the different features available.
We can see below, that there are four continuous variables: bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. Additionally, there are a few categorical variables: island, sex, and year.
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
For the continuous variables, we can examine the distributions, broken out by the target variable, species:
By separating our distributions by our target variable, species, we can see that many of the feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. With the exception of the distribution of bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.
Next, we’ll do some data tidying to get our data set ready for our KNN model. year reflects the date/time of recording and requires some additional thought. If we hypothesize or expect that variability in penguin features could have a time dependent effect, then year could be important. For example, if penguin food supplies vary year by year and this impacts growth or if yearly temperature affects penguin development, then year could matter. If we thought global warming might be impacting penguins physical development or other natural disasters might have impacted grown on specific years, then again, year might matter. That said, in this situation, because we don’t have any reason to expect a time dependent effect, year probably doesn’t add any explanatory power to our models and we can remove it from our data set:
penguins <- penguins %>%
dplyr::select(-year)Additionally, when looking at the number of missing values, we can see the following:
penguins %>%
summarise_all(funs(sum(is.na(.)))) %>%
pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>%
arrange(desc(NA_count)) %>%
ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
geom_label(aes(label = NA_count)) +
theme_minimal() +
labs(title = 'Count of missing values in penguins dataset') +
theme(plot.title = element_text(hjust = 0.5))gg_miss_upset(penguins)We can see that 11 individuals have at least one missing data point. The missing data points are clustered - two penguin records are missing bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. None other penguin records missing just the sex data point. We cannot do much with the 2 records missing most of their data, so those will be dropped. For the records missing just sex, while we could try to KNN impute, we don’t know enough about gender differences to know if that’s a safe operation or whether it would introduce additional variability. Therefore, we’ll drop these from our data set as well.
penguins <- na.omit(penguins)Now, we can take a look at a summary of data before splitting to get a sense of what still needs to be tidy’d in order to get it ready for our KNN model:
skim(penguins)| Name | penguins |
| Number of rows | 333 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
| island | 0 | 1 | FALSE | 3 | Bis: 163, Dre: 123, Tor: 47 |
| sex | 0 | 1 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
From above, we can see that we’ll need to remove our species variable, and save it to a separate outcome variable in order to evaluate the performance of our KNN model later on. We can do this by running the following syntax:
species_actual <- penguins %>%
dplyr::select(species)
penguins <- penguins %>%
dplyr::select(-species)Additionally, we can see below in the distributions of each of our continuous variables that the scales are inconsistent across features. For better model performance, we’ll want to standardize each of our features.
In order to do this, we’ll administer a z-score standardization to fix our scaling, which will ultimately help with our clustering.
penguins[, c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")] <- scale(penguins[, c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")])As you can see below, after the z-score standardization, the scaling on the x and y axis is a lot more consistent across our features.
penguins %>%
dplyr::select(body_mass_g, bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
ggpairs()Although this may help with our KNN model, we’ll have to be careful about interpretability later on! Finally, we’ll want to dummy code our island and sex variables, since they are categorical, we’ll need to convert them to 1s and 0s:
island_dcode <- as.data.frame(dummy.code(penguins$island))
penguins <- cbind(penguins, island_dcode)We’ll also want to remove our original island variable from this dataset, since we created our dummy variables:
penguins <- penguins %>%
dplyr::select(-island)The same process will then be applied to the sex variable:
penguins$sex <- dummy.code(penguins$sex)Finally, after a fair amount of data tidying and investigation, our data set is ready for our KNN model. We can take a quick look at the updated version of our data set:
kable(head(penguins)) %>%
kable_styling(bootstrap_options = "basic")| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | Biscoe | Dream | Torgersen | |
|---|---|---|---|---|---|---|---|---|
| 1 | -0.8946955 | 0.7795590 | -1.4246077 | -0.5676206 | 1 | 0 | 0 | 0 |
| 2 | -0.8215515 | 0.1194043 | -1.0678666 | -0.5055254 | 0 | 1 | 0 | 0 |
| 3 | -0.6752636 | 0.4240910 | -0.4257325 | -1.1885721 | 0 | 1 | 0 | 0 |
| 5 | -1.3335592 | 1.0842457 | -0.5684290 | -0.9401915 | 0 | 1 | 0 | 0 |
| 6 | -0.8581235 | 1.7444004 | -0.7824736 | -0.6918109 | 1 | 0 | 0 | 0 |
| 7 | -0.9312674 | 0.3225288 | -1.4246077 | -0.7228585 | 0 | 1 | 0 | 0 |
With our data set tidied, we then split it into a training and test set.
set.seed(123)
trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)
train_penguins <- penguins[trainingRows,]
test_penguins <- penguins[-trainingRows,]We’ll also do the same for our target variable, using the same split:
species_actual_train <- species_actual$species[trainingRows]
species_actual_test <- species_actual$species[-trainingRows]Now, with our data split accordingly, we’ll need to identify the appropriate number for k. To do this, we will first take a standard approach of calculating the square root of the number of rows in our training data set:
sqrt(nrow(train_penguins))## [1] 16.37071
We can see above, that it is roughly between 16 and 17. Therefore, we’ll perform two KNN models with k=16 and k=17:
set.seed(123)
k16 <- knn(train_penguins,
test_penguins,
cl=species_actual_train,
k=16)
k17 <- knn(train_penguins,
test_penguins,
cl=species_actual_train,
k=17)misClassError <- mean(k16 != species_actual_test)
paste0('The number of misclassified penguins with 16 neighbors is: ', misClassError)## [1] "The number of misclassified penguins with 16 neighbors is: 0.0307692307692308"
table(k16, species_actual_test)## species_actual_test
## k16 Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
misClassError <- mean(k17 != species_actual_test)
paste0('The number of misclassified penguins with 17 neighbors is: ', misClassError)## [1] "The number of misclassified penguins with 17 neighbors is: 0.0307692307692308"
table(k17, species_actual_test)## species_actual_test
## k17 Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
From our resulting confusion matrices, that show the classified penguins from our KNN model compared to the actual species designations, we can see that both a k of 16 and a k of 17 preformed well on our test set. Essentially, the misclassification rate showed no difference with k=16 or k=17. Although this approach seemed to work quite effectively, we do need to be careful about how we’d go about using either one of these models to predict new data. With an artificially high k value, we could be underfitting the data.
Therefore, as a final step, we can run one final KNN model using the caret package, where the built-in train() function will run KNN classification that automatically picks the optimal number of neighbors (k).
knn_caret <- train(train_penguins,
species_actual_train,
method = "knn",
preProcess = c("center", "scale"))
knn_caret## k-Nearest Neighbors
##
## 268 samples
## 8 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## Pre-processing: centered (7), scaled (7), ignore (1)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 268, 268, 268, 268, 268, 268, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9980298 0.9968970
## 7 0.9976347 0.9963040
## 9 0.9991997 0.9987279
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
From our output, we can see that a k value of 9 was chose as the optimal value for our model based on the Accuracy and Kappa values calculated. Since this inter-rater reliability metric of Kappa is quite important when working with unbalanced data sets such as our penguins data set, this k value of 9 may ultimately perform better than our initial KNN models where k=16 or k=17. We can see this relationship determined by the train function for finding the optimal neighbors plotted below:
plot(knn_caret)We can see that a neighbors value of 9 clearly showed the highest accuracy value.
knn_caret_predictions <- predict(knn_caret, newdata = test_penguins)
confusionMatrix(knn_caret_predictions, as.factor(species_actual_test))## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9692
## 95% CI : (0.8932, 0.9963)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9516
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 1.0000
## Specificity 0.9722 0.9808 1.0000
## Pos Pred Value 0.9655 0.9231 1.0000
## Neg Pred Value 0.9722 0.9808 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3538
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 1.0000
In the end, with the caret package identifying the “optimal” k-value, we can see that all three of our KNN models performed very well on our test data set. Although it will likely depend on the utility of the KNN model, it does appear that any of our three models could be chose to provide accurate predictions of the three penguin species.
Please use the attached dataset on loan approval status to predict loan approval using Decision Trees. Please be sure to conduct a thorough exploratory analysis to start the task and walk us through your reasoning behind all the steps you are taking.
First, we decided to read in the loan approval data set and take a look at its features:
loan <- read.csv("https://raw.githubusercontent.com/DaisyCai2019/NewData/master/Loan_approval.csv")
kable(head(loan)) %>%
kable_styling(bootstrap_options = "basic")| Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LP001002 | Male | No | 0 | Graduate | No | 5849 | 0 | NA | 360 | 1 | Urban | Y |
| LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508 | 128 | 360 | 1 | Rural | N |
| LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0 | 66 | 360 | 1 | Urban | Y |
| LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358 | 120 | 360 | 1 | Urban | Y |
| LP001008 | Male | No | 0 | Graduate | No | 6000 | 0 | 141 | 360 | 1 | Urban | Y |
| LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196 | 267 | 360 | 1 | Urban | Y |
As we can see from a glimpse of the data set above, the following features are available:
Now, we can run some exploratory data analysis to get a better sense of how to tidy and interpret these features. Here’s an initial summary of the dataset:
skim(loan)| Name | loan |
| Number of rows | 614 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Loan_ID | 0 | 1 | 8 | 8 | 0 | 614 | 0 |
| Gender | 0 | 1 | 0 | 6 | 13 | 3 | 0 |
| Married | 0 | 1 | 0 | 3 | 3 | 3 | 0 |
| Dependents | 0 | 1 | 0 | 2 | 15 | 5 | 0 |
| Education | 0 | 1 | 8 | 12 | 0 | 2 | 0 |
| Self_Employed | 0 | 1 | 0 | 3 | 32 | 3 | 0 |
| Property_Area | 0 | 1 | 5 | 9 | 0 | 3 | 0 |
| Loan_Status | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1.00 | 5403.46 | 6109.04 | 150 | 2877.5 | 3812.5 | 5795.00 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1.00 | 1621.25 | 2926.25 | 0 | 0.0 | 1188.5 | 2297.25 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 22 | 0.96 | 146.41 | 85.59 | 9 | 100.0 | 128.0 | 168.00 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 14 | 0.98 | 342.00 | 65.12 | 12 | 360.0 | 360.0 | 360.00 | 480 | ▁▁▁▇▁ |
| Credit_History | 50 | 0.92 | 0.84 | 0.36 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
We can see that there are a fair amount of things we’ll need to do to clean the dataset before being able to run our decision tree algorithm. First, we can see that there quite a few missing values (NAs) in our LoanAmount, Loan_Amount_Term and Credit_History features. Also, we noticed that there were a lot of blank values, which needed to be recoded to NAs. Therefore, we used the naniar package below to handle this:
loan <- loan %>%
replace_with_na_all(condition = ~. == "")loan %>%
summarise_all(funs(sum(is.na(.)))) %>%
pivot_longer(cols = 1:13, names_to = 'columns', values_to = 'NA_count') %>%
arrange(desc(NA_count)) %>%
ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
geom_label(aes(label = NA_count)) +
theme_minimal() +
labs(title = 'Count of missing values in penguins dataset') +
theme(plot.title = element_text(hjust = 0.5))gg_miss_upset(loan)This can be further visualized by the “Missingness Map” below:
missmap(loan)Before progressing, we thought it would be helpful to conduct some transformations on this data, as well as account for NAs. To do this, we used the factor function on each feature:
loan$Loan_Status <- factor(loan$Loan_Status)
loan <- loan %>%
mutate(Gender = factor(Gender),
Married = factor(Married),
Dependents = factor(Dependents),
Education = factor(Education),
Self_Employed = factor(Self_Employed),
Property_Area = factor(Property_Area),
Loan_Status = factor(Loan_Status))
summary(loan)## Loan_ID Gender Married Dependents Education
## Length:614 Female:112 No :213 0 :345 Graduate :480
## Class :character Male :489 Yes :398 1 :102 Not Graduate:134
## Mode :character NA's : 13 NA's: 3 2 :101
## 3+ : 51
## NA's: 15
##
##
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :500 Min. : 150 Min. : 0 Min. : 9.0
## Yes : 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## NA's: 32 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
Next, we’ll want to take a look at the distributions of the continuous features, which appear to be ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term. We color coded Loan_status where blue is Y and red is N to help visualize patterns between the distributions and our feature distribution:
We can see that ApplicantIncome, CoapplicantIncome and LoanAmount are highly right skewed, with long right tails. Conversely, it looks like Loan_Amount_Term is highly left skewed, with a long left tail.
Additionally, we can also take a look at the categorical variables, broken out by our target variable of whether or not a loan was issued to the applicant. From the green bars below, we can see the total amount of loan approvals, relative to the red bars below documenting the loan denials.
From the categorical variable splits based on loan approval status, we can see a few interesting things:
The number of loans applied for by males was significantly higher than loans applied for by females in this dataset.
The same phenomenon is true for graduate students, where a much higher number of loan applications were coming from graduate students relative to undergraduate students.
Whether or not the applicant was self employed or married also showed a pretty large class imbalance, where those that identified as not self employed and those that identified as married had higher proportions of applicants in this dataset than those that did not identify by those two characteristics.
There was a pretty even split in the number of applicants based on property area, where someone identifying that they live in rural, semiurban, or urban settings were pretty evenly distributed.
We can see that applicants that did not pass the credit history criteria were almost always likely to be denied a loan.
These factors will be interesting to examine and discuss as we look to build our decision tree model and random forest models, since these class imbalances could affect model performance and interpretation.
Since there were a large amount of missing values that we identified above, we can use KNN() to help with our imputations. With the KNN method, a categorical missing value is imputed with the majority among its k nearest neighbors, and the average value (mean) of the k nearest neighbors is regarded as the prediction for a numerical missing value.
loan <- kNN(loan) %>%
subset(select = Loan_ID:Loan_Status)Now, we can double check that this worked correctly by running syntax to find any missing values:
sapply(loan, function(x) sum(is.na(x)))## Loan_ID Gender Married Dependents
## 0 0 0 0
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 0 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 0 0 0 0
## Loan_Status
## 0
Fortunately, we can see that there are now no missing data points!
Next, we’ll have to work through a few transformations for our highly skewed continuous data. For our LoanAmount feature, we can conduct a log transformation:
loan$LogLoanAmount <- log(loan$LoanAmount)
loan$LogLoan_Amount_Term <- log(loan$Loan_Amount_Term)Banks generally don’t want loan payments to exceed 25% of an applicants income. Most banks will add the co-applicant income to the applicant income and treat the sum in their decision criteria. For this reason, we combine ApplicantIncome and CoapplicantIncome into a single Income feature and drop the separate ApplicantIncome and CoapplicantIncome. Next we perform a log transformation on Income to obtain the best results.
loan$Income <- loan$ApplicantIncome + loan$CoapplicantIncome
loan$ApplicantIncome <- NULL
loan$CoapplicantIncome <- NULL
loan$LogIncome <- log(loan$Income)When checking the new distribution of our Income variable, we can see that it is much more normally distributed:
With the exploratory data analysis behind us, and our data tidy’d, we are now ready to run our decision tree!
Similar to our KNN model process above, we’ll determine an 80/20 split for our testing and training datasets:
set.seed(123)
trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)
train_loan<-loan[trainingRows,]
test_loan<-loan[-trainingRows,]With the data split, we can now run our first decision tree:
tree1 <- rpart(Loan_Status ~ Gender + Married + Dependents + Education +
Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term +
Credit_History + Property_Area,
data=train_loan)
rpart.plot(tree1, nn=TRUE)summary(tree1)## Call:
## rpart(formula = Loan_Status ~ Gender + Married + Dependents +
## Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term +
## Credit_History + Property_Area, data = train_loan)
## n= 268
##
## CP nsplit rel error xerror xstd
## 1 0.31764706 0 1.0000000 1.0000000 0.08962903
## 2 0.01647059 1 0.6823529 0.6823529 0.07931171
## 3 0.01176471 8 0.5647059 0.9294118 0.08781286
## 4 0.01000000 10 0.5411765 0.9058824 0.08715167
##
## Variable importance
## Credit_History LogIncome LogLoanAmount Property_Area Gender
## 47 22 18 9 2
## Self_Employed Education Dependents
## 1 1 1
##
## Node number 1: 268 observations, complexity param=0.3176471
## predicted class=Y expected loss=0.3171642 P(node) =1
## class counts: 85 183
## probabilities: 0.317 0.683
## left son=2 (37 obs) right son=3 (231 obs)
## Primary splits:
## Credit_History < 0.5 to the left, improve=25.753790, (0 missing)
## LogIncome < 7.779046 to the left, improve= 4.191559, (0 missing)
## LogLoanAmount < 5.361193 to the right, improve= 3.578266, (0 missing)
## Property_Area splits as LRL, improve= 2.391196, (0 missing)
## Dependents splits as RLRL, improve= 1.315920, (0 missing)
##
## Node number 2: 37 observations
## predicted class=N expected loss=0.1351351 P(node) =0.1380597
## class counts: 32 5
## probabilities: 0.865 0.135
##
## Node number 3: 231 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.2294372 P(node) =0.8619403
## class counts: 53 178
## probabilities: 0.229 0.771
## left son=6 (23 obs) right son=7 (208 obs)
## Primary splits:
## LogLoanAmount < 5.381604 to the right, improve=3.1629310, (0 missing)
## Property_Area splits as LRL, improve=2.9041910, (0 missing)
## LogIncome < 8.869608 to the right, improve=2.0238030, (0 missing)
## Married splits as LR, improve=0.5197897, (0 missing)
## Dependents splits as LLRL, improve=0.4655822, (0 missing)
## Surrogate splits:
## LogIncome < 9.627714 to the right, agree=0.922, adj=0.217, (0 split)
##
## Node number 6: 23 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.4782609 P(node) =0.0858209
## class counts: 11 12
## probabilities: 0.478 0.522
## left son=12 (14 obs) right son=13 (9 obs)
## Primary splits:
## Property_Area splits as LLR, improve=0.6211180, (0 missing)
## Dependents splits as LRRR, improve=0.5540184, (0 missing)
## LogLoanAmount < 5.623991 to the left, improve=0.5540184, (0 missing)
## LogIncome < 9.360734 to the right, improve=0.5244147, (0 missing)
## Married splits as LR, improve=0.0115942, (0 missing)
## Surrogate splits:
## Self_Employed splits as LR, agree=0.739, adj=0.333, (0 split)
## LogIncome < 8.998378 to the right, agree=0.696, adj=0.222, (0 split)
## LogLoanAmount < 6.325379 to the left, agree=0.696, adj=0.222, (0 split)
## Dependents splits as LLRL, agree=0.652, adj=0.111, (0 split)
##
## Node number 7: 208 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.2019231 P(node) =0.7761194
## class counts: 42 166
## probabilities: 0.202 0.798
## left son=14 (121 obs) right son=15 (87 obs)
## Primary splits:
## Property_Area splits as LRL, improve=3.6171640, (0 missing)
## LogIncome < 7.799417 to the left, improve=1.9780490, (0 missing)
## LogLoanAmount < 4.297262 to the right, improve=1.0214400, (0 missing)
## Married splits as LR, improve=0.5536131, (0 missing)
## Dependents splits as RLRL, improve=0.3148534, (0 missing)
## Surrogate splits:
## Gender splits as RL, agree=0.601, adj=0.046, (0 split)
## LogLoanAmount < 5.318012 to the left, agree=0.587, adj=0.011, (0 split)
## Loan_Amount_Term < 420 to the left, agree=0.587, adj=0.011, (0 split)
##
## Node number 12: 14 observations
## predicted class=N expected loss=0.4285714 P(node) =0.05223881
## class counts: 8 6
## probabilities: 0.571 0.429
##
## Node number 13: 9 observations
## predicted class=Y expected loss=0.3333333 P(node) =0.03358209
## class counts: 3 6
## probabilities: 0.333 0.667
##
## Node number 14: 121 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.2809917 P(node) =0.4514925
## class counts: 34 87
## probabilities: 0.281 0.719
## left son=28 (106 obs) right son=29 (15 obs)
## Primary splits:
## LogLoanAmount < 4.283372 to the right, improve=1.57306500, (0 missing)
## LogIncome < 8.505121 to the left, improve=1.37741000, (0 missing)
## Education splits as RL, improve=0.19043430, (0 missing)
## Dependents splits as LRRL, improve=0.13282170, (0 missing)
## Self_Employed splits as RL, improve=0.09381985, (0 missing)
## Surrogate splits:
## LogIncome < 8.01334 to the right, agree=0.901, adj=0.2, (0 split)
##
## Node number 15: 87 observations
## predicted class=Y expected loss=0.09195402 P(node) =0.3246269
## class counts: 8 79
## probabilities: 0.092 0.908
##
## Node number 28: 106 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.3113208 P(node) =0.3955224
## class counts: 33 73
## probabilities: 0.311 0.689
## left son=56 (11 obs) right son=57 (95 obs)
## Primary splits:
## LogIncome < 8.139586 to the left, improve=4.24708900, (0 missing)
## LogLoanAmount < 5.225732 to the left, improve=0.78845450, (0 missing)
## Dependents splits as LRRL, improve=0.10571960, (0 missing)
## Gender splits as LR, improve=0.06773702, (0 missing)
## Self_Employed splits as RL, improve=0.06773702, (0 missing)
## Surrogate splits:
## LogLoanAmount < 4.349757 to the left, agree=0.915, adj=0.182, (0 split)
##
## Node number 29: 15 observations
## predicted class=Y expected loss=0.06666667 P(node) =0.05597015
## class counts: 1 14
## probabilities: 0.067 0.933
##
## Node number 56: 11 observations
## predicted class=N expected loss=0.2727273 P(node) =0.04104478
## class counts: 8 3
## probabilities: 0.727 0.273
##
## Node number 57: 95 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.2631579 P(node) =0.3544776
## class counts: 25 70
## probabilities: 0.263 0.737
## left son=114 (26 obs) right son=115 (69 obs)
## Primary splits:
## LogIncome < 8.869608 to the right, improve=1.05615200, (0 missing)
## LogLoanAmount < 4.740565 to the left, improve=0.62682750, (0 missing)
## Dependents splits as RLRL, improve=0.57809530, (0 missing)
## Self_Employed splits as RL, improve=0.29007700, (0 missing)
## Gender splits as RL, improve=0.08916409, (0 missing)
## Surrogate splits:
## LogLoanAmount < 5.15614 to the right, agree=0.832, adj=0.385, (0 split)
##
## Node number 114: 26 observations, complexity param=0.01647059
## predicted class=Y expected loss=0.3846154 P(node) =0.09701493
## class counts: 10 16
## probabilities: 0.385 0.615
## left son=228 (9 obs) right son=229 (17 obs)
## Primary splits:
## LogIncome < 9.051689 to the left, improve=2.1900450, (0 missing)
## Dependents splits as LLR-, improve=1.1197220, (0 missing)
## LogLoanAmount < 5.225732 to the left, improve=1.1197220, (0 missing)
## Property_Area splits as L-R, improve=0.3076923, (0 missing)
## Married splits as LR, improve=0.1864802, (0 missing)
## Surrogate splits:
## Education splits as RL, agree=0.731, adj=0.222, (0 split)
## Gender splits as LR, agree=0.692, adj=0.111, (0 split)
##
## Node number 115: 69 observations, complexity param=0.01176471
## predicted class=Y expected loss=0.2173913 P(node) =0.2574627
## class counts: 15 54
## probabilities: 0.217 0.783
## left son=230 (31 obs) right son=231 (38 obs)
## Primary splits:
## LogIncome < 8.505121 to the left, improve=2.1268180, (0 missing)
## LogLoanAmount < 4.740565 to the left, improve=1.6248130, (0 missing)
## Loan_Amount_Term < 210 to the left, improve=0.6948507, (0 missing)
## Dependents splits as RLRL, improve=0.5077346, (0 missing)
## Property_Area splits as R-L, improve=0.4461663, (0 missing)
## Surrogate splits:
## LogLoanAmount < 4.740565 to the left, agree=0.797, adj=0.548, (0 split)
## Property_Area splits as R-L, agree=0.652, adj=0.226, (0 split)
## Dependents splits as RLRR, agree=0.638, adj=0.194, (0 split)
## Gender splits as LR, agree=0.594, adj=0.097, (0 split)
##
## Node number 228: 9 observations
## predicted class=N expected loss=0.3333333 P(node) =0.03358209
## class counts: 6 3
## probabilities: 0.667 0.333
##
## Node number 229: 17 observations
## predicted class=Y expected loss=0.2352941 P(node) =0.06343284
## class counts: 4 13
## probabilities: 0.235 0.765
##
## Node number 230: 31 observations, complexity param=0.01176471
## predicted class=Y expected loss=0.3548387 P(node) =0.1156716
## class counts: 11 20
## probabilities: 0.355 0.645
## left son=460 (14 obs) right son=461 (17 obs)
## Primary splits:
## LogLoanAmount < 4.695914 to the right, improve=2.39522900, (0 missing)
## LogIncome < 8.403892 to the right, improve=1.77450100, (0 missing)
## Married splits as RL, improve=0.43039050, (0 missing)
## Dependents splits as LRRL, improve=0.24396860, (0 missing)
## Education splits as RL, improve=0.06021505, (0 missing)
## Surrogate splits:
## LogIncome < 8.403892 to the right, agree=0.806, adj=0.571, (0 split)
## Gender splits as LR, agree=0.613, adj=0.143, (0 split)
## Self_Employed splits as RL, agree=0.613, adj=0.143, (0 split)
##
## Node number 231: 38 observations
## predicted class=Y expected loss=0.1052632 P(node) =0.141791
## class counts: 4 34
## probabilities: 0.105 0.895
##
## Node number 460: 14 observations
## predicted class=N expected loss=0.4285714 P(node) =0.05223881
## class counts: 8 6
## probabilities: 0.571 0.429
##
## Node number 461: 17 observations
## predicted class=Y expected loss=0.1764706 P(node) =0.06343284
## class counts: 3 14
## probabilities: 0.176 0.824
From our decision tree summary and plot, we can see that a few variables, such as Credit_History, Income and LoanAmount, were the most important features in this classification method. We can use this decision tree to make predictions on our holdout test set.
loanPre <- predict(tree1,
test_loan,
type="class",
control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
confusionMatrix(loanPre, test_loan$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 68 69
## Y 39 170
##
## Accuracy : 0.6879
## 95% CI : (0.6361, 0.7363)
## No Information Rate : 0.6908
## P-Value [Acc > NIR] : 0.572055
##
## Kappa : 0.3219
##
## Mcnemar's Test P-Value : 0.005262
##
## Sensitivity : 0.6355
## Specificity : 0.7113
## Pos Pred Value : 0.4964
## Neg Pred Value : 0.8134
## Prevalence : 0.3092
## Detection Rate : 0.1965
## Detection Prevalence : 0.3960
## Balanced Accuracy : 0.6734
##
## 'Positive' Class : N
##
As we can see from our predictions and the confusion matrix, we were able to obtain about 68% accuracy using this decision tree. To achieve higher accuracy, we’ll use the prune() function in the rpart package to examine a predicted optimal tree size.
plotcp(tree1) By plotting the cross-validated error against the complexity parameter, we can see that the relationship between the yields the best optimal outcome for our tree at a tree size of 2.
tree1$cptable## CP nsplit rel error xerror xstd
## 1 0.31764706 0 1.0000000 1.0000000 0.08962903
## 2 0.01647059 1 0.6823529 0.6823529 0.07931171
## 3 0.01176471 8 0.5647059 0.9294118 0.08781286
## 4 0.01000000 10 0.5411765 0.9058824 0.08715167
Although this is the case, when using this cp value in our prune() function, we still obtain accuracy ratings of around 68%. In order to test difference complexity parameters, we also decided to use our cp value of 0.01647059, which is a tree size of 2. When we use this tree size, and prune our initial tree accordingly, we can see the resulting decision tree below:
tree1.pruned <- prune(tree1, cp=0.01647059)
rpart.plot(tree1.pruned,
extra = 104,
main = "Decision Tree")Interestingly, from this pruning process, the Credit_History feature alone seems to do a fairly good job of classifying applicants into the approval vs. disapproval status. The accuracy is about 86% on our test set, which is slightly higher than our initial tree.
However, it’s important to think critically about whether or not this pruned tree would perform better on another dataset, and if it is applicable for real-life events.
tree1.pruned.pred <- predict(tree1.pruned,
test_loan,
type = "class" )
confusionMatrix(tree1.pruned.pred, test_loan$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 54 2
## Y 53 237
##
## Accuracy : 0.841
## 95% CI : (0.7982, 0.8779)
## No Information Rate : 0.6908
## P-Value [Acc > NIR] : 1.040e-10
##
## Kappa : 0.5715
##
## Mcnemar's Test P-Value : 1.562e-11
##
## Sensitivity : 0.5047
## Specificity : 0.9916
## Pos Pred Value : 0.9643
## Neg Pred Value : 0.8172
## Prevalence : 0.3092
## Detection Rate : 0.1561
## Detection Prevalence : 0.1618
## Balanced Accuracy : 0.7482
##
## 'Positive' Class : N
##
While a single feature does offer fairly good accuracy, there may be mitigating information from other features that would affect a loan decision which is lost if we only include a single feature. While decision trees are simple to construct and understand, their weakness is that they follow a simple linear path for decision making. This linear flow doesn’t allow for downstream features to alter or influence potential predictions.
Using the same dataset on Loan Approval Status, please use Random Forests to predict on loan approval status. Again, please be sure to walk us through the steps you took to get to your final model. (50 points)
set.seed(123)
fit.forest <- randomForest(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + LogIncome +
LogLoanAmount + Loan_Amount_Term +
Credit_History + Property_Area,
data=train_loan)
fit.forest##
## Call:
## randomForest(formula = Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term + Credit_History + Property_Area, data = train_loan)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 23.88%
## Confusion matrix:
## N Y class.error
## N 38 47 0.55294118
## Y 17 166 0.09289617
importance(fit.forest)## MeanDecreaseGini
## Gender 2.437321
## Married 3.169019
## Dependents 7.414653
## Education 2.277848
## Self_Employed 2.834501
## LogIncome 27.251346
## LogLoanAmount 24.507891
## Loan_Amount_Term 5.023619
## Credit_History 22.057853
## Property_Area 6.812607
forest.pred <- predict(fit.forest,
newdata = test_loan)
(dtree.cm_train <- confusionMatrix(forest.pred, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 59 13
## Y 48 226
##
## Accuracy : 0.8237
## 95% CI : (0.7794, 0.8624)
## No Information Rate : 0.6908
## P-Value [Acc > NIR] : 1.331e-08
##
## Kappa : 0.5464
##
## Mcnemar's Test P-Value : 1.341e-05
##
## Sensitivity : 0.5514
## Specificity : 0.9456
## Pos Pred Value : 0.8194
## Neg Pred Value : 0.8248
## Prevalence : 0.3092
## Detection Rate : 0.1705
## Detection Prevalence : 0.2081
## Balanced Accuracy : 0.7485
##
## 'Positive' Class : N
##
Here, we notice slight improvements on both samples where accuracy for the training sample is 82.37%. Notice the disparity between Specificity and Sensitivity. This model has a high false positive rate and tends to over predict Y when it should have chosen N.
Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 100 trees.
plot(fit.forest,
col = c("black", "red", "dark green"),
main = "Predicted Loan Error Rates")
legend("topright", colnames(fit.forest$err.rate), col = 1:6, cex = 0.8, fill = 1:6)In order to see whether the Random forest model can be improved, we will run the same model, but this time we will use the tuneRF function. We provide the features, x, target, y, and number of trees used at each tuning step, ntreeTry.
The values were assigned randomly initially, and they have tweaked until the optimal was found.
set.seed(123)
tree_var <- tuneRF(x = subset(train_loan, select = -Loan_Status),
y = train_loan$Loan_Status,
ntreeTry = 1000)## mtry = 3 OOB error = 23.88%
## Searching left ...
## mtry = 2 OOB error = 23.13%
## 0.03125 0.05
## Searching right ...
## mtry = 6 OOB error = 23.13%
## 0.03125 0.05
We rerun the Random Forest model with the new parameter \(Ntree = 1000\) and \(Mtry = 3\)
set.seed(123)
val_opt <- tree_var [,"mtry"][which.min(tree_var [,"OOBError"])]
fit.forest2 <- randomForest(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + LogIncome +
LogLoanAmount + Loan_Amount_Term +
Credit_History + Property_Area,
data=train_loan,
importance = TRUE,
ntree=1000,
mtry = val_opt)
fit.forest2##
## Call:
## randomForest(formula = Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term + Credit_History + Property_Area, data = train_loan, importance = TRUE, ntree = 1000, mtry = val_opt)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 22.39%
## Confusion matrix:
## N Y class.error
## N 37 48 0.56470588
## Y 12 171 0.06557377
The results of the trained Random Forest model are an out of bag error of 22.39%, which is higher than the original model 23.88%, Although, it still a good result it has got worse with the tuning.
forest.pred2 <- predict(fit.forest2,
newdata = test_loan)
(forest.cm_train <- confusionMatrix(forest.pred2, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 57 5
## Y 50 234
##
## Accuracy : 0.841
## 95% CI : (0.7982, 0.8779)
## No Information Rate : 0.6908
## P-Value [Acc > NIR] : 1.040e-10
##
## Kappa : 0.579
##
## Mcnemar's Test P-Value : 2.975e-09
##
## Sensitivity : 0.5327
## Specificity : 0.9791
## Pos Pred Value : 0.9194
## Neg Pred Value : 0.8239
## Prevalence : 0.3092
## Detection Rate : 0.1647
## Detection Prevalence : 0.1792
## Balanced Accuracy : 0.7559
##
## 'Positive' Class : N
##
Here, we notice there has the same accuracy as previous one for the training sample is 82.11%. The tuned model has performed worse than the test data set than the original Random Forest. The accuracy is slightly decreased, and the 95 % CI has decreased a bit too.
Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 20 trees.
plot(fit.forest2,
col = c("black", "red", "dark green"),
main = "Predicted Loan Error Rates")
legend("topright", colnames(fit.forest2$err.rate), col = 1:6, cex = 0.8, fill = 1:6)Using the Loan Approval Status data, please use Gradient Boosting to predict on the loan approval status. Please use whatever boosting approach you deem appropriate;but please be sure to walk us through your steps. (50 points)
set.seed(123)
boost <- gbm(Loan_Status~.,
data=train_loan[,-1],
distribution = "gaussian",
n.trees = 1000,
cv.folds = 3)gbm uses a default number of trees of 100, which is rarely sufficient. Consequently, I crank it up to 1,000 trees. The default depth of each tree (interaction.depth) is 1, which means we are ensembling a bunch of stumps. Lastly, I also include cv.folds to perform a 3 fold cross validation.
boost## gbm(formula = Loan_Status ~ ., distribution = "gaussian", data = train_loan[,
## -1], n.trees = 1000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 23.
## There were 13 predictors of which 4 had non-zero influence.
sqrt(min(boost$cv.error))## [1] 0.4136226
summary(boost)## var rel.inf
## LoanAmount LoanAmount 34.8085776
## Income Income 34.0120783
## Dependents Dependents 9.6081783
## Credit_History Credit_History 7.1610501
## Property_Area Property_Area 5.3913204
## Loan_Amount_Term Loan_Amount_Term 3.7712276
## Married Married 1.6303913
## Education Education 1.4209212
## Gender Gender 1.1968845
## Self_Employed Self_Employed 0.9993707
## LogLoanAmount LogLoanAmount 0.0000000
## LogLoan_Amount_Term LogLoan_Amount_Term 0.0000000
## LogIncome LogIncome 0.0000000
The best cross-validation iteration was 34 and the RMSE is 0.3913894
The summary of output creates a new data set with var, a factor variable with the variables in our model, and rel.inf, the relative influence each variable had on our model predictions.From the table, we can see LoanAmount, LogIncome and Credit_History are the top three most important variable for our model.
We could tune parameters to see how the results change. Here, I increase the number of trees and also perform a 5 fold cross validation.
set.seed(123)
boost2<-gbm(Loan_Status~.,
data=train_loan[,-1],
distribution = "gaussian",
n.trees = 2000,
cv.folds = 5)
sqrt(min(boost2$cv.error))## [1] 0.4160404
summary(boost2)## var rel.inf
## Income Income 34.132899
## LoanAmount LoanAmount 31.436681
## Dependents Dependents 12.092748
## Property_Area Property_Area 6.425018
## Credit_History Credit_History 5.068015
## Loan_Amount_Term Loan_Amount_Term 4.341234
## Self_Employed Self_Employed 1.879418
## Married Married 1.838764
## Gender Gender 1.540934
## Education Education 1.244288
## LogLoanAmount LogLoanAmount 0.000000
## LogLoan_Amount_Term LogLoan_Amount_Term 0.000000
## LogIncome LogIncome 0.000000
We can see the number of relative influence of each variable change and also the RMSE change to 0.3900092, lower than our previous model.
The partial Dependence Plots tell us the relationship and dependence of the variables X with the Response variable Y.
#partial dependence plots
plot(boost2,i="LoanAmount")plot(boost2,i="Income") The first plot shows loanAmount is negatively correlated with the response Loan_Status before 300K. The second plot indicate income is positively correlated with Loan_Status when it less than $20,000.
# plot loss function as a result of n trees added to the ensemble
gbm.perf(boost2, method = "cv")## [1] 26
gbm.perf(boost2, method = "OOB")## [1] 33
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 2000
## Equivalent Number of Parameters: 39.99
## Residual Standard Error: 0.0003932
The plots indicating the optimum number of trees based on the technique we used. The green line indicates error on test and the blue dotted line points the optimum number of iterations. We can observe that the beyond a certain a point (55 iterations for the cv method and 40 for the OOB method), the error on the test data appears to increase because of overfitting.
We use model 2 to forecast our data. According to the cv method, we choose 55 as the number of trees.
boostPre <- predict.gbm(boost2,
test_loan,
n.trees = 55)
boostPre <- ifelse(boostPre < 1.5, 'N', 'Y')
boostPre <- as.factor(boostPre)
(boost.cm <- confusionMatrix(boostPre, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 54 2
## Y 53 237
##
## Accuracy : 0.841
## 95% CI : (0.7982, 0.8779)
## No Information Rate : 0.6908
## P-Value [Acc > NIR] : 1.040e-10
##
## Kappa : 0.5715
##
## Mcnemar's Test P-Value : 1.562e-11
##
## Sensitivity : 0.5047
## Specificity : 0.9916
## Pos Pred Value : 0.9643
## Neg Pred Value : 0.8172
## Prevalence : 0.3092
## Detection Rate : 0.1561
## Detection Prevalence : 0.1618
## Balanced Accuracy : 0.7482
##
## 'Positive' Class : N
##
According to the Confusion matrix, Our model accuracy is 0.841. Looking at Sensitivity and Specificaity, this model is has a far lower False Negative rate and is less likely to pick ‘N’ when it shouldn’t. However, it’s False positive rate is higher and where it picked ‘Y’ when it should have predicted ‘N’.
Model performance: please compare the models you settled on for problem # 2 – 4.Comment on their relative performance. Which one would you prefer the most? Why?(20 points)
temp <- data.frame(dtree.cm_train$overall,
forest.cm_train$overall,
boost.cm$overall) %>%
t() %>%
data.frame() %>%
dplyr::select(Accuracy) %>%
mutate(`Classification Error Rate` = 1-Accuracy)eval <- data.frame(dtree.cm_train$byClass,
forest.cm_train$byClass,
boost.cm$byClass)
eval <- data.frame(t(eval)) %>%
cbind(temp) %>%
mutate(eval = c("Decision Tree", "Random Forest", "Gradient Boosting")) eval <- dplyr::select(eval, Accuracy, `Classification Error Rate`, Sensitivity, Specificity, Precision, Recall, F1)
rownames(eval) = c("Decision Tree", "Random Forest", "Gradient Boosting")
t_eval <- t(eval)
colnames(t_eval) <- rownames(eval)
rownames(t_eval) <- colnames(eval)
knitr::kable(t_eval)| Decision Tree | Random Forest | Gradient Boosting | |
|---|---|---|---|
| Accuracy | 0.8236994 | 0.8410405 | 0.8410405 |
| Classification Error Rate | 0.1763006 | 0.1589595 | 0.1589595 |
| Sensitivity | 0.5514019 | 0.5327103 | 0.5046729 |
| Specificity | 0.9456067 | 0.9790795 | 0.9916318 |
| Precision | 0.8194444 | 0.9193548 | 0.9642857 |
| Recall | 0.5514019 | 0.5327103 | 0.5046729 |
| F1 | 0.6592179 | 0.6745562 | 0.6625767 |