2024-04-03

knitr::opts_chunk$set(echo = FALSE)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'rpart' was built under R version 4.3.3
## Warning: package 'rpart.plot' was built under R version 4.3.3

Decision Tree

DATA 607, Spring 2024
Chhiring Lama

What is a Decision Tree

  • Imagine a tree where each branch represents a decision, and each leaf represents an outcome.
  • In simple terms, decision trees help us understand which decisions to make based on certain conditions.

How Does a Decision Tree Work?

  • At each step, a decision tree looks at the data and asks questions to make the best choice.
  • It splits the data into smaller groups based on features that help classify or predict outcomes.

Splitting Criteria

  • The tree decides which questions to ask first based on how much they help to separate the data into pure groups.
  • It’s like picking the most useful questions to ask to get the clearest answers.

Load the data from github

loan_url <- "https://media.githubusercontent.com/media/topkelama/lfsStorage/main/loan_eligibility.csv"
loana_df<- read.csv(loan_url)

Set a seed of random nuber and shuffle the data

set.seed(1234)  # Set a seed for reproducibility 
shuffled_loan <- loana_df[sample(nrow(loana_df)), ] 

Data Transformation

# Drop variables
clean_loan <- shuffled_loan %>%
  select(-c(Loan.ID, Customer.ID)) %>%
  # Convert to factor level
  mutate(Term = factor(Term),
         Years.in.current.job = factor(Years.in.current.job),
         Home.Ownership = factor(Home.Ownership),
         Purpose = factor(Purpose),
         Bankruptcies = factor(Bankruptcies),
         Tax.Liens = factor(Tax.Liens)) %>%
  na.omit()

Check current loan status

clean_loan <- clean_loan %>%
  mutate(Loan_Status = ifelse(Current.Loan.Amount < 500000, "Approved", "Denied"))

Create Fucntion for Train and Test

create_train_test <- function(data, size = 0.8, train = TRUE) {
    n_row <- nrow(data)  
    total_row <- round(size * n_row) 
    
    train_sample <- sample(n_row, total_row) 
    
    if (train == TRUE) {
        return(data[train_sample, ])  
    } else {
        test_sample <- sample(setdiff(1:n_row, train_sample), n_row - total_row)
        return(data[test_sample, ])  
    }
}

Dataset Splitting into 80/20

train_set <- create_train_test(clean_loan, size = 0.8, train = TRUE)
dim(train_set)
## [1] 598  17
test_set <- create_train_test(clean_loan, size = 0.8, train = FALSE)
dim(test_set)
## [1] 150  17

Define the formula for the decision tree model

formula <- Loan_Status ~ Credit.Score + Years.in.current.job + Home.Ownership

Build the decision tree model using rpart

decision_tree <- rpart(formula, data = train_set, method = "class")

Print the summary of the decision tree model

Slide with Plot

# Plot the decision tree
rpart.plot(decision_tree, uniform = TRUE, main = "Decision Tree Model")

Evaluate the model on the testing set

And Calculate accuracy

predictions <- predict(decision_tree, test_set, type = "class")

accuracy <- mean(predictions == test_set$Loan_Status)
print(paste("Accuracy of the decision tree model:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of the decision tree model: 74.67 %"

The End

Citation:-