I split the data into 2 parts: train with 80% data and test with 20% data:
data <- Default
data <- data %>% mutate(default=ifelse(default=="Yes",1,0)) %>% mutate(student=ifelse(student=="Yes",1,0))
set.seed(123)
data_split <- initial_split(data, pop=0.8, strat= default)
data_train <- training(data_split)
data_test <- testing(data_split)
I build three models to test the impacts of variables on the default: The first model is called model1 to test the impact of balance on default. The second model is called model2 to test the impact of student on default. The third model is called model3 to test the impact of balance, student, and income on default.
model1 <- glm(default ~ balance,family="binomial", data= data_train)
model2 <- glm(default ~ student, family="binomial", data= data_train)
model3 <- glm(default ~., family="binomial",data = data_train)
tidy(model1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.8 0.423 -25.4 8.90e-143
## 2 balance 0.00554 0.000256 21.6 1.22e-103
tidy(model2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.53 0.0826 -42.8 0
## 2 student 0.429 0.134 3.20 0.00137
tidy(model3)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.9 0.575 -18.9 7.51e- 80
## 2 student -0.700 0.275 -2.54 1.10e- 2
## 3 balance 0.00578 0.000270 21.4 1.41e-101
## 4 income 0.000000568 0.00000961 0.0591 9.53e- 1
Looking at the first table, we have the model1:
default = -10.75+ 0.0055*balance
It means that the estimated coefficient of balance is 0.0055. In other words, if balance increases 1 unit, the log odds of default will increase by 0.0055 units.
Looking at the second table, we have the model2:
default= -3.53+ 0.4287*student
It means that the estimated coefficient of student is 0.4287. Because student is dummy variable, we denote student is 1 and non-student is 0. The result suggests that student have higher probability to default than non-student.
Looking at the third table, we have the model3:
default= -10.88-0.7 * student + 0.00578 * balance + 0.000 * income
The result shows negative relationship between student and default; positive relationship betwwen default and balance; and approximately no relationship between default and income. It means that when it’s defined as student, and the balance is higher, the possibility of default will increase while the income approximately has no impacts.
data_train$default <- factor(data_train$default, levels = c("0", "1"))
data_test$default <- factor(data_test$default, levels = c("0", "1"))
pred_default <- predict(model3,data_test, type="response")
pred_result <- data_test %>% select(default) %>% bind_cols(pred_default)
## New names:
## • `` -> `...2`
colnames(pred_result) <- c("actual","prediction")
pred_default <- as.factor(pred_default)
str(pred_default)
## Factor w/ 2500 levels "9.43883611530627e-06",..: 1252 660 1740 947 2254 1490 2101 1041 463 962 ...
## - attr(*, "names")= chr [1:2500] "6" "19" "23" "24" ...
#confusionMatrix(data= relevel(pred_default,ref=1),
# reference=relevel(data_test$default,ref=1))
actual<- as.numeric(pred_result$actual)
prediction <- as.numeric(pred_result$prediction)
roc_data <- roc(actual,prediction)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
auc_value <- auc(roc_data)
plot(roc_data, print.auc = TRUE, auc.polygon = TRUE,
legacy.axes = TRUE, grid = TRUE, main = paste0("ROC Curve (AUC = ", round(auc_value, 3), ")"))