hw03

R Splitting data

I split the data into 2 parts: train with 80% data and test with 20% data:

data <- Default
data <- data %>% mutate(default=ifelse(default=="Yes",1,0)) %>% mutate(student=ifelse(student=="Yes",1,0))
set.seed(123)
data_split <- initial_split(data, pop=0.8, strat= default)
data_train <- training(data_split)
data_test <- testing(data_split)

Model

I build three models to test the impacts of variables on the default: The first model is called model1 to test the impact of balance on default. The second model is called model2 to test the impact of student on default. The third model is called model3 to test the impact of balance, student, and income on default.

model1 <- glm(default ~ balance,family="binomial", data= data_train)
model2 <- glm(default ~ student, family="binomial", data= data_train)
model3 <- glm(default ~., family="binomial",data = data_train)
tidy(model1)

## # A tibble: 2 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.8      0.423        -25.4 8.90e-143
## 2 balance       0.00554  0.000256      21.6 1.22e-103

tidy(model2)

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   -3.53     0.0826    -42.8  0      
## 2 student        0.429    0.134       3.20 0.00137

tidy(model3)

## # A tibble: 4 × 5
##   term             estimate  std.error statistic   p.value
##   <chr>               <dbl>      <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.9         0.575       -18.9    7.51e- 80
## 2 student      -0.700       0.275        -2.54   1.10e-  2
## 3 balance       0.00578     0.000270     21.4    1.41e-101
## 4 income        0.000000568 0.00000961    0.0591 9.53e-  1

Looking at the first table, we have the model1:

default = -10.75+ 0.0055*balance

It means that the estimated coefficient of balance is 0.0055. In other words, if balance increases 1 unit, the log odds of default will increase by 0.0055 units.

Looking at the second table, we have the model2:

default= -3.53+ 0.4287*student

It means that the estimated coefficient of student is 0.4287. Because student is dummy variable, we denote student is 1 and non-student is 0. The result suggests that student have higher probability to default than non-student.

Looking at the third table, we have the model3:

default= -10.88-0.7 * student + 0.00578 * balance + 0.000 * income

The result shows negative relationship between student and default; positive relationship betwwen default and balance; and approximately no relationship between default and income. It means that when it’s defined as student, and the balance is higher, the possibility of default will increase while the income approximately has no impacts.

Out-of-sample performance measures

data_train$default <- factor(data_train$default, levels = c("0", "1"))
data_test$default <- factor(data_test$default, levels = c("0", "1"))

pred_default <- predict(model3,data_test, type="response")
pred_result <- data_test %>% select(default) %>% bind_cols(pred_default)

## New names:
## • `` -> `...2`

colnames(pred_result) <- c("actual","prediction")

pred_default <- as.factor(pred_default)
str(pred_default)

##  Factor w/ 2500 levels "9.43883611530627e-06",..: 1252 660 1740 947 2254 1490 2101 1041 463 962 ...
##  - attr(*, "names")= chr [1:2500] "6" "19" "23" "24" ...

#confusionMatrix(data= relevel(pred_default,ref=1),
#                reference=relevel(data_test$default,ref=1))

ROC

actual<- as.numeric(pred_result$actual)
prediction <- as.numeric(pred_result$prediction)
roc_data <- roc(actual,prediction)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

auc_value <- auc(roc_data)
plot(roc_data, print.auc = TRUE, auc.polygon = TRUE,
     legacy.axes = TRUE, grid = TRUE, main = paste0("ROC Curve (AUC = ", round(auc_value, 3), ")"))

hw03

Nguyen Bao Quynh Trang

2023-04-26

R Splitting data

Model

Out-of-sample performance measures

ROC