Ever been heart broken and/or wondered what makes a lasting Marriage? This dataset and prediction analysis may help you.
This dataset contains data about 150 couples with their corresponding Divorce Predictors Scale variables (DPS) on the basis of Gottman couples therapy.The couples are from various regions of Turkey wherein the records were acquired from face-to-face interviews from couples who were already divorced or happily married. All responses were collected on a 5 point scale (0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).
Source dataset : https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set.
Before we do analysis, we need to load the required library packages.
library(dplyr) #praprocess data
library(caret) #confusion matrix
library(e1071) #Naive Bayes Classifier
library(rsample) #Splitting data
library(partykit) #Decision Tree
library(randomForest) #Random Forest
library(readr) #Read RDS
# plot
library(ggplot2)
library(tidyr)We need the data to do the analysis. Then, we have to load the dataset
divorce <- readxl::read_xlsx("divorce.xlsx")
head(divorce)Divorce Predictors dataset is consist of 54 questionnaire which carried out by using the Divorce Predictors Scale (DPS) on the basis of Gottman couples therapy. Questionnaire was taken place in Turkey.
Target Variable –> Class, 0-Not_Divorce ; 1-Divorce
All responses were collected on a 5 point scale (0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).
The list of Questionnaires:
Let us check each column’s data type.
glimpse(divorce)#> Rows: 170
#> Columns: 55
#> $ Atr1 <dbl> 2, 4, 2, 3, 2, 0, 3, 2, 2, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr2 <dbl> 2, 4, 2, 2, 2, 0, 3, 1, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr3 <dbl> 4, 4, 2, 3, 1, 1, 3, 2, 1, 1, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3~
#> $ Atr4 <dbl> 1, 4, 2, 2, 1, 0, 2, 2, 0, 1, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3, 3, 3~
#> $ Atr5 <dbl> 0, 4, 1, 3, 1, 0, 1, 2, 0, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr6 <dbl> 0, 0, 3, 3, 1, 2, 3, 1, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1~
#> $ Atr7 <dbl> 0, 0, 2, 3, 0, 0, 4, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0~
#> $ Atr8 <dbl> 0, 4, 1, 3, 0, 0, 3, 3, 3, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr9 <dbl> 0, 4, 1, 3, 0, 0, 2, 3, 3, 2, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3~
#> $ Atr10 <dbl> 0, 4, 2, 3, 0, 1, 2, 2, 3, 2, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3, 3, 3~
#> $ Atr11 <dbl> 1, 4, 3, 4, 0, 0, 2, 4, 3, 3, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr12 <dbl> 0, 3, 4, 3, 1, 2, 2, 3, 3, 0, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr13 <dbl> 1, 4, 2, 3, 0, 1, 2, 2, 3, 0, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr14 <dbl> 1, 0, 3, 4, 1, 0, 3, 3, 3, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr15 <dbl> 0, 4, 3, 3, 1, 2, 2, 4, 3, 1, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3~
#> $ Atr16 <dbl> 1, 4, 3, 3, 1, 0, 3, 3, 3, 0, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3, 3, 3~
#> $ Atr17 <dbl> 0, 4, 3, 3, 1, 2, 3, 2, 3, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr18 <dbl> 0, 4, 3, 3, 1, 1, 3, 3, 3, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr19 <dbl> 0, 3, 3, 3, 2, 0, 3, 2, 3, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr20 <dbl> 1, 2, 2, 4, 1, 1, 2, 1, 3, 0, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr21 <dbl> 0, 1, 1, 1, 1, 0, 3, 2, 2, 0, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3~
#> $ Atr22 <dbl> 0, 1, 0, 1, 0, 0, 3, 1, 2, 0, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3, 3, 3~
#> $ Atr23 <dbl> 0, 0, 1, 1, 0, 0, 3, 1, 2, 0, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr24 <dbl> 0, 2, 2, 1, 0, 0, 3, 2, 3, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr25 <dbl> 0, 2, 2, 2, 0, 2, 2, 3, 2, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr26 <dbl> 0, 1, 2, 1, 2, 2, 3, 3, 3, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr27 <dbl> 0, 2, 2, 1, 1, 0, 3, 2, 2, 1, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3~
#> $ Atr28 <dbl> 0, 0, 2, 1, 2, 0, 2, 2, 3, 1, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3, 3, 3~
#> $ Atr29 <dbl> 0, 1, 3, 1, 1, 0, 2, 2, 2, 1, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4~
#> $ Atr30 <dbl> 1, 1, 2, 3, 1, 0, 2, 3, 3, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3~
#> $ Atr31 <dbl> 1, 0, 3, 2, 1, 4, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr32 <dbl> 2, 4, 3, 3, 1, 1, 2, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr33 <dbl> 1, 2, 1, 2, 1, 1, 2, 0, 1, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr34 <dbl> 2, 3, 1, 2, 1, 1, 1, 2, 1, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr35 <dbl> 0, 0, 1, 1, 0, 1, 1, 2, 1, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr36 <dbl> 1, 2, 1, 1, 0, 1, 2, 1, 1, 0, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr37 <dbl> 2, 3, 2, 3, 0, 1, 3, 4, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr38 <dbl> 1, 4, 1, 3, 0, 2, 2, 4, 2, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr39 <dbl> 3, 2, 3, 4, 2, 0, 2, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr40 <dbl> 3, 4, 3, 4, 1, 2, 3, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr41 <dbl> 2, 2, 3, 2, 0, 2, 3, 4, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr42 <dbl> 1, 2, 3, 2, 2, 1, 3, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr43 <dbl> 1, 3, 2, 3, 3, 2, 3, 3, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr44 <dbl> 2, 4, 3, 2, 0, 3, 4, 2, 2, 2, 4, 4, 4, 4, 3, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr45 <dbl> 3, 2, 2, 3, 2, 0, 3, 0, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr46 <dbl> 2, 2, 3, 2, 2, 2, 3, 0, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr47 <dbl> 1, 2, 2, 2, 1, 2, 2, 1, 1, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr48 <dbl> 3, 3, 3, 3, 2, 1, 3, 2, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr49 <dbl> 3, 4, 1, 3, 3, 2, 2, 2, 1, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr50 <dbl> 3, 4, 1, 3, 2, 1, 3, 2, 1, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr51 <dbl> 2, 4, 1, 3, 2, 1, 3, 1, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr52 <dbl> 3, 4, 2, 2, 2, 1, 2, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Atr53 <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4~
#> $ Atr54 <dbl> 1, 2, 2, 2, 0, 0, 2, 0, 1, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
#> $ Class <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
After we check the datatypes, only our target variable data type that needs to be changed to factor/category
divorce <- divorce %>%
mutate(Class=ifelse(Class==1,"divorce","not_divorce")) %>%
mutate(Class=as.factor(Class))
head(divorce)All the data types are correct, we are ready to go for the next step
We have to check if there is any missing values in our data set
colSums(is.na(divorce))#> Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
#> 0 0 0 0 0 0 0 0 0 0 0 0 0
#> Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25 Atr26
#> 0 0 0 0 0 0 0 0 0 0 0 0 0
#> Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37 Atr38 Atr39
#> 0 0 0 0 0 0 0 0 0 0 0 0 0
#> Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49 Atr50 Atr51 Atr52
#> 0 0 0 0 0 0 0 0 0 0 0 0 0
#> Atr53 Atr54 Class
#> 0 0 0
There are no missing values in our data set. Now we are ready to go to the data analysis.
To get to know more about our data, let us check the summary.
summary(divorce)#> Atr1 Atr2 Atr3 Atr4
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :2.000 Median :2.000 Median :2.000 Median :1.000
#> Mean :1.776 Mean :1.653 Mean :1.765 Mean :1.482
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr5 Atr6 Atr7 Atr8
#> Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
#> Median :1.000 Median :0.0000 Median :0.0000 Median :1.000
#> Mean :1.541 Mean :0.7471 Mean :0.4941 Mean :1.453
#> 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.0000 Max. :4.0000 Max. :4.000
#> Atr9 Atr10 Atr11 Atr12
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :1.000 Median :2.000 Median :1.000 Median :1.500
#> Mean :1.459 Mean :1.576 Mean :1.688 Mean :1.653
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr13 Atr14 Atr15 Atr16
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :2.000 Median :1.000 Median :1.000 Median :1.000
#> Mean :1.835 Mean :1.571 Mean :1.571 Mean :1.476
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr17 Atr18 Atr19 Atr20
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :1.000 Median :1.000 Median :1.000 Median :1.000
#> Mean :1.653 Mean :1.518 Mean :1.641 Mean :1.459
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr21 Atr22 Atr23 Atr24
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :1.000 Median :0.000 Median :0.000 Median :1.000
#> Mean :1.388 Mean :1.247 Mean :1.412 Mean :1.512
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr25 Atr26 Atr27 Atr28 Atr29
#> Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.000
#> Median :1.000 Median :1.000 Median :1.0 Median :0.500 Median :1.000
#> Mean :1.629 Mean :1.488 Mean :1.4 Mean :1.306 Mean :1.494
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.0 3rd Qu.:3.000 3rd Qu.:3.000
#> Max. :4.000 Max. :4.000 Max. :4.0 Max. :4.000 Max. :4.000
#> Atr30 Atr31 Atr32 Atr33 Atr34
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0
#> Median :1.000 Median :2.000 Median :2.000 Median :1.000 Median :1.0
#> Mean :1.494 Mean :2.124 Mean :2.059 Mean :1.806 Mean :1.9
#> 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.0
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.0
#> Atr35 Atr36 Atr37 Atr38
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :0.500 Median :0.000 Median :2.000 Median :1.000
#> Mean :1.671 Mean :1.606 Mean :2.088 Mean :1.859
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr39 Atr40 Atr41 Atr42
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :2.000 Median :1.500 Median :2.000 Median :2.000
#> Mean :2.088 Mean :1.871 Mean :1.994 Mean :2.159
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr43 Atr44 Atr45 Atr46
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:2.000
#> Median :3.000 Median :2.000 Median :3.000 Median :3.000
#> Mean :2.706 Mean :1.941 Mean :2.459 Mean :2.553
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr47 Atr48 Atr49 Atr50
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
#> Median :2.000 Median :3.000 Median :3.000 Median :2.000
#> Mean :2.271 Mean :2.741 Mean :2.382 Mean :2.429
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Atr51 Atr52 Atr53 Atr54
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
#> Median :3.000 Median :3.000 Median :2.000 Median :2.000
#> Mean :2.476 Mean :2.518 Mean :2.241 Mean :2.012
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
#> Class
#> divorce :84
#> not_divorce:86
#>
#>
#>
#>
Below frequency data visualization for each numerical variables
ggplot(gather(divorce %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10, fill="dodgerblue4") +
facet_wrap(~key, scales = 'free_x',nrow=6) +
theme_bw()Summary of the variables:
Let’s check the proportion of our target variable
prop.table(table(divorce$Class))#>
#> divorce not_divorce
#> 0.4941176 0.5058824
Our target variable has a balance proportion
Before we build our model, we should split the dataset into training and test data. We will split the data into 80% training and 20% test using sample() function, set.seed(100), and store it as data.train and data.test
RNGkind(sample.kind = "Rounding")
set.seed(100)
splitter_sample <- sample(nrow(divorce), nrow(divorce)*0.8)
data.train <- divorce[splitter_sample, ]
data.test <- divorce[-splitter_sample, ]Now let us check the target variable proportion of our data train.
prop.table(table(data.train$Class))#>
#> divorce not_divorce
#> 0.4926471 0.5073529
Now let us check the target variable proportion of our data test.
prop.table(table(data.test$Class))#>
#> divorce not_divorce
#> 0.5 0.5
If we look at above proportion results, both of our data train and data set have balance target variables. So there is no needs to do upsampling or downsampling.
Now let us build our models!
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.
There are several advantages in using this model, for example:
After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. Build a Naive Bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter as 1.
model_naive <- naiveBayes(x = divorce %>% select(-Class),
y = divorce$Class,
laplace = 1) Now let us predict our test data to obtain class prediction.
pred_naive <- predict(object= model_naive,
newdata = data.test,
type="class")Now let us check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data.testusing divorce Class as the positive class (positive = "divorce").
confusionMatrix(data= pred_naive,
reference= data.test$Class,
positive="divorce")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction divorce not_divorce
#> divorce 16 0
#> not_divorce 1 17
#>
#> Accuracy : 0.9706
#> 95% CI : (0.8467, 0.9993)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000000002037
#>
#> Kappa : 0.9412
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9412
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9444
#> Prevalence : 0.5000
#> Detection Rate : 0.4706
#> Detection Prevalence : 0.4706
#> Balanced Accuracy : 0.9706
#>
#> 'Positive' Class : divorce
#>
If we look at above confusion matrix. We have a 97,06 % accuracy. It means that this model can predict divorce with 97,06 % accuracy. It means this model is good to predict both divorce and no_divorce. If we can predict a couple is going to potentially divorce, we can propose them to go to marriage consultant to reconciliate and fix the marriage before it’s too late. If we can predict a couple is not going to divorce it will help them to retain and tighter the marriage .
This model also has high percentage in other metrics such as Recall/Sensitivity, Specificity, and Pos Pred value/Precision.
Since we have balanced data with good accuracy, recall, precision, specificity, then we do not need to check the Receiver-Operating Curve (ROC) and the Area Under ROC Curve (AUC).
A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.
A decision tree consists of three types of nodes:[1]
Decision nodes – typically represented by squares Chance nodes – typically represented by circles End nodes – typically represented by triangles
Let us build our decision tree model using ctree() function to build the model and store it under the model_dt object.
set.seed(100)
model_dt <-ctree(Class ~ ., divorce)Now let us predict our test data to obtain class prediction.
pred_dt <- predict(model_dt, newdata = data.test, type = "response")Now let us check the model performance for the Decision Tree model using confusionMatrix() and compare the predicted class (pred_dt) with the actual label in data.testusing divorce Class as the positive class (positive = "divorce").
confusionMatrix(pred_dt, data.test$Class, positive = "divorce")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction divorce not_divorce
#> divorce 16 0
#> not_divorce 1 17
#>
#> Accuracy : 0.9706
#> 95% CI : (0.8467, 0.9993)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000000002037
#>
#> Kappa : 0.9412
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9412
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9444
#> Prevalence : 0.5000
#> Detection Rate : 0.4706
#> Detection Prevalence : 0.4706
#> Balanced Accuracy : 0.9706
#>
#> 'Positive' Class : divorce
#>
If we look at above confusion matrix, we can see the accuracy is 97,06 %. It is interesting because we have the same accuracy and another metrics as Naive bayes model.
Now let us check the model performance with data train and compare to data test, to check whether this model is overfitting, just right, or under fitting
pred_dt_train <- predict(model_dt, newdata = data.train, type = "response")
confusionMatrix(pred_dt_train, data.train$Class, positive = "divorce")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction divorce not_divorce
#> divorce 63 0
#> not_divorce 4 69
#>
#> Accuracy : 0.9706
#> 95% CI : (0.9264, 0.9919)
#> No Information Rate : 0.5074
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9411
#>
#> Mcnemar's Test P-Value : 0.1336
#>
#> Sensitivity : 0.9403
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9452
#> Prevalence : 0.4926
#> Detection Rate : 0.4632
#> Detection Prevalence : 0.4632
#> Balanced Accuracy : 0.9701
#>
#> 'Positive' Class : divorce
#>
Accuracy data comparison:
Basically the Accuracy value for both data.test and data.train is high and the same value. So the model is just right fitting and we do not need to do pruning.
Why Accuracy metrics? Same as mentioned in naive bayes,It means this model is good to predict both divorce and no_divorce. If we can predict a couple is going to potentially divorce, we can propose them to go to marriage consultant to reconciliate and fix the marriage before it’s too late. If we can predict a couple is not going to divorce it will help them to retain and tighter the marriage .
To have a better grasp of our model, we will plot the model and set type = "simple".
plot(model_dt,type="simple") It looks like the plot is self explanatory. It is interesting that we have a variable Atr40 that directly get to Class decision.
Random forest is an ensemble-based state-of-the-art algorithm built on the decision tree method we learned about above and is also known for its versatility and performance. Among the family of ensemble-based classifier include a technique called boosting and it works by combining the performance of weak learners to gain an overall boosted performance.
The idea of ensembling is largely in principle and doesn’t necessarily reference any particular algorithm. They describe any meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance, reduce bias, or improve predictions.
When we apply the ensemble-based approach on a decision tree model, the trees we built are usually trained using resampled data. In the prediction phase, these trees then vote for a final prediction. Another way to apply ensemble methods on our tree model is known as bagging (bootstrap aggregation). Bagging proposes the idea of creating many subsets of training sample through random sampling (with replacement). Then each of these sets of training sample are used to train one unit of decision tree. This leads us to an “ensemble” of trees, and we’ll use the average of all the predictions from these different trees in the prediction phase.
Random Forest extends the idea of bagging by taking one more measure: in addition to creating subsets from the training set, each of the tree is also trained using a random selection of features (rather than using all features). Because each tree is built with a random set of predictors and training samples, the collective of it is called a Random Forest, which is a lot more robust as a model compared to a single tree.
Among many of its advantages, random forest can be used to solve for both regression and classification tasks, handles extremely large datasets well (since the ensemble approach means it only use a small sampled subset from the full dataset), would solve for the dimensionality problems through implicit feature selection while treating noisy data (missing values and outlier values) out of the box.
Let us create our Random Forest now, using a 5-Fold Cross Validation, with 3 repeats.
set.seed(100)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_forest <- train(Class ~ .,
data = data.train,
method = "rf",
trControl = ctrl)
saveRDS(model_forest, "model_forest_update.RDS")model_rf <- readRDS("model_forest_update.RDS")Now let us check the summary of the final model we built using model_rf$finalModel
model_rf$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 2.21%
#> Confusion matrix:
#> divorce not_divorce class.error
#> divorce 64 3 0.04477612
#> not_divorce 0 69 0.00000000
If we look at the model above, the OOB or Out-Off-Bag error is 2.21%, It means that this model has accuracy of 97.79%.
We could also use Variable Importance, to get a list of the most important variables used in our random forest. Many would argue that random forest, being a black box model, can offer no true information beyond its job in accuracy; actually paying special attention to attributes like variable importance for example often do help us gain valuable information about our data.
Let us check which variable has a high influence to the prediction.
varImp(model_rf)#> rf variable importance
#>
#> only 20 most important variables shown (out of 54)
#>
#> Overall
#> Atr17 100.00
#> Atr30 96.69
#> Atr19 95.64
#> Atr20 91.34
#> Atr18 89.54
#> Atr9 87.88
#> Atr29 78.93
#> Atr11 71.37
#> Atr40 66.55
#> Atr16 65.97
#> Atr27 64.87
#> Atr12 63.10
#> Atr39 61.55
#> Atr26 61.11
#> Atr15 58.66
#> Atr36 57.54
#> Atr38 55.79
#> Atr14 54.47
#> Atr41 53.65
#> Atr28 53.35
plot(varImp(model_rf)) If we look at plot above, Top 5 most important attributes are:
After building the model, we can now predict the test data based on model_rf using predict() function and set the parameter type = "raw" to obtain class prediction.
pred_rf <- predict(model_rf, data.test, type = "raw")Next, let us evaluate the random forest model built using confusionMatrix()
confusionMatrix(pred_rf, data.test$Class, positive = "divorce")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction divorce not_divorce
#> divorce 16 0
#> not_divorce 1 17
#>
#> Accuracy : 0.9706
#> 95% CI : (0.8467, 0.9993)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.000000002037
#>
#> Kappa : 0.9412
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9412
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9444
#> Prevalence : 0.5000
#> Detection Rate : 0.4706
#> Detection Prevalence : 0.4706
#> Balanced Accuracy : 0.9706
#>
#> 'Positive' Class : divorce
#>
Let us look below comparison between each models with every metrics
#> Model Accuracy Recall specificity Precision
#> 1 Naive Bayes 0.9706 0.9412 1 1
#> 2 Decision Tree 0.9706 0.9412 1 1
#> 3 Random Forest (OOB-Accuracy 0.9779 0.9412 1 1
If we look result above, All of our models perform very well. Even they have same result in Recall, specifity and precision. However we will choose Random Forest for this dataset since the Accuracy from the OOB is higher than Accuracy from another models.
As mention in the model evaluation, the reason we choose accuracy metric is that this metric is good to predict both divorce and no_divorce. If we can predict a couple is going to potentially divorce, we can propose them to go to marriage consultant to reconciliate and fix the marriage before it’s too late. If we can predict a couple is not going to divorce it will help them to retain and tighter the marriage .
Suggestion from this report is we need more data sample to predict more accurate result since the sample is only less than 200 couple.