Machine Learning to Predict a Successful Marriage by Determining Some Important Variable

Intro
Exploratory Data
- Read Data
- Exploratory Data Analysis
Visualization
- Using All Parameters
- Visualizaton - 9 Parameters
Cross Validation Dataset
Machine Learning
Neural Network - Using 9 Parameters only
- Creating Neural Network Architecture
- Model Validation (confussionMatrix)
Checking Model Performance compare to train dataset
- Model Validation (confussionMatrix)
Save The Model
Prediction USing ‘NN_92’ Model
- Model Validation (confussionMatrix) - Using Unseen Dataset
Conclusion

library(dplyr)
library(caret)  #RF
library(randomForest)  #rf
library(class)  #ok
library(keras)
library(tensorflow)
library(GGally)
library(e1071)
library(partykit)  #ok
library(scales)
library(lmtest)
library(MLmetrics)
library(car)
library(ggplot2)  #ok
library(tidyverse)
library(gridExtra)  #ok
library(ggplot2)  #ok
library(treemapify)
library(plotly)  #ok
library(rsample)
library(ggthemes)  #ok
library(fmsb)
library(hrbrthemes)  #ipsum
library(egg)  # theme_article()
library(bigstatsr)  #theme_bigstatsr
library(formatR)

options(scipen = 9999)

Intro

Background

When we are talking about relationship between man and woman,one of them is marriage. Can you spot a good relationship? Of course nobody knows what really goes on between any couple. Good relationships don’t happen overnight. They take commitment, compromise, forgiveness and most of all — effort. The current era, in order to maintain a good relationship become more challanging. In decades of scientific research into love, intimacy and relationships have taught us that a number of behaviors can predict when a couple is on solid ground or headed for troubled waters.

There are many areas of closeness that can enhance a marriage/relationship, help it to remain strong and help it to get back on track when it has become distant and difficult. Sometimes couples feel that things are not right between them, they wonder what is wrong and what they can do?

In Indonesia itself the rate of divorce is increasing almost every year. Data from Badan Pusat Statistik shows us that since 2012 the rate of divorce reached up to more than 340,000 of people and was continue increasing until reached at 360,000 on 2016. Unfortunately i didnt get the data for 2017-2019. In the oher hand the rate of marrigae showing quite interesting story. Since 2007-2011 the rate of marrige seems to increased but lately it has decresed since 2011 until 2016.

Please find the data and graph regarding the rate of divorce and the rate of marriage at below :

The Rate of Divorce Data taken from BPS RI https://www.bps.go.id/linkTableDinamis/view/id/893

cerai <- read.csv("cerai.csv", sep = ";")
cerai

    Type Tahun Jumlah
1  Cerai  2007 175713
2  Cerai  2008 193189
3  Cerai  2009 223371
4  Cerai  2010 285184
5  Cerai  2011 276791
6  Cerai  2012 346480
7  Cerai  2013 324247
8  Cerai  2014 344237
9  Cerai  2015 347256
10 Cerai  2016 365633

cerai %>% ggplot(aes(Jumlah, Tahun)) + geom_segment(aes(x = 0, xend = Jumlah, 
    y = Tahun, yend = Tahun)) + geom_point(size = 4, color = "red", fill = "orange", 
    alpha = 0.7, shape = 21, stroke = 2) + geom_text(aes(label = comma(Jumlah)), 
    hjust = -0.3, size = 3) + scale_x_continuous(limits = c(0, 450000)) + theme_wsj() + 
    labs(title = "Divorce Rate In Indonesia") + theme(plot.title = element_text(size = 15), 
    legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"), panel.grid.major.x = element_line(size = 2, 
        linetype = "solid", colour = "white"))

Interpretations: Divorce rate in Indonesia is increasing year by year. The last update from BPS Indonesia, in 2016 there are 365,633 people who divorced in total.

The Rate of Marriage Data taken from BPS RI https://www.bps.go.id/linkTableDinamis/view/id/893

nikah <- read.csv("nikah.csv", sep = ";")
nikah

    Type Tahun  Jumlah
1  Nikah  2007 1944569
2  Nikah  2008 2195037
3  Nikah  2009 2162115
4  Nikah  2010 2207224
5  Nikah  2011 2319821
6  Nikah  2012 2289648
7  Nikah  2013 2210046
8  Nikah  2014 2110776
9  Nikah  2015 1958394
10 Nikah  2016 1837185

nikah %>% ggplot(aes(Jumlah, Tahun)) + geom_segment(aes(x = 0, xend = Jumlah, 
    y = Tahun, yend = Tahun)) + geom_point(size = 4, color = "red", fill = "orange", 
    alpha = 0.7, shape = 21, stroke = 2) + geom_text(aes(label = comma(Jumlah)), 
    hjust = -0.3, size = 3) + scale_x_continuous(limits = c(0, 2500000)) + theme_wsj() + 
    labs(title = "Marriage Rate In Indonesia") + theme(plot.title = element_text(size = 15), 
    legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"), panel.grid.major.x = element_line(size = 2, 
        linetype = "solid", colour = "white"))

Problem to be Solved

Predicting potential divorce before it happened by doing divorce predictor scale. Hopefully it could predict potential divorce before it happened and help a couple to fix their relationship and put the corective action corectly and precisely by using the most influential predictors from Divorce predictors scale. This divorce prediction scale was working succesfully in Turkey around karadeniz(near black sea) area.

In indonesia it self the number of divorce is increasing year by year. so hopefully this predictors will help a couple which inside crisis condition in their relationship to be able to detect the most critical problem and working on it, or them which not in a crisis to maintain the relationship.

This Dataset is taken from external source https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set.

Divorce Predictors dataset is consist of 54 questionnaire which carried out by using the Divorce Predictors Scale (DPS) on the basis of Gottman couples therapy. Questionnaire was taken place in Turkey. The participants consist of 84 (49%) were divorced and 86 (51%) were married couples.

Additional Information are :
a)The ages of the participants ranged from 20 to 63 (X̄= 36.04, SD = 9.34)
b)The participants, 74 (43.5%) were married for love, and 96 (56.5%) were married in an arranged marriage
c)127 (74.7%) of the participants had children, 43 (25.3%) had no children
d)18 (10.58%) of the participants were primary school graduate, 15 (8.8%) were secondary school graduate, 33 (19.41%) were high school graduate, 88 (51.76%) were college graduate, and 15 (8.8%) had master’s degree
e) Monthly income are 34 (20%) individuals had under 2000 TL, 54 (31.76%) had between 2001-3000 TL, 28 (16.47%) had between 3001-4000 TL and 54 (31.76%) individuals had a monthly income over 4000 TL

The list of Questionnaire:
1.When one of us apologizes when our discussions go bad, the issue does not extend.
2.I know we can ignore our differences, even if things get hard sometimes.
3.When we need to, we can take our discussions from the beginning and correct it.
4.When I argue with my spouse, it will eventually work for me to contact him.
5.The time I spent with my spouse is special for us.
6.We don’t have time at home as partners.
7.We are like two strangers who share the same environment at home rather than family.
8.I enjoy our holidays with my spouse.
9.I enjoy traveling with my spouse.
10.My spouse and most of our goals are common.
11.I think that some day, my spouse and I will bee in harmony with each other.
12.My spouse and I have similar values regarding personal freedom.
13.My spouse and I have similar entertainment.
14.Most of our goals in regards to people (children, friends, etc.) are the same.
15.My dreams of living are similar and harmonious with those of my spouse.
16.I’m compatible with my spouse about what love should be.
17.I share the same views with my spouse about being happy.
18.My spouse and I have similar ideas about how marriage should be.
19.My spouse and I have similar ideas about how roles should be in marriage.
20.My spouse and I have similar values regarding trust.
21.I know exactly what my spouse likes.
22.I know how my spouse wants to be taken care of when she’s sick.
23.I know my spouse’s favorite food.
24.I can tell you what kind of stress my spouse is having in life.
25.I have knowledge of my spouse’s inner world.
26.I know my spouse’s basic concerns.
27.I know what my spouse’s current sources of stress are.
28.I know my spouse’s hopes and wishes.
29.I know my spouse very well.
30.I know my spouse’s friends and their social relationships.
31.I feel aggressive when I argue with my spouse.
32.When discussing with my spouse, I usually use expressions such as X, Y, Z.
33.I can use negative statements about my spouse’s personality during our discussions.
34.I can use offensive expressions during our discussions.
35.I can insult our discussions.
36.I can be humiliating when we argue.
37.My argument with my spouse is not calm.
38.I hate my spouse’s way of bringing it up.
39.Fights often occur suddenly.
40.We’re just starting a fight before I know what’s going on.
41.When I talk to my spouse about something, my calm suddenly breaks.
42.When I argue with my spouse, it only snaps in and I don’t say a word.
43.I’m mostly willing to calm the environment a little bit.
44.Sometimes I think it’s good for me to leave home for a while.
45.I’d rather stay silent than argue with my spouse.
46.Even if I’m right in the argument, I’m willing not to upset the other side.
47.When I argue with my spouse, I remain silent because I am afraid of not being able to control my anger.
48.I feel right in our discussions.
49.I have nothing to do with what I’ve been accused of.
50.I’m not actually the one who’s guilty of what I’m accused of.
51.I’m not the one who’s wrong about problems at home.
52.I wouldn’t hesitate to tell her about my spouse’s inadequacy.
53.I remind my spouse of her inadequacies during our discussion.
54.I’m not afraid to tell her about my spouse’s incompetence.

Target Variable –> Class, 0-Not Divorce ; 1-Divorce Potential

Exploratory Data

Read Data

d <- read.csv("divorce.csv", sep = ";")
d

  Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1    2    2    4    1    0    0    0    0    0     0     1     0     1
  Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1     1     0     1     0     0     0     1     0     0     0     0     0
  Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1     0     0     0     0     1     1     2     1     2     0     1     2
  Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1     1     3     3     2     1     1     2     3     2     1     3     3
  Atr50 Atr51 Atr52 Atr53 Atr54 Class
1     3     2     3     2     1     1
 [ reached 'max' / getOption("max.print") -- omitted 169 rows ]

This data contains 170 rows and 55 columns. Last column (Class) is our target variable , and others columns are the questionnaire’s questions.

Exploratory Data Analysis

*Change ‘Class’ column become factor type then rename 1:‘Divorce_Potential’ and 0:‘Not_Divorce’

d$Class <- ifelse(d$Class == "1", "Divorce_Potential", "Not_Divorce")

*Lets check again

head(d)

  Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1    2    2    4    1    0    0    0    0    0     0     1     0     1
  Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1     1     0     1     0     0     0     1     0     0     0     0     0
  Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1     0     0     0     0     1     1     2     1     2     0     1     2
  Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1     1     3     3     2     1     1     2     3     2     1     3     3
  Atr50 Atr51 Atr52 Atr53 Atr54             Class
1     3     2     3     2     1 Divorce_Potential
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

*Check Proportion of Target Variable

prop.table(table(d$Class)) * 100


Divorce_Potential       Not_Divorce 
         49.41176          50.58824

Not divorce class(0) is higher than potential divorce class(1).

Lets check potential significant columns by mean to get columns which have more influence for this case.

diff <- d %>% group_by(Class) %>% summarise_all(mean)

diff

# A tibble: 2 x 55
  Class  Atr1  Atr2  Atr3  Atr4  Atr5  Atr6   Atr7  Atr8   Atr9 Atr10 Atr11
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
1 Divo~ 3.19  2.87  2.92  2.73  3.01  1.13  0.988  2.81  2.89   2.77  3.21 
2 Not_~ 0.395 0.465 0.640 0.267 0.105 0.372 0.0116 0.128 0.0581 0.407 0.198
# ... with 43 more variables: Atr12 <dbl>, Atr13 <dbl>, Atr14 <dbl>,
#   Atr15 <dbl>, Atr16 <dbl>, Atr17 <dbl>, Atr18 <dbl>, Atr19 <dbl>,
#   Atr20 <dbl>, Atr21 <dbl>, Atr22 <dbl>, Atr23 <dbl>, Atr24 <dbl>,
#   Atr25 <dbl>, Atr26 <dbl>, Atr27 <dbl>, Atr28 <dbl>, Atr29 <dbl>,
#   Atr30 <dbl>, Atr31 <dbl>, Atr32 <dbl>, Atr33 <dbl>, Atr34 <dbl>,
#   Atr35 <dbl>, Atr36 <dbl>, Atr37 <dbl>, Atr38 <dbl>, Atr39 <dbl>,
#   Atr40 <dbl>, Atr41 <dbl>, Atr42 <dbl>, Atr43 <dbl>, Atr44 <dbl>,
#   Atr45 <dbl>, Atr46 <dbl>, Atr47 <dbl>, Atr48 <dbl>, Atr49 <dbl>,
#   Atr50 <dbl>, Atr51 <dbl>, Atr52 <dbl>, Atr53 <dbl>, Atr54 <dbl>

Transpose become column and create some additional column

diff2 <- data.frame(t(diff))

diff2 <- diff2 %>% mutate(X1 = as.numeric(as.character(X1))) %>% mutate(X2 = as.numeric(as.character(X2))) %>% 
    na.omit %>% mutate(Difference = X1 - X2) %>% mutate(Questions = c(1:54))

diff2

          X1         X2 Difference Questions
1  3.1904762 0.39534880  2.7951274         1
2  2.8690476 0.46511630  2.4039313         2
3  2.9166667 0.63953490  2.2771318         3
4  2.7261905 0.26744190  2.4587486         4
5  3.0119048 0.10465120  2.9072536         5
6  1.1309520 0.37209300  0.7588590         6
7  0.9880952 0.01162791  0.9764673         7
8  2.8095240 0.12790700  2.6816170         8
9  2.8928571 0.05813953  2.8347176         9
10 2.7738095 0.40697670  2.3668328        10
11 3.2142857 0.19767440  3.0166113        11
12 2.9404762 0.39534880  2.5451274        12
13 3.0952381 0.60465120  2.4905869        13
14 2.8809524 0.29069770  2.5902547        14
15 2.9404762 0.23255810  2.7079181        15
16 2.8214286 0.16279070  2.6586379        16
17 3.1666667 0.17441860  2.9922481        17
18 2.9761905 0.09302326  2.8831672        18
 [ reached 'max' / getOption("max.print") -- omitted 36 rows ]

Create a graph for showing difference

IV <- diff2 %>% ggplot(aes(Questions, Difference)) + geom_line(col = "#ef2964") + 
    geom_point(aes(text = paste("Quest#", Questions, "<br>", "Var:", round(Difference, 
        2))), pch = 18, cex = 1) + geom_hline(aes(yintercept = 3), color = "green", 
    linetype = "dashed", size = 1) + scale_x_continuous(limits = c(1, 54), breaks = seq(1, 
    54, 2)) + labs(title = "Important Questions", x = "Questions", y = NULL) + 
    theme(plot.title = element_text(size = 30, family = "arial"), legend.title = element_text(size = 50, 
        family = "arial")) + theme_article()

ggplotly(IV, tooltip = "text")

from here, we got that colums which have high diference by mean are:
Atr11 –> I think that some day, my spouse and I will bee in harmony with each other (3.01)
Atr19 –> My spouse and I have similar ideas about how roles should be in marriage (3.03)
Atr33 –> I can use negative statements about my spouse’s personality during our discussions(3,06)
Atr35 –> I can insult our discussions(3.16)
Atr36 –> I can be humiliating when we argue(3.17)
Atr38 –> I hate my spouse’s way of bringing it up (3.05)
Atr39 –> Fights often occur suddenly(3.07)
Atr40 –> We’re just starting a fight before I know what’s going on(3.3)
Atr41 –> When I talk to my spouse about something, my calm suddenly breaks(3.07)
The top 3 are : Atr35, Atr36, Atr40

Visualization

Create graph using original data which showing the distribution of ‘Not_Divorce’ and ‘Divorce_Potential’ towards the statement which represent a question of the questionnaire. First part will showing distribution plot by density plot showing all parameters being used and another part is distribution graph by violin plot and density plot which only using 9 parameters above (has high difference by mean).

Using All Parameters

In this part, We will use density plot to see all parameters distribution. First of all, devide it into 2 part, Atr1-Atr27 then save it into ‘gat1’ object and Atr28-Atr54 then save it into ‘gat2’ object. Combine all information using function ‘gather()’.

Atr1-Atr27 : ‘gat1’

gat1 <- d %>% select("Atr1":"Atr27", "Class") %>% gather(key = "key", value = "value", 
    -Class)

gat1

               Class  key value
1  Divorce_Potential Atr1     2
2  Divorce_Potential Atr1     4
3  Divorce_Potential Atr1     2
4  Divorce_Potential Atr1     3
5  Divorce_Potential Atr1     2
6  Divorce_Potential Atr1     0
7  Divorce_Potential Atr1     3
8  Divorce_Potential Atr1     2
9  Divorce_Potential Atr1     2
10 Divorce_Potential Atr1     1
11 Divorce_Potential Atr1     4
12 Divorce_Potential Atr1     4
13 Divorce_Potential Atr1     3
14 Divorce_Potential Atr1     3
15 Divorce_Potential Atr1     3
16 Divorce_Potential Atr1     4
17 Divorce_Potential Atr1     4
18 Divorce_Potential Atr1     4
19 Divorce_Potential Atr1     3
20 Divorce_Potential Atr1     4
21 Divorce_Potential Atr1     4
22 Divorce_Potential Atr1     4
23 Divorce_Potential Atr1     3
24 Divorce_Potential Atr1     3
25 Divorce_Potential Atr1     4
 [ reached 'max' / getOption("max.print") -- omitted 4565 rows ]

Continue to Density Plot

den_all1 <- gat1 %>% ggplot(aes(value)) + geom_density(aes(fill = Class)) + 
    facet_wrap(~key, scales = "free_y", ncol = 4) + theme_tufte()
den_all1

Atr28-Atr54 : ‘gat2’

gat2 <- d %>% select("Atr28":"Atr54", "Class") %>% gather(key = "key", value = "value", 
    -Class)

gat2

               Class   key value
1  Divorce_Potential Atr28     0
2  Divorce_Potential Atr28     0
3  Divorce_Potential Atr28     2
4  Divorce_Potential Atr28     1
5  Divorce_Potential Atr28     2
6  Divorce_Potential Atr28     0
7  Divorce_Potential Atr28     2
8  Divorce_Potential Atr28     2
9  Divorce_Potential Atr28     3
10 Divorce_Potential Atr28     1
11 Divorce_Potential Atr28     3
12 Divorce_Potential Atr28     3
13 Divorce_Potential Atr28     4
14 Divorce_Potential Atr28     4
15 Divorce_Potential Atr28     4
16 Divorce_Potential Atr28     2
17 Divorce_Potential Atr28     2
18 Divorce_Potential Atr28     3
19 Divorce_Potential Atr28     4
20 Divorce_Potential Atr28     3
21 Divorce_Potential Atr28     3
22 Divorce_Potential Atr28     3
23 Divorce_Potential Atr28     4
24 Divorce_Potential Atr28     4
25 Divorce_Potential Atr28     4
 [ reached 'max' / getOption("max.print") -- omitted 4565 rows ]

Continue to Density Plot

den_all2 <- gat2 %>% ggplot(aes(value)) + geom_density(aes(fill = Class)) + 
    facet_wrap(~key, scales = "free_y", ncol = 4) + theme_tufte()
den_all2

Visualizaton - 9 Parameters

From above we got some columns which have significant difference by mean. There are : Atr11, Atr19, Atr33, Atr35, Atr36, Atr38, Atr39, Atr40,Atr41.
The top 3 higest are:
Atr35 –> I can insult our discussions
Atr36 –> I can be humiliating when we argue
Atr40 –> We’re just starting a fight before I know what’s going on

Atr40: We’re just starting a fight before I know what’s going on

Atr40 Violin Plot

d %>% ggplot(aes(Class, Atr40, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr40", 
    subtitle = "Starting a fight before know what’s going on") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

Atr40 Density Plot

den_art40 <- d %>% ggplot(aes(Atr40, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr40", subtitle = "Starting a fight before know what’s going on") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art40

Interpretation: People who classified into ‘Not_Divorce’ class is tend to choose strongly disagree with the statement (Starting a fight before know whats going on) but on the other side people who are classified into ‘Divorce_Potential’ mostly choose strongly agree with the statement. Atr40 is categorize into negative conflict behaviors group.

Atr36: I can be humiliating when we argue

Atr36 Violin Plot

d %>% ggplot(aes(Class, Atr36, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr36", 
    subtitle = "humiliating when we argue") + theme_wsj() + theme(plot.title = element_text(size = 25), 
    plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15), 
    panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))

Atr36 Density Plot

den_art36 <- d %>% ggplot(aes(Atr36, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr36", subtitle = "humiliating when we argue") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

den_art36

Interpretation: This graph is telling us that all people who ara in ‘Not_Divorce’ condition is strongly disagree with the statement, noone choose in the middle. The opposite, people who are in ‘Divorce_Potential’ class have more flexible answer. Atr36 is categorize into negative conflict behaviors group.

Atr35: I can insult our discussions

Atr35 Violin Plot

d %>% ggplot(aes(Class, Atr35, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr35", 
    subtitle = "I can insult our discussions") + theme_wsj() + theme(plot.title = element_text(size = 25), 
    plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15), 
    panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))

Atr35 Density Plot

den_art35 <- d %>% ggplot(aes(Atr35, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr35", subtitle = "I can insult our discussions") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

den_art35

Interpretation: Atr35 is a strong statement and categorize into negative conflict behaviors group. This result is telling us that people who are classified into ‘Not_Divorce’ class mostly disagree for the statement (I can insult our discussion) contrarily people who are classified into ‘Divorce_Potential’ are more flexible with the answer but most of them are answered strongly agree with the statement (I can insult our discussion)

Atr 39: Fights often occur suddenly

Atr39 Violin Plot

d %>% ggplot(aes(Class, Atr39, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr39", 
    subtitle = "Fights often occur suddenly") + theme_wsj() + theme(plot.title = element_text(size = 25), 
    plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15), 
    panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))

Atr39 Density Plot

den_art39 <- d %>% ggplot(aes(Atr39, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr39", subtitle = "Fights often occur suddenly") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

den_art39

Interpretation: Atr39 is categorize into negative conflict behaviors group. The graph showing us that mostly people who are in ‘Divorce_Potential’ class are strongly agree with the statement (Fights often occur suddenly) and people who in ‘Not_Divorce’ class mostly strongly disagree with the statement(Fights often occur suddenly).

Atr41: When I talk to my spouse about something, my calm suddenly breaks

Atr41 Violin Plot

d %>% ggplot(aes(Class, Atr41, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr41", 
    subtitle = "My calm suddenly breaks when talking") + theme_wsj() + theme(plot.title = element_text(size = 25), 
    plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15), 
    panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))

Atr41 Density Plot

den_art41 <- d %>% ggplot(aes(Atr41, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr41", subtitle = "My calm suddenly breaks when talking") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art41

Interpretation: Within this grap wwe may know that people who classified into ‘Divorce_Potential’ mostly choose strongly agree with the statement and people who classified into ‘Not_Divorce’ class mostly choose strongly disagree with the statement. The graph also showing us that both of the class (not divorce and potential divorce) have different tendency of how to be act to behave to their spouse. Atr41 is categorize into negative conflict behaviors group.

Atr33: I can use negative statements about my spouse’s personality during our discussions

Atr33 Violin Plot

d %>% ggplot(aes(Class, Atr33, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr33", 
    subtitle = "Use negative statements during discussions") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

Atr33 Density Plot

den_art33 <- d %>% ggplot(aes(Atr33, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr33", subtitle = "Use negative statements during discussions") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art33

Interpretation: This graph is showing us there’s a big difference between ‘Divorce_Potential’ and ‘Not_Divorce’ class. Both of them are having different answer which make us easily to separate the action. ‘Not_Divorce’ class people are strongly disagree with the statement but mostly people at ‘Potential_Divorce’ class is tend to choose strongly agree with the statement (Using negative statement during discussion). This statement is categorize into negative conflict behaviors group.

Atr38: I hate my spouse’s way of bringing it up

Atr38 Violin Plot

d %>% ggplot(aes(Class, Atr38, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr38", 
    subtitle = "I hate my spouse’s way of bringing it up") + theme_wsj() + theme(plot.title = element_text(size = 25), 
    plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15), 
    panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))

Atr38 Density Plot

den_art38 <- d %>% ggplot(aes(Atr38, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr38", subtitle = "I hate my spouse’s way of bringing it up") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art38

Interpretation: From this graph we may conclude that most of people at ‘Not_Divorce’ class are strongly disagree with the statement and most of people in ‘Divorce_Potential’ class are choosing strongly agree with the statement (I hate my spoouse’s way of bringing it up). Atr38 is categorized at negative conflict behavior group.

Atr19: My spouse and I have similar ideas about how roles should be in marriage

Atr19 Violin Plot

d %>% ggplot(aes(Class, Atr19, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr19", 
    subtitle = "Similar ideas about how roles should be in marriage") + theme_wsj() + 
    theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
        family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
        linetype = "solid", colour = "white"))

Atr19 Density Plot

den_art19 <- d %>% ggplot(aes(Atr19, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr19", subtitle = "Similar ideas about how roles should be in marriage") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art19

Interpretation: Atr19 is categorize into common meaning and failed to attempts to repair group. The result shows us that people who are in a ‘Not_Divorce’ class mostly choose not agree with a statement which saying similiar idea about how roles should be in married and some people who are in a ‘Divorce_Potential’ class mostly have tendency to agree with the statement. Suprisingly!

Atr11: I think that one day in the future, when I look back, I see that my wife and I are in harmony with each other

Atr11 Violin Plot

d %>% ggplot(aes(Class, Atr11, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr11", 
    subtitle = "Some day my spouse and I will bee in harmony with each other") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

Atr11 Density Plot

den_art11 <- d %>% ggplot(aes(Atr11, fill = Class)) + geom_density(alpha = 0.3) + 
    labs(title = "Atr11", subtitle = "Some day my spouse and I will bee in harmony with each other") + 
    theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15, 
    family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7, 
    linetype = "solid", colour = "white"))

den_art11

Interpretation : Atr11 is categorize into love map group. Suprisingly, dataset shows us that people who are strongly agree with the statement “Some day my spouse and I will bee in harmony with each other” are puted into ‘Divorce_Potential’ condition in their relationship and people who are into ‘Not_Divorce’ class are strongly disagree with the statement.

Cross Validation Dataset

Read Data again

div <- read.csv("divorce.csv", sep = ";")
div

  Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1    2    2    4    1    0    0    0    0    0     0     1     0     1
  Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1     1     0     1     0     0     0     1     0     0     0     0     0
  Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1     0     0     0     0     1     1     2     1     2     0     1     2
  Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1     1     3     3     2     1     1     2     3     2     1     3     3
  Atr50 Atr51 Atr52 Atr53 Atr54 Class
1     3     2     3     2     1     1
 [ reached 'max' / getOption("max.print") -- omitted 169 rows ]

Target Variable –> Class, 0-Not Divorce ; 1-Divorce Potential

** Subsetting using only 9 parameters**

div9 <- div %>% select(Atr11, Atr19, Atr33, Atr35, Atr36, Atr38, Atr39, Atr40, 
    Atr41, Class)

** Using initial_split function to split train and test dataset**

set.seed(55)
split_nn9 <- initial_split(data = div9, prop = 0.75, strata = "Class")
train_nn9 <- training(split_nn9)
test_nn9_fin <- testing(split_nn9)

set.seed(55)
split_nn9b <- initial_split(data = train_nn9, prop = 0.8, strata = "Class")
train_nn9b <- training(split_nn9b)
test_nn9b <- testing(split_nn9b)

For the next step, we will use ‘train_nn9’ as train dataset, ‘test_nn9b’ as test dataset and ‘test_nn9_fin’ as validation dataset

Machine Learning

Neural Network - Using 9 Parameters only

Change into Matrix type

train_nn9b_matrix <- as.matrix(train_nn9b)
test_nn9b_matrix <- as.matrix(test_nn9b)

Separating x variable and y variable (separating target variable) to create one hot coding

Prediktor

train_nn9b_x <- train_nn9b_matrix[, -10]
test_nn9b_x <- test_nn9b_matrix[, -10]

Target

train_nn9b_y <- train_nn9b_matrix[, 10]
test_nn9b_y <- test_nn9b_matrix[, 10]

Change into Array

Array

train_nn9b_x_keras <- array_reshape(x = train_nn9b_x, dim = c(nrow(train_nn9b_x), 
    ncol(train_nn9b_x)))
test_nn9b_x_keras <- array_reshape(x = test_nn9b_x, dim = c(nrow(test_nn9b_x), 
    ncol(test_nn9b_x)))
head(train_nn9b_x_keras)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]    1    0    1    0    1    1    3    3    2
[2,]    3    3    1    1    1    1    3    3    3
[3,]    4    3    2    1    1    3    4    4    2
[4,]    0    2    1    0    0    0    2    1    0
[5,]    0    0    1    1    1    2    0    2    2
[6,]    2    3    2    1    2    2    2    3    3

One Hot Coding

train_nn9b_y_keras <- to_categorical(train_nn9b_y)

Creating Neural Network Architecture

Neural Network sequencing

mnn9b <- keras_model_sequential()

mnn9b %>% layer_dense(input_shape = ncol(train_nn9b_x_keras), units = 64, activation = "relu", 
    name = "hidden_A") %>% layer_dense(units = 32, activation = "relu", name = "hidden_B", 
    kernel_initializer = initializer_random_uniform(seed = 55)) %>% layer_dense(units = 2, 
    activation = "sigmoid")

summary(mnn9b)

Model: "sequential"
___________________________________________________________________________
Layer (type)                     Output Shape                  Param #     
===========================================================================
hidden_A (Dense)                 (None, 64)                    640         
___________________________________________________________________________
hidden_B (Dense)                 (None, 32)                    2080        
___________________________________________________________________________
dense (Dense)                    (None, 2)                     66          
===========================================================================
Total params: 2,786
Trainable params: 2,786
Non-trainable params: 0
___________________________________________________________________________

Compiling

Gather all Neural Network architecture, set learning rate = 0.001

mnn9b %>% compile(loss = "binary_crossentropy", optimizer = optimizer_sgd(lr = 0.001), 
    metric = "accuracy")

*Creating Model Neural Network Using ‘fit()’ function

history9b <- mnn9b %>% fit(train_nn9b_x_keras, train_nn9b_y_keras, epoch = 80, 
    batch_size = 10)

Plotting Model

plot(history9b) + geom_line(col = "#ef2964", lwd = 1) + geom_point(cex = 1, 
    col = 14) + theme_bigstatsr() + labs(y = NULL)

Prediction using Probability Number

prob_test9b <- predict_proba(object = mnn9b, x = test_nn9b_x_keras)
pred_test9b <- if_else(prob_test9b[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")

*Rename Target variable become ‘Divorce_Potential’ and ‘Not_Divorce’

func <- function(data) {
    sapply(as.character(data), switch, `0` = "Not_Divorce", `1` = "Divorce_Potential")
}

reference_test9b <- func(test_nn9b_y)

Model Validation (confussionMatrix)

confusionMatrix(as.factor(pred_test9b), as.factor(reference_test9b))

Confusion Matrix and Statistics

                   Reference
Prediction          Divorce_Potential Not_Divorce
  Divorce_Potential                12           2
  Not_Divorce                       0          10
                                           
               Accuracy : 0.9167           
                 95% CI : (0.73, 0.9897)   
    No Information Rate : 0.5              
    P-Value [Acc > NIR] : 0.00001794       
                                           
                  Kappa : 0.8333           
                                           
 Mcnemar's Test P-Value : 0.4795           
                                           
            Sensitivity : 1.0000           
            Specificity : 0.8333           
         Pos Pred Value : 0.8571           
         Neg Pred Value : 1.0000           
             Prevalence : 0.5000           
         Detection Rate : 0.5000           
   Detection Prevalence : 0.5833           
      Balanced Accuracy : 0.9167           
                                           
       'Positive' Class : Divorce_Potential

Following the purpose of this case is to predict potential divorce before it happened, by do so that the model need to catch as much possibility as possible by using Recall/Sensitivity besides accuracy. The confussion Matrix gives us a good result for recall/sensitivity and accuracy as well

Checking Model Performance compare to train dataset

Prediction using Probability Number

prob_9b <- predict_proba(object = mnn9b, x = train_nn9b_x_keras)
pred_9b <- if_else(prob_9b[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")

reference_train <- func(train_nn9b_y)

Model Validation (confussionMatrix)

confusionMatrix(as.factor(pred_9b), as.factor(reference_train))

Confusion Matrix and Statistics

                   Reference
Prediction          Divorce_Potential Not_Divorce
  Divorce_Potential                51           9
  Not_Divorce                       0          44
                                               
               Accuracy : 0.9135               
                 95% CI : (0.8421, 0.9597)     
    No Information Rate : 0.5096               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8274               
                                               
 Mcnemar's Test P-Value : 0.007661             
                                               
            Sensitivity : 1.0000               
            Specificity : 0.8302               
         Pos Pred Value : 0.8500               
         Neg Pred Value : 1.0000               
             Prevalence : 0.4904               
         Detection Rate : 0.4904               
   Detection Prevalence : 0.5769               
      Balanced Accuracy : 0.9151               
                                               
       'Positive' Class : Divorce_Potential

Train Dataset showing us that, the resulting model is good enough by comparing between train dataset and test dataset

Save The Model

Save the model and name it : ‘NN_92.hdf5’

# mnn9b %>% save_model_hdf5('NN_92.hdf5')

Prediction USing ‘NN_92’ Model

Load Model

model_NN_92 <- load_model_hdf5("NN_92.hdf5")

Change ‘test_nn9_fin’ dataset into keras. First step , change into matrix

test_nn9b_fin_matrix <- as.matrix(test_nn9_fin)

Separating x variable and y variable (separating target variable) to create one hot coding

Prediktor

test_nn9b_fin_x <- test_nn9b_fin_matrix[, -10]

Target

test_nn9b_fin_y <- test_nn9b_fin_matrix[, 10]

Change into Array

test_nn9b_fin_x_keras <- array_reshape(x = test_nn9b_fin_x, dim = c(nrow(test_nn9b_fin_x), 
    ncol(test_nn9b_fin_x)))
head(test_nn9b_fin_x_keras)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]    3    3    4    4    4    4    4    4    4
[2,]    3    3    4    4    4    4    4    4    4
[3,]    4    4    4    4    3    3    4    3    4
[4,]    4    4    4    4    4    4    4    4    4
[5,]    3    3    4    4    4    4    4    4    4
[6,]    3    3    4    4    4    4    4    4    4

Prediction using Probability Number

prob_test92_fin <- predict_proba(object = model_NN_92, x = test_nn9b_fin_x_keras)
pred_test92_fin <- if_else(prob_test92_fin[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")

func <- function(data) {
    sapply(as.character(data), switch, `0` = "Not_Divorce", `1` = "Divorce_Potential")
}

reference_test9b_fin <- func(test_nn9b_fin_y)

Model Validation (confussionMatrix) - Using Unseen Dataset

confusionMatrix(as.factor(pred_test92_fin), as.factor(reference_test9b_fin))

Confusion Matrix and Statistics

                   Reference
Prediction          Divorce_Potential Not_Divorce
  Divorce_Potential                21           2
  Not_Divorce                       0          19
                                           
               Accuracy : 0.9524           
                 95% CI : (0.8384, 0.9942) 
    No Information Rate : 0.5              
    P-Value [Acc > NIR] : 0.0000000002055  
                                           
                  Kappa : 0.9048           
                                           
 Mcnemar's Test P-Value : 0.4795           
                                           
            Sensitivity : 1.0000           
            Specificity : 0.9048           
         Pos Pred Value : 0.9130           
         Neg Pred Value : 1.0000           
             Prevalence : 0.5000           
         Detection Rate : 0.5000           
   Detection Prevalence : 0.5476           
      Balanced Accuracy : 0.9524           
                                           
       'Positive' Class : Divorce_Potential

Recall/Sensitivity deliver a good result and Accuracy as well

Conclusion

After trying some algorithm (Logistic, KNN, Naive Bayes, Decision Tree, Random Forest), I may conclude that Neural Network could explain better than other while still considering the 9 Parameters which have big impact among all questions inside the questionnaire. It provide a decent result as well by considering accuracy, and recall (sensitivity) number, which the goal from this questionnaire is to predict a potential divorce case as much as possible before it happened.

The 9 Parameters are : Atr11 –> I think that some day, my spouse and I will bee in harmony with each other (3.01)
Atr19 –> My spouse and I have similar ideas about how roles should be in marriage (3.03)
Atr33 –> I can use negative statements about my spouse’s personality during our discussions(3,06)
Atr35 –> I can insult our discussions(3.16)
Atr36 –> I can be humiliating when we argue(3.17)
Atr38 –> I hate my spouse’s way of bringing it up (3.05)
Atr39 –> Fights often occur suddenly(3.07)
Atr40 –> We’re just starting a fight before I know what’s going on(3.3)
Atr41 –> When I talk to my spouse about something, my calm suddenly breaks(3.07)
The top 3 are : Atr35, Atr36, Atr40

Simple practical used in a company is by implementing the questionnaire to the employee to reduce the posibility of having a personal issues which may leading to less focus and reduce the productivity in work. In other words, the questionnaire could maintain employee productivity.

Machine Learning to Predict a Successful Marriage by Determining Some Important Variable

Machine Learning to Predict a Successful Marriage by Determining Some Important Variable

Intro

Background

Problem to be Solved

Contents

Exploratory Data

Read Data

Exploratory Data Analysis

Visualization

Using All Parameters

Visualizaton - 9 Parameters

Cross Validation Dataset

Machine Learning

Neural Network - Using 9 Parameters only

Creating Neural Network Architecture

Model Validation (confussionMatrix)

Checking Model Performance compare to train dataset

Model Validation (confussionMatrix)

Save The Model

Prediction USing ‘NN_92’ Model

Model Validation (confussionMatrix) - Using Unseen Dataset

Conclusion