Machine Learning to Predict a Successful Marriage by Determining Some Important Variable
Machine Learning to Predict a Successful Marriage by Determining Some Important Variable
library(dplyr)
library(caret) #RF
library(randomForest) #rf
library(class) #ok
library(keras)
library(tensorflow)
library(GGally)
library(e1071)
library(partykit) #ok
library(scales)
library(lmtest)
library(MLmetrics)
library(car)
library(ggplot2) #ok
library(tidyverse)
library(gridExtra) #ok
library(ggplot2) #ok
library(treemapify)
library(plotly) #ok
library(rsample)
library(ggthemes) #ok
library(fmsb)
library(hrbrthemes) #ipsum
library(egg) # theme_article()
library(bigstatsr) #theme_bigstatsr
library(formatR)
options(scipen = 9999)Intro
Background
When we are talking about relationship between man and woman,one of them is marriage. Can you spot a good relationship? Of course nobody knows what really goes on between any couple. Good relationships don’t happen overnight. They take commitment, compromise, forgiveness and most of all — effort. The current era, in order to maintain a good relationship become more challanging. In decades of scientific research into love, intimacy and relationships have taught us that a number of behaviors can predict when a couple is on solid ground or headed for troubled waters.
There are many areas of closeness that can enhance a marriage/relationship, help it to remain strong and help it to get back on track when it has become distant and difficult. Sometimes couples feel that things are not right between them, they wonder what is wrong and what they can do?
In Indonesia itself the rate of divorce is increasing almost every year. Data from Badan Pusat Statistik shows us that since 2012 the rate of divorce reached up to more than 340,000 of people and was continue increasing until reached at 360,000 on 2016. Unfortunately i didnt get the data for 2017-2019. In the oher hand the rate of marrigae showing quite interesting story. Since 2007-2011 the rate of marrige seems to increased but lately it has decresed since 2011 until 2016.
Please find the data and graph regarding the rate of divorce and the rate of marriage at below :
- The Rate of Divorce Data taken from BPS RI https://www.bps.go.id/linkTableDinamis/view/id/893
Type Tahun Jumlah
1 Cerai 2007 175713
2 Cerai 2008 193189
3 Cerai 2009 223371
4 Cerai 2010 285184
5 Cerai 2011 276791
6 Cerai 2012 346480
7 Cerai 2013 324247
8 Cerai 2014 344237
9 Cerai 2015 347256
10 Cerai 2016 365633
cerai %>% ggplot(aes(Jumlah, Tahun)) + geom_segment(aes(x = 0, xend = Jumlah,
y = Tahun, yend = Tahun)) + geom_point(size = 4, color = "red", fill = "orange",
alpha = 0.7, shape = 21, stroke = 2) + geom_text(aes(label = comma(Jumlah)),
hjust = -0.3, size = 3) + scale_x_continuous(limits = c(0, 450000)) + theme_wsj() +
labs(title = "Divorce Rate In Indonesia") + theme(plot.title = element_text(size = 15),
legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"), panel.grid.major.x = element_line(size = 2,
linetype = "solid", colour = "white")) Interpretations: Divorce rate in Indonesia is increasing year by year. The last update from BPS Indonesia, in 2016 there are 365,633 people who divorced in total.
- The Rate of Marriage Data taken from BPS RI https://www.bps.go.id/linkTableDinamis/view/id/893
Type Tahun Jumlah
1 Nikah 2007 1944569
2 Nikah 2008 2195037
3 Nikah 2009 2162115
4 Nikah 2010 2207224
5 Nikah 2011 2319821
6 Nikah 2012 2289648
7 Nikah 2013 2210046
8 Nikah 2014 2110776
9 Nikah 2015 1958394
10 Nikah 2016 1837185
nikah %>% ggplot(aes(Jumlah, Tahun)) + geom_segment(aes(x = 0, xend = Jumlah,
y = Tahun, yend = Tahun)) + geom_point(size = 4, color = "red", fill = "orange",
alpha = 0.7, shape = 21, stroke = 2) + geom_text(aes(label = comma(Jumlah)),
hjust = -0.3, size = 3) + scale_x_continuous(limits = c(0, 2500000)) + theme_wsj() +
labs(title = "Marriage Rate In Indonesia") + theme(plot.title = element_text(size = 15),
legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"), panel.grid.major.x = element_line(size = 2,
linetype = "solid", colour = "white"))Problem to be Solved
Predicting potential divorce before it happened by doing divorce predictor scale. Hopefully it could predict potential divorce before it happened and help a couple to fix their relationship and put the corective action corectly and precisely by using the most influential predictors from Divorce predictors scale. This divorce prediction scale was working succesfully in Turkey around karadeniz(near black sea) area.
In indonesia it self the number of divorce is increasing year by year. so hopefully this predictors will help a couple which inside crisis condition in their relationship to be able to detect the most critical problem and working on it, or them which not in a crisis to maintain the relationship.
Contents
This Dataset is taken from external source https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set.
Divorce Predictors dataset is consist of 54 questionnaire which carried out by using the Divorce Predictors Scale (DPS) on the basis of Gottman couples therapy. Questionnaire was taken place in Turkey. The participants consist of 84 (49%) were divorced and 86 (51%) were married couples.
Additional Information are :
a)The ages of the participants ranged from 20 to 63 (X̄= 36.04, SD = 9.34)
b)The participants, 74 (43.5%) were married for love, and 96 (56.5%) were married in an arranged marriage
c)127 (74.7%) of the participants had children, 43 (25.3%) had no children
d)18 (10.58%) of the participants were primary school graduate, 15 (8.8%) were secondary school graduate, 33 (19.41%) were high school graduate, 88 (51.76%) were college graduate, and 15 (8.8%) had master’s degree
e) Monthly income are 34 (20%) individuals had under 2000 TL, 54 (31.76%) had between 2001-3000 TL, 28 (16.47%) had between 3001-4000 TL and 54 (31.76%) individuals had a monthly income over 4000 TL
The list of Questionnaire:
1.When one of us apologizes when our discussions go bad, the issue does not extend.
2.I know we can ignore our differences, even if things get hard sometimes.
3.When we need to, we can take our discussions from the beginning and correct it.
4.When I argue with my spouse, it will eventually work for me to contact him.
5.The time I spent with my spouse is special for us.
6.We don’t have time at home as partners.
7.We are like two strangers who share the same environment at home rather than family.
8.I enjoy our holidays with my spouse.
9.I enjoy traveling with my spouse.
10.My spouse and most of our goals are common.
11.I think that some day, my spouse and I will bee in harmony with each other.
12.My spouse and I have similar values regarding personal freedom.
13.My spouse and I have similar entertainment.
14.Most of our goals in regards to people (children, friends, etc.) are the same.
15.My dreams of living are similar and harmonious with those of my spouse.
16.I’m compatible with my spouse about what love should be.
17.I share the same views with my spouse about being happy.
18.My spouse and I have similar ideas about how marriage should be.
19.My spouse and I have similar ideas about how roles should be in marriage.
20.My spouse and I have similar values regarding trust.
21.I know exactly what my spouse likes.
22.I know how my spouse wants to be taken care of when she’s sick.
23.I know my spouse’s favorite food.
24.I can tell you what kind of stress my spouse is having in life.
25.I have knowledge of my spouse’s inner world.
26.I know my spouse’s basic concerns.
27.I know what my spouse’s current sources of stress are.
28.I know my spouse’s hopes and wishes.
29.I know my spouse very well.
30.I know my spouse’s friends and their social relationships.
31.I feel aggressive when I argue with my spouse.
32.When discussing with my spouse, I usually use expressions such as X, Y, Z.
33.I can use negative statements about my spouse’s personality during our discussions.
34.I can use offensive expressions during our discussions.
35.I can insult our discussions.
36.I can be humiliating when we argue.
37.My argument with my spouse is not calm.
38.I hate my spouse’s way of bringing it up.
39.Fights often occur suddenly.
40.We’re just starting a fight before I know what’s going on.
41.When I talk to my spouse about something, my calm suddenly breaks.
42.When I argue with my spouse, it only snaps in and I don’t say a word.
43.I’m mostly willing to calm the environment a little bit.
44.Sometimes I think it’s good for me to leave home for a while.
45.I’d rather stay silent than argue with my spouse.
46.Even if I’m right in the argument, I’m willing not to upset the other side.
47.When I argue with my spouse, I remain silent because I am afraid of not being able to control my anger.
48.I feel right in our discussions.
49.I have nothing to do with what I’ve been accused of.
50.I’m not actually the one who’s guilty of what I’m accused of.
51.I’m not the one who’s wrong about problems at home.
52.I wouldn’t hesitate to tell her about my spouse’s inadequacy.
53.I remind my spouse of her inadequacies during our discussion.
54.I’m not afraid to tell her about my spouse’s incompetence.
Target Variable –> Class, 0-Not Divorce ; 1-Divorce Potential
Exploratory Data
Read Data
Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1 2 2 4 1 0 0 0 0 0 0 1 0 1
Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1 1 0 1 0 0 0 1 0 0 0 0 0
Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1 0 0 0 0 1 1 2 1 2 0 1 2
Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1 1 3 3 2 1 1 2 3 2 1 3 3
Atr50 Atr51 Atr52 Atr53 Atr54 Class
1 3 2 3 2 1 1
[ reached 'max' / getOption("max.print") -- omitted 169 rows ]
This data contains 170 rows and 55 columns. Last column (Class) is our target variable , and others columns are the questionnaire’s questions.
Exploratory Data Analysis
*Change ‘Class’ column become factor type then rename 1:‘Divorce_Potential’ and 0:‘Not_Divorce’
*Lets check again
Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1 2 2 4 1 0 0 0 0 0 0 1 0 1
Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1 1 0 1 0 0 0 1 0 0 0 0 0
Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1 0 0 0 0 1 1 2 1 2 0 1 2
Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1 1 3 3 2 1 1 2 3 2 1 3 3
Atr50 Atr51 Atr52 Atr53 Atr54 Class
1 3 2 3 2 1 Divorce_Potential
[ reached 'max' / getOption("max.print") -- omitted 5 rows ]
*Check Proportion of Target Variable
Divorce_Potential Not_Divorce
49.41176 50.58824
Not divorce class(0) is higher than potential divorce class(1).
Lets check potential significant columns by mean to get columns which have more influence for this case.
# A tibble: 2 x 55
Class Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Divo~ 3.19 2.87 2.92 2.73 3.01 1.13 0.988 2.81 2.89 2.77 3.21
2 Not_~ 0.395 0.465 0.640 0.267 0.105 0.372 0.0116 0.128 0.0581 0.407 0.198
# ... with 43 more variables: Atr12 <dbl>, Atr13 <dbl>, Atr14 <dbl>,
# Atr15 <dbl>, Atr16 <dbl>, Atr17 <dbl>, Atr18 <dbl>, Atr19 <dbl>,
# Atr20 <dbl>, Atr21 <dbl>, Atr22 <dbl>, Atr23 <dbl>, Atr24 <dbl>,
# Atr25 <dbl>, Atr26 <dbl>, Atr27 <dbl>, Atr28 <dbl>, Atr29 <dbl>,
# Atr30 <dbl>, Atr31 <dbl>, Atr32 <dbl>, Atr33 <dbl>, Atr34 <dbl>,
# Atr35 <dbl>, Atr36 <dbl>, Atr37 <dbl>, Atr38 <dbl>, Atr39 <dbl>,
# Atr40 <dbl>, Atr41 <dbl>, Atr42 <dbl>, Atr43 <dbl>, Atr44 <dbl>,
# Atr45 <dbl>, Atr46 <dbl>, Atr47 <dbl>, Atr48 <dbl>, Atr49 <dbl>,
# Atr50 <dbl>, Atr51 <dbl>, Atr52 <dbl>, Atr53 <dbl>, Atr54 <dbl>
Transpose become column and create some additional column
diff2 <- data.frame(t(diff))
diff2 <- diff2 %>% mutate(X1 = as.numeric(as.character(X1))) %>% mutate(X2 = as.numeric(as.character(X2))) %>%
na.omit %>% mutate(Difference = X1 - X2) %>% mutate(Questions = c(1:54))
diff2 X1 X2 Difference Questions
1 3.1904762 0.39534880 2.7951274 1
2 2.8690476 0.46511630 2.4039313 2
3 2.9166667 0.63953490 2.2771318 3
4 2.7261905 0.26744190 2.4587486 4
5 3.0119048 0.10465120 2.9072536 5
6 1.1309520 0.37209300 0.7588590 6
7 0.9880952 0.01162791 0.9764673 7
8 2.8095240 0.12790700 2.6816170 8
9 2.8928571 0.05813953 2.8347176 9
10 2.7738095 0.40697670 2.3668328 10
11 3.2142857 0.19767440 3.0166113 11
12 2.9404762 0.39534880 2.5451274 12
13 3.0952381 0.60465120 2.4905869 13
14 2.8809524 0.29069770 2.5902547 14
15 2.9404762 0.23255810 2.7079181 15
16 2.8214286 0.16279070 2.6586379 16
17 3.1666667 0.17441860 2.9922481 17
18 2.9761905 0.09302326 2.8831672 18
[ reached 'max' / getOption("max.print") -- omitted 36 rows ]
Create a graph for showing difference
IV <- diff2 %>% ggplot(aes(Questions, Difference)) + geom_line(col = "#ef2964") +
geom_point(aes(text = paste("Quest#", Questions, "<br>", "Var:", round(Difference,
2))), pch = 18, cex = 1) + geom_hline(aes(yintercept = 3), color = "green",
linetype = "dashed", size = 1) + scale_x_continuous(limits = c(1, 54), breaks = seq(1,
54, 2)) + labs(title = "Important Questions", x = "Questions", y = NULL) +
theme(plot.title = element_text(size = 30, family = "arial"), legend.title = element_text(size = 50,
family = "arial")) + theme_article()
ggplotly(IV, tooltip = "text")from here, we got that colums which have high diference by mean are:
Atr11 –> I think that some day, my spouse and I will bee in harmony with each other (3.01)
Atr19 –> My spouse and I have similar ideas about how roles should be in marriage (3.03)
Atr33 –> I can use negative statements about my spouse’s personality during our discussions(3,06)
Atr35 –> I can insult our discussions(3.16)
Atr36 –> I can be humiliating when we argue(3.17)
Atr38 –> I hate my spouse’s way of bringing it up (3.05)
Atr39 –> Fights often occur suddenly(3.07)
Atr40 –> We’re just starting a fight before I know what’s going on(3.3)
Atr41 –> When I talk to my spouse about something, my calm suddenly breaks(3.07)
The top 3 are : Atr35, Atr36, Atr40
Visualization
Create graph using original data which showing the distribution of ‘Not_Divorce’ and ‘Divorce_Potential’ towards the statement which represent a question of the questionnaire. First part will showing distribution plot by density plot showing all parameters being used and another part is distribution graph by violin plot and density plot which only using 9 parameters above (has high difference by mean).
Using All Parameters
In this part, We will use density plot to see all parameters distribution. First of all, devide it into 2 part, Atr1-Atr27 then save it into ‘gat1’ object and Atr28-Atr54 then save it into ‘gat2’ object. Combine all information using function ‘gather()’.
Atr1-Atr27 : ‘gat1’
Class key value
1 Divorce_Potential Atr1 2
2 Divorce_Potential Atr1 4
3 Divorce_Potential Atr1 2
4 Divorce_Potential Atr1 3
5 Divorce_Potential Atr1 2
6 Divorce_Potential Atr1 0
7 Divorce_Potential Atr1 3
8 Divorce_Potential Atr1 2
9 Divorce_Potential Atr1 2
10 Divorce_Potential Atr1 1
11 Divorce_Potential Atr1 4
12 Divorce_Potential Atr1 4
13 Divorce_Potential Atr1 3
14 Divorce_Potential Atr1 3
15 Divorce_Potential Atr1 3
16 Divorce_Potential Atr1 4
17 Divorce_Potential Atr1 4
18 Divorce_Potential Atr1 4
19 Divorce_Potential Atr1 3
20 Divorce_Potential Atr1 4
21 Divorce_Potential Atr1 4
22 Divorce_Potential Atr1 4
23 Divorce_Potential Atr1 3
24 Divorce_Potential Atr1 3
25 Divorce_Potential Atr1 4
[ reached 'max' / getOption("max.print") -- omitted 4565 rows ]
Continue to Density Plot
den_all1 <- gat1 %>% ggplot(aes(value)) + geom_density(aes(fill = Class)) +
facet_wrap(~key, scales = "free_y", ncol = 4) + theme_tufte()
den_all1Atr28-Atr54 : ‘gat2’
gat2 <- d %>% select("Atr28":"Atr54", "Class") %>% gather(key = "key", value = "value",
-Class)
gat2 Class key value
1 Divorce_Potential Atr28 0
2 Divorce_Potential Atr28 0
3 Divorce_Potential Atr28 2
4 Divorce_Potential Atr28 1
5 Divorce_Potential Atr28 2
6 Divorce_Potential Atr28 0
7 Divorce_Potential Atr28 2
8 Divorce_Potential Atr28 2
9 Divorce_Potential Atr28 3
10 Divorce_Potential Atr28 1
11 Divorce_Potential Atr28 3
12 Divorce_Potential Atr28 3
13 Divorce_Potential Atr28 4
14 Divorce_Potential Atr28 4
15 Divorce_Potential Atr28 4
16 Divorce_Potential Atr28 2
17 Divorce_Potential Atr28 2
18 Divorce_Potential Atr28 3
19 Divorce_Potential Atr28 4
20 Divorce_Potential Atr28 3
21 Divorce_Potential Atr28 3
22 Divorce_Potential Atr28 3
23 Divorce_Potential Atr28 4
24 Divorce_Potential Atr28 4
25 Divorce_Potential Atr28 4
[ reached 'max' / getOption("max.print") -- omitted 4565 rows ]
Continue to Density Plot
den_all2 <- gat2 %>% ggplot(aes(value)) + geom_density(aes(fill = Class)) +
facet_wrap(~key, scales = "free_y", ncol = 4) + theme_tufte()
den_all2Visualizaton - 9 Parameters
From above we got some columns which have significant difference by mean. There are : Atr11, Atr19, Atr33, Atr35, Atr36, Atr38, Atr39, Atr40,Atr41.
The top 3 higest are:
Atr35 –> I can insult our discussions
Atr36 –> I can be humiliating when we argue
Atr40 –> We’re just starting a fight before I know what’s going on
Atr40: We’re just starting a fight before I know what’s going on
Atr40 Violin Plot
d %>% ggplot(aes(Class, Atr40, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr40",
subtitle = "Starting a fight before know what’s going on") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))Atr40 Density Plot
den_art40 <- d %>% ggplot(aes(Atr40, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr40", subtitle = "Starting a fight before know what’s going on") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art40 Interpretation: People who classified into ‘Not_Divorce’ class is tend to choose strongly disagree with the statement (Starting a fight before know whats going on) but on the other side people who are classified into ‘Divorce_Potential’ mostly choose strongly agree with the statement. Atr40 is categorize into negative conflict behaviors group.
Atr36: I can be humiliating when we argue
Atr36 Violin Plot
d %>% ggplot(aes(Class, Atr36, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr36",
subtitle = "humiliating when we argue") + theme_wsj() + theme(plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15),
panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))Atr36 Density Plot
den_art36 <- d %>% ggplot(aes(Atr36, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr36", subtitle = "humiliating when we argue") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art36 Interpretation: This graph is telling us that all people who ara in ‘Not_Divorce’ condition is strongly disagree with the statement, noone choose in the middle. The opposite, people who are in ‘Divorce_Potential’ class have more flexible answer. Atr36 is categorize into negative conflict behaviors group.
Atr35: I can insult our discussions
Atr35 Violin Plot
d %>% ggplot(aes(Class, Atr35, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr35",
subtitle = "I can insult our discussions") + theme_wsj() + theme(plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15),
panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))Atr35 Density Plot
den_art35 <- d %>% ggplot(aes(Atr35, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr35", subtitle = "I can insult our discussions") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art35 Interpretation: Atr35 is a strong statement and categorize into negative conflict behaviors group. This result is telling us that people who are classified into ‘Not_Divorce’ class mostly disagree for the statement (I can insult our discussion) contrarily people who are classified into ‘Divorce_Potential’ are more flexible with the answer but most of them are answered strongly agree with the statement (I can insult our discussion)
Atr 39: Fights often occur suddenly
Atr39 Violin Plot
d %>% ggplot(aes(Class, Atr39, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr39",
subtitle = "Fights often occur suddenly") + theme_wsj() + theme(plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15),
panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))Atr39 Density Plot
den_art39 <- d %>% ggplot(aes(Atr39, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr39", subtitle = "Fights often occur suddenly") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art39 Interpretation: Atr39 is categorize into negative conflict behaviors group. The graph showing us that mostly people who are in ‘Divorce_Potential’ class are strongly agree with the statement (Fights often occur suddenly) and people who in ‘Not_Divorce’ class mostly strongly disagree with the statement(Fights often occur suddenly).
Atr41: When I talk to my spouse about something, my calm suddenly breaks
Atr41 Violin Plot
d %>% ggplot(aes(Class, Atr41, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr41",
subtitle = "My calm suddenly breaks when talking") + theme_wsj() + theme(plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15),
panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))Atr41 Density Plot
den_art41 <- d %>% ggplot(aes(Atr41, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr41", subtitle = "My calm suddenly breaks when talking") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art41 Interpretation: Within this grap wwe may know that people who classified into ‘Divorce_Potential’ mostly choose strongly agree with the statement and people who classified into ‘Not_Divorce’ class mostly choose strongly disagree with the statement. The graph also showing us that both of the class (not divorce and potential divorce) have different tendency of how to be act to behave to their spouse. Atr41 is categorize into negative conflict behaviors group.
Atr33: I can use negative statements about my spouse’s personality during our discussions
Atr33 Violin Plot
d %>% ggplot(aes(Class, Atr33, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr33",
subtitle = "Use negative statements during discussions") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))Atr33 Density Plot
den_art33 <- d %>% ggplot(aes(Atr33, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr33", subtitle = "Use negative statements during discussions") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art33 Interpretation: This graph is showing us there’s a big difference between ‘Divorce_Potential’ and ‘Not_Divorce’ class. Both of them are having different answer which make us easily to separate the action. ‘Not_Divorce’ class people are strongly disagree with the statement but mostly people at ‘Potential_Divorce’ class is tend to choose strongly agree with the statement (Using negative statement during discussion). This statement is categorize into negative conflict behaviors group.
Atr38: I hate my spouse’s way of bringing it up
Atr38 Violin Plot
d %>% ggplot(aes(Class, Atr38, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr38",
subtitle = "I hate my spouse’s way of bringing it up") + theme_wsj() + theme(plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 15, family = "arial"), legend.title = element_text(size = 15),
panel.grid.major = element_line(size = 0.7, linetype = "solid", colour = "white"))Atr38 Density Plot
den_art38 <- d %>% ggplot(aes(Atr38, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr38", subtitle = "I hate my spouse’s way of bringing it up") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art38 Interpretation: From this graph we may conclude that most of people at ‘Not_Divorce’ class are strongly disagree with the statement and most of people in ‘Divorce_Potential’ class are choosing strongly agree with the statement (I hate my spoouse’s way of bringing it up). Atr38 is categorized at negative conflict behavior group.
Atr19: My spouse and I have similar ideas about how roles should be in marriage
Atr19 Violin Plot
d %>% ggplot(aes(Class, Atr19, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr19",
subtitle = "Similar ideas about how roles should be in marriage") + theme_wsj() +
theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))Atr19 Density Plot
den_art19 <- d %>% ggplot(aes(Atr19, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr19", subtitle = "Similar ideas about how roles should be in marriage") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art19 Interpretation: Atr19 is categorize into common meaning and failed to attempts to repair group. The result shows us that people who are in a ‘Not_Divorce’ class mostly choose not agree with a statement which saying similiar idea about how roles should be in married and some people who are in a ‘Divorce_Potential’ class mostly have tendency to agree with the statement. Suprisingly!
Atr11: I think that one day in the future, when I look back, I see that my wife and I are in harmony with each other
Atr11 Violin Plot
d %>% ggplot(aes(Class, Atr11, fill = Class)) + geom_violin(alpha = 0.3) + labs(title = "Atr11",
subtitle = "Some day my spouse and I will bee in harmony with each other") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))Atr11 Density Plot
den_art11 <- d %>% ggplot(aes(Atr11, fill = Class)) + geom_density(alpha = 0.3) +
labs(title = "Atr11", subtitle = "Some day my spouse and I will bee in harmony with each other") +
theme_wsj() + theme(plot.title = element_text(size = 25), plot.subtitle = element_text(size = 15,
family = "arial"), legend.title = element_text(size = 15), panel.grid.major = element_line(size = 0.7,
linetype = "solid", colour = "white"))
den_art11 Interpretation : Atr11 is categorize into love map group. Suprisingly, dataset shows us that people who are strongly agree with the statement “Some day my spouse and I will bee in harmony with each other” are puted into ‘Divorce_Potential’ condition in their relationship and people who are into ‘Not_Divorce’ class are strongly disagree with the statement.
Cross Validation Dataset
Read Data again
Atr1 Atr2 Atr3 Atr4 Atr5 Atr6 Atr7 Atr8 Atr9 Atr10 Atr11 Atr12 Atr13
1 2 2 4 1 0 0 0 0 0 0 1 0 1
Atr14 Atr15 Atr16 Atr17 Atr18 Atr19 Atr20 Atr21 Atr22 Atr23 Atr24 Atr25
1 1 0 1 0 0 0 1 0 0 0 0 0
Atr26 Atr27 Atr28 Atr29 Atr30 Atr31 Atr32 Atr33 Atr34 Atr35 Atr36 Atr37
1 0 0 0 0 1 1 2 1 2 0 1 2
Atr38 Atr39 Atr40 Atr41 Atr42 Atr43 Atr44 Atr45 Atr46 Atr47 Atr48 Atr49
1 1 3 3 2 1 1 2 3 2 1 3 3
Atr50 Atr51 Atr52 Atr53 Atr54 Class
1 3 2 3 2 1 1
[ reached 'max' / getOption("max.print") -- omitted 169 rows ]
Target Variable –> Class, 0-Not Divorce ; 1-Divorce Potential
** Subsetting using only 9 parameters**
** Using initial_split function to split train and test dataset**
set.seed(55)
split_nn9 <- initial_split(data = div9, prop = 0.75, strata = "Class")
train_nn9 <- training(split_nn9)
test_nn9_fin <- testing(split_nn9)set.seed(55)
split_nn9b <- initial_split(data = train_nn9, prop = 0.8, strata = "Class")
train_nn9b <- training(split_nn9b)
test_nn9b <- testing(split_nn9b)For the next step, we will use ‘train_nn9’ as train dataset, ‘test_nn9b’ as test dataset and ‘test_nn9_fin’ as validation dataset
Machine Learning
Neural Network - Using 9 Parameters only
Change into Matrix type
Separating x variable and y variable (separating target variable) to create one hot coding
Prediktor
Target
Change into Array
Array
train_nn9b_x_keras <- array_reshape(x = train_nn9b_x, dim = c(nrow(train_nn9b_x),
ncol(train_nn9b_x)))
test_nn9b_x_keras <- array_reshape(x = test_nn9b_x, dim = c(nrow(test_nn9b_x),
ncol(test_nn9b_x)))
head(train_nn9b_x_keras) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 0 1 0 1 1 3 3 2
[2,] 3 3 1 1 1 1 3 3 3
[3,] 4 3 2 1 1 3 4 4 2
[4,] 0 2 1 0 0 0 2 1 0
[5,] 0 0 1 1 1 2 0 2 2
[6,] 2 3 2 1 2 2 2 3 3
One Hot Coding
Creating Neural Network Architecture
Neural Network sequencing
mnn9b %>% layer_dense(input_shape = ncol(train_nn9b_x_keras), units = 64, activation = "relu",
name = "hidden_A") %>% layer_dense(units = 32, activation = "relu", name = "hidden_B",
kernel_initializer = initializer_random_uniform(seed = 55)) %>% layer_dense(units = 2,
activation = "sigmoid")
summary(mnn9b)Model: "sequential"
___________________________________________________________________________
Layer (type) Output Shape Param #
===========================================================================
hidden_A (Dense) (None, 64) 640
___________________________________________________________________________
hidden_B (Dense) (None, 32) 2080
___________________________________________________________________________
dense (Dense) (None, 2) 66
===========================================================================
Total params: 2,786
Trainable params: 2,786
Non-trainable params: 0
___________________________________________________________________________
Compiling
Gather all Neural Network architecture, set learning rate = 0.001
mnn9b %>% compile(loss = "binary_crossentropy", optimizer = optimizer_sgd(lr = 0.001),
metric = "accuracy")*Creating Model Neural Network Using ‘fit()’ function
Plotting Model
plot(history9b) + geom_line(col = "#ef2964", lwd = 1) + geom_point(cex = 1,
col = 14) + theme_bigstatsr() + labs(y = NULL)Prediction using Probability Number
prob_test9b <- predict_proba(object = mnn9b, x = test_nn9b_x_keras)
pred_test9b <- if_else(prob_test9b[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")*Rename Target variable become ‘Divorce_Potential’ and ‘Not_Divorce’
Model Validation (confussionMatrix)
Confusion Matrix and Statistics
Reference
Prediction Divorce_Potential Not_Divorce
Divorce_Potential 12 2
Not_Divorce 0 10
Accuracy : 0.9167
95% CI : (0.73, 0.9897)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.00001794
Kappa : 0.8333
Mcnemar's Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.8333
Pos Pred Value : 0.8571
Neg Pred Value : 1.0000
Prevalence : 0.5000
Detection Rate : 0.5000
Detection Prevalence : 0.5833
Balanced Accuracy : 0.9167
'Positive' Class : Divorce_Potential
Following the purpose of this case is to predict potential divorce before it happened, by do so that the model need to catch as much possibility as possible by using Recall/Sensitivity besides accuracy. The confussion Matrix gives us a good result for recall/sensitivity and accuracy as well
Checking Model Performance compare to train dataset
Prediction using Probability Number
prob_9b <- predict_proba(object = mnn9b, x = train_nn9b_x_keras)
pred_9b <- if_else(prob_9b[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")Model Validation (confussionMatrix)
Confusion Matrix and Statistics
Reference
Prediction Divorce_Potential Not_Divorce
Divorce_Potential 51 9
Not_Divorce 0 44
Accuracy : 0.9135
95% CI : (0.8421, 0.9597)
No Information Rate : 0.5096
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.8274
Mcnemar's Test P-Value : 0.007661
Sensitivity : 1.0000
Specificity : 0.8302
Pos Pred Value : 0.8500
Neg Pred Value : 1.0000
Prevalence : 0.4904
Detection Rate : 0.4904
Detection Prevalence : 0.5769
Balanced Accuracy : 0.9151
'Positive' Class : Divorce_Potential
Train Dataset showing us that, the resulting model is good enough by comparing between train dataset and test dataset
Prediction USing ‘NN_92’ Model
Load Model
Change ‘test_nn9_fin’ dataset into keras. First step , change into matrix
Separating x variable and y variable (separating target variable) to create one hot coding
Prediktor
Target
Change into Array
test_nn9b_fin_x_keras <- array_reshape(x = test_nn9b_fin_x, dim = c(nrow(test_nn9b_fin_x),
ncol(test_nn9b_fin_x)))
head(test_nn9b_fin_x_keras) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 3 3 4 4 4 4 4 4 4
[2,] 3 3 4 4 4 4 4 4 4
[3,] 4 4 4 4 3 3 4 3 4
[4,] 4 4 4 4 4 4 4 4 4
[5,] 3 3 4 4 4 4 4 4 4
[6,] 3 3 4 4 4 4 4 4 4
Prediction using Probability Number
prob_test92_fin <- predict_proba(object = model_NN_92, x = test_nn9b_fin_x_keras)
pred_test92_fin <- if_else(prob_test92_fin[, 1] >= 0.45, "Not_Divorce", "Divorce_Potential")func <- function(data) {
sapply(as.character(data), switch, `0` = "Not_Divorce", `1` = "Divorce_Potential")
}Model Validation (confussionMatrix) - Using Unseen Dataset
Confusion Matrix and Statistics
Reference
Prediction Divorce_Potential Not_Divorce
Divorce_Potential 21 2
Not_Divorce 0 19
Accuracy : 0.9524
95% CI : (0.8384, 0.9942)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.0000000002055
Kappa : 0.9048
Mcnemar's Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.9048
Pos Pred Value : 0.9130
Neg Pred Value : 1.0000
Prevalence : 0.5000
Detection Rate : 0.5000
Detection Prevalence : 0.5476
Balanced Accuracy : 0.9524
'Positive' Class : Divorce_Potential
Recall/Sensitivity deliver a good result and Accuracy as well
Conclusion
After trying some algorithm (Logistic, KNN, Naive Bayes, Decision Tree, Random Forest), I may conclude that Neural Network could explain better than other while still considering the 9 Parameters which have big impact among all questions inside the questionnaire. It provide a decent result as well by considering accuracy, and recall (sensitivity) number, which the goal from this questionnaire is to predict a potential divorce case as much as possible before it happened.
The 9 Parameters are : Atr11 –> I think that some day, my spouse and I will bee in harmony with each other (3.01)
Atr19 –> My spouse and I have similar ideas about how roles should be in marriage (3.03)
Atr33 –> I can use negative statements about my spouse’s personality during our discussions(3,06)
Atr35 –> I can insult our discussions(3.16)
Atr36 –> I can be humiliating when we argue(3.17)
Atr38 –> I hate my spouse’s way of bringing it up (3.05)
Atr39 –> Fights often occur suddenly(3.07)
Atr40 –> We’re just starting a fight before I know what’s going on(3.3)
Atr41 –> When I talk to my spouse about something, my calm suddenly breaks(3.07)
The top 3 are : Atr35, Atr36, Atr40
Simple practical used in a company is by implementing the questionnaire to the employee to reduce the posibility of having a personal issues which may leading to less focus and reduce the productivity in work. In other words, the questionnaire could maintain employee productivity.