Reasons Why Married People Cheat : Most people had the same reason for cheating on their partners.

1. Introduction

왜 일부 사람들은 배우자를 속일까?

1978년 주부들을 대상으로 설문 조사를 실시해 외도를 하는 요인을 파악했다. 이 연구는 남성과 여성 모두에 대한 많은 미래 연구의 기초가 됐으며, 배우자 몰래 다른 데에서 파트너를 찾게 되는 사람들과 결혼의 특징에 초점을 두려고 시도했다.

1.1. Aim

지도 학습이 항상 예측만 하는 것은 아니다. 이 사례 연구에서 누군가가 외도를 경험할 때 가장 중요하다고 생각되는 많은 요소 중 몇 가지를 파악하려고 한다.

추가 참고 자료

1.2. Collecting the data

Data set are given by Statmodels. you can download from statmodel

2. Exploring and preparing the data

2.1. Import libary

suppressMessages(library(dplyr)) #edit
suppressMessages(library(readxl)) #excel load
suppressMessages(library(doBy))
suppressMessages(library(fmsb)) # radar chart
suppressMessages(library(ggplot2)) #visualization
suppressMessages(library(corrplot)) #correlation
suppressMessages(library(VIM)) #missing data detection
suppressMessages(library(DMwR)) #outlier detection
suppressMessages(library(corrplot)) #correlation plot
suppressMessages(library(PerformanceAnalytics)) #correlation chart
suppressMessages(library(rpart)) #decision tree
suppressMessages(library(C50)) # decision tree
suppressMessages(library(rattle)) #decision tree fancy tree
suppressMessages(library(rpart.plot)) #decision tree fancy tree
suppressMessages(library(RColorBrewer)) #decision tree fancy tree
suppressMessages(library(ipred)) # bagging
suppressMessages(library(randomForest)) # random forest
suppressMessages(library(adabag)) # adaptive boosting
suppressMessages(library(lme4)) #dummy function
suppressMessages(library(caret)) #standard
suppressMessages(library(class)) #KNN 
suppressMessages(library(VennDiagram)) #VennDiagram
suppressMessages(library(neuralnet)) #ann
suppressMessages(library(e1071)) #SVM
suppressMessages(library(ROCR)) #ROC

suppressMessages(library(mctest))
suppressMessages(library(dummies))
suppressMessages(library(Information))
suppressMessages(library(pROC))

suppressMessages(library(caret))
suppressMessages(library(gmodels)) # CrossTable

2.2. Preparing the data

affairs_df <- read.csv("./input/affair.csv")
affairs_df <- affairs_df[-1]

str(affairs_df)

## 'data.frame':    6366 obs. of  9 variables:
##  $ rate_marriage  : num  3 3 4 4 5 4 5 5 3 3 ...
##  $ age            : num  32 27 22 37 27 27 37 37 22 27 ...
##  $ yrs_married    : num  9 13 2.5 16.5 9 9 23 23 2.5 6 ...
##  $ children       : num  3 3 0 4 1 0 5.5 5.5 0 0 ...
##  $ religious      : num  3 1 1 3 1 2 2 2 2 1 ...
##  $ educ           : num  17 14 16 16 14 14 12 12 12 16 ...
##  $ occupation     : num  2 3 3 5 3 3 5 2 3 3 ...
##  $ occupation_husb: num  5 4 5 5 4 4 4 3 3 5 ...
##  $ affairs        : num  0.111 3.231 1.4 0.727 4.667 ...

head(affairs_df)

##   rate_marriage age yrs_married children religious educ occupation
## 1             3  32         9.0        3         3   17          2
## 2             3  27        13.0        3         1   14          3
## 3             4  22         2.5        0         1   16          3
## 4             4  37        16.5        4         3   16          5
## 5             5  27         9.0        1         1   14          3
## 6             4  27         9.0        0         2   14          3
##   occupation_husb   affairs
## 1               5 0.1111111
## 2               4 3.2307692
## 3               5 1.3999996
## 4               5 0.7272727
## 5               4 4.6666660
## 6               4 4.6666660

2.3. Data description

• rate_marriage : 결혼에 주어진 등급(아내가 부여함); 서열 수준
    - 1 = 매우 나쁨, 2 = 나쁨, 3 = 보통, 4 = 좋음, 5 = 매우 좋음

• age : 아내의 나이; 비율 수준

• yrs_married : 결혼한 년 수; 비율 수준

• children : 남편과 아내 사이의 자녀 수; 비율 수준

• religious : 아내가 얼마나 종교적인가; 서열 수준
    - 1 = 그렇지 않음, 2 = 약간, 3 = 상당히, 4 = 강하게
    
• educ : 교육 수준; 비율 수준
    - 9 = 초등학교, 12 = 고등학교, 14 = 대학, 16 = 대학 졸업, 17 = 대학원, 20 = 고급 학위
    
• occupation : 아내의 직업; 명목 수준
    - 1 = 학생
    - 2 = 농업, 반숙련공, 비숙련공
    - 3 = 화이트칼라
    - 4 = 교사, 카운슬러, 사회 복지사, 간호사, 예술가, 작가, 기술자, 숙련된 근로자
    - 5 = 경영, 행정, 비즈니스
    - 6 = 고급 학위 소지자
    
• occupation_husb : 남편의 직업. occupation과 같음; 명목 수준

• affairs : 외도에 소요된 시간의 측정; 비율 수준

정량적인 응답이 있다. 그러나 질문은 어떤 요인들로 인해 사람이 외도를 하는지에 대한 것이다. 분이나 시간의 정확한 숫자는 그다지 중요하지 않다. 이런 이유로 affair_binary라는 새로운 범주형 변수를 만들어보자. 이 변수는 참(0분을 넘는 외도) 또는 거짓(0분의 외도)이다.

# Create a categorical variable

affairs_df['affair_binary'] = (affairs_df['affairs'] > 0)

table(affairs_df$affair_binary)

## 
## FALSE  TRUE 
##  4313  2053

2.4. Correlation Analysis

# find linear correlations between variables and affair_binary
# correlation matrix

data_corr <- affairs_df

for(i in 1:ncol(data_corr)){
  
  data_corr[,i]<- as.integer(data_corr[,i])
}

cor(data_corr)

##                 rate_marriage         age yrs_married    children
## rate_marriage      1.00000000 -0.11116426 -0.13018925 -0.13108300
## age               -0.11116426  1.00000000  0.89295708  0.67750331
## yrs_married       -0.13018925  0.89295708  1.00000000  0.77839473
## children          -0.13108300  0.67750331  0.77839473  1.00000000
## religious          0.07879426  0.13633078  0.13156771  0.14176023
## educ               0.07986881  0.02906597 -0.10716905 -0.14329400
## occupation         0.03952804  0.10672123  0.04265341 -0.01505493
## occupation_husb    0.02774453  0.16298952  0.12989878  0.08810300
## affairs           -0.15253690 -0.11282504 -0.11577227 -0.09040363
## affair_binary     -0.33177597  0.14689599  0.20486231  0.16321222
##                    religious        educ    occupation occupation_husb
## rate_marriage    0.078794261  0.07986881  0.0395280396     0.027744525
## age              0.136330784  0.02906597  0.1067212260     0.162989521
## yrs_married      0.131567714 -0.10716905  0.0426534104     0.129898782
## children         0.141760227 -0.14329400 -0.0150549307     0.088102999
## religious        1.000000000  0.03224460  0.0357464670     0.004060543
## educ             0.032244597  1.00000000  0.3822863399     0.183932372
## occupation       0.035746467  0.38228634  1.0000000000     0.201155871
## occupation_husb  0.004060543  0.18393237  0.2011558711     1.000000000
## affairs         -0.118156123 -0.01107978  0.0005584221    -0.017982213
## affair_binary   -0.129299436 -0.07528010  0.0289813052     0.017637479
##                       affairs affair_binary
## rate_marriage   -0.1525369013   -0.33177597
## age             -0.1128250432    0.14689599
## yrs_married     -0.1157722671    0.20486231
## children        -0.0904036320    0.16321222
## religious       -0.1181561228   -0.12929944
## educ            -0.0110797815   -0.07528010
## occupation       0.0005584221    0.02898131
## occupation_husb -0.0179822126    0.01763748
## affairs          1.0000000000    0.38841462
## affair_binary    0.3884146226    1.00000000

corrplot(cor(data_corr), order="hclust", title="Correlation", addrect=2, method="number") #visualization

Summary

종속변수와 상관관계가 있는 상위 4가지 변수를 볼 수 있다.

• affairs
• age
• yrs_marrled
• children

affairs 변수는 가장 큰 상관관계를 보이지만, affairs에 따라 직접 affalr_binary 변수를 만들었기 때문에 affairs_binary와 상관관계가 있다. 따라서 그것을 무시하자.

2.5. dummy

affair_dummy <- data.frame(affairs_df, dummy(affairs_df$occupation))
affair_dummy <- data.frame(affair_dummy, dummy(affair_dummy$occupation_husb))
head(affair_dummy)

##   rate_marriage age yrs_married children religious educ occupation
## 1             3  32         9.0        3         3   17          2
## 2             3  27        13.0        3         1   14          3
## 3             4  22         2.5        0         1   16          3
## 4             4  37        16.5        4         3   16          5
## 5             5  27         9.0        1         1   14          3
## 6             4  27         9.0        0         2   14          3
##   occupation_husb   affairs affairs
## 1               5 0.1111111    TRUE
## 2               4 3.2307692    TRUE
## 3               5 1.3999996    TRUE
## 4               5 0.7272727    TRUE
## 5               4 4.6666660    TRUE
## 6               4 4.6666660    TRUE
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1
## 1                                                                                    0
## 2                                                                                    0
## 3                                                                                    0
## 4                                                                                    0
## 5                                                                                    0
## 6                                                                                    0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2
## 1                                                                                    1
## 2                                                                                    0
## 3                                                                                    0
## 4                                                                                    0
## 5                                                                                    0
## 6                                                                                    0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3
## 1                                                                                    0
## 2                                                                                    1
## 3                                                                                    1
## 4                                                                                    0
## 5                                                                                    1
## 6                                                                                    1
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4
## 1                                                                                    0
## 2                                                                                    0
## 3                                                                                    0
## 4                                                                                    0
## 5                                                                                    0
## 6                                                                                    0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5
## 1                                                                                    0
## 2                                                                                    0
## 3                                                                                    0
## 4                                                                                    1
## 5                                                                                    0
## 6                                                                                    0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd6
## 1                                                                                    0
## 2                                                                                    0
## 3                                                                                    0
## 4                                                                                    0
## 5                                                                                    0
## 6                                                                                    0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1.1
## 1                                                                                      0
## 2                                                                                      0
## 3                                                                                      0
## 4                                                                                      0
## 5                                                                                      0
## 6                                                                                      0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2.1
## 1                                                                                      0
## 2                                                                                      0
## 3                                                                                      0
## 4                                                                                      0
## 5                                                                                      0
## 6                                                                                      0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3.1
## 1                                                                                      0
## 2                                                                                      0
## 3                                                                                      0
## 4                                                                                      0
## 5                                                                                      0
## 6                                                                                      0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4.1
## 1                                                                                      0
## 2                                                                                      1
## 3                                                                                      0
## 4                                                                                      0
## 5                                                                                      1
## 6                                                                                      1
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5.1
## 1                                                                                      1
## 2                                                                                      0
## 3                                                                                      1
## 4                                                                                      1
## 5                                                                                      0
## 6                                                                                      0
##   X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd6.1
## 1                                                                                      0
## 2                                                                                      0
## 3                                                                                      0
## 4                                                                                      0
## 5                                                                                      0
## 6                                                                                      0

3. Modeling

3.1. Creating random sampling datasets

# create a random sample for training and test data
# use set.seed to use the same random number sequence as the tutorial
set.seed(123)
train_sample <- sample(6366, 5092)

str(train_sample)

##  int [1:5092] 1831 5018 2603 5619 5984 290 3359 5675 3507 2903 ...

# split the data frames

data_train <- affair_dummy[train_sample, -9]
data_test  <- affair_dummy[-train_sample, -9]

# check the proportion of class variable

dim(data_train)

## [1] 5092   21

prop.table(table(data_train$affair_binary))

## 
##     FALSE      TRUE 
## 0.6812647 0.3187353

dim(data_test)

## [1] 1274   21

prop.table(table(data_test$affair_binary))

## 
##     FALSE      TRUE 
## 0.6624804 0.3375196

3.2. Training a model on the data

## Training
dt <- rpart(as.factor(affair_binary)~., data = data_train, cp = 0.1^20) # 모든 변수 사용, Full tree 생성

xerror_min_which <- which.min(dt$cptable[, "xerror"])
xerror_min <- min(dt$cptable[, "xerror"])

printcp(dt) # cptable 출력

## 
## Classification tree:
## rpart(formula = as.factor(affair_binary) ~ ., data = data_train, 
##     cp = 0.1^20)
## 
## Variables actually used in tree construction:
##  [1] age                                                                                   
##  [2] children                                                                              
##  [3] educ                                                                                  
##  [4] occupation                                                                            
##  [5] occupation_husb                                                                       
##  [6] rate_marriage                                                                         
##  [7] religious                                                                             
##  [8] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2  
##  [9] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2.1
## [10] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3  
## [11] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3.1
## [12] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4  
## [13] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4.1
## [14] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5  
## [15] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5.1
## [16] yrs_married                                                                           
## 
## Root node error: 1623/5092 = 0.31874
## 
## n= 5092 
## 
##            CP nsplit rel error  xerror     xstd
## 1  1.0228e-01      0   1.00000 1.00000 0.020488
## 2  2.1565e-02      1   0.89772 0.89772 0.019871
## 3  4.3130e-03      2   0.87616 0.88417 0.019780
## 4  3.6969e-03      3   0.87184 0.89156 0.019830
## 5  2.4646e-03      4   0.86815 0.90265 0.019904
## 6  1.8484e-03      9   0.85582 0.89649 0.019863
## 7  1.5404e-03     26   0.81023 0.90696 0.019932
## 8  1.4377e-03     32   0.80099 0.90450 0.019916
## 9  1.3093e-03     35   0.79667 0.90450 0.019916
## 10 1.2323e-03     45   0.78065 0.90943 0.019948
## 11 9.2421e-04     58   0.76340 0.91251 0.019968
## 12 8.2152e-04     66   0.75601 0.91374 0.019976
## 13 6.1614e-04     72   0.75108 0.91436 0.019980
## 14 4.9291e-04     95   0.73629 0.93284 0.020096
## 15 4.6211e-04    101   0.73259 0.94208 0.020153
## 16 3.8509e-04    109   0.72890 0.94701 0.020183
## 17 3.5208e-04    126   0.72089 0.94824 0.020191
## 18 3.0807e-04    134   0.71781 0.97043 0.020322
## 19 2.4646e-04    147   0.71349 0.97227 0.020333
## 20 2.0538e-04    152   0.71226 0.97289 0.020336
## 21 1.2323e-04    164   0.70980 0.97843 0.020368
## 22 1.0000e-20    169   0.70918 0.97905 0.020371

plotcp(dt) # cpplot 출력

abline(v = xerror_min_which, lty = 2, col = "red")
text(xerror_min_which, xerror_min, labels = round(xerror_min_which, 2), pos = 3, col = "red")

# pruning
dt_prune <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])

Training Accuracy

# training accuracy
pred_tr_dt <- predict(dt_prune, type = "class") # class(범주형)으로 예측

t_tr_dt <- table(pred_tr_dt, data_train$affair_binary) # confusion matrix
t_tr_dt

##           
## pred_tr_dt FALSE TRUE
##      FALSE  3156 1109
##      TRUE    313  514

acc_tr_dt <- sum(diag(t_tr_dt)) / sum(t_tr_dt) # accuracy
acc_tr_dt

## [1] 0.7207384

CrossTable(x = data_train$affair_binary, y = pred_tr_dt, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  5092 
## 
##  
##                          | pred_tr_dt 
## data_train$affair_binary |     FALSE |      TRUE | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |      3156 |       313 |      3469 | 
##                          |     0.910 |     0.090 |     0.681 | 
##                          |     0.740 |     0.378 |           | 
##                          |     0.620 |     0.061 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |      1109 |       514 |      1623 | 
##                          |     0.683 |     0.317 |     0.319 | 
##                          |     0.260 |     0.622 |           | 
##                          |     0.218 |     0.101 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |      4265 |       827 |      5092 | 
##                          |     0.838 |     0.162 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

Test accuracy

# test accuracy
pred_te_dt <- predict(dt_prune, data_test, type = "class")
t_te_dt <- table(pred_te_dt, data_test$affair_binary)
t_te_dt

##           
## pred_te_dt FALSE TRUE
##      FALSE   770  278
##      TRUE     74  152

acc_tr_dt <- sum(diag(t_te_dt)) / sum(t_te_dt) # accuracy
acc_tr_dt

## [1] 0.7237049

CrossTable(x = data_test$affair_binary, y = pred_te_dt, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1274 
## 
##  
##                         | pred_te_dt 
## data_test$affair_binary |     FALSE |      TRUE | Row Total | 
## ------------------------|-----------|-----------|-----------|
##                   FALSE |       770 |        74 |       844 | 
##                         |     0.912 |     0.088 |     0.662 | 
##                         |     0.735 |     0.327 |           | 
##                         |     0.604 |     0.058 |           | 
## ------------------------|-----------|-----------|-----------|
##                    TRUE |       278 |       152 |       430 | 
##                         |     0.647 |     0.353 |     0.338 | 
##                         |     0.265 |     0.673 |           | 
##                         |     0.218 |     0.119 |           | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |      1048 |       226 |      1274 | 
##                         |     0.823 |     0.177 |           | 
## ------------------------|-----------|-----------|-----------|
## 
##

4. Summary

4.1. Visualization - Decision Tree

# plotting
plot(dt_prune, margin = 0.1)
text(dt_prune, use.n = T)

fancyRpartPlot(dt_prune, cex = 1) #fancy tree

4.2. Variable importance

dt_prune$variable.importance

##                                                                          rate_marriage 
##                                                                           187.27219167 
##                                                                            yrs_married 
##                                                                            14.21489739 
##                                                                                    age 
##                                                                             7.61685628 
##                                                                               children 
##                                                                             5.09407586 
##                                                                        occupation_husb 
##                                                                             0.09703002 
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1.1 
##                                                                             0.09703002

barplot(dt_prune$variable.importance, ylim = c(0, 55))

4.3. Summary

• rate_marriage : 결혼 등급(만족도)
• yrs_married : 결혼 연차
• age : 여성의 나이
• occupation : 여성의 직업
• children : 자녀의 수

• educ：여성의 교육 수준
• occupation_husb : 남편의 직업


=> 이것들은 1978년 조사에서 한 여성이 외도와 관련되는지 여부를 결정할 때 가장 중요한 변수로 정렬했다.

Why do some people cheat on their spouses?

Andrew Jang

10/6/2018