1978년 주부들을 대상으로 설문 조사를 실시해 외도를 하는 요인을 파악했다. 이 연구는 남성과 여성 모두에 대한 많은 미래 연구의 기초가 됐으며, 배우자 몰래 다른 데에서 파트너를 찾게 되는 사람들과 결혼의 특징에 초점을 두려고 시도했다.
지도 학습이 항상 예측만 하는 것은 아니다. 이 사례 연구에서 누군가가 외도를 경험할 때 가장 중요하다고 생각되는 많은 요소 중 몇 가지를 파악하려고 한다.
Data set are given by Statmodels. you can download from statmodel
suppressMessages(library(dplyr)) #edit
suppressMessages(library(readxl)) #excel load
suppressMessages(library(doBy))
suppressMessages(library(fmsb)) # radar chart
suppressMessages(library(ggplot2)) #visualization
suppressMessages(library(corrplot)) #correlation
suppressMessages(library(VIM)) #missing data detection
suppressMessages(library(DMwR)) #outlier detection
suppressMessages(library(corrplot)) #correlation plot
suppressMessages(library(PerformanceAnalytics)) #correlation chart
suppressMessages(library(rpart)) #decision tree
suppressMessages(library(C50)) # decision tree
suppressMessages(library(rattle)) #decision tree fancy tree
suppressMessages(library(rpart.plot)) #decision tree fancy tree
suppressMessages(library(RColorBrewer)) #decision tree fancy tree
suppressMessages(library(ipred)) # bagging
suppressMessages(library(randomForest)) # random forest
suppressMessages(library(adabag)) # adaptive boosting
suppressMessages(library(lme4)) #dummy function
suppressMessages(library(caret)) #standard
suppressMessages(library(class)) #KNN
suppressMessages(library(VennDiagram)) #VennDiagram
suppressMessages(library(neuralnet)) #ann
suppressMessages(library(e1071)) #SVM
suppressMessages(library(ROCR)) #ROC
suppressMessages(library(mctest))
suppressMessages(library(dummies))
suppressMessages(library(Information))
suppressMessages(library(pROC))
suppressMessages(library(caret))
suppressMessages(library(gmodels)) # CrossTable
affairs_df <- read.csv("./input/affair.csv")
affairs_df <- affairs_df[-1]
str(affairs_df)
## 'data.frame': 6366 obs. of 9 variables:
## $ rate_marriage : num 3 3 4 4 5 4 5 5 3 3 ...
## $ age : num 32 27 22 37 27 27 37 37 22 27 ...
## $ yrs_married : num 9 13 2.5 16.5 9 9 23 23 2.5 6 ...
## $ children : num 3 3 0 4 1 0 5.5 5.5 0 0 ...
## $ religious : num 3 1 1 3 1 2 2 2 2 1 ...
## $ educ : num 17 14 16 16 14 14 12 12 12 16 ...
## $ occupation : num 2 3 3 5 3 3 5 2 3 3 ...
## $ occupation_husb: num 5 4 5 5 4 4 4 3 3 5 ...
## $ affairs : num 0.111 3.231 1.4 0.727 4.667 ...
head(affairs_df)
## rate_marriage age yrs_married children religious educ occupation
## 1 3 32 9.0 3 3 17 2
## 2 3 27 13.0 3 1 14 3
## 3 4 22 2.5 0 1 16 3
## 4 4 37 16.5 4 3 16 5
## 5 5 27 9.0 1 1 14 3
## 6 4 27 9.0 0 2 14 3
## occupation_husb affairs
## 1 5 0.1111111
## 2 4 3.2307692
## 3 5 1.3999996
## 4 5 0.7272727
## 5 4 4.6666660
## 6 4 4.6666660
• rate_marriage : 결혼에 주어진 등급(아내가 부여함); 서열 수준
- 1 = 매우 나쁨, 2 = 나쁨, 3 = 보통, 4 = 좋음, 5 = 매우 좋음
• age : 아내의 나이; 비율 수준
• yrs_married : 결혼한 년 수; 비율 수준
• children : 남편과 아내 사이의 자녀 수; 비율 수준
• religious : 아내가 얼마나 종교적인가; 서열 수준
- 1 = 그렇지 않음, 2 = 약간, 3 = 상당히, 4 = 강하게
• educ : 교육 수준; 비율 수준
- 9 = 초등학교, 12 = 고등학교, 14 = 대학, 16 = 대학 졸업, 17 = 대학원, 20 = 고급 학위
• occupation : 아내의 직업; 명목 수준
- 1 = 학생
- 2 = 농업, 반숙련공, 비숙련공
- 3 = 화이트칼라
- 4 = 교사, 카운슬러, 사회 복지사, 간호사, 예술가, 작가, 기술자, 숙련된 근로자
- 5 = 경영, 행정, 비즈니스
- 6 = 고급 학위 소지자
• occupation_husb : 남편의 직업. occupation과 같음; 명목 수준
• affairs : 외도에 소요된 시간의 측정; 비율 수준
정량적인 응답이 있다. 그러나 질문은 어떤 요인들로 인해 사람이 외도를 하는지에 대한 것이다. 분이나 시간의 정확한 숫자는 그다지 중요하지 않다. 이런 이유로 affair_binary라는 새로운 범주형 변수를 만들어보자. 이 변수는 참(0분을 넘는 외도) 또는 거짓(0분의 외도)이다.
# Create a categorical variable
affairs_df['affair_binary'] = (affairs_df['affairs'] > 0)
table(affairs_df$affair_binary)
##
## FALSE TRUE
## 4313 2053
# find linear correlations between variables and affair_binary
# correlation matrix
data_corr <- affairs_df
for(i in 1:ncol(data_corr)){
data_corr[,i]<- as.integer(data_corr[,i])
}
cor(data_corr)
## rate_marriage age yrs_married children
## rate_marriage 1.00000000 -0.11116426 -0.13018925 -0.13108300
## age -0.11116426 1.00000000 0.89295708 0.67750331
## yrs_married -0.13018925 0.89295708 1.00000000 0.77839473
## children -0.13108300 0.67750331 0.77839473 1.00000000
## religious 0.07879426 0.13633078 0.13156771 0.14176023
## educ 0.07986881 0.02906597 -0.10716905 -0.14329400
## occupation 0.03952804 0.10672123 0.04265341 -0.01505493
## occupation_husb 0.02774453 0.16298952 0.12989878 0.08810300
## affairs -0.15253690 -0.11282504 -0.11577227 -0.09040363
## affair_binary -0.33177597 0.14689599 0.20486231 0.16321222
## religious educ occupation occupation_husb
## rate_marriage 0.078794261 0.07986881 0.0395280396 0.027744525
## age 0.136330784 0.02906597 0.1067212260 0.162989521
## yrs_married 0.131567714 -0.10716905 0.0426534104 0.129898782
## children 0.141760227 -0.14329400 -0.0150549307 0.088102999
## religious 1.000000000 0.03224460 0.0357464670 0.004060543
## educ 0.032244597 1.00000000 0.3822863399 0.183932372
## occupation 0.035746467 0.38228634 1.0000000000 0.201155871
## occupation_husb 0.004060543 0.18393237 0.2011558711 1.000000000
## affairs -0.118156123 -0.01107978 0.0005584221 -0.017982213
## affair_binary -0.129299436 -0.07528010 0.0289813052 0.017637479
## affairs affair_binary
## rate_marriage -0.1525369013 -0.33177597
## age -0.1128250432 0.14689599
## yrs_married -0.1157722671 0.20486231
## children -0.0904036320 0.16321222
## religious -0.1181561228 -0.12929944
## educ -0.0110797815 -0.07528010
## occupation 0.0005584221 0.02898131
## occupation_husb -0.0179822126 0.01763748
## affairs 1.0000000000 0.38841462
## affair_binary 0.3884146226 1.00000000
corrplot(cor(data_corr), order="hclust", title="Correlation", addrect=2, method="number") #visualization
종속변수와 상관관계가 있는 상위 4가지 변수를 볼 수 있다.
• affairs
• age
• yrs_marrled
• children
affairs 변수는 가장 큰 상관관계를 보이지만, affairs에 따라 직접 affalr_binary 변수를 만들었기 때문에 affairs_binary와 상관관계가 있다. 따라서 그것을 무시하자.
affair_dummy <- data.frame(affairs_df, dummy(affairs_df$occupation))
affair_dummy <- data.frame(affair_dummy, dummy(affair_dummy$occupation_husb))
head(affair_dummy)
## rate_marriage age yrs_married children religious educ occupation
## 1 3 32 9.0 3 3 17 2
## 2 3 27 13.0 3 1 14 3
## 3 4 22 2.5 0 1 16 3
## 4 4 37 16.5 4 3 16 5
## 5 5 27 9.0 1 1 14 3
## 6 4 27 9.0 0 2 14 3
## occupation_husb affairs affairs
## 1 5 0.1111111 TRUE
## 2 4 3.2307692 TRUE
## 3 5 1.3999996 TRUE
## 4 5 0.7272727 TRUE
## 5 4 4.6666660 TRUE
## 6 4 4.6666660 TRUE
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3
## 1 0
## 2 1
## 3 1
## 4 0
## 5 1
## 6 1
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd6
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1.1
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2.1
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3.1
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4.1
## 1 0
## 2 1
## 3 0
## 4 0
## 5 1
## 6 1
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5.1
## 1 1
## 2 0
## 3 1
## 4 1
## 5 0
## 6 0
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd6.1
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# create a random sample for training and test data
# use set.seed to use the same random number sequence as the tutorial
set.seed(123)
train_sample <- sample(6366, 5092)
str(train_sample)
## int [1:5092] 1831 5018 2603 5619 5984 290 3359 5675 3507 2903 ...
# split the data frames
data_train <- affair_dummy[train_sample, -9]
data_test <- affair_dummy[-train_sample, -9]
# check the proportion of class variable
dim(data_train)
## [1] 5092 21
prop.table(table(data_train$affair_binary))
##
## FALSE TRUE
## 0.6812647 0.3187353
dim(data_test)
## [1] 1274 21
prop.table(table(data_test$affair_binary))
##
## FALSE TRUE
## 0.6624804 0.3375196
## Training
dt <- rpart(as.factor(affair_binary)~., data = data_train, cp = 0.1^20) # 모든 변수 사용, Full tree 생성
xerror_min_which <- which.min(dt$cptable[, "xerror"])
xerror_min <- min(dt$cptable[, "xerror"])
printcp(dt) # cptable 출력
##
## Classification tree:
## rpart(formula = as.factor(affair_binary) ~ ., data = data_train,
## cp = 0.1^20)
##
## Variables actually used in tree construction:
## [1] age
## [2] children
## [3] educ
## [4] occupation
## [5] occupation_husb
## [6] rate_marriage
## [7] religious
## [8] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2
## [9] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd2.1
## [10] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3
## [11] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd3.1
## [12] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4
## [13] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd4.1
## [14] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5
## [15] X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd5.1
## [16] yrs_married
##
## Root node error: 1623/5092 = 0.31874
##
## n= 5092
##
## CP nsplit rel error xerror xstd
## 1 1.0228e-01 0 1.00000 1.00000 0.020488
## 2 2.1565e-02 1 0.89772 0.89772 0.019871
## 3 4.3130e-03 2 0.87616 0.88417 0.019780
## 4 3.6969e-03 3 0.87184 0.89156 0.019830
## 5 2.4646e-03 4 0.86815 0.90265 0.019904
## 6 1.8484e-03 9 0.85582 0.89649 0.019863
## 7 1.5404e-03 26 0.81023 0.90696 0.019932
## 8 1.4377e-03 32 0.80099 0.90450 0.019916
## 9 1.3093e-03 35 0.79667 0.90450 0.019916
## 10 1.2323e-03 45 0.78065 0.90943 0.019948
## 11 9.2421e-04 58 0.76340 0.91251 0.019968
## 12 8.2152e-04 66 0.75601 0.91374 0.019976
## 13 6.1614e-04 72 0.75108 0.91436 0.019980
## 14 4.9291e-04 95 0.73629 0.93284 0.020096
## 15 4.6211e-04 101 0.73259 0.94208 0.020153
## 16 3.8509e-04 109 0.72890 0.94701 0.020183
## 17 3.5208e-04 126 0.72089 0.94824 0.020191
## 18 3.0807e-04 134 0.71781 0.97043 0.020322
## 19 2.4646e-04 147 0.71349 0.97227 0.020333
## 20 2.0538e-04 152 0.71226 0.97289 0.020336
## 21 1.2323e-04 164 0.70980 0.97843 0.020368
## 22 1.0000e-20 169 0.70918 0.97905 0.020371
plotcp(dt) # cpplot 출력
abline(v = xerror_min_which, lty = 2, col = "red")
text(xerror_min_which, xerror_min, labels = round(xerror_min_which, 2), pos = 3, col = "red")
# pruning
dt_prune <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])
# training accuracy
pred_tr_dt <- predict(dt_prune, type = "class") # class(범주형)으로 예측
t_tr_dt <- table(pred_tr_dt, data_train$affair_binary) # confusion matrix
t_tr_dt
##
## pred_tr_dt FALSE TRUE
## FALSE 3156 1109
## TRUE 313 514
acc_tr_dt <- sum(diag(t_tr_dt)) / sum(t_tr_dt) # accuracy
acc_tr_dt
## [1] 0.7207384
CrossTable(x = data_train$affair_binary, y = pred_tr_dt, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5092
##
##
## | pred_tr_dt
## data_train$affair_binary | FALSE | TRUE | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 3156 | 313 | 3469 |
## | 0.910 | 0.090 | 0.681 |
## | 0.740 | 0.378 | |
## | 0.620 | 0.061 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 1109 | 514 | 1623 |
## | 0.683 | 0.317 | 0.319 |
## | 0.260 | 0.622 | |
## | 0.218 | 0.101 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 4265 | 827 | 5092 |
## | 0.838 | 0.162 | |
## -------------------------|-----------|-----------|-----------|
##
##
# test accuracy
pred_te_dt <- predict(dt_prune, data_test, type = "class")
t_te_dt <- table(pred_te_dt, data_test$affair_binary)
t_te_dt
##
## pred_te_dt FALSE TRUE
## FALSE 770 278
## TRUE 74 152
acc_tr_dt <- sum(diag(t_te_dt)) / sum(t_te_dt) # accuracy
acc_tr_dt
## [1] 0.7237049
CrossTable(x = data_test$affair_binary, y = pred_te_dt, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1274
##
##
## | pred_te_dt
## data_test$affair_binary | FALSE | TRUE | Row Total |
## ------------------------|-----------|-----------|-----------|
## FALSE | 770 | 74 | 844 |
## | 0.912 | 0.088 | 0.662 |
## | 0.735 | 0.327 | |
## | 0.604 | 0.058 | |
## ------------------------|-----------|-----------|-----------|
## TRUE | 278 | 152 | 430 |
## | 0.647 | 0.353 | 0.338 |
## | 0.265 | 0.673 | |
## | 0.218 | 0.119 | |
## ------------------------|-----------|-----------|-----------|
## Column Total | 1048 | 226 | 1274 |
## | 0.823 | 0.177 | |
## ------------------------|-----------|-----------|-----------|
##
##
# plotting
plot(dt_prune, margin = 0.1)
text(dt_prune, use.n = T)
fancyRpartPlot(dt_prune, cex = 1) #fancy tree
dt_prune$variable.importance
## rate_marriage
## 187.27219167
## yrs_married
## 14.21489739
## age
## 7.61685628
## children
## 5.09407586
## occupation_husb
## 0.09703002
## X.Users.mac.codebox_modified.dev.data.analytics.project.love.affair.love.affair.Rmd1.1
## 0.09703002
barplot(dt_prune$variable.importance, ylim = c(0, 55))
• rate_marriage : 결혼 등급(만족도)
• yrs_married : 결혼 연차
• age : 여성의 나이
• occupation : 여성의 직업
• children : 자녀의 수
• educ:여성의 교육 수준
• occupation_husb : 남편의 직업
=> 이것들은 1978년 조사에서 한 여성이 외도와 관련되는지 여부를 결정할 때 가장 중요한 변수로 정렬했다.