Nayana Davis
How did you develop your question and what relevant research has already been completed on this topic?
Infidelity is a topic where people are not exactly going to be the most forthcoming if they're commited the act. As such, I wasn't able to glean much valuable insight on the drivers for extramarital affairs beyond basic demographic markers.
{r} [NPR reports] (https://www.npr.org/2015/07/26/426434619/sorting-through-the-numbers-on-infidelity) that data from Ashley Madison revealed that men approaching their next decade (ie 29, 39, etc) were more likely to be on the site. {r} [The General Social Survey] (https://ifstudies.org/blog/who-cheats-more-the-demographics-of-cheating-in-america), conducted by Institute for Family Studies found that Democrats, those who grew up in broken homes, and those who did not attend many religious services where more likely to take part in adultery.
Although, the Extramarital Affairs Data set used in a survey conducted by Yale's Ray C. Fair is older -- 1977, it did provide additional factors that I suspected could contribute to having an extramarital affair, such as rating the participant's happiness in the marriage, if children were involved and years married. This prompted me to wonder if there was a factor evaluated that would allow for prediction of infidelity.
As previously mentioned, I used the {r} [Fairs Extramarital Affairs Data set] (https://www.npr.org/2015/07/26/426434619/sorting-through-the-numbers-on-infidelity) to conduct my analysis. I converted many of the predictor variables into factors from integers since they were categorical and ordinal values. I actually got rid of the affairs column, which reported the number of times a participant had sexual intercourse outside of marriage and instead made a new column called hadaffair, which is set to "yes" if the number in the old affairs column was greater than or equal to 1. I did this because I thought the old column was a strange metric since it doesn't reveal if there were multiple partners or time span -- I believed the new one I created is more straightforward.
setwd("/Users/nayanadavis/Desktop/R/Intro to DS/Final Project/")
affair_data <- data.frame(read.csv("./Fair.csv", header = TRUE))
str(affair_data)## 'data.frame': 601 obs. of 10 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
## $ age : num 37 27 32 57 22 32 22 57 32 22 ...
## $ ym : num 10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
## $ child : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
## $ religious : int 3 4 1 5 2 2 2 2 4 4 ...
## $ education : int 18 14 12 18 17 17 12 14 16 14 ...
## $ occupation: int 7 6 1 6 6 5 1 4 1 4 ...
## $ rate : int 4 4 4 5 3 5 3 4 2 5 ...
## $ nbaffairs : int 0 0 0 0 0 0 0 0 0 0 ...
# added column to say if person had at least one affair
affair_data <- within(affair_data, {
hadaffair = ifelse(nbaffairs >= 1, "yes", "no")
})
# removed number of affairs column
affair_data<- affair_data[ -c(10) ]
affair_data$age<- factor(affair_data$age)
affair_data$ym <- factor(affair_data$ym)
affair_data$religious <- factor(affair_data$religious)
affair_data$education <- factor(affair_data$education)
affair_data$occupation <- factor(affair_data$occupation)
affair_data$rate <- factor(affair_data$rate)
affair_data$hadaffair <- factor(affair_data$hadaffair)str(affair_data)## 'data.frame': 601 obs. of 10 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
## $ age : Factor w/ 9 levels "17.5","22","27",..: 5 3 4 9 2 4 2 9 4 2 ...
## $ ym : Factor w/ 8 levels "0.125","0.417",..: 7 5 8 8 3 4 3 8 8 4 ...
## $ child : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
## $ religious : Factor w/ 5 levels "1","2","3","4",..: 3 4 1 5 2 2 2 2 4 4 ...
## $ education : Factor w/ 7 levels "9","12","14",..: 6 3 2 6 5 5 2 3 4 3 ...
## $ occupation: Factor w/ 7 levels "1","2","3","4",..: 7 6 1 6 6 5 1 4 1 4 ...
## $ rate : Factor w/ 5 levels "1","2","3","4",..: 4 4 4 5 3 5 3 4 2 5 ...
## $ hadaffair : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head(affair_data)## X sex age ym child religious education occupation rate hadaffair
## 1 1 male 37 10 no 3 18 7 4 no
## 2 2 female 27 4 no 4 14 6 4 no
## 3 3 female 32 15 yes 1 12 1 4 no
## 4 4 male 57 15 yes 5 18 6 5 no
## 5 5 male 22 0.75 no 2 17 6 3 no
## 6 6 female 32 1.5 no 2 17 5 5 no
#check for averages in continous variables, counts for categorical
summary(affair_data)## X sex age ym child
## Min. : 1 female:315 27 :153 15 :204 no :171
## 1st Qu.:151 male :286 22 :117 4 :105 yes:430
## Median :301 32 :115 1.5 : 88
## Mean :301 37 : 88 7 : 82
## 3rd Qu.:451 42 : 56 10 : 70
## Max. :601 47 : 23 0.75 : 31
## (Other): 49 (Other): 21
## religious education occupation rate hadaffair
## 1: 48 9 : 7 1:113 1: 16 no :451
## 2:164 12: 44 2: 13 2: 66 yes:150
## 3:129 14:154 3: 47 3: 93
## 4:190 16:115 4: 68 4:194
## 5: 70 17: 89 5:204 5:232
## 18:112 6:143
## 20: 80 7: 13
plot(affair_data$hadaffair)# check for missing values
library(Amelia)## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(affair_data, main = "Missing vs observed values") 3. How did you select and determine the correct model to answer your question? Since the question I want to answer was if certain factors could predict if a person was going to participate in an extramarital affair not, I thought a logistic regression model would be best. I wanted to know if specific responses could point to infidelity, which, if you've committed the act is just a yes or no answer.
library(caret)## Loading required package: lattice
## Loading required package: ggplot2
trainIndex <- createDataPartition(affair_data$hadaffair, p=.66, list=FALSE,times = 1)
train <- affair_data[trainIndex, ]
test <- affair_data[-trainIndex, ]
hadaffairlogit <- glm(hadaffair~sex+age+ym+child+religious+education+occupation+rate, family = binomial(link = "logit"), data = train)
# Estimate: the log odds amount in which hadaffair would increase if the covariate was one unit higher
# Intercept: values when covariates are all 0
summary(hadaffairlogit)##
## Call:
## glm(formula = hadaffair ~ sex + age + ym + child + religious +
## education + occupation + rate, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.80691 -0.70616 -0.42314 -0.00046 2.77275
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.83284 827.94767 -0.014 0.988597
## sexmale 0.51785 0.35605 1.454 0.145831
## age22 -2.71799 1.16624 -2.331 0.019776 *
## age27 -3.02774 1.20859 -2.505 0.012239 *
## age32 -2.72557 1.21895 -2.236 0.025352 *
## age37 -3.25788 1.30758 -2.492 0.012719 *
## age42 -3.53115 1.32935 -2.656 0.007900 **
## age47 -2.88979 1.44859 -1.995 0.046055 *
## age52 -2.35739 1.44236 -1.634 0.102176
## age57 -5.02536 1.57435 -3.192 0.001413 **
## ym0.417 -0.63253 1.77298 -0.357 0.721270
## ym0.75 -0.30413 1.53382 -0.198 0.842823
## ym1.5 -0.53658 1.36277 -0.394 0.693770
## ym4 0.20331 1.33831 0.152 0.879254
## ym7 0.52731 1.36773 0.386 0.699842
## ym10 1.08263 1.37796 0.786 0.432057
## ym15 1.08468 1.37727 0.788 0.430953
## childyes 0.36275 0.43817 0.828 0.407741
## religious2 -1.35944 0.50497 -2.692 0.007100 **
## religious3 -0.51550 0.49905 -1.033 0.301619
## religious4 -1.67986 0.50284 -3.341 0.000836 ***
## religious5 -1.56811 0.60406 -2.596 0.009433 **
## education12 14.44728 827.94536 0.017 0.986078
## education14 14.05038 827.94531 0.017 0.986460
## education16 14.26372 827.94536 0.017 0.986255
## education17 15.28214 827.94537 0.018 0.985274
## education18 14.83120 827.94535 0.018 0.985708
## education20 14.21861 827.94542 0.017 0.986298
## occupation2 1.04662 1.02018 1.026 0.304930
## occupation3 1.02466 0.56499 1.814 0.069740 .
## occupation4 0.75553 0.55738 1.356 0.175255
## occupation5 0.03996 0.47027 0.085 0.932289
## occupation6 0.37683 0.55134 0.683 0.494308
## occupation7 0.54877 0.88389 0.621 0.534694
## rate2 0.40693 0.85946 0.473 0.635874
## rate3 -0.97356 0.86390 -1.127 0.259770
## rate4 -1.15331 0.83014 -1.389 0.164744
## rate5 -1.78873 0.85964 -2.081 0.037454 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 445.94 on 396 degrees of freedom
## Residual deviance: 355.28 on 359 degrees of freedom
## AIC: 431.28
##
## Number of Fisher Scoring iterations: 14
#There was no positive correlation between any of the predictor variables with a low enough p-value that would allow us to reject the null hypothesis
#Kept religion and age since it showed the lowest p-values, indicating we can more confidently reject the null hypothesis.
hadaffairlogit <- glm(hadaffair~age + religious + rate, family = binomial(link = "logit"), data = train)
summary(hadaffairlogit)##
## Call:
## glm(formula = hadaffair ~ age + religious + rate, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6224 -0.7001 -0.4856 -0.2268 2.5056
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.4871 1.2810 2.722 0.006484 **
## age22 -3.0515 1.0212 -2.988 0.002806 **
## age27 -2.5742 0.9985 -2.578 0.009936 **
## age32 -1.7877 0.9904 -1.805 0.071071 .
## age37 -2.0795 1.0217 -2.035 0.041817 *
## age42 -2.2544 1.0391 -2.170 0.030039 *
## age47 -1.4405 1.1522 -1.250 0.211218
## age52 -1.3368 1.1403 -1.172 0.241048
## age57 -3.6047 1.2693 -2.840 0.004512 **
## religious2 -1.3484 0.4669 -2.888 0.003878 **
## religious3 -0.5496 0.4711 -1.167 0.243355
## religious4 -1.5661 0.4710 -3.325 0.000885 ***
## religious5 -1.5318 0.5741 -2.668 0.007629 **
## rate2 0.2020 0.7677 0.263 0.792482
## rate3 -1.0735 0.7736 -1.388 0.165218
## rate4 -1.2462 0.7383 -1.688 0.091430 .
## rate5 -1.9641 0.7658 -2.565 0.010328 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 445.94 on 396 degrees of freedom
## Residual deviance: 378.33 on 380 degrees of freedom
## AIC: 412.33
##
## Number of Fisher Scoring iterations: 5
#Removed age for a stronger model
hadaffairlogit <- glm(hadaffair~religious + rate, family = binomial(link = "logit"), data = train)
summary(hadaffairlogit)##
## Call:
## glm(formula = hadaffair ~ religious + rate, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4891 -0.6977 -0.6188 -0.4493 2.1648
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.15770 0.78509 1.475 0.14032
## religious2 -1.20341 0.44826 -2.685 0.00726 **
## religious3 -0.42447 0.45224 -0.939 0.34794
## religious4 -1.37783 0.45271 -3.044 0.00234 **
## religious5 -1.23847 0.55023 -2.251 0.02440 *
## rate2 -0.02501 0.75180 -0.033 0.97347
## rate3 -1.15502 0.75703 -1.526 0.12708
## rate4 -1.33561 0.72362 -1.846 0.06493 .
## rate5 -2.02222 0.73607 -2.747 0.00601 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 445.94 on 396 degrees of freedom
## Residual deviance: 401.06 on 388 degrees of freedom
## AIC: 419.06
##
## Number of Fisher Scoring iterations: 4
modeloutput <- exp(coef(hadaffairlogit))
modeloutput## (Intercept) religious2 religious3 religious4 religious5 rate2
## 3.1825964 0.3001701 0.6541137 0.2521245 0.2898285 0.9753045
## rate3 rate4 rate5
## 0.3150520 0.2629986 0.1323619
library(ResourceSelection)## ResourceSelection 0.3-2 2017-02-28
hoslem.test(train$hadaffair, fitted(hadaffairlogit))## Warning in Ops.factor(1, y): '-' not meaningful for factors
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: train$hadaffair, fitted(hadaffairlogit)
## X-squared = 397, df = 8, p-value < 2.2e-16
pred.model <- predict.glm(hadaffairlogit,test,type='response')
pred.model ## 6 13 17 20 22 27
## 0.11225376 0.19523046 0.23134606 0.11225376 0.11225376 0.09601139
## 30 35 37 42 46 51
## 0.35379859 0.45563963 0.35379859 0.22517033 0.21602347 0.10880715
## 52 55 56 59 61 62
## 0.20079784 0.29639630 0.44518749 0.17425859 0.47358109 0.35379859
## 65 74 77 82 85 87
## 0.17425859 0.20178871 0.21602347 0.09601139 0.09601139 0.09601139
## 88 91 92 93 95 97
## 0.67000721 0.17425859 0.23134606 0.17425859 0.45563963 0.67551218
## 98 99 102 106 107 112
## 0.45563963 0.21602347 0.20079784 0.17425859 0.17425859 0.09601139
## 114 115 116 125 127 128
## 0.19523046 0.35379859 0.09601139 0.21602347 0.11225376 0.09601139
## 132 134 136 148 151 152
## 0.09601139 0.23134606 0.35379859 0.22517033 0.20178871 0.23134606
## 157 159 161 163 166 168
## 0.47981881 0.20178871 0.11225376 0.10880715 0.43902001 0.17425859
## 169 178 179 180 181 183
## 0.35379859 0.20079784 0.20178871 0.17425859 0.47358109 0.39608748
## 184 185 187 189 190 191
## 0.09601139 0.21602347 0.21602347 0.09601139 0.29639630 0.21602347
## 196 197 199 204 207 209
## 0.45563963 0.20178871 0.43902001 0.23134606 0.43902001 0.20079784
## 215 223 226 229 234 239
## 0.11225376 0.09601139 0.29639630 0.09601139 0.17425859 0.21602347
## 240 241 244 246 253 255
## 0.17425859 0.20079784 0.21602347 0.35379859 0.09601139 0.29639630
## 259 260 262 264 265 269
## 0.17425859 0.11225376 0.20178871 0.11225376 0.39608748 0.09601139
## 270 271 273 282 285 287
## 0.21602347 0.09601139 0.23134606 0.17425859 0.35379859 0.21602347
## 289 290 293 294 297 298
## 0.67000721 0.20079784 0.19523046 0.20178871 0.09601139 0.11225376
## 306 308 309 315 317 327
## 0.20079784 0.22517033 0.20178871 0.22517033 0.11225376 0.35379859
## 328 330 335 340 349 350
## 0.11225376 0.20178871 0.21602347 0.11225376 0.20178871 0.47358109
## 351 352 355 356 357 360
## 0.11225376 0.21602347 0.09601139 0.11225376 0.39608748 0.21602347
## 363 366 368 373 375 377
## 0.19523046 0.21602347 0.09601139 0.11225376 0.09601139 0.21602347
## 378 379 381 382 384 390
## 0.35379859 0.21602347 0.29639630 0.09601139 0.17425859 0.35379859
## 394 399 401 402 404 405
## 0.67000721 0.09601139 0.10880715 0.09601139 0.23134606 0.44518749
## 406 407 412 413 417 421
## 0.09601139 0.20079784 0.11225376 0.20178871 0.09601139 0.09601139
## 424 427 429 430 434 438
## 0.19523046 0.29639630 0.22517033 0.17425859 0.10880715 0.17425859
## 446 449 450 459 464 465
## 0.09601139 0.48232882 0.35379859 0.20079784 0.47981881 0.48232882
## 466 467 469 470 472 473
## 0.67000721 0.29639630 0.21602347 0.09601139 0.35379859 0.19523046
## 476 481 484 491 497 501
## 0.17425859 0.17425859 0.09601139 0.23134606 0.44518749 0.19523046
## 502 504 510 511 512 514
## 0.20079784 0.11225376 0.75633533 0.35379859 0.17425859 0.67000721
## 516 519 525 526 528 534
## 0.17425859 0.50066997 0.35379859 0.29639630 0.11225376 0.23134606
## 535 542 543 545 546 550
## 0.39608748 0.35379859 0.20178871 0.48232882 0.22517033 0.48232882
## 551 552 553 555 559 560
## 0.67551218 0.10880715 0.17425859 0.39608748 0.75633533 0.39608748
## 562 565 567 569 577 579
## 0.11225376 0.11225376 0.39608748 0.29639630 0.20079784 0.67000721
## 582 594 595 596 597 601
## 0.20178871 0.20079784 0.20178871 0.67000721 0.29639630 0.21602347
I wasn't able to find any factors that positively correlate with extramarital activity. The p-values rendered the majority of the factor variables nonsignificant. However, I did find that those identifying as more religious and more unhappy in the marriage are predictors for not having an extramarital affair. Even though I wasn't able to answer my question directly, but I was able to see that there are factors that contribute to not committing adultery.
My goodness of fit test revealed a very low p-value showing evidence of poor fit.
I'm not sure that I accounted for the for ordinal values appropriately. I think I should have accounted for the fact that the number values are signficant.