Final Project: Examining Affair Data

Nayana Davis

Problem: What factor(s) contribute(s) most strongly to an individual having an extramarital affair?

How did you develop your question and what relevant research has already been completed on this topic?

Infidelity is a topic where people are not exactly going to be the most forthcoming if they're commited the act. As such, I wasn't able to glean much valuable insight on the drivers for extramarital affairs beyond basic demographic markers.

{r} [NPR reports] (https://www.npr.org/2015/07/26/426434619/sorting-through-the-numbers-on-infidelity) that data from Ashley Madison revealed that men approaching their next decade (ie 29, 39, etc) were more likely to be on the site. {r} [The General Social Survey] (https://ifstudies.org/blog/who-cheats-more-the-demographics-of-cheating-in-america), conducted by Institute for Family Studies found that Democrats, those who grew up in broken homes, and those who did not attend many religious services where more likely to take part in adultery.

Although, the Extramarital Affairs Data set used in a survey conducted by Yale's Ray C. Fair is older -- 1977, it did provide additional factors that I suspected could contribute to having an extramarital affair, such as rating the participant's happiness in the marriage, if children were involved and years married. This prompted me to wonder if there was a factor evaluated that would allow for prediction of infidelity.

How did you gather and prepare the data for analysis?

As previously mentioned, I used the {r} [Fairs Extramarital Affairs Data set] (https://www.npr.org/2015/07/26/426434619/sorting-through-the-numbers-on-infidelity) to conduct my analysis. I converted many of the predictor variables into factors from integers since they were categorical and ordinal values. I actually got rid of the affairs column, which reported the number of times a participant had sexual intercourse outside of marriage and instead made a new column called hadaffair, which is set to "yes" if the number in the old affairs column was greater than or equal to 1. I did this because I thought the old column was a strange metric since it doesn't reveal if there were multiple partners or time span -- I believed the new one I created is more straightforward.

Data Processing

setwd("/Users/nayanadavis/Desktop/R/Intro to DS/Final Project/")
affair_data <- data.frame(read.csv("./Fair.csv", header = TRUE))
str(affair_data)

## 'data.frame':    601 obs. of  10 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
##  $ age       : num  37 27 32 57 22 32 22 57 32 22 ...
##  $ ym        : num  10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
##  $ child     : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
##  $ religious : int  3 4 1 5 2 2 2 2 4 4 ...
##  $ education : int  18 14 12 18 17 17 12 14 16 14 ...
##  $ occupation: int  7 6 1 6 6 5 1 4 1 4 ...
##  $ rate      : int  4 4 4 5 3 5 3 4 2 5 ...
##  $ nbaffairs : int  0 0 0 0 0 0 0 0 0 0 ...

# added column to say if person had at least one affair
affair_data <- within(affair_data, {
    hadaffair = ifelse(nbaffairs >= 1, "yes", "no")
 })

# removed number of affairs column
affair_data<- affair_data[ -c(10) ]

affair_data$age<- factor(affair_data$age)
affair_data$ym <- factor(affair_data$ym)
affair_data$religious <- factor(affair_data$religious)
affair_data$education <- factor(affair_data$education)
affair_data$occupation <- factor(affair_data$occupation)
affair_data$rate <- factor(affair_data$rate)
affair_data$hadaffair <- factor(affair_data$hadaffair)

Exploratory Data Analysis

str(affair_data)

## 'data.frame':    601 obs. of  10 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
##  $ age       : Factor w/ 9 levels "17.5","22","27",..: 5 3 4 9 2 4 2 9 4 2 ...
##  $ ym        : Factor w/ 8 levels "0.125","0.417",..: 7 5 8 8 3 4 3 8 8 4 ...
##  $ child     : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
##  $ religious : Factor w/ 5 levels "1","2","3","4",..: 3 4 1 5 2 2 2 2 4 4 ...
##  $ education : Factor w/ 7 levels "9","12","14",..: 6 3 2 6 5 5 2 3 4 3 ...
##  $ occupation: Factor w/ 7 levels "1","2","3","4",..: 7 6 1 6 6 5 1 4 1 4 ...
##  $ rate      : Factor w/ 5 levels "1","2","3","4",..: 4 4 4 5 3 5 3 4 2 5 ...
##  $ hadaffair : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

head(affair_data)

##   X    sex age   ym child religious education occupation rate hadaffair
## 1 1   male  37   10    no         3        18          7    4        no
## 2 2 female  27    4    no         4        14          6    4        no
## 3 3 female  32   15   yes         1        12          1    4        no
## 4 4   male  57   15   yes         5        18          6    5        no
## 5 5   male  22 0.75    no         2        17          6    3        no
## 6 6 female  32  1.5    no         2        17          5    5        no

#check for averages in continous variables, counts for categorical 
summary(affair_data)

##        X           sex           age            ym      child    
##  Min.   :  1   female:315   27     :153   15     :204   no :171  
##  1st Qu.:151   male  :286   22     :117   4      :105   yes:430  
##  Median :301                32     :115   1.5    : 88            
##  Mean   :301                37     : 88   7      : 82            
##  3rd Qu.:451                42     : 56   10     : 70            
##  Max.   :601                47     : 23   0.75   : 31            
##                             (Other): 49   (Other): 21            
##  religious education occupation rate    hadaffair
##  1: 48     9 :  7    1:113      1: 16   no :451  
##  2:164     12: 44    2: 13      2: 66   yes:150  
##  3:129     14:154    3: 47      3: 93            
##  4:190     16:115    4: 68      4:194            
##  5: 70     17: 89    5:204      5:232            
##            18:112    6:143                       
##            20: 80    7: 13

plot(affair_data$hadaffair)

# check for missing values
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(affair_data, main = "Missing vs observed values")

3. How did you select and determine the correct model to answer your question? Since the question I want to answer was if certain factors could predict if a person was going to participate in an extramarital affair not, I thought a logistic regression model would be best. I wanted to know if specific responses could point to infidelity, which, if you've committed the act is just a yes or no answer.

Model Building

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

trainIndex <- createDataPartition(affair_data$hadaffair, p=.66, list=FALSE,times = 1)

train <- affair_data[trainIndex, ]
test <- affair_data[-trainIndex, ]


hadaffairlogit <- glm(hadaffair~sex+age+ym+child+religious+education+occupation+rate, family = binomial(link = "logit"), data = train)

# Estimate: the log odds amount in which hadaffair would increase if the covariate was one unit higher
# Intercept: values when covariates are all 0
summary(hadaffairlogit)

## 
## Call:
## glm(formula = hadaffair ~ sex + age + ym + child + religious + 
##     education + occupation + rate, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.80691  -0.70616  -0.42314  -0.00046   2.77275  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.83284  827.94767  -0.014 0.988597    
## sexmale       0.51785    0.35605   1.454 0.145831    
## age22        -2.71799    1.16624  -2.331 0.019776 *  
## age27        -3.02774    1.20859  -2.505 0.012239 *  
## age32        -2.72557    1.21895  -2.236 0.025352 *  
## age37        -3.25788    1.30758  -2.492 0.012719 *  
## age42        -3.53115    1.32935  -2.656 0.007900 ** 
## age47        -2.88979    1.44859  -1.995 0.046055 *  
## age52        -2.35739    1.44236  -1.634 0.102176    
## age57        -5.02536    1.57435  -3.192 0.001413 ** 
## ym0.417      -0.63253    1.77298  -0.357 0.721270    
## ym0.75       -0.30413    1.53382  -0.198 0.842823    
## ym1.5        -0.53658    1.36277  -0.394 0.693770    
## ym4           0.20331    1.33831   0.152 0.879254    
## ym7           0.52731    1.36773   0.386 0.699842    
## ym10          1.08263    1.37796   0.786 0.432057    
## ym15          1.08468    1.37727   0.788 0.430953    
## childyes      0.36275    0.43817   0.828 0.407741    
## religious2   -1.35944    0.50497  -2.692 0.007100 ** 
## religious3   -0.51550    0.49905  -1.033 0.301619    
## religious4   -1.67986    0.50284  -3.341 0.000836 ***
## religious5   -1.56811    0.60406  -2.596 0.009433 ** 
## education12  14.44728  827.94536   0.017 0.986078    
## education14  14.05038  827.94531   0.017 0.986460    
## education16  14.26372  827.94536   0.017 0.986255    
## education17  15.28214  827.94537   0.018 0.985274    
## education18  14.83120  827.94535   0.018 0.985708    
## education20  14.21861  827.94542   0.017 0.986298    
## occupation2   1.04662    1.02018   1.026 0.304930    
## occupation3   1.02466    0.56499   1.814 0.069740 .  
## occupation4   0.75553    0.55738   1.356 0.175255    
## occupation5   0.03996    0.47027   0.085 0.932289    
## occupation6   0.37683    0.55134   0.683 0.494308    
## occupation7   0.54877    0.88389   0.621 0.534694    
## rate2         0.40693    0.85946   0.473 0.635874    
## rate3        -0.97356    0.86390  -1.127 0.259770    
## rate4        -1.15331    0.83014  -1.389 0.164744    
## rate5        -1.78873    0.85964  -2.081 0.037454 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 445.94  on 396  degrees of freedom
## Residual deviance: 355.28  on 359  degrees of freedom
## AIC: 431.28
## 
## Number of Fisher Scoring iterations: 14

#There was no positive correlation between any of the predictor variables with a low enough p-value that would allow us to reject the null hypothesis

#Kept religion and age since it showed the lowest p-values, indicating we can more confidently reject the null hypothesis. 
hadaffairlogit <- glm(hadaffair~age + religious + rate, family = binomial(link = "logit"), data = train)

summary(hadaffairlogit)

## 
## Call:
## glm(formula = hadaffair ~ age + religious + rate, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6224  -0.7001  -0.4856  -0.2268   2.5056  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.4871     1.2810   2.722 0.006484 ** 
## age22        -3.0515     1.0212  -2.988 0.002806 ** 
## age27        -2.5742     0.9985  -2.578 0.009936 ** 
## age32        -1.7877     0.9904  -1.805 0.071071 .  
## age37        -2.0795     1.0217  -2.035 0.041817 *  
## age42        -2.2544     1.0391  -2.170 0.030039 *  
## age47        -1.4405     1.1522  -1.250 0.211218    
## age52        -1.3368     1.1403  -1.172 0.241048    
## age57        -3.6047     1.2693  -2.840 0.004512 ** 
## religious2   -1.3484     0.4669  -2.888 0.003878 ** 
## religious3   -0.5496     0.4711  -1.167 0.243355    
## religious4   -1.5661     0.4710  -3.325 0.000885 ***
## religious5   -1.5318     0.5741  -2.668 0.007629 ** 
## rate2         0.2020     0.7677   0.263 0.792482    
## rate3        -1.0735     0.7736  -1.388 0.165218    
## rate4        -1.2462     0.7383  -1.688 0.091430 .  
## rate5        -1.9641     0.7658  -2.565 0.010328 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 445.94  on 396  degrees of freedom
## Residual deviance: 378.33  on 380  degrees of freedom
## AIC: 412.33
## 
## Number of Fisher Scoring iterations: 5

#Removed age for a stronger model
hadaffairlogit <- glm(hadaffair~religious + rate, family = binomial(link = "logit"), data = train)

summary(hadaffairlogit)

## 
## Call:
## glm(formula = hadaffair ~ religious + rate, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4891  -0.6977  -0.6188  -0.4493   2.1648  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.15770    0.78509   1.475  0.14032   
## religious2  -1.20341    0.44826  -2.685  0.00726 **
## religious3  -0.42447    0.45224  -0.939  0.34794   
## religious4  -1.37783    0.45271  -3.044  0.00234 **
## religious5  -1.23847    0.55023  -2.251  0.02440 * 
## rate2       -0.02501    0.75180  -0.033  0.97347   
## rate3       -1.15502    0.75703  -1.526  0.12708   
## rate4       -1.33561    0.72362  -1.846  0.06493 . 
## rate5       -2.02222    0.73607  -2.747  0.00601 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 445.94  on 396  degrees of freedom
## Residual deviance: 401.06  on 388  degrees of freedom
## AIC: 419.06
## 
## Number of Fisher Scoring iterations: 4

modeloutput <- exp(coef(hadaffairlogit))
modeloutput

## (Intercept)  religious2  religious3  religious4  religious5       rate2 
##   3.1825964   0.3001701   0.6541137   0.2521245   0.2898285   0.9753045 
##       rate3       rate4       rate5 
##   0.3150520   0.2629986   0.1323619

library(ResourceSelection)

## ResourceSelection 0.3-2   2017-02-28

hoslem.test(train$hadaffair, fitted(hadaffairlogit))

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  train$hadaffair, fitted(hadaffairlogit)
## X-squared = 397, df = 8, p-value < 2.2e-16

pred.model <- predict.glm(hadaffairlogit,test,type='response')
pred.model

##          6         13         17         20         22         27 
## 0.11225376 0.19523046 0.23134606 0.11225376 0.11225376 0.09601139 
##         30         35         37         42         46         51 
## 0.35379859 0.45563963 0.35379859 0.22517033 0.21602347 0.10880715 
##         52         55         56         59         61         62 
## 0.20079784 0.29639630 0.44518749 0.17425859 0.47358109 0.35379859 
##         65         74         77         82         85         87 
## 0.17425859 0.20178871 0.21602347 0.09601139 0.09601139 0.09601139 
##         88         91         92         93         95         97 
## 0.67000721 0.17425859 0.23134606 0.17425859 0.45563963 0.67551218 
##         98         99        102        106        107        112 
## 0.45563963 0.21602347 0.20079784 0.17425859 0.17425859 0.09601139 
##        114        115        116        125        127        128 
## 0.19523046 0.35379859 0.09601139 0.21602347 0.11225376 0.09601139 
##        132        134        136        148        151        152 
## 0.09601139 0.23134606 0.35379859 0.22517033 0.20178871 0.23134606 
##        157        159        161        163        166        168 
## 0.47981881 0.20178871 0.11225376 0.10880715 0.43902001 0.17425859 
##        169        178        179        180        181        183 
## 0.35379859 0.20079784 0.20178871 0.17425859 0.47358109 0.39608748 
##        184        185        187        189        190        191 
## 0.09601139 0.21602347 0.21602347 0.09601139 0.29639630 0.21602347 
##        196        197        199        204        207        209 
## 0.45563963 0.20178871 0.43902001 0.23134606 0.43902001 0.20079784 
##        215        223        226        229        234        239 
## 0.11225376 0.09601139 0.29639630 0.09601139 0.17425859 0.21602347 
##        240        241        244        246        253        255 
## 0.17425859 0.20079784 0.21602347 0.35379859 0.09601139 0.29639630 
##        259        260        262        264        265        269 
## 0.17425859 0.11225376 0.20178871 0.11225376 0.39608748 0.09601139 
##        270        271        273        282        285        287 
## 0.21602347 0.09601139 0.23134606 0.17425859 0.35379859 0.21602347 
##        289        290        293        294        297        298 
## 0.67000721 0.20079784 0.19523046 0.20178871 0.09601139 0.11225376 
##        306        308        309        315        317        327 
## 0.20079784 0.22517033 0.20178871 0.22517033 0.11225376 0.35379859 
##        328        330        335        340        349        350 
## 0.11225376 0.20178871 0.21602347 0.11225376 0.20178871 0.47358109 
##        351        352        355        356        357        360 
## 0.11225376 0.21602347 0.09601139 0.11225376 0.39608748 0.21602347 
##        363        366        368        373        375        377 
## 0.19523046 0.21602347 0.09601139 0.11225376 0.09601139 0.21602347 
##        378        379        381        382        384        390 
## 0.35379859 0.21602347 0.29639630 0.09601139 0.17425859 0.35379859 
##        394        399        401        402        404        405 
## 0.67000721 0.09601139 0.10880715 0.09601139 0.23134606 0.44518749 
##        406        407        412        413        417        421 
## 0.09601139 0.20079784 0.11225376 0.20178871 0.09601139 0.09601139 
##        424        427        429        430        434        438 
## 0.19523046 0.29639630 0.22517033 0.17425859 0.10880715 0.17425859 
##        446        449        450        459        464        465 
## 0.09601139 0.48232882 0.35379859 0.20079784 0.47981881 0.48232882 
##        466        467        469        470        472        473 
## 0.67000721 0.29639630 0.21602347 0.09601139 0.35379859 0.19523046 
##        476        481        484        491        497        501 
## 0.17425859 0.17425859 0.09601139 0.23134606 0.44518749 0.19523046 
##        502        504        510        511        512        514 
## 0.20079784 0.11225376 0.75633533 0.35379859 0.17425859 0.67000721 
##        516        519        525        526        528        534 
## 0.17425859 0.50066997 0.35379859 0.29639630 0.11225376 0.23134606 
##        535        542        543        545        546        550 
## 0.39608748 0.35379859 0.20178871 0.48232882 0.22517033 0.48232882 
##        551        552        553        555        559        560 
## 0.67551218 0.10880715 0.17425859 0.39608748 0.75633533 0.39608748 
##        562        565        567        569        577        579 
## 0.11225376 0.11225376 0.39608748 0.29639630 0.20079784 0.67000721 
##        582        594        595        596        597        601 
## 0.20178871 0.20079784 0.20178871 0.67000721 0.29639630 0.21602347

What predictions can you make with your model?

I wasn't able to find any factors that positively correlate with extramarital activity. The p-values rendered the majority of the factor variables nonsignificant. However, I did find that those identifying as more religious and more unhappy in the marriage are predictors for not having an extramarital affair. Even though I wasn't able to answer my question directly, but I was able to see that there are factors that contribute to not committing adultery.

How reliable are your results?

My goodness of fit test revealed a very low p-value showing evidence of poor fit.

What additional information or analysis might improve your model results or work to control limitations?

I'm not sure that I accounted for the for ordinal values appropriately. I think I should have accounted for the fact that the number values are signficant.