My final project will be an exploratory data analysis of the Extramarital affairs data set. A link to the original documentation can be found at this page: https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html. I am curious about the demographics around extra-martial affairs and whether stereotypes around the middle age man are in fact representative in a survey.
Do men at all ages tend to have extramarital affairs at rates higher than women? Does level of education impact the number of affairs across gender and is there an inverse relationship with more years of education?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gh_input = 'https://raw.githubusercontent.com/jforster19/R_BRIDGE/main/Affairs.csv'
local_input = 'Affairs.csv'
df <- read.table(file=gh_input,header=TRUE,sep = ',')
head(df)
## X affairs gender age yearsmarried children religiousness education
## 1 4 0 male 37 10.00 no 3 18
## 2 5 0 female 27 4.00 no 4 14
## 3 11 0 female 32 15.00 yes 1 12
## 4 16 0 male 57 15.00 yes 5 18
## 5 23 0 male 22 0.75 no 2 17
## 6 29 0 female 32 1.50 no 2 17
## occupation rating
## 1 7 4
## 2 6 4
## 3 1 4
## 4 6 5
## 5 6 3
## 6 5 5
summary(df)
## X affairs gender age
## Min. : 4 Min. : 0.000 Length:601 Min. :17.50
## 1st Qu.: 528 1st Qu.: 0.000 Class :character 1st Qu.:27.00
## Median :1009 Median : 0.000 Mode :character Median :32.00
## Mean :1060 Mean : 1.456 Mean :32.49
## 3rd Qu.:1453 3rd Qu.: 0.000 3rd Qu.:37.00
## Max. :9029 Max. :12.000 Max. :57.00
## yearsmarried children religiousness education
## Min. : 0.125 Length:601 Min. :1.000 Min. : 9.00
## 1st Qu.: 4.000 Class :character 1st Qu.:2.000 1st Qu.:14.00
## Median : 7.000 Mode :character Median :3.000 Median :16.00
## Mean : 8.178 Mean :3.116 Mean :16.17
## 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :15.000 Max. :5.000 Max. :20.00
## occupation rating
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000
## Median :5.000 Median :4.000
## Mean :4.195 Mean :3.932
## 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :7.000 Max. :5.000
# nrow(df) 601
rapply(df,typeof)
## X affairs gender age yearsmarried
## "integer" "integer" "character" "double" "double"
## children religiousness education occupation rating
## "character" "integer" "integer" "integer" "integer"
table(df$affairs,df$gender)
##
## female male
## 0 243 208
## 1 15 19
## 2 7 10
## 3 8 11
## 7 22 20
## 12 20 18
table(df$affairs,df$age)
##
## 17.5 22 27 32 37 42 47 52 57
## 0 3 101 117 77 65 38 16 15 19
## 1 1 7 11 9 1 2 1 2 0
## 2 0 2 4 3 5 1 0 0 2
## 3 0 2 6 6 0 3 2 0 0
## 7 0 4 9 11 8 6 1 2 1
## 12 2 1 6 9 9 6 3 2 0
My initial thoughts after reviewing the summary statistics were that some of the visuals would require aggregation due to the fact that many of the fields have dummy categorical discrete data or were slightly modified to be discrete with the design of the survey. On a positive note the vast majority of individuals surveyed did not have extramarital affairs. From some basic counts it does not appear that men have that many more extramarital affairs than women at a high level.
#install.packages("dplyr")
library(dplyr)
df$EduDesc <- NA
df$EduDesc[which(df$education==9)] <- "09 - grade school"
df$EduDesc[which(df$education==12)] <- "12 - high school degree"
df$EduDesc[which(df$education==14)] <- "14 - some college"
df$EduDesc[which(df$education==16)] <- "16 - college degree"
df$EduDesc[which(df$education==17)] <- "17 - some postgrad"
df$EduDesc[which(df$education==18)] <- "18 - masters degree"
df$EduDesc[which(df$education==20)] <- "20 - advanced degree"
df_m <- subset(df,gender=='male' & affairs>0)
summary(df_m)
## X affairs gender age
## Min. : 6 Min. : 1.000 Length:78 Min. :22.00
## 1st Qu.: 522 1st Qu.: 2.000 Class :character 1st Qu.:27.00
## Median : 970 Median : 3.000 Mode :character Median :32.00
## Mean :1017 Mean : 5.487 Mean :34.24
## 3rd Qu.:1560 3rd Qu.: 7.000 3rd Qu.:37.00
## Max. :1959 Max. :12.000 Max. :57.00
## yearsmarried children religiousness education
## Min. : 0.125 Length:78 Min. :1.000 Min. :12.00
## 1st Qu.: 4.000 Class :character 1st Qu.:2.000 1st Qu.:16.00
## Median :10.000 Mode :character Median :3.000 Median :18.00
## Mean : 9.257 Mean :2.795 Mean :17.17
## 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :15.000 Max. :5.000 Max. :20.00
## occupation rating EduDesc
## Min. :1.000 Min. :1.000 Length:78
## 1st Qu.:4.000 1st Qu.:2.250 Class :character
## Median :5.500 Median :4.000 Mode :character
## Mean :5.103 Mean :3.513
## 3rd Qu.:6.000 3rd Qu.:4.000
## Max. :7.000 Max. :5.000
df_w <-subset(df,gender=='female' & affairs>0)
summary(df_w)
## X affairs gender age
## Min. : 12.0 Min. : 1.000 Length:72 Min. :17.50
## 1st Qu.: 346.8 1st Qu.: 2.000 Class :character 1st Qu.:27.00
## Median : 979.0 Median : 7.000 Mode :character Median :32.00
## Mean :1038.4 Mean : 6.208 Mean :32.51
## 3rd Qu.:1564.0 3rd Qu.:12.000 3rd Qu.:38.25
## Max. :9010.0 Max. :12.000 Max. :57.00
## yearsmarried children religiousness education
## Min. : 0.75 Length:72 Min. :1.000 Min. : 9.00
## 1st Qu.: 4.00 Class :character 1st Qu.:2.000 1st Qu.:14.00
## Median :10.00 Mode :character Median :3.000 Median :16.00
## Mean : 9.83 Mean :2.917 Mean :15.25
## 3rd Qu.:15.00 3rd Qu.:4.000 3rd Qu.:17.00
## Max. :15.00 Max. :5.000 Max. :20.00
## occupation rating EduDesc
## Min. :1.000 Min. :1.000 Length:72
## 1st Qu.:1.000 1st Qu.:2.000 Class :character
## Median :4.000 Median :3.000 Mode :character
## Mean :3.458 Mean :3.375
## 3rd Qu.:5.000 3rd Qu.:4.250
## Max. :7.000 Max. :5.000
prop.table(table(df_m$affairs,df_m$education))
##
## 12 14 16 17 18 20
## 1 0.02564103 0.06410256 0.06410256 0.00000000 0.03846154 0.05128205
## 2 0.00000000 0.00000000 0.01282051 0.02564103 0.06410256 0.02564103
## 3 0.01282051 0.03846154 0.01282051 0.01282051 0.05128205 0.01282051
## 7 0.00000000 0.02564103 0.02564103 0.02564103 0.07692308 0.10256410
## 12 0.01282051 0.02564103 0.01282051 0.07692308 0.05128205 0.05128205
prop.table(table(df_w$affairs,df_w$education))
##
## 9 12 14 16 17 18
## 1 0.00000000 0.00000000 0.12500000 0.04166667 0.01388889 0.02777778
## 2 0.00000000 0.00000000 0.02777778 0.01388889 0.04166667 0.01388889
## 3 0.00000000 0.02777778 0.01388889 0.02777778 0.04166667 0.00000000
## 7 0.00000000 0.02777778 0.05555556 0.04166667 0.09722222 0.06944444
## 12 0.02777778 0.06944444 0.09722222 0.01388889 0.02777778 0.04166667
##
## 20
## 1 0.00000000
## 2 0.00000000
## 3 0.00000000
## 7 0.01388889
## 12 0.00000000
df_w %>% group_by(gender) %>% summarize(AvgAffairs=mean(affairs))
## # A tibble: 1 × 2
## gender AvgAffairs
## <chr> <dbl>
## 1 female 6.21
df_m %>% group_by(gender) %>% summarize(AvgAffairs=mean(affairs))
## # A tibble: 1 × 2
## gender AvgAffairs
## <chr> <dbl>
## 1 male 5.49
The main transformations made to the data set were designed to exclude those records without extramarital affairs as they heavily skewed zero. It is also was interesting to separate by gender to be able to review some of the summary statistics again. Lastly, although the education field most closely resembles number of years of education, it appeared helpful to create a new field including the related description from the documentation to help provide additional context for users.
The summary statistics for those subsets of the data showed that females had a higher average of affairs although one value is bringing up the average. Looking at generic percentages across years of education, it was a bit surprising to see somewhat of a positive relationship between the number of years of education and the number of affairs particularly for those respondents that engaged in this behavior more than once.
#install.packages("ggplot2")
library(ggplot2)
hist_aff <- ggplot(data=df,aes(x=affairs,fill=gender))+geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
hist_aff + ggtitle('Total Affairs encoded by gender')+xlab('Binned Number of Affairs')+ylab('Number of Respondents')
box_m <- ggplot(data=df_m,aes(y=affairs,x=EduDesc))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
box_m + ggtitle('Distribution of Male Affairs (excl 0) by educational attainment')+ylab('Number of Affairs')+xlab('Education Level')
box_w <- ggplot(data=df_w,aes(y=affairs,x=EduDesc))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
box_w + ggtitle('Distribution of Female Affairs (excl 0) by educational attainment')+xlab('Educational Level')+ylab('Number of Affairs')
line_age <- ggplot(df,aes(x=age,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='line')
line_age + ggtitle('Average number of affairs by age and gender')+xlab('Age')+ylab('Average Number of Affairs')
line_yrs_married <- ggplot(df,aes(x=yearsmarried,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='line')
line_yrs_married + ggtitle('Average number of affairs by years married and gender')+xlab('Years Married')+ylab('Average Number of Affairs')
scat_edu <- ggplot(df,aes(x=EduDesc,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='point')+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
scat_edu + ggtitle('Average number of affairs by years of education and gender')+xlab('Educational Level')+ylab('Average Number of Affairs')
After all of the exploration of the data using ggplot graphs, it continued to reinforce the fact that additional years of education does not decrease the chances of extramarital affairs. Unlike the original hypothesis, there appears to be a somewhat positive relationship between years of education and number of affairs. The median number of affairs clearly displayed higher totals in the box plots for men in the survey respondents with more education. One could speculate why that might be the case, but further research is likely needed to determine if the difference is statistically significant. The graphs of boxplots for women did not clearly identify a pattern based on this review; however, it appears that some of the results particularly for lower education did not have many respondents that had extramarital affairs and therefore may have skewed the averages and distributions.
The remaining graphs were intended to show that there did not appear to be substantial differences between male and female affair totals. Unsurprisingly, as years of marriage go up the average number of affairs was higher across gender and both lines seemed to trend in similar fashion.It would be interesting to further review how children, religion, and relationship ratings might impact the analysis and better segment those individuals more likely to have extramarital affairs.