R Bridge Course Final Project

My final project will be an exploratory data analysis of the Extramarital affairs data set. A link to the original documentation can be found at this page: https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html. I am curious about the demographics around extra-martial affairs and whether stereotypes around the middle age man are in fact representative in a survey.

Do men at all ages tend to have extramarital affairs at rates higher than women? Does level of education impact the number of affairs across gender and is there an inverse relationship with more years of education?

Load Data into DataFrame and review summary

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gh_input = 'https://raw.githubusercontent.com/jforster19/R_BRIDGE/main/Affairs.csv'
local_input = 'Affairs.csv'
df <- read.table(file=gh_input,header=TRUE,sep = ',')
head(df)
##    X affairs gender age yearsmarried children religiousness education
## 1  4       0   male  37        10.00       no             3        18
## 2  5       0 female  27         4.00       no             4        14
## 3 11       0 female  32        15.00      yes             1        12
## 4 16       0   male  57        15.00      yes             5        18
## 5 23       0   male  22         0.75       no             2        17
## 6 29       0 female  32         1.50       no             2        17
##   occupation rating
## 1          7      4
## 2          6      4
## 3          1      4
## 4          6      5
## 5          6      3
## 6          5      5
summary(df)
##        X           affairs          gender               age       
##  Min.   :   4   Min.   : 0.000   Length:601         Min.   :17.50  
##  1st Qu.: 528   1st Qu.: 0.000   Class :character   1st Qu.:27.00  
##  Median :1009   Median : 0.000   Mode  :character   Median :32.00  
##  Mean   :1060   Mean   : 1.456                      Mean   :32.49  
##  3rd Qu.:1453   3rd Qu.: 0.000                      3rd Qu.:37.00  
##  Max.   :9029   Max.   :12.000                      Max.   :57.00  
##   yearsmarried      children         religiousness     education    
##  Min.   : 0.125   Length:601         Min.   :1.000   Min.   : 9.00  
##  1st Qu.: 4.000   Class :character   1st Qu.:2.000   1st Qu.:14.00  
##  Median : 7.000   Mode  :character   Median :3.000   Median :16.00  
##  Mean   : 8.178                      Mean   :3.116   Mean   :16.17  
##  3rd Qu.:15.000                      3rd Qu.:4.000   3rd Qu.:18.00  
##  Max.   :15.000                      Max.   :5.000   Max.   :20.00  
##    occupation        rating     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000  
##  Median :5.000   Median :4.000  
##  Mean   :4.195   Mean   :3.932  
##  3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :7.000   Max.   :5.000
# nrow(df) 601
rapply(df,typeof)
##             X       affairs        gender           age  yearsmarried 
##     "integer"     "integer"   "character"      "double"      "double" 
##      children religiousness     education    occupation        rating 
##   "character"     "integer"     "integer"     "integer"     "integer"
table(df$affairs,df$gender)
##     
##      female male
##   0     243  208
##   1      15   19
##   2       7   10
##   3       8   11
##   7      22   20
##   12     20   18
table(df$affairs,df$age)
##     
##      17.5  22  27  32  37  42  47  52  57
##   0     3 101 117  77  65  38  16  15  19
##   1     1   7  11   9   1   2   1   2   0
##   2     0   2   4   3   5   1   0   0   2
##   3     0   2   6   6   0   3   2   0   0
##   7     0   4   9  11   8   6   1   2   1
##   12    2   1   6   9   9   6   3   2   0

My initial thoughts after reviewing the summary statistics were that some of the visuals would require aggregation due to the fact that many of the fields have dummy categorical discrete data or were slightly modified to be discrete with the design of the survey. On a positive note the vast majority of individuals surveyed did not have extramarital affairs. From some basic counts it does not appear that men have that many more extramarital affairs than women at a high level.

Wrangle Data

#install.packages("dplyr")
library(dplyr)

df$EduDesc <- NA
df$EduDesc[which(df$education==9)] <- "09 - grade school"
df$EduDesc[which(df$education==12)] <- "12 - high school degree"
df$EduDesc[which(df$education==14)] <- "14 - some college"
df$EduDesc[which(df$education==16)] <- "16 - college degree"
df$EduDesc[which(df$education==17)] <- "17 - some postgrad"
df$EduDesc[which(df$education==18)] <- "18 - masters degree"
df$EduDesc[which(df$education==20)] <- "20 - advanced degree"

df_m <- subset(df,gender=='male' & affairs>0)
summary(df_m)
##        X           affairs          gender               age       
##  Min.   :   6   Min.   : 1.000   Length:78          Min.   :22.00  
##  1st Qu.: 522   1st Qu.: 2.000   Class :character   1st Qu.:27.00  
##  Median : 970   Median : 3.000   Mode  :character   Median :32.00  
##  Mean   :1017   Mean   : 5.487                      Mean   :34.24  
##  3rd Qu.:1560   3rd Qu.: 7.000                      3rd Qu.:37.00  
##  Max.   :1959   Max.   :12.000                      Max.   :57.00  
##   yearsmarried      children         religiousness     education    
##  Min.   : 0.125   Length:78          Min.   :1.000   Min.   :12.00  
##  1st Qu.: 4.000   Class :character   1st Qu.:2.000   1st Qu.:16.00  
##  Median :10.000   Mode  :character   Median :3.000   Median :18.00  
##  Mean   : 9.257                      Mean   :2.795   Mean   :17.17  
##  3rd Qu.:15.000                      3rd Qu.:4.000   3rd Qu.:18.00  
##  Max.   :15.000                      Max.   :5.000   Max.   :20.00  
##    occupation        rating        EduDesc         
##  Min.   :1.000   Min.   :1.000   Length:78         
##  1st Qu.:4.000   1st Qu.:2.250   Class :character  
##  Median :5.500   Median :4.000   Mode  :character  
##  Mean   :5.103   Mean   :3.513                     
##  3rd Qu.:6.000   3rd Qu.:4.000                     
##  Max.   :7.000   Max.   :5.000
df_w <-subset(df,gender=='female' & affairs>0)
summary(df_w)
##        X             affairs          gender               age       
##  Min.   :  12.0   Min.   : 1.000   Length:72          Min.   :17.50  
##  1st Qu.: 346.8   1st Qu.: 2.000   Class :character   1st Qu.:27.00  
##  Median : 979.0   Median : 7.000   Mode  :character   Median :32.00  
##  Mean   :1038.4   Mean   : 6.208                      Mean   :32.51  
##  3rd Qu.:1564.0   3rd Qu.:12.000                      3rd Qu.:38.25  
##  Max.   :9010.0   Max.   :12.000                      Max.   :57.00  
##   yearsmarried     children         religiousness     education    
##  Min.   : 0.75   Length:72          Min.   :1.000   Min.   : 9.00  
##  1st Qu.: 4.00   Class :character   1st Qu.:2.000   1st Qu.:14.00  
##  Median :10.00   Mode  :character   Median :3.000   Median :16.00  
##  Mean   : 9.83                      Mean   :2.917   Mean   :15.25  
##  3rd Qu.:15.00                      3rd Qu.:4.000   3rd Qu.:17.00  
##  Max.   :15.00                      Max.   :5.000   Max.   :20.00  
##    occupation        rating        EduDesc         
##  Min.   :1.000   Min.   :1.000   Length:72         
##  1st Qu.:1.000   1st Qu.:2.000   Class :character  
##  Median :4.000   Median :3.000   Mode  :character  
##  Mean   :3.458   Mean   :3.375                     
##  3rd Qu.:5.000   3rd Qu.:4.250                     
##  Max.   :7.000   Max.   :5.000
prop.table(table(df_m$affairs,df_m$education))
##     
##              12         14         16         17         18         20
##   1  0.02564103 0.06410256 0.06410256 0.00000000 0.03846154 0.05128205
##   2  0.00000000 0.00000000 0.01282051 0.02564103 0.06410256 0.02564103
##   3  0.01282051 0.03846154 0.01282051 0.01282051 0.05128205 0.01282051
##   7  0.00000000 0.02564103 0.02564103 0.02564103 0.07692308 0.10256410
##   12 0.01282051 0.02564103 0.01282051 0.07692308 0.05128205 0.05128205
prop.table(table(df_w$affairs,df_w$education))
##     
##               9         12         14         16         17         18
##   1  0.00000000 0.00000000 0.12500000 0.04166667 0.01388889 0.02777778
##   2  0.00000000 0.00000000 0.02777778 0.01388889 0.04166667 0.01388889
##   3  0.00000000 0.02777778 0.01388889 0.02777778 0.04166667 0.00000000
##   7  0.00000000 0.02777778 0.05555556 0.04166667 0.09722222 0.06944444
##   12 0.02777778 0.06944444 0.09722222 0.01388889 0.02777778 0.04166667
##     
##              20
##   1  0.00000000
##   2  0.00000000
##   3  0.00000000
##   7  0.01388889
##   12 0.00000000
df_w %>% group_by(gender) %>% summarize(AvgAffairs=mean(affairs))
## # A tibble: 1 × 2
##   gender AvgAffairs
##   <chr>       <dbl>
## 1 female       6.21
df_m %>% group_by(gender) %>% summarize(AvgAffairs=mean(affairs))
## # A tibble: 1 × 2
##   gender AvgAffairs
##   <chr>       <dbl>
## 1 male         5.49

The main transformations made to the data set were designed to exclude those records without extramarital affairs as they heavily skewed zero. It is also was interesting to separate by gender to be able to review some of the summary statistics again. Lastly, although the education field most closely resembles number of years of education, it appeared helpful to create a new field including the related description from the documentation to help provide additional context for users.

The summary statistics for those subsets of the data showed that females had a higher average of affairs although one value is bringing up the average. Looking at generic percentages across years of education, it was a bit surprising to see somewhat of a positive relationship between the number of years of education and the number of affairs particularly for those respondents that engaged in this behavior more than once.

Visualize data with ggplot

#install.packages("ggplot2")
library(ggplot2)

hist_aff <- ggplot(data=df,aes(x=affairs,fill=gender))+geom_histogram(stat="count") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
hist_aff + ggtitle('Total Affairs encoded by gender')+xlab('Binned Number of Affairs')+ylab('Number of Respondents')

box_m <- ggplot(data=df_m,aes(y=affairs,x=EduDesc))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
box_m + ggtitle('Distribution of Male Affairs (excl 0) by educational attainment')+ylab('Number of Affairs')+xlab('Education Level')

box_w <- ggplot(data=df_w,aes(y=affairs,x=EduDesc))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
box_w + ggtitle('Distribution of Female Affairs (excl 0) by educational attainment')+xlab('Educational Level')+ylab('Number of Affairs')

line_age <- ggplot(df,aes(x=age,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='line')
line_age + ggtitle('Average number of affairs by age and gender')+xlab('Age')+ylab('Average Number of Affairs')

line_yrs_married <- ggplot(df,aes(x=yearsmarried,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='line')
line_yrs_married + ggtitle('Average number of affairs by years married and gender')+xlab('Years Married')+ylab('Average Number of Affairs')

scat_edu <- ggplot(df,aes(x=EduDesc,y=affairs,color=gender))+stat_summary(fun = 'mean',geom='point')+theme(axis.text.x = element_text(angle = 90,vjust = 0.5,hjust = 1))
scat_edu + ggtitle('Average number of affairs by years of education and gender')+xlab('Educational Level')+ylab('Average Number of Affairs')

After all of the exploration of the data using ggplot graphs, it continued to reinforce the fact that additional years of education does not decrease the chances of extramarital affairs. Unlike the original hypothesis, there appears to be a somewhat positive relationship between years of education and number of affairs. The median number of affairs clearly displayed higher totals in the box plots for men in the survey respondents with more education. One could speculate why that might be the case, but further research is likely needed to determine if the difference is statistically significant. The graphs of boxplots for women did not clearly identify a pattern based on this review; however, it appears that some of the results particularly for lower education did not have many respondents that had extramarital affairs and therefore may have skewed the averages and distributions.

The remaining graphs were intended to show that there did not appear to be substantial differences between male and female affair totals. Unsurprisingly, as years of marriage go up the average number of affairs was higher across gender and both lines seemed to trend in similar fashion.It would be interesting to further review how children, religion, and relationship ratings might impact the analysis and better segment those individuals more likely to have extramarital affairs.