It’s wedding season. Again. Even though approximately 45% of married couples eventually divorce (it was closer to 50% in the 1980s), couples are still tying the knot, secretly hoping that their marriage will be the one that endures/beats the odds, thus debunking the stats. There have been more studies than not on this topic, with each one offering their opinion on what it takes to remain a happily married couple, yet none has been able to pinpoint the exact reason for the demise of the marital union. There is a myriad of reasons why marriages fail - financial struggles, infidelity, abuse(emotional, physical, mental), illness, sexual incompatibility and/or abandonment, married too young, contentiousness et al. Too many to list. For my final project, I chose the “Affairs” dataset entitled “Fair’s Extramarital Affairs Data” located at http://vincentarelbundock.github.io/Rdatasets). Described as “Infidelity data, known as Fair’s Affairs, this dataset is a cross-section data from a survey conducted by Psychology Today in 1969.”

I am curious to know whether one’s educational/occupational status leads to a greater likelihood of having affairs, or if it’s the opposite, the lack thereof that is the main cause of having affairs.

  1. Data Exploration : This should include summary statistics, means, median, quartiles or any other relevant information about the data set. Please include some conclusiOns in the R Markdown text.
# import file, set header = FALSE to get row name header
affairs <- read.csv('https://raw.githubusercontent.com/carolc57/RBridgeFinal/main/Affairs.csv', header = TRUE, sep = ",") 

head (affairs)   #get a glimpse of the data
##   affairs gender age yearsmarried children religiousness education occupation
## 1       0   male  37        10.00       no             3        18          7
## 2       0 female  27         4.00       no             4        14          6
## 3       0 female  32        15.00      yes             1        12          1
## 4       0   male  57        15.00      yes             5        18          6
## 5       0   male  22         0.75       no             2        17          6
## 6       0 female  32         1.50       no             2        17          5
##   rating
## 1      4
## 2      4
## 3      4
## 4      5
## 5      3
## 6      5
summary (affairs)
##     affairs          gender               age         yearsmarried   
##  Min.   : 0.000   Length:601         Min.   :17.50   Min.   : 0.125  
##  1st Qu.: 0.000   Class :character   1st Qu.:27.00   1st Qu.: 4.000  
##  Median : 0.000   Mode  :character   Median :32.00   Median : 7.000  
##  Mean   : 1.456                      Mean   :32.49   Mean   : 8.178  
##  3rd Qu.: 0.000                      3rd Qu.:37.00   3rd Qu.:15.000  
##  Max.   :12.000                      Max.   :57.00   Max.   :15.000  
##    children         religiousness     education       occupation   
##  Length:601         Min.   :1.000   Min.   : 9.00   Min.   :1.000  
##  Class :character   1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000  
##  Mode  :character   Median :3.000   Median :16.00   Median :5.000  
##                     Mean   :3.116   Mean   :16.17   Mean   :4.195  
##                     3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000  
##                     Max.   :5.000   Max.   :20.00   Max.   :7.000  
##      rating     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.932  
##  3rd Qu.:5.000  
##  Max.   :5.000
affairs_mean <- RoundTo(mean (affairs$age),0.2, FUN=round)
sprintf(paste("The mean of age at marriage is", affairs_mean, "years"))
## [1] "The mean of age at marriage is 32.4 years"
affairs_median <- median(affairs$yearsmarried)
sprintf(paste("The median years married is", affairs_median, "years"))
## [1] "The median years married is 7 years"
#resort data in age column in ascending order; prep for quartile analysis
affairs_agesort <- affairs %>% arrange(age) %>% head (601)
print (head(affairs_agesort, 15)) #print 1st ten rows 
##    affairs gender  age yearsmarried children religiousness education occupation
## 1        0   male 17.5         1.50      yes             3        18          6
## 2        0 female 17.5         0.75       no             2        18          5
## 3        0 female 17.5        10.00       no             4        14          4
## 4        1 female 17.5         0.75       no             5        14          4
## 5       12 female 17.5         0.75      yes             2        12          1
## 6       12 female 17.5         0.75      yes             2        12          3
## 7        0   male 22.0         0.75       no             2        17          6
## 8        0 female 22.0         0.75       no             2        12          1
## 9        0   male 22.0         1.50       no             4        14          4
## 10       0 female 22.0         1.50       no             2        17          5
## 11       0 female 22.0         0.75       no             3        16          5
## 12       0 female 22.0         1.50       no             2        16          5
## 13       0 female 22.0         1.50       no             2        16          5
## 14       0 female 22.0         1.50       no             2        16          5
## 15       0 female 22.0         1.50       no             2        18          5
##    rating
## 1       5
## 2       4
## 3       5
## 4       5
## 5       3
## 6       5
## 7       3
## 8       3
## 9       5
## 10      4
## 11      4
## 12      5
## 13      5
## 14      5
## 15      5
print (tail(affairs_agesort, 15))   #print last 15 rows
##     affairs gender age yearsmarried children religiousness education occupation
## 587       0   male  57           15      yes             4        20          5
## 588       0 female  57           15      yes             4        16          6
## 589       0 female  57           15      yes             2        18          5
## 590       0   male  57           15      yes             4         9          2
## 591       0   male  57           15      yes             4        20          6
## 592       0   male  57           15      yes             2        20          6
## 593       0   male  57           15      yes             4         9          2
## 594       0   male  57           15      yes             4        17          5
## 595       0   male  57           15      yes             5        18          5
## 596       0 female  57           15      yes             3        18          5
## 597       0   male  57           15       no             4         9          3
## 598       0 female  57           15       no             4        20          6
## 599       2 female  57           15      yes             1        18          5
## 600       2   male  57           15      yes             1        17          4
## 601       7   male  57           15      yes             5        20          4
##     rating
## 587      4
## 588      4
## 589      2
## 590      2
## 591      5
## 592      4
## 593      4
## 594      5
## 595      2
## 596      2
## 597      1
## 598      5
## 599      4
## 600      4
## 601      5
affairs_quantile <- quantile(affairs$age, probs = c(0.25,0.5,0.75,1)) 
print (affairs_quantile)
##  25%  50%  75% 100% 
##   27   32   37   57
print("The affairs_data_quantile shows that the first, second, and third quartiles are 27, 32 and 37 respectively")
## [1] "The affairs_data_quantile shows that the first, second, and third quartiles are 27, 32 and 37 respectively"
print("The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.")
## [1] "The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%."
  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
#Create a new data frame with a subset of the columns and rows. Make sure to name it
#rename a column, ensure use of rename () function in dplyer due to "Error in `chr_as_locations()`:"

affairs_colname <- plyr::rename(affairs, c(
       "rating" = "marital_satisfaction",
       "children" = "number_children"))

head(affairs_colname)
##   affairs gender age yearsmarried number_children religiousness education
## 1       0   male  37        10.00              no             3        18
## 2       0 female  27         4.00              no             4        14
## 3       0 female  32        15.00             yes             1        12
## 4       0   male  57        15.00             yes             5        18
## 5       0   male  22         0.75              no             2        17
## 6       0 female  32         1.50              no             2        17
##   occupation marital_satisfaction
## 1          7                    4
## 2          6                    4
## 3          1                    4
## 4          6                    5
## 5          6                    3
## 6          5                    5
#Change values in education columns to text meanings; store in new column

affairs_change <- affairs

for (i in 1:nrow(affairs_change)){
  if (affairs_change$education [i] == "9"){
    affairs_change$edu_descr[i] <- "< Grade school"
  }else if (affairs_change$education [i] == "12"){
    affairs_change$edu_descr[i] <- "HS Grad"
  }else if (affairs_change$education [i] == "14"){
    affairs_change$edu_descr[i] <- "Some College"
  }else if (affairs_change$education [i] == "16"){
    affairs_change$edu_descr[i] <- "College Grad"
  }else if (affairs_change$education [i] == "17"){
    affairs_change$edu_descr[i] <- "Some Grad work"
  }else if (affairs_change$education [i] == "18"){
    affairs_change$edu_descr[i] <- " Master's"
  }else {(affairs_change$education [i] == "")
    affairs_change$edu_descr[i] <- "PhD, MD, other_adv"
  }  
}

#reorder columns so that education and edu_descr are together, give new name
affairs_change1 <- affairs_change %>% relocate(edu_descr, .after = education)


print (head(affairs_change1, 15)) #print 1st 15 rows
##    affairs gender age yearsmarried children religiousness education
## 1        0   male  37        10.00       no             3        18
## 2        0 female  27         4.00       no             4        14
## 3        0 female  32        15.00      yes             1        12
## 4        0   male  57        15.00      yes             5        18
## 5        0   male  22         0.75       no             2        17
## 6        0 female  32         1.50       no             2        17
## 7        0 female  22         0.75       no             2        12
## 8        0   male  57        15.00      yes             2        14
## 9        0 female  32        15.00      yes             4        16
## 10       0   male  22         1.50       no             4        14
## 11       0   male  37        15.00      yes             2        20
## 12       0   male  27         4.00      yes             4        18
## 13       0   male  47        15.00      yes             5        17
## 14       0 female  22         1.50       no             2        17
## 15       0 female  27         4.00       no             4        14
##             edu_descr occupation rating
## 1            Master's          7      4
## 2        Some College          6      4
## 3             HS Grad          1      4
## 4            Master's          6      5
## 5      Some Grad work          6      3
## 6      Some Grad work          5      5
## 7             HS Grad          1      3
## 8        Some College          4      4
## 9        College Grad          1      2
## 10       Some College          4      5
## 11 PhD, MD, other_adv          7      2
## 12           Master's          6      4
## 13     Some Grad work          6      4
## 14     Some Grad work          5      4
## 15       Some College          5      4
print (tail(affairs_change1, 15)) #print last 15 rows 
##     affairs gender age yearsmarried children religiousness education
## 587      12 female  52         15.0      yes             1        18
## 588      12   male  47         15.0       no             1        18
## 589       3 female  32         15.0      yes             4        16
## 590       7 female  32         15.0      yes             3        14
## 591       7 female  27          7.0      yes             4        16
## 592      12   male  42         15.0      yes             3        18
## 593       7 female  42         15.0      yes             2        14
## 594      12   male  27          7.0      yes             2        17
## 595       3   male  32         10.0      yes             4        14
## 596       7   male  47         15.0      yes             3        16
## 597       1   male  22          1.5      yes             1        12
## 598       7 female  32         10.0      yes             2        18
## 599       2   male  32         10.0      yes             2        17
## 600       2   male  22          7.0      yes             3        18
## 601       1 female  32         15.0      yes             3        14
##          edu_descr occupation rating
## 587       Master's          5      5
## 588       Master's          6      5
## 589   College Grad          4      4
## 590   Some College          3      2
## 591   College Grad          1      2
## 592       Master's          6      2
## 593   Some College          3      2
## 594 Some Grad work          5      4
## 595   Some College          4      3
## 596   College Grad          4      2
## 597        HS Grad          2      5
## 598       Master's          5      4
## 599 Some Grad work          6      5
## 600       Master's          6      2
## 601   Some College          1      5
  1. Graphics: Please make sure to display at least one scatterplot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
#scatterplot
affairs_change1 %>%
  filter (yearsmarried > 5) %>%
  ggplot(aes (x= yearsmarried,
              y= affairs))+
  geom_point(aes(color = education))+
  geom_smooth (method = lm,
               se = T)+
  labs(x = "# Years Married",
       y = "# of Affairs",
       title = "Affairs vs Yrs Married")+
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

ggsave("scatterplot.png", width = 4, height = 3, dpi = 1000)
## `geom_smooth()` using formula 'y ~ x'
#histogram of educational level
affairs_change1 %>%
 filter(education >= 12 & affairs >= 1) %>%
  ggplot(aes(x = education))+
  geom_histogram(color = "black",fill = "maroon", binwidth = 2)+
  theme_bw()+
  labs(x = "Education",
       y = NULL,
       title = "Affairs by educational level, HS or higher")

ggsave ("histogram.png", width = 4, height = 3, dpi = 1000)
#boxplot
library(dplyr)

affairs_1ormore <-  filter(affairs_change1, affairs >= 1)
head(affairs_1ormore)
##   affairs gender age yearsmarried children religiousness education
## 1       3   male  27        1.500       no             3        18
## 2       3 female  27        4.000      yes             3        17
## 3       7   male  37       15.000      yes             5        18
## 4      12 female  32       10.000      yes             3        17
## 5       1   male  22        0.125       no             4        16
## 6       1 female  22        1.500      yes             2        14
##        edu_descr occupation rating
## 1       Master's          4      4
## 2 Some Grad work          1      5
## 3       Master's          6      2
## 4 Some Grad work          5      2
## 5   College Grad          5      5
## 6   Some College          1      5
ggplot(affairs_1ormore, aes(edu_descr, affairs))+
  geom_boxplot(fill = "pink", color = "darkblue" )+
  labs(x = "Education level" , y =  "# of Affairs had")+
  coord_flip()+
  theme_classic()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# saving the final figure as .png fle 4" wide x 3" high
ggsave ("boxplot.png", width = 4, height = 3, dpi = 1000)
  1. Please write a conclusion paragraph in R markdown at the end.

After studying and analyzing this dataset in R, I conclude that the educational group(s) that had the highest incidents of cheating/affairs were those who married 5 years or more and had either HS Diploma, some college or some Grad work had the highest probability of cheating. Interestingly enough, those who graduated college, or had Master’s or advanced level degrees had a lower probability of cheating.

  1. BONUS - place the original.csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
##import from github directory
affairs <- read.csv('https://raw.githubusercontent.com/carolc57/RBridgeFinal/main/Affairs.csv', header = TRUE, sep = ",")