R Bridge Course Final Project
The presentation approach is up to you but it should contain the following:
1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text
2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
5. BONUS - place the original .csv in a github file and have R read from the link. This will be very useful skill as you progress in your data science education and career.
Please submit your .rmd file and the .csv file as well as a link to your RPubs.
This dataset looks into:
- Affairs:
- 0 = none
- 1 = once
- 2 = twice
- 3 = three times
- 7 = 4 - 10 times
- 12 = daily, weekly or monthly
- Gender:
- Age:
- 17.5 = under 20 yrs old
- 22 = 20 - 24 yrs old
- 27 = 25 - 29 yrs old
- 32 = 30 - 34 yrs old
- 37 = 35 - 39 yrs old
- 42 = 40 - 44 yrs old
- 47 = 45 - 47 yrs old
- 52 = 50 - 54 yrs old
- 57 = 55 and over
- Years Married:
- 0.125 = 3 months or less
- 0.417 = 4 - 6 months
- 0.75 = 6 months - 1 yr
- 1.5 = 1 - 2 yrs
- 4 = 3 - 5 yrs
- 7 = 6 - 8 yrs
- 10 = 9 - 11 yrs
- 15 = 12 or more yrs
- Children: if they had any children in their marriage
- Religiousness:
- 1 = not religious
- 2 = not at all religious
- 3 = slightly religious
- 4 = somewhat religious
- 5 = very religious
- Education:
- 9 = grade school
- 12 = high school graduate
- 14 = some college
- 16 = college graduate
- 17 = some graduate work
- 18 = master’s degree
- 20 = advanced degree
- Occupation: according to the Hollingshead classification in reverse numbering
- 1 = higher executive, major professional, etc
- 2 = Small business owner, farm owner, teacher, low level manager, salaried worker
- 3 = Technician, semiprofessional, supervisor, office manager
- 4 = Clerical/sales, small farm owner
- 5 = Skilled manual worker, craftsman, police and fire services, enlisted military and non-commissioned officer
- 6 = machine operators, semi-skilled worker
- 7 = unskilled, service worker
- Self Rating:
- 1 = very unhappy
- 2 = somewhat happy
- 3 = average / neutral
- 4 = happier than average
- 5 = very happy
Using this dataset I am looking to answer the following questions:
- What is the average age of the individuals that participated in the data set?
- Is there an occupation that has a greater number of affairs / no affairs? What about in the education level?
- Is there a level of religiousness that had more affairs? Any level that had the least or no affairs?
- Is there any correlation between the religiousness and years married with the number of affairs?
- Would longer years of marriage have an affect on the self rating from the individuals? How does this affect the number of affairs?
Loading the data set and using the summary function to gain an overview of the data set.
# Load the data set
#theURL <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/AER/Affairs.csv"
#affairsdf <- read.csv(file = theURL , header = TRUE , sep = ",")
#head(affairsdf)
# BONUS
theURL <- "https://raw.githubusercontent.com/letisalbal/R-Final-Project/main/Affairs.csv"
affairsdf <- read.csv(file = theURL , header = TRUE , sep = ",")
head(affairsdf)
## X affairs gender age yearsmarried children religiousness education
## 1 4 0 male 37 10.00 no 3 18
## 2 5 0 female 27 4.00 no 4 14
## 3 11 0 female 32 15.00 yes 1 12
## 4 16 0 male 57 15.00 yes 5 18
## 5 23 0 male 22 0.75 no 2 17
## 6 29 0 female 32 1.50 no 2 17
## occupation rating
## 1 7 4
## 2 6 4
## 3 1 4
## 4 6 5
## 5 6 3
## 6 5 5
Data Exploration:
# Print summary
summary(affairsdf)
## X affairs gender age
## Min. : 4 Min. : 0.000 Length:601 Min. :17.50
## 1st Qu.: 528 1st Qu.: 0.000 Class :character 1st Qu.:27.00
## Median :1009 Median : 0.000 Mode :character Median :32.00
## Mean :1060 Mean : 1.456 Mean :32.49
## 3rd Qu.:1453 3rd Qu.: 0.000 3rd Qu.:37.00
## Max. :9029 Max. :12.000 Max. :57.00
## yearsmarried children religiousness education
## Min. : 0.125 Length:601 Min. :1.000 Min. : 9.00
## 1st Qu.: 4.000 Class :character 1st Qu.:2.000 1st Qu.:14.00
## Median : 7.000 Mode :character Median :3.000 Median :16.00
## Mean : 8.178 Mean :3.116 Mean :16.17
## 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :15.000 Max. :5.000 Max. :20.00
## occupation rating
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000
## Median :5.000 Median :4.000
## Mean :4.195 Mean :3.932
## 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :7.000 Max. :5.000
# Print colnames
colnames(affairsdf, do.NULL = TRUE, prefix = "col")
## [1] "X" "affairs" "gender" "age"
## [5] "yearsmarried" "children" "religiousness" "education"
## [9] "occupation" "rating"
# Print Mean and Median
means <- sapply(affairsdf[, c("affairs", "age", "yearsmarried","religiousness", "education", "occupation", "rating")], mean)
medians <- sapply(affairsdf[, c("affairs", "age", "yearsmarried","religiousness", "education", "occupation", "rating")], median)
means_medianDF <- data.frame(means, medians)
means_medianDF
## means medians
## affairs 1.455907 0
## age 32.487521 32
## yearsmarried 8.177696 7
## religiousness 3.116473 3
## education 16.166389 16
## occupation 4.194676 5
## rating 3.931780 4
Data wrangling:
# Create and new data frame with a subset of the columns and rows
Affairs_subset <- affairsdf[,c ("affairs", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")]
head(Affairs_subset)
## affairs age yearsmarried religiousness education occupation rating
## 1 0 37 10.00 3 18 7 4
## 2 0 27 4.00 4 14 6 4
## 3 0 32 15.00 1 12 1 4
## 4 0 57 15.00 5 18 6 5
## 5 0 22 0.75 2 17 6 3
## 6 0 32 1.50 2 17 5 5
tail(Affairs_subset)
## affairs age yearsmarried religiousness education occupation rating
## 596 7 47 15.0 3 16 4 2
## 597 1 22 1.5 1 12 2 5
## 598 7 32 10.0 2 18 5 4
## 599 2 32 10.0 2 17 6 5
## 600 2 22 7.0 3 18 6 2
## 601 1 32 15.0 3 14 1 5
# Create new column names for the new data frame
colnames(Affairs_subset) <- c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level","Occupation", "Self_Rating")
colnames(Affairs_subset)
## [1] "Affairs" "Age" "Years_Married" "Religiousness"
## [5] "Education_Level" "Occupation" "Self_Rating"
# Print Table to see new changes
head(Affairs_subset)
## Affairs Age Years_Married Religiousness Education_Level Occupation
## 1 0 37 10.00 3 18 7
## 2 0 27 4.00 4 14 6
## 3 0 32 15.00 1 12 1
## 4 0 57 15.00 5 18 6
## 5 0 22 0.75 2 17 6
## 6 0 32 1.50 2 17 5
## Self_Rating
## 1 4
## 2 4
## 3 4
## 4 5
## 5 3
## 6 5
tail(Affairs_subset)
## Affairs Age Years_Married Religiousness Education_Level Occupation
## 596 7 47 15.0 3 16 4
## 597 1 22 1.5 1 12 2
## 598 7 32 10.0 2 18 5
## 599 2 32 10.0 2 17 6
## 600 2 22 7.0 3 18 6
## 601 1 32 15.0 3 14 1
## Self_Rating
## 596 2
## 597 5
## 598 4
## 599 5
## 600 2
## 601 5
# Print summary of new data set
summary(Affairs_subset)
## Affairs Age Years_Married Religiousness
## Min. : 0.000 Min. :17.50 Min. : 0.125 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:27.00 1st Qu.: 4.000 1st Qu.:2.000
## Median : 0.000 Median :32.00 Median : 7.000 Median :3.000
## Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116
## 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000
## Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000
## Education_Level Occupation Self_Rating
## Min. : 9.00 Min. :1.000 Min. :1.000
## 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
## Median :16.00 Median :5.000 Median :4.000
## Mean :16.17 Mean :4.195 Mean :3.932
## 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :20.00 Max. :7.000 Max. :5.000
# Print Mean and Median for new data set
means2 <- sapply(Affairs_subset[, c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level", "Occupation", "Self_Rating")], mean)
medians2 <- sapply(Affairs_subset[, c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level", "Occupation", "Self_Rating")], median)
means_medianDF2 <- data.frame(means2, medians2)
means_medianDF2
## means2 medians2
## Affairs 1.455907 0
## Age 32.487521 32
## Years_Married 8.177696 7
## Religiousness 3.116473 3
## Education_Level 16.166389 16
## Occupation 4.194676 5
## Self_Rating 3.931780 4
Graphics:
I’ve created a grouped bar graph to show the occupation of the individuals with the education level. Occupation 1, higher exceutives, has many individuals with some college completed. Occupation 5, skilled manual worker, has individuals with master’s degree, some graduate work and college graduates. Lastly, in occupation 6, machine operators / semi skilled, you have more individuals with advanced degrees. In occupations 7 and 2, unskilled and managers respectively, you see the group consists of individuals with grade school, high school or some college level of education.
# Summary of Occupation
summary(Affairs_subset$Occupation)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.195 6.000 7.000
# Summary of Education Level
summary(Affairs_subset$Education_Level)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 14.00 16.00 16.17 18.00 20.00
# Summary of Affairs
summary(Affairs_subset$Affairs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.456 0.000 12.000
# Grouped Bar graph
counts <- table(Affairs_subset$Education_Level, Affairs_subset$Occupation)
barplot(counts, main="Occupation based on Education Level",
xlab="Occupation", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

To answer question 2, here’s a bar graph showing the number of affairs / no affairs per occupation the individuals had. There are more occupations that had zero affairs than those with one or more. Among those occupations with the highest levels or zero affairs were higher executives, skilled manual worker, and machine operators / semi-skilled.
# Grouped Bar graph
counts <- table(Affairs_subset$Occupation, Affairs_subset$Affairs)
barplot(counts, main="Occupation and Affairs Comparisson",
xlab="Affairs", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

Part 2 of question 2, I am comparing education level and number of affairs or no affairs done. Similar to part 1 of the question, there’s a range of education levels that have zero affairs which include some college, college graduate, master’s degree, some graduate work, advanced degree.
# Grouped Bar graph
counts <- table(Affairs_subset$Education_Level, Affairs_subset$Affairs)
barplot(counts, main="Education Level and Affairs Comparisson",
xlab="Affairs", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

Big question in terms of Religiousness comes about to see if it plays an important part in people when having an affair or not. Here we see the levels of religiousnes throughout the data set where (2) “not at all” and (4) “somewhat” religious has the highest values.
# Summary of Religiousness
summary(Affairs_subset$Religiousness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.116 4.000 5.000
# Summary of Years Married
summary(Affairs_subset$Years_Married)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.125 4.000 7.000 8.178 15.000 15.000
#Histogram for Religiousness
Hist <- ggplot(Affairs_subset, aes(x=Religiousness)) +
geom_histogram(fill="blue", bins=10)
Hist

The histogram of Affairs values shows that most people in the data set had zero affairs compared to the rest.
#Histogram for Affairs
Hist <- ggplot(Affairs_subset, aes(x=Affairs)) +
geom_histogram(fill="green", bins=10)
Hist

The histogram of the years married shows that most people were married for 12 or more years in this data set.
#Histogram for Years Married
Hist <- ggplot(Affairs_subset, aes(x=Years_Married)) +
geom_histogram(fill="red", bins=10)
Hist

The scatter plot below shows the relationship between years married and religiounsess. Many of the individuals who had less than 5 years of marriage with all levels of religiousness had between 0 - 3 affairs with a couple of them having daily, weekly or monthly affairs. The more years of marriage and the closer to no religious beliefs you see more variations in the number of affairs. As well as those with more years married and more religiousness the less the number of affairs.
# Scatter Plot between Religiousness and Years Married with the number of Affairs
ScatterPlot <- ggplot(Affairs_subset, aes(x = Religiousness, y = Years_Married, color = factor(Affairs)))+
geom_point(size=2.5)
ScatterPlot

Looking into years married the avarage in this data set is 8 years. I created a box plot as well as a scatter plot to see if there’s any correlation between the years married and self rating as well as self rating and affairs.
# Summary for Years Married
summary(Affairs_subset$Years_Married)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.125 4.000 7.000 8.178 15.000 15.000
# Summary for Self Rating
summary(Affairs_subset$Self_Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.932 5.000 5.000
The box plot below shows that there are outliers for those married 3 - 5 years and 6 - 8 years and self rated very unhappy.
# Box plot for Years Married and Self Rating
ggplot(Affairs_subset, aes(x=as.factor(Self_Rating), y=Years_Married)) +
geom_boxplot(fill="slateblue", alpha=0.2) + coord_cartesian(ylim = c(0, 20)) +
xlab("Self Rating")

Comparing self rating and affairs there’s plenty of outliers for those individuals who rated average /neutral, happy and very happy in their marriages.
# Box plot for Affairs and Self Rating
ggplot(Affairs_subset, aes(x=as.factor(Self_Rating), y=Affairs)) +
geom_boxplot(fill="slateblue", alpha=0.2) + coord_cartesian(ylim = c(0, 20)) +
xlab("Self Rating")

The scatter plot below shows the relationship between self rating and years married with the number of affairs. Many individuals with below 5 years of marriage and that rated unhappy, average / neutral, happy and very unhappy do have less amount of affairs. Once it reaches more than 5 years but below 12 years of marriage you notice the number of affairs in all self rating levels to go up. Notice that there’s two 12 affairs (daily, weekly or monthly) for those who rated average / neutral and very happy.
# Scatter Plot between Self Rating and Years Married with the number of Affairs
ScatterPlot <- ggplot(Affairs_subset, aes(x = Self_Rating, y = Years_Married, color = factor(Affairs)))+
geom_point(size=2.5)
ScatterPlot

Scatter plot Matrix showing the data set as a whole
# Scatter plot Matrix Part 1
pairs(~Affairs+Age+Years_Married+Religiousness+Education_Level+Occupation+Self_Rating, data=Affairs_subset,
main="Affairs Data Set Scatterplot Matrix")

CONCLUSION: Through data exploration and wrangling, I came to the realization that in any occupation or education level there will be some level of affair(s). In this data set there were more individuals who had zero affairs regardless of occupation and education level. My initial thought in regards to religousness was that the more religious an individual was and the longer the years married the less affairs but this analysis proved me wrong. The data demonstrated quite the distribution of affairs regardless of any of these factors. The same goes for years married and self rating. There were individuals who rated very unhappy and unhappy and still had less number of affairs under 10 years of marriage than those who rated happy and very happy. The truth is there’s no telling what kind of person will be more inclined to have more or less affairs throughout their married lives, age, education level, occupation or religiousness.