R Bridge Course Final Project

This is a final project to show off what you have learned. Select your data set from the list below:http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://archive.ics.uci.edu/ml/datasets.html

The presentation approach is up to you but it should contain the following:

1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text
2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
5. BONUS - place the original .csv in a github file and have R read from the link. This will be very useful skill as you progress in your data science education and career.
ANALYSIS: I decided to analyze the Affairs dataset taken from http://vincentarelbundock.github.io/Rdatasets/. This data set looks at 601 individuals who have commited (if any) an affair(s) throughout their married lives.
This dataset looks into:
  • Affairs:
    • 0 = none
    • 1 = once
    • 2 = twice
    • 3 = three times
    • 7 = 4 - 10 times
    • 12 = daily, weekly or monthly
  • Gender:
    • Male or Female
  • Age:
    • 17.5 = under 20 yrs old
    • 22 = 20 - 24 yrs old
    • 27 = 25 - 29 yrs old
    • 32 = 30 - 34 yrs old
    • 37 = 35 - 39 yrs old
    • 42 = 40 - 44 yrs old
    • 47 = 45 - 47 yrs old
    • 52 = 50 - 54 yrs old
    • 57 = 55 and over
  • Years Married:
    • 0.125 = 3 months or less
    • 0.417 = 4 - 6 months
    • 0.75 = 6 months - 1 yr
    • 1.5 = 1 - 2 yrs
    • 4 = 3 - 5 yrs
    • 7 = 6 - 8 yrs
    • 10 = 9 - 11 yrs
    • 15 = 12 or more yrs
  • Children: if they had any children in their marriage
    • Yes or No
  • Religiousness:
    • 1 = not religious
    • 2 = not at all religious
    • 3 = slightly religious
    • 4 = somewhat religious
    • 5 = very religious
  • Education:
    • 9 = grade school
    • 12 = high school graduate
    • 14 = some college
    • 16 = college graduate
    • 17 = some graduate work
    • 18 = master’s degree
    • 20 = advanced degree
  • Occupation: according to the Hollingshead classification in reverse numbering
    • 1 = higher executive, major professional, etc
    • 2 = Small business owner, farm owner, teacher, low level manager, salaried worker
    • 3 = Technician, semiprofessional, supervisor, office manager
    • 4 = Clerical/sales, small farm owner
    • 5 = Skilled manual worker, craftsman, police and fire services, enlisted military and non-commissioned officer
    • 6 = machine operators, semi-skilled worker
    • 7 = unskilled, service worker
  • Self Rating:
    • 1 = very unhappy
    • 2 = somewhat happy
    • 3 = average / neutral
    • 4 = happier than average
    • 5 = very happy

Using this dataset I am looking to answer the following questions:

Loading the data set and using the summary function to gain an overview of the data set.

# Load the data set
#theURL <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/AER/Affairs.csv"

#affairsdf <- read.csv(file = theURL , header = TRUE , sep = ",")

#head(affairsdf)

# BONUS 
theURL <- "https://raw.githubusercontent.com/letisalbal/R-Final-Project/main/Affairs.csv"

affairsdf <- read.csv(file = theURL , header = TRUE , sep = ",")

head(affairsdf)
##    X affairs gender age yearsmarried children religiousness education
## 1  4       0   male  37        10.00       no             3        18
## 2  5       0 female  27         4.00       no             4        14
## 3 11       0 female  32        15.00      yes             1        12
## 4 16       0   male  57        15.00      yes             5        18
## 5 23       0   male  22         0.75       no             2        17
## 6 29       0 female  32         1.50       no             2        17
##   occupation rating
## 1          7      4
## 2          6      4
## 3          1      4
## 4          6      5
## 5          6      3
## 6          5      5

Data Exploration:

# Print summary
summary(affairsdf)
##        X           affairs          gender               age       
##  Min.   :   4   Min.   : 0.000   Length:601         Min.   :17.50  
##  1st Qu.: 528   1st Qu.: 0.000   Class :character   1st Qu.:27.00  
##  Median :1009   Median : 0.000   Mode  :character   Median :32.00  
##  Mean   :1060   Mean   : 1.456                      Mean   :32.49  
##  3rd Qu.:1453   3rd Qu.: 0.000                      3rd Qu.:37.00  
##  Max.   :9029   Max.   :12.000                      Max.   :57.00  
##   yearsmarried      children         religiousness     education    
##  Min.   : 0.125   Length:601         Min.   :1.000   Min.   : 9.00  
##  1st Qu.: 4.000   Class :character   1st Qu.:2.000   1st Qu.:14.00  
##  Median : 7.000   Mode  :character   Median :3.000   Median :16.00  
##  Mean   : 8.178                      Mean   :3.116   Mean   :16.17  
##  3rd Qu.:15.000                      3rd Qu.:4.000   3rd Qu.:18.00  
##  Max.   :15.000                      Max.   :5.000   Max.   :20.00  
##    occupation        rating     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000  
##  Median :5.000   Median :4.000  
##  Mean   :4.195   Mean   :3.932  
##  3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :7.000   Max.   :5.000
# Print colnames
colnames(affairsdf, do.NULL = TRUE, prefix = "col")
##  [1] "X"             "affairs"       "gender"        "age"          
##  [5] "yearsmarried"  "children"      "religiousness" "education"    
##  [9] "occupation"    "rating"
# Print Mean and Median
means <- sapply(affairsdf[, c("affairs", "age", "yearsmarried","religiousness", "education", "occupation", "rating")], mean)
medians <- sapply(affairsdf[, c("affairs", "age", "yearsmarried","religiousness", "education", "occupation", "rating")], median)

means_medianDF <- data.frame(means, medians)
means_medianDF
##                   means medians
## affairs        1.455907       0
## age           32.487521      32
## yearsmarried   8.177696       7
## religiousness  3.116473       3
## education     16.166389      16
## occupation     4.194676       5
## rating         3.931780       4

Data wrangling:

# Create and new data frame with a subset of the columns and rows
Affairs_subset <- affairsdf[,c ("affairs", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")]
head(Affairs_subset)
##   affairs age yearsmarried religiousness education occupation rating
## 1       0  37        10.00             3        18          7      4
## 2       0  27         4.00             4        14          6      4
## 3       0  32        15.00             1        12          1      4
## 4       0  57        15.00             5        18          6      5
## 5       0  22         0.75             2        17          6      3
## 6       0  32         1.50             2        17          5      5
tail(Affairs_subset)
##     affairs age yearsmarried religiousness education occupation rating
## 596       7  47         15.0             3        16          4      2
## 597       1  22          1.5             1        12          2      5
## 598       7  32         10.0             2        18          5      4
## 599       2  32         10.0             2        17          6      5
## 600       2  22          7.0             3        18          6      2
## 601       1  32         15.0             3        14          1      5
# Create new column names for the new data frame
colnames(Affairs_subset) <- c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level","Occupation", "Self_Rating")
colnames(Affairs_subset)
## [1] "Affairs"         "Age"             "Years_Married"   "Religiousness"  
## [5] "Education_Level" "Occupation"      "Self_Rating"
# Print Table to see new changes
head(Affairs_subset)
##   Affairs Age Years_Married Religiousness Education_Level Occupation
## 1       0  37         10.00             3              18          7
## 2       0  27          4.00             4              14          6
## 3       0  32         15.00             1              12          1
## 4       0  57         15.00             5              18          6
## 5       0  22          0.75             2              17          6
## 6       0  32          1.50             2              17          5
##   Self_Rating
## 1           4
## 2           4
## 3           4
## 4           5
## 5           3
## 6           5
tail(Affairs_subset)
##     Affairs Age Years_Married Religiousness Education_Level Occupation
## 596       7  47          15.0             3              16          4
## 597       1  22           1.5             1              12          2
## 598       7  32          10.0             2              18          5
## 599       2  32          10.0             2              17          6
## 600       2  22           7.0             3              18          6
## 601       1  32          15.0             3              14          1
##     Self_Rating
## 596           2
## 597           5
## 598           4
## 599           5
## 600           2
## 601           5
# Print summary of new data set
summary(Affairs_subset)
##     Affairs            Age        Years_Married    Religiousness  
##  Min.   : 0.000   Min.   :17.50   Min.   : 0.125   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:27.00   1st Qu.: 4.000   1st Qu.:2.000  
##  Median : 0.000   Median :32.00   Median : 7.000   Median :3.000  
##  Mean   : 1.456   Mean   :32.49   Mean   : 8.178   Mean   :3.116  
##  3rd Qu.: 0.000   3rd Qu.:37.00   3rd Qu.:15.000   3rd Qu.:4.000  
##  Max.   :12.000   Max.   :57.00   Max.   :15.000   Max.   :5.000  
##  Education_Level   Occupation     Self_Rating   
##  Min.   : 9.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
##  Median :16.00   Median :5.000   Median :4.000  
##  Mean   :16.17   Mean   :4.195   Mean   :3.932  
##  3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :20.00   Max.   :7.000   Max.   :5.000
# Print Mean and Median for new data set
means2 <- sapply(Affairs_subset[, c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level", "Occupation", "Self_Rating")], mean)
medians2 <- sapply(Affairs_subset[, c("Affairs", "Age", "Years_Married", "Religiousness", "Education_Level", "Occupation", "Self_Rating")], median)

means_medianDF2 <- data.frame(means2, medians2)
means_medianDF2
##                    means2 medians2
## Affairs          1.455907        0
## Age             32.487521       32
## Years_Married    8.177696        7
## Religiousness    3.116473        3
## Education_Level 16.166389       16
## Occupation       4.194676        5
## Self_Rating      3.931780        4

Graphics:

This data set includes more people between the ages of 27 - 37. Looking back at the mean and median, age 32 is the most common age group in the data set.
# Summary for Age
summary(Affairs_subset$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.50   27.00   32.00   32.49   37.00   57.00
# Histogram for Age
hist(Affairs_subset$Age, main = "Age Histogram", xlab = "Age")

I’ve created a grouped bar graph to show the occupation of the individuals with the education level. Occupation 1, higher exceutives, has many individuals with some college completed. Occupation 5, skilled manual worker, has individuals with master’s degree, some graduate work and college graduates. Lastly, in occupation 6, machine operators / semi skilled, you have more individuals with advanced degrees. In occupations 7 and 2, unskilled and managers respectively, you see the group consists of individuals with grade school, high school or some college level of education.
# Summary of Occupation
summary(Affairs_subset$Occupation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.195   6.000   7.000
# Summary of Education Level
summary(Affairs_subset$Education_Level)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   14.00   16.00   16.17   18.00   20.00
# Summary of Affairs
summary(Affairs_subset$Affairs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.456   0.000  12.000
# Grouped Bar graph
counts <- table(Affairs_subset$Education_Level, Affairs_subset$Occupation)
barplot(counts, main="Occupation based on Education Level",
  xlab="Occupation", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

To answer question 2, here’s a bar graph showing the number of affairs / no affairs per occupation the individuals had. There are more occupations that had zero affairs than those with one or more. Among those occupations with the highest levels or zero affairs were higher executives, skilled manual worker, and machine operators / semi-skilled.
# Grouped Bar graph
counts <- table(Affairs_subset$Occupation, Affairs_subset$Affairs)
barplot(counts, main="Occupation and Affairs Comparisson",
  xlab="Affairs", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

Part 2 of question 2, I am comparing education level and number of affairs or no affairs done. Similar to part 1 of the question, there’s a range of education levels that have zero affairs which include some college, college graduate, master’s degree, some graduate work, advanced degree.
# Grouped Bar graph
counts <- table(Affairs_subset$Education_Level, Affairs_subset$Affairs)
barplot(counts, main="Education Level and Affairs Comparisson",
  xlab="Affairs", col=c("yellow","orange", "purple", "pink", "red", "blue", "green"), legend = rownames(counts), beside=TRUE)

Big question in terms of Religiousness comes about to see if it plays an important part in people when having an affair or not. Here we see the levels of religiousnes throughout the data set where (2) “not at all” and (4) “somewhat” religious has the highest values.
# Summary of Religiousness
summary(Affairs_subset$Religiousness)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.116   4.000   5.000
# Summary of Years Married
summary(Affairs_subset$Years_Married)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.125   4.000   7.000   8.178  15.000  15.000
#Histogram for Religiousness
Hist <- ggplot(Affairs_subset, aes(x=Religiousness)) + 
  geom_histogram(fill="blue", bins=10)
Hist

The histogram of Affairs values shows that most people in the data set had zero affairs compared to the rest.
#Histogram for Affairs
Hist <- ggplot(Affairs_subset, aes(x=Affairs)) + 
  geom_histogram(fill="green", bins=10)
Hist

The histogram of the years married shows that most people were married for 12 or more years in this data set.
#Histogram for Years Married
Hist <- ggplot(Affairs_subset, aes(x=Years_Married)) + 
  geom_histogram(fill="red", bins=10)
Hist

The scatter plot below shows the relationship between years married and religiounsess. Many of the individuals who had less than 5 years of marriage with all levels of religiousness had between 0 - 3 affairs with a couple of them having daily, weekly or monthly affairs. The more years of marriage and the closer to no religious beliefs you see more variations in the number of affairs. As well as those with more years married and more religiousness the less the number of affairs.
# Scatter Plot between Religiousness and Years Married with the number of Affairs
ScatterPlot <- ggplot(Affairs_subset, aes(x = Religiousness, y = Years_Married, color = factor(Affairs)))+ 
  geom_point(size=2.5)
ScatterPlot

Looking into years married the avarage in this data set is 8 years. I created a box plot as well as a scatter plot to see if there’s any correlation between the years married and self rating as well as self rating and affairs.
# Summary for Years Married
summary(Affairs_subset$Years_Married)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.125   4.000   7.000   8.178  15.000  15.000
# Summary for Self Rating
summary(Affairs_subset$Self_Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.932   5.000   5.000
The box plot below shows that there are outliers for those married 3 - 5 years and 6 - 8 years and self rated very unhappy.
# Box plot for Years Married and Self Rating 
ggplot(Affairs_subset, aes(x=as.factor(Self_Rating), y=Years_Married)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + coord_cartesian(ylim = c(0, 20)) + 
    xlab("Self Rating")

Comparing self rating and affairs there’s plenty of outliers for those individuals who rated average /neutral, happy and very happy in their marriages.
# Box plot for Affairs and Self Rating
ggplot(Affairs_subset, aes(x=as.factor(Self_Rating), y=Affairs)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + coord_cartesian(ylim = c(0, 20)) +
    xlab("Self Rating")

The scatter plot below shows the relationship between self rating and years married with the number of affairs. Many individuals with below 5 years of marriage and that rated unhappy, average / neutral, happy and very unhappy do have less amount of affairs. Once it reaches more than 5 years but below 12 years of marriage you notice the number of affairs in all self rating levels to go up. Notice that there’s two 12 affairs (daily, weekly or monthly) for those who rated average / neutral and very happy.
# Scatter Plot between Self Rating and Years Married with the number of Affairs
ScatterPlot <- ggplot(Affairs_subset, aes(x = Self_Rating, y = Years_Married, color = factor(Affairs)))+ 
  geom_point(size=2.5)
ScatterPlot

Scatter plot Matrix showing the data set as a whole
# Scatter plot Matrix Part 1
pairs(~Affairs+Age+Years_Married+Religiousness+Education_Level+Occupation+Self_Rating, data=Affairs_subset,
   main="Affairs Data Set Scatterplot Matrix")

CONCLUSION: Through data exploration and wrangling, I came to the realization that in any occupation or education level there will be some level of affair(s). In this data set there were more individuals who had zero affairs regardless of occupation and education level. My initial thought in regards to religousness was that the more religious an individual was and the longer the years married the less affairs but this analysis proved me wrong. The data demonstrated quite the distribution of affairs regardless of any of these factors. The same goes for years married and self rating. There were individuals who rated very unhappy and unhappy and still had less number of affairs under 10 years of marriage than those who rated happy and very happy. The truth is there’s no telling what kind of person will be more inclined to have more or less affairs throughout their married lives, age, education level, occupation or religiousness.