I selected the next data set as it was part of fivethirtyeight data set and was very interesting as it involved data on marriage rates. This wide data is across many cross sections like educational background, Race, Geographic Region, Employment backgrounds etc.
Readme: https://github.com/fivethirtyeight/data/tree/master/marriage
I thought it would be fun to play with this data to understand and analyse how the marriage rates vary across some of these cross sections for people of ages 25-34.
My Hypothesis (before starting the project):
1. Marriage rates have fallen with time. People are more independent now compared to in past
2. Marriage rates might be lower for people with higher educational backgrounds as I think they spend more time in attaining education and are more independent
3. Marriage rates might be lower in people of color
4. Marriage rates might be higher in people who are high income vs people who are low income. Let’s see…
# Load the libraries
library(tidyverse) #For Tidyverse## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RCurl) #For File Operations## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(dplyr) #For Manipulating the data frames
library(DT) #For Data table package
library(ggplot2) #For Visualizations# Good Practise: Basic house keeping: cleanup the env before you start new work
rm(list=ls())
# Garbage collector to free the memory
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 786326 42.0 1442291 77.1 1168576 62.5
## Vcells 1152423 8.8 2060183 15.8 1431918 11.0
# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments\\project02")
# Read the File directly from Github
#fileURL <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/marriage/both_sexes.csv"
#untidy_data <- read.csv(text = getURL(fileURL), header = TRUE, sep = ",")
untidy_data <- read.csv("both_sexes.csv", header = TRUE, sep = ",")
# check the dimenstions
dim(untidy_data)## [1] 17 75
# Structure of the data frame
head(untidy_data,1)## X year date all_2534 HS_2534 SC_2534 BAp_2534 BAo_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952
## GD_2534 White_2534 Black_2534 Hisp_2534 NE_2534 MA_2534 Midwest_2534
## 1 NA 0.1164848 0.1621855 0.1393736 0.1504184 0.1628934 0.1121467
## South_2534 Mountain_2534 Pacific_2534 poor_2534 mid_2534 rich_2534
## 1 0.1090562 0.09152117 0.1198758 0.1371597 0.07514929 0.2066776
## all_3544 HS_3544 SC_3544 BAp_3544 BAo_3544 GD_3544 White_3544
## 1 0.07058157 0.06860309 0.06663695 0.1326265 0.1326265 NA 0.06825586
## Black_3544 Hisp_3544 NE_3544 MA_3544 Midwest_3544 South_3544
## 1 0.08836728 0.07307651 0.09194322 0.09347468 0.0686336 0.06026353
## Mountain_3544 Pacific_3544 poor_3544 mid_3544 rich_3544 all_4554
## 1 0.04739747 0.05822486 0.1019749 0.04717272 0.0855387 0.07254649
## HS_4554 SC_4554 BAp_4554 BAo_4554 GD_4554 White_4554 Black_4554
## 1 0.06840792 0.07903755 0.1536089 0.1536089 NA 0.07246692 0.06913249
## Hisp_4554 NE_4554 MA_4554 Midwest_4554 South_4554 Mountain_4554
## 1 0.06636058 0.1023641 0.09264788 0.07285321 0.05977295 0.04754183
## Pacific_4554 poor_4554 mid_4554 rich_4554 nokids_all_2534
## 1 0.05996993 0.1030055 0.05364421 0.07908591 0.4640564
## kids_all_2534 nokids_HS_2534 nokids_SC_2534 nokids_BAp_2534
## 1 0.002820625 0.4430148 0.5000402 0.5619099
## nokids_BAo_2534 nokids_GD_2534 kids_HS_2534 kids_SC_2534 kids_BAp_2534
## 1 0.5619099 NA 0.003318886 0.001150824 0.0005751073
## kids_BAo_2534 kids_GD_2534 nokids_poor_2534 nokids_mid_2534
## 1 0.0005751073 NA 0.4933061 0.410008
## nokids_rich_2534 kids_poor_2534 kids_mid_2534 kids_rich_2534
## 1 0.4921184 0.008722711 0.0007532065 0.0008027331
# Names of the variables
names(untidy_data) ## [1] "X" "year" "date"
## [4] "all_2534" "HS_2534" "SC_2534"
## [7] "BAp_2534" "BAo_2534" "GD_2534"
## [10] "White_2534" "Black_2534" "Hisp_2534"
## [13] "NE_2534" "MA_2534" "Midwest_2534"
## [16] "South_2534" "Mountain_2534" "Pacific_2534"
## [19] "poor_2534" "mid_2534" "rich_2534"
## [22] "all_3544" "HS_3544" "SC_3544"
## [25] "BAp_3544" "BAo_3544" "GD_3544"
## [28] "White_3544" "Black_3544" "Hisp_3544"
## [31] "NE_3544" "MA_3544" "Midwest_3544"
## [34] "South_3544" "Mountain_3544" "Pacific_3544"
## [37] "poor_3544" "mid_3544" "rich_3544"
## [40] "all_4554" "HS_4554" "SC_4554"
## [43] "BAp_4554" "BAo_4554" "GD_4554"
## [46] "White_4554" "Black_4554" "Hisp_4554"
## [49] "NE_4554" "MA_4554" "Midwest_4554"
## [52] "South_4554" "Mountain_4554" "Pacific_4554"
## [55] "poor_4554" "mid_4554" "rich_4554"
## [58] "nokids_all_2534" "kids_all_2534" "nokids_HS_2534"
## [61] "nokids_SC_2534" "nokids_BAp_2534" "nokids_BAo_2534"
## [64] "nokids_GD_2534" "kids_HS_2534" "kids_SC_2534"
## [67] "kids_BAp_2534" "kids_BAo_2534" "kids_GD_2534"
## [70] "nokids_poor_2534" "nokids_mid_2534" "nokids_rich_2534"
## [73] "kids_poor_2534" "kids_mid_2534" "kids_rich_2534"
# Create dataframe with rates
# Create focus group of people between ages 25-34
age2534 <- untidy_data %>% select(1:21)
head(age2534, 1)## X year date all_2534 HS_2534 SC_2534 BAp_2534 BAo_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952
## GD_2534 White_2534 Black_2534 Hisp_2534 NE_2534 MA_2534 Midwest_2534
## 1 NA 0.1164848 0.1621855 0.1393736 0.1504184 0.1628934 0.1121467
## South_2534 Mountain_2534 Pacific_2534 poor_2534 mid_2534 rich_2534
## 1 0.1090562 0.09152117 0.1198758 0.1371597 0.07514929 0.2066776
# Fix the Column names to more readable names
names(age2534) <- c("x", "Year", "Date", "All", "High School", "Some College", "Bachelor Degree Plus", "Bachelor Degree", "Graduate Degree", "White", "Black", "Hispanic", "New England", "Mid Atlantic", "Mid West", "South", "Mountain", "Pacific", "Low Income", "Middle Class", "Rich")
##
# Create a long format overall data for age 25-34
##
head(age2534)## x Year Date All High School Some College
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818
## 2 2 1970 1970-01-01 0.1269715 0.1094000 0.1495096
## 3 3 1980 1980-01-01 0.1991767 0.1617313 0.2236916
## 4 4 1990 1990-01-01 0.2968306 0.2777491 0.2780912
## 5 5 2000 2000-01-01 0.3450087 0.3316545 0.3249205
## 6 6 2001 2001-01-01 0.3527767 0.3446069 0.3341101
## Bachelor Degree Plus Bachelor Degree Graduate Degree White Black
## 1 0.2389952 0.2389952 NA 0.1164848 0.1621855
## 2 0.2187031 0.2187031 NA 0.1179043 0.1855163
## 3 0.2881646 0.2881646 NA 0.1824126 0.3137500
## 4 0.3612968 0.3656655 0.3474505 0.2639256 0.4838556
## 5 0.3874906 0.3939579 0.3691740 0.3127149 0.5144994
## 6 0.3835686 0.3925148 0.3590304 0.3183506 0.5437985
## Hispanic New England Mid Atlantic Mid West South Mountain
## 1 0.1393736 0.1504184 0.1628934 0.1121467 0.1090562 0.09152117
## 2 0.1298769 0.1517231 0.1640680 0.1153741 0.1126220 0.10293602
## 3 0.1885440 0.2414327 0.2505925 0.1828339 0.1688435 0.17434230
## 4 0.2962372 0.3500384 0.3623321 0.2755046 0.2639794 0.25264326
## 5 0.3180681 0.4091852 0.4175565 0.3308022 0.3099712 0.30621032
## 6 0.3321214 0.4200581 0.4294281 0.3344332 0.3182688 0.30980779
## Pacific Low Income Middle Class Rich
## 1 0.1198758 0.1371597 0.07514929 0.2066776
## 2 0.1374964 0.1717202 0.08159207 0.1724093
## 3 0.2334279 0.3100591 0.14825303 0.1851082
## 4 0.3319579 0.4199108 0.24320008 0.2783226
## 5 0.3753061 0.5033676 0.30202036 0.2717386
## 6 0.3844799 0.5178771 0.31716118 0.2532041
age2534_tidy <- age2534 %>%
gather(Category, Single, 4:21) %>%
filter(!is.na(Single)) %>%
mutate(Year = as.numeric(Year), Married=1-Single) %>%
mutate(Single=format(Single, digits=2, nsmall=2),
Married=format(Married, digits=2, nsmall=2))# Datatable for people based on their education background
datatable(age2534_tidy)allpeople <- age2534_tidy %>% filter(Category=="All")
ggplot(allpeople, aes(Year, as.numeric(Married), group=1)) +
geom_line(color="red") +
geom_point(color="blue") +
expand_limits(y=.5) +
scale_x_continuous(limits = c(1960, 2013)) +
theme_bw() +
ggtitle("Declining Marriage Rates in All People of Ages 25-34") +
ylab("Marriage Rate") +
xlab("Year") +
theme(plot.title = element_text(lineheight = .8))people_by_edu <- age2534_tidy %>% filter(Category %in% c("High School",
"Some College",
"Bachelor Degree Plus",
"Bachelor Degree",
"Graduate Degree"))
ggplot(people_by_edu, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
geom_line() +
geom_point() +
scale_y_continuous() +
scale_x_continuous(limits = c(1960, 2013)) +
theme_bw() +
ylab("Marriage Rate") +
xlab("Year") +
ggtitle("Declining Marriage Rates by Educational Background \nin People Ages 25-34") +
ylab("Marriage Rate") +
theme(plot.title = element_text(lineheight = .8))people_by_race <- age2534_tidy %>% filter(Category %in% c("Black",
"White",
"Hispanic"))
ggplot(people_by_race, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
geom_line() +
geom_point() +
scale_y_continuous() +
scale_x_continuous(limits = c(1960, 2013)) +
theme_bw() +
ylab("Marriage Rate") +
xlab("Year") +
ggtitle("Declining Marriage Rates by Demographics \nin People Ages 25-34") +
ylab("Marriage Rate") +
theme(plot.title = element_text(lineheight = .8))unique(age2534_tidy$Category)## [1] "All" "High School" "Some College"
## [4] "Bachelor Degree Plus" "Bachelor Degree" "Graduate Degree"
## [7] "White" "Black" "Hispanic"
## [10] "New England" "Mid Atlantic" "Mid West"
## [13] "South" "Mountain" "Pacific"
## [16] "Low Income" "Middle Class" "Rich"
people_by_income <- age2534_tidy %>%
filter(Category %in% c("Low Income",
"Middle Class",
"Rich"))
ggplot(people_by_income, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
geom_line() +
geom_point() +
scale_y_continuous() +
scale_x_continuous(limits = c(1960, 2013)) +
theme_bw() +
ylab("Marriage Rate") +
xlab("Year") +
ggtitle("Declining Marriage Rates by Income \nin People Ages 25-34") +
ylab("Marriage Rate") +
theme(plot.title = element_text(lineheight = .8))
1. We saw in 4a, how the marriage rates have falled nearly every year since 1960. But the surprise was the more steeper decline from year 2000 onwards
2. We saw in 4b, that my hypothesis that higher educated people might have lower marriage rates was **wrong**. What surprised me was that how in 1960 people with just High School had the highest married rate, but since then this exact group has had the greatest decline in the marraige rates. People with graduate degree seemed to have the lowest level of decline.
3. We saw in 4c, that my hypothsis for marriage rates amoung people of color was correct. What disturbed me was that marriage rates among African American were mere 30%!!
4. We saw in 4d, that my hypothsis for marriage rates amoung people of various income levels was partially correct. Middle class seems to have done the best to preserve the sanctity of a marriage. Again, it was disturbing to see how the steep decline in the Low Income group.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.