Marriage Rates

I selected the next data set as it was part of fivethirtyeight data set and was very interesting as it involved data on marriage rates. This wide data is across many cross sections like educational background, Race, Geographic Region, Employment backgrounds etc.

Readme: https://github.com/fivethirtyeight/data/tree/master/marriage

I thought it would be fun to play with this data to understand and analyse how the marriage rates vary across some of these cross sections for people of ages 25-34.


My Hypothesis

My Hypothesis (before starting the project):
1. Marriage rates have fallen with time. People are more independent now compared to in past
2. Marriage rates might be lower for people with higher educational backgrounds as I think they spend more time in attaining education and are more independent
3. Marriage rates might be lower in people of color
4. Marriage rates might be higher in people who are high income vs people who are low income. Let’s see…

INDEX (Step by Step)

STEP 1. Load Libraries
STEP 2. Load the file
STEP 3. Use Dplyr to convert the data in long format
STEP 4a. Analysis 1: How have marriage rates done with time
STEP 4b. Analysis 2: How have marriage rates done based on education background by time
STEP 4c. Analysis 3: How have marriage rates done for people of color by time
STEP 4d. Analysis 4: How have marriage rates done for people who are high income vs. low income
STEP 5. Conclusion

STEP 1 : Load your libraries

# Load the libraries
library(tidyverse)  #For Tidyverse
## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RCurl)      #For File Operations
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(dplyr)      #For Manipulating the data frames
library(DT)         #For Data table package
library(ggplot2)    #For Visualizations

STEP 2 : Load the File

# Good Practise: Basic house keeping: cleanup the env before you start new work
rm(list=ls())

# Garbage collector to free the memory
gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  786326 42.0    1442291 77.1  1168576 62.5
## Vcells 1152423  8.8    2060183 15.8  1431918 11.0
# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments\\project02")

# Read the File directly from Github
#fileURL <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/marriage/both_sexes.csv"
#untidy_data <- read.csv(text = getURL(fileURL), header = TRUE, sep = ",")
untidy_data <- read.csv("both_sexes.csv", header = TRUE, sep = ",")


# check the dimenstions
dim(untidy_data)
## [1] 17 75
# Structure of the data frame
head(untidy_data,1)
##   X year       date  all_2534   HS_2534   SC_2534  BAp_2534  BAo_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952
##   GD_2534 White_2534 Black_2534 Hisp_2534   NE_2534   MA_2534 Midwest_2534
## 1      NA  0.1164848  0.1621855 0.1393736 0.1504184 0.1628934    0.1121467
##   South_2534 Mountain_2534 Pacific_2534 poor_2534   mid_2534 rich_2534
## 1  0.1090562    0.09152117    0.1198758 0.1371597 0.07514929 0.2066776
##     all_3544    HS_3544    SC_3544  BAp_3544  BAo_3544 GD_3544 White_3544
## 1 0.07058157 0.06860309 0.06663695 0.1326265 0.1326265      NA 0.06825586
##   Black_3544  Hisp_3544    NE_3544    MA_3544 Midwest_3544 South_3544
## 1 0.08836728 0.07307651 0.09194322 0.09347468    0.0686336 0.06026353
##   Mountain_3544 Pacific_3544 poor_3544   mid_3544 rich_3544   all_4554
## 1    0.04739747   0.05822486 0.1019749 0.04717272 0.0855387 0.07254649
##      HS_4554    SC_4554  BAp_4554  BAo_4554 GD_4554 White_4554 Black_4554
## 1 0.06840792 0.07903755 0.1536089 0.1536089      NA 0.07246692 0.06913249
##    Hisp_4554   NE_4554    MA_4554 Midwest_4554 South_4554 Mountain_4554
## 1 0.06636058 0.1023641 0.09264788   0.07285321 0.05977295    0.04754183
##   Pacific_4554 poor_4554   mid_4554  rich_4554 nokids_all_2534
## 1   0.05996993 0.1030055 0.05364421 0.07908591       0.4640564
##   kids_all_2534 nokids_HS_2534 nokids_SC_2534 nokids_BAp_2534
## 1   0.002820625      0.4430148      0.5000402       0.5619099
##   nokids_BAo_2534 nokids_GD_2534 kids_HS_2534 kids_SC_2534 kids_BAp_2534
## 1       0.5619099             NA  0.003318886  0.001150824  0.0005751073
##   kids_BAo_2534 kids_GD_2534 nokids_poor_2534 nokids_mid_2534
## 1  0.0005751073           NA        0.4933061        0.410008
##   nokids_rich_2534 kids_poor_2534 kids_mid_2534 kids_rich_2534
## 1        0.4921184    0.008722711  0.0007532065   0.0008027331
# Names of the variables
names(untidy_data) 
##  [1] "X"                "year"             "date"            
##  [4] "all_2534"         "HS_2534"          "SC_2534"         
##  [7] "BAp_2534"         "BAo_2534"         "GD_2534"         
## [10] "White_2534"       "Black_2534"       "Hisp_2534"       
## [13] "NE_2534"          "MA_2534"          "Midwest_2534"    
## [16] "South_2534"       "Mountain_2534"    "Pacific_2534"    
## [19] "poor_2534"        "mid_2534"         "rich_2534"       
## [22] "all_3544"         "HS_3544"          "SC_3544"         
## [25] "BAp_3544"         "BAo_3544"         "GD_3544"         
## [28] "White_3544"       "Black_3544"       "Hisp_3544"       
## [31] "NE_3544"          "MA_3544"          "Midwest_3544"    
## [34] "South_3544"       "Mountain_3544"    "Pacific_3544"    
## [37] "poor_3544"        "mid_3544"         "rich_3544"       
## [40] "all_4554"         "HS_4554"          "SC_4554"         
## [43] "BAp_4554"         "BAo_4554"         "GD_4554"         
## [46] "White_4554"       "Black_4554"       "Hisp_4554"       
## [49] "NE_4554"          "MA_4554"          "Midwest_4554"    
## [52] "South_4554"       "Mountain_4554"    "Pacific_4554"    
## [55] "poor_4554"        "mid_4554"         "rich_4554"       
## [58] "nokids_all_2534"  "kids_all_2534"    "nokids_HS_2534"  
## [61] "nokids_SC_2534"   "nokids_BAp_2534"  "nokids_BAo_2534" 
## [64] "nokids_GD_2534"   "kids_HS_2534"     "kids_SC_2534"    
## [67] "kids_BAp_2534"    "kids_BAo_2534"    "kids_GD_2534"    
## [70] "nokids_poor_2534" "nokids_mid_2534"  "nokids_rich_2534"
## [73] "kids_poor_2534"   "kids_mid_2534"    "kids_rich_2534"
As we can see that the data has 75 variables and only 17 observations, it is in a wide format. We need to convert this to the long format.

STEP 3. Use Dplyr to convert the data in long format

# Create dataframe with rates
# Create focus group of people between ages 25-34
age2534 <- untidy_data %>% select(1:21)

head(age2534, 1)
##   X year       date  all_2534   HS_2534   SC_2534  BAp_2534  BAo_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952
##   GD_2534 White_2534 Black_2534 Hisp_2534   NE_2534   MA_2534 Midwest_2534
## 1      NA  0.1164848  0.1621855 0.1393736 0.1504184 0.1628934    0.1121467
##   South_2534 Mountain_2534 Pacific_2534 poor_2534   mid_2534 rich_2534
## 1  0.1090562    0.09152117    0.1198758 0.1371597 0.07514929 0.2066776
# Fix the Column names to more readable names 
names(age2534) <- c("x", "Year", "Date", "All", "High School", "Some College", "Bachelor Degree Plus", "Bachelor Degree", "Graduate Degree", "White", "Black", "Hispanic", "New England", "Mid Atlantic", "Mid West", "South", "Mountain", "Pacific", "Low Income", "Middle Class", "Rich")

##
# Create a long format overall data for age 25-34
##

head(age2534)
##   x Year       Date       All High School Some College
## 1 1 1960 1960-01-01 0.1233145   0.1095332    0.1522818
## 2 2 1970 1970-01-01 0.1269715   0.1094000    0.1495096
## 3 3 1980 1980-01-01 0.1991767   0.1617313    0.2236916
## 4 4 1990 1990-01-01 0.2968306   0.2777491    0.2780912
## 5 5 2000 2000-01-01 0.3450087   0.3316545    0.3249205
## 6 6 2001 2001-01-01 0.3527767   0.3446069    0.3341101
##   Bachelor Degree Plus Bachelor Degree Graduate Degree     White     Black
## 1            0.2389952       0.2389952              NA 0.1164848 0.1621855
## 2            0.2187031       0.2187031              NA 0.1179043 0.1855163
## 3            0.2881646       0.2881646              NA 0.1824126 0.3137500
## 4            0.3612968       0.3656655       0.3474505 0.2639256 0.4838556
## 5            0.3874906       0.3939579       0.3691740 0.3127149 0.5144994
## 6            0.3835686       0.3925148       0.3590304 0.3183506 0.5437985
##    Hispanic New England Mid Atlantic  Mid West     South   Mountain
## 1 0.1393736   0.1504184    0.1628934 0.1121467 0.1090562 0.09152117
## 2 0.1298769   0.1517231    0.1640680 0.1153741 0.1126220 0.10293602
## 3 0.1885440   0.2414327    0.2505925 0.1828339 0.1688435 0.17434230
## 4 0.2962372   0.3500384    0.3623321 0.2755046 0.2639794 0.25264326
## 5 0.3180681   0.4091852    0.4175565 0.3308022 0.3099712 0.30621032
## 6 0.3321214   0.4200581    0.4294281 0.3344332 0.3182688 0.30980779
##     Pacific Low Income Middle Class      Rich
## 1 0.1198758  0.1371597   0.07514929 0.2066776
## 2 0.1374964  0.1717202   0.08159207 0.1724093
## 3 0.2334279  0.3100591   0.14825303 0.1851082
## 4 0.3319579  0.4199108   0.24320008 0.2783226
## 5 0.3753061  0.5033676   0.30202036 0.2717386
## 6 0.3844799  0.5178771   0.31716118 0.2532041
age2534_tidy <- age2534 %>% 
    gather(Category, Single, 4:21) %>% 
    filter(!is.na(Single)) %>% 
    mutate(Year = as.numeric(Year), Married=1-Single)  %>% 
    mutate(Single=format(Single, digits=2, nsmall=2),
           Married=format(Married, digits=2, nsmall=2))

Create Data tables for these tidy dataframes

# Datatable for people based on their education background
datatable(age2534_tidy)

STEP 4a. Analysis 1: How have marriage rates done with time

allpeople <- age2534_tidy %>% filter(Category=="All")

ggplot(allpeople, aes(Year, as.numeric(Married), group=1)) +
  geom_line(color="red") +
  geom_point(color="blue") +
  expand_limits(y=.5) +
  scale_x_continuous(limits = c(1960, 2013)) +
  theme_bw() +
  ggtitle("Declining Marriage Rates in All People of Ages 25-34") +
  ylab("Marriage Rate") +
  xlab("Year") +
  theme(plot.title = element_text(lineheight = .8))

STEP 4b. Analysis 2: How have marriage rates done based on education background by time

people_by_edu <- age2534_tidy %>% filter(Category %in% c("High School", 
                                                            "Some College", 
                                                            "Bachelor Degree Plus", 
                                                            "Bachelor Degree", 
                                                            "Graduate Degree"))


ggplot(people_by_edu, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
  geom_line() +
  geom_point() +
  scale_y_continuous() +
  scale_x_continuous(limits = c(1960, 2013)) +
  theme_bw() +
  ylab("Marriage Rate") +
  xlab("Year") +
  ggtitle("Declining Marriage Rates by Educational Background \nin People Ages 25-34") +
  ylab("Marriage Rate") +
  theme(plot.title = element_text(lineheight = .8))

STEP 4c. Analysis 3: How have marriage rates done for people of color by time

people_by_race <- age2534_tidy %>% filter(Category %in% c("Black", 
                                                            "White", 
                                                            "Hispanic"))


ggplot(people_by_race, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
  geom_line() +
  geom_point() +
  scale_y_continuous() +
  scale_x_continuous(limits = c(1960, 2013)) +
  theme_bw() +
  ylab("Marriage Rate") +
  xlab("Year") +
  ggtitle("Declining Marriage Rates by Demographics \nin People Ages 25-34") +
  ylab("Marriage Rate") +
  theme(plot.title = element_text(lineheight = .8))

STEP 4d. Analysis 4: How have marriage rates done for people who are high income vs. low income

unique(age2534_tidy$Category)
##  [1] "All"                  "High School"          "Some College"        
##  [4] "Bachelor Degree Plus" "Bachelor Degree"      "Graduate Degree"     
##  [7] "White"                "Black"                "Hispanic"            
## [10] "New England"          "Mid Atlantic"         "Mid West"            
## [13] "South"                "Mountain"             "Pacific"             
## [16] "Low Income"           "Middle Class"         "Rich"
people_by_income <- age2534_tidy %>% 
    filter(Category %in% c("Low Income", 
                                                            "Middle Class", 
                                                            "Rich")) 



ggplot(people_by_income, aes(x = as.numeric(Year) , y = as.numeric(Married), group = Category, colour = Category)) +
  geom_line() +
  geom_point() +
  scale_y_continuous() +
  scale_x_continuous(limits = c(1960, 2013)) +
  theme_bw() +
  ylab("Marriage Rate") +
  xlab("Year") +
  ggtitle("Declining Marriage Rates by Income \nin People Ages 25-34") +
  ylab("Marriage Rate") +
  theme(plot.title = element_text(lineheight = .8))

STEP 5: Conclusion

Conclusion

1. We saw in 4a, how the marriage rates have falled nearly every year since 1960. But the surprise was the more steeper decline from year 2000 onwards


2. We saw in 4b, that my hypothesis that higher educated people might have lower marriage rates was **wrong**. What surprised me was that how in 1960 people with just High School had the highest married rate, but since then this exact group has had the greatest decline in the marraige rates. People with graduate degree seemed to have the lowest level of decline.


3. We saw in 4c, that my hypothsis for marriage rates amoung people of color was correct. What disturbed me was that marriage rates among African American were mere 30%!! 


4. We saw in 4d, that my hypothsis for marriage rates amoung people of various income levels was partially correct. Middle class seems to have done the best to preserve the sanctity of a marriage. Again, it was disturbing to see how the steep decline in the Low Income group.  

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.