Assignment Summary

For Project 2, choose any three of the “wide” datasets identified in the Week 6 Discussion items. This document contains the first of the three files selected for this assignment.

File selected: State Marriage Rates submitted by Gabriel Abreu. link to state marriage rates

Analysis to perform: This file provides state marriage rates breaking down the data into regions and years. Group the data by census region or census division, then organize the rates according to year, changing it from wide data to long data. Provide narrative descriptions of the data cleanup work and analysis completed, along with conclusions about the data.

Setup and load data

library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
data <-  read.csv('data/state_marriage_rates_90_95_99_16.csv')
head(data,4)
##      state    X2016     X2015     X2014    X2013 X2012 X2011 X2010 X2009 X2008
## 1  Alabama 7.147821  7.351544  7.806776 7.817785   8.2   8.4   8.2   8.3   8.6
## 2   Alaska 7.103441  7.407588  7.508836 7.293928   7.2   7.8   8.0   7.8   8.4
## 3  Arizona 5.930541  5.922469  5.780449 5.401091   5.6   5.7   5.9   5.6   6.0
## 4 Arkansas 9.860962 10.040279 10.112026 9.751052  10.9  10.4  10.8  10.7  10.6
##   X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1995 X1990
## 1   8.9   9.2   9.2   9.4   9.6   9.9   9.4  10.1  10.8   9.8  10.6
## 2   8.5   8.2   8.2   8.5   8.1   8.3   8.1   8.9   8.6   9.0  10.2
## 3   6.4   6.5   6.6   6.7   6.5   6.7   7.6   7.5   8.2   8.8  10.0
## 4  12.0  12.4  12.9  13.4  13.4  14.3  14.3  15.4  14.8  14.4  15.3
regions <- read.csv('data/Census_Regions_and_Divisions.csv')
regions <- select(regions, State, Region)
regions <- regions %>% `colnames<-`(c('state', 'region'))
head(regions, 4)
##      state region
## 1  Alabama  South
## 2   Alaska   West
## 3  Arizona   West
## 4 Arkansas  South

Tidy up the data

#Join the census region to the marriage rates data
data2 <- merge(data, regions, by.x = "state", by.y = "state")
tail(data2,2)
##        state    X2016    X2015    X2014    X2013 X2012 X2011 X2010 X2009 X2008
## 50 Wisconsin 5.616134 5.611351 5.691296 5.219136   5.4   5.3   5.3   5.3   5.6
## 51   Wyoming 7.079407 7.341663 7.658952 7.549883   7.6   7.8   7.6   8.0   8.6
##    X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1995 X1990  region
## 50   5.7   6.0   6.1   6.2   6.2   6.3   6.5   6.7   6.7   7.0   7.9 Midwest
## 51   9.0   9.3   9.3   9.3   9.3   9.5  10.0  10.0   9.9  10.6  10.7    West
#Convert the year columns to a single year column with the associated value in a second column
data3 <- pivot_longer(data2,c(2:21), 'year')
data3 <- data3 %>% `colnames<-`(c('state', 'region', 'year', 'value'))
head(data3)
## # A tibble: 6 x 4
##   state   region year  value
##   <fct>   <fct>  <chr> <dbl>
## 1 Alabama South  X2016  7.15
## 2 Alabama South  X2015  7.35
## 3 Alabama South  X2014  7.81
## 4 Alabama South  X2013  7.82
## 5 Alabama South  X2012  8.2 
## 6 Alabama South  X2011  8.4

Analysis

#How many regions are there and how many states are in each region?
data4 <- data3 %>% group_by(region, state) %>% summarize(count=n())
data4 %>% group_by(region) %>% summarize(count=n())
## # A tibble: 4 x 2
##   region    count
##   <fct>     <int>
## 1 Midwest      12
## 2 Northeast     9
## 3 South        17
## 4 West         13
data4 <- data4 %>% `colnames<-`(c('region', 'state_count'))
library(sqldf)

data5 <- sqldf('SELECT region, year, SUM(value) AS marriage_rate FROM data3 GROUP BY region, year')
head(data5,2)
##    region  year marriage_rate
## 1 Midwest X1990         105.6
## 2 Midwest X1995          93.6
#Since there are only four regions, chart each region as a series of marriage rates per year
p<-ggplot(data5, aes(x=year, y=marriage_rate, group=region)) +
  geom_line(aes(color=region))+
  geom_point(aes(color=region))
p

Based on the above chart the marriage rates in the Northeast and Midwest have declined since 2002, but leveled off. Of greater interest is the steep decline in marriage rates in the West and South and in recent years the merging of the marriage rate in the West and South.

#Create a series that shows each states marriage rate per year in the South
south <- subset(data3, data3$region == 'South')

p<-ggplot(south, aes(x=year, y=value, group=state)) +
  geom_line(aes(color=state))+
  geom_point(aes(color=state))
p

It looks like South Carolina, Tennessee, and Arkansas have the steepest declines, but there is a consistent trend around all the states declining throughout the 1990’s.

#Now create a similar series for the West that shows each states marriage rate per year
west <- subset(data3, data3$region == 'West')

p<-ggplot(west, aes(x=year, y=value, group=state)) +
  geom_line(aes(color=state))+
  geom_point(aes(color=state))
p

Nevada has an extremely large decline in marriage rates (to the point where the data might need to be reviewed for not being collected in an acceptable manner or perhaps errors in reporting the data). Hawaii shows a slight increase in the 90’s, but then it leveled off to it’s former level by 2016.

#For our final analysis, we'll perform a t-test on the 1990 data versus 2016.
#We are assuming the data is independent and collected in a probabilistic manner.

#H0: 1990 marriage rates = 2016 marriage rates
#H1: 1900 marriage rates != 2016 marriage rates

tdata <- select(data2, X2016, X1990)

#t.test(tdata$X1990, tdata$X2016, alternative=c('two-sided','less','greater'), mu = 0, paired = FALSE, var.equal = #FALSE, conf.level=0.95)
t.test(tdata$X1990, tdata$X2016)
## 
##  Welch Two Sample t-test
## 
## data:  tdata$X1990 and tdata$X2016
## t = 2.271, df = 57.043, p-value = 0.02694
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4932151 7.8479970
## sample estimates:
## mean of x mean of y 
## 11.588235  7.417629

Conclusions and Recommendations

Marriage rates have declined across the United States by a statistically significant amount from 1990 to 2016. The t-test results using these years resulted in a p-value of 0.02 at a confidence level of 95%. In 1990 the mean marriage rate was 11.6 and in 2016 it was 7.4.

Across the four census regions (there was no Puerto Rico data in this dataset) marriage rates in Northeast, Midwest, South, and West all declined the most in the 1990’s and since 2009 the marriage rates in each region have remained relatively stable.

The sharpest declines occurred in the West and the South and within these regions the states that contributed most to the decline were South Carolina, Tennessee, Arkansas and Nevada (West).

Recommendations for further analysis and study are…

  1. Validate the accuracy of Nevada’s data. It is extremely sharper in declining marriage rates than other states.

  2. Investigate data that might help to further explain this story, such as divorce rates

  3. It would be interesting to assess if birth rates are declining in proportion to marriage rates

  4. Conduct further analysis at a per state level - if there are states where marriage rates increased, compare them to several states that decreased.