Tidy Data

First we will be looking at our marriage dataset by state.

df <- read.csv('marriage.csv',skip = 5, header = TRUE, as.is = TRUE)
head(df, 10)

##                       X X2016 X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008
## 1                          NA    NA                NA    NA    NA    NA    NA
## 2               Alabama   7.1   7.4   7.8   7.8   8.2   8.4   8.2   8.3   8.6
## 3                Alaska   7.1   7.4   7.5   7.3   7.2   7.8   8.0   7.8   8.4
## 4               Arizona   5.9   5.9   5.8   5.4   5.6   5.7   5.9   5.6   6.0
## 5              Arkansas   9.9  10.0  10.1   9.8  10.9  10.4  10.8  10.7  10.6
## 6          California 1   6.5   6.2   6.4   6.5   6.0   5.8   5.8   5.8   6.7
## 7              Colorado   7.4   6.8   7.1   6.5   6.8   7.0   6.9   6.9   7.4
## 8           Connecticut   5.6   5.3   5.4   5.0   5.2   5.5   5.6   5.9   5.4
## 9              Delaware   5.6   5.7   6.0   6.6   5.8   5.2   5.2   5.4   5.5
## 10 District of Columbia   8.1   8.2  11.8  10.8   8.4   8.7   7.6   4.7   4.1
##    X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1995 X1990
## 1     NA          NA    NA                            NA    NA    NA
## 2    8.9   9.2   9.2   9.4   9.6   9.9   9.4  10.1  10.8   9.8  10.6
## 3    8.5   8.2   8.2   8.5   8.1   8.3   8.1   8.9   8.6   9.0  10.2
## 4    6.4   6.5   6.6   6.7   6.5   6.7   7.6   7.5   8.2   8.8  10.0
## 5   12.0  12.4  12.9  13.4  13.4  14.3  14.3  15.4  14.8  14.4  15.3
## 6    6.2   6.3   6.4   6.4   6.1   6.2   6.5   5.8   6.4   6.3   7.9
## 7    7.1   7.2   7.6   7.4   7.8   8.0   8.2   8.3   8.2   9.0   9.8
## 8    5.5   5.5   5.8   5.8   5.5   5.7   5.4   5.7   5.8   6.6   7.9
## 9    5.7   5.9   5.9   6.1   6.0   6.4   6.5   6.5   6.7   7.3   8.4
## 10   4.2   4.0   4.1   5.2   5.1   5.1   6.2   4.9   6.6   6.1   8.2

Couple clear things that need to be done. The first row must be dropped. The columns are not all the same datatype, and one variable is split amongst many different columns. First I will clean the columns and rename our first column.

df <- df[c(-1, -53:-60),]
df[-1] <- sapply(df[,-1], as.double)
names(df)[1] <- 'State'
head(df,10)

##                   State X2016 X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008
## 2               Alabama   7.1   7.4   7.8   7.8   8.2   8.4   8.2   8.3   8.6
## 3                Alaska   7.1   7.4   7.5   7.3   7.2   7.8   8.0   7.8   8.4
## 4               Arizona   5.9   5.9   5.8   5.4   5.6   5.7   5.9   5.6   6.0
## 5              Arkansas   9.9  10.0  10.1   9.8  10.9  10.4  10.8  10.7  10.6
## 6          California 1   6.5   6.2   6.4   6.5   6.0   5.8   5.8   5.8   6.7
## 7              Colorado   7.4   6.8   7.1   6.5   6.8   7.0   6.9   6.9   7.4
## 8           Connecticut   5.6   5.3   5.4   5.0   5.2   5.5   5.6   5.9   5.4
## 9              Delaware   5.6   5.7   6.0   6.6   5.8   5.2   5.2   5.4   5.5
## 10 District of Columbia   8.1   8.2  11.8  10.8   8.4   8.7   7.6   4.7   4.1
## 11              Florida   8.1   8.2   7.3   7.0   7.2   7.4   7.3   7.5   8.0
##    X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1995 X1990
## 2    8.9   9.2   9.2   9.4   9.6   9.9   9.4  10.1  10.8   9.8  10.6
## 3    8.5   8.2   8.2   8.5   8.1   8.3   8.1   8.9   8.6   9.0  10.2
## 4    6.4   6.5   6.6   6.7   6.5   6.7   7.6   7.5   8.2   8.8  10.0
## 5   12.0  12.4  12.9  13.4  13.4  14.3  14.3  15.4  14.8  14.4  15.3
## 6    6.2   6.3   6.4   6.4   6.1   6.2   6.5   5.8   6.4   6.3   7.9
## 7    7.1   7.2   7.6   7.4   7.8   8.0   8.2   8.3   8.2   9.0   9.8
## 8    5.5   5.5   5.8   5.8   5.5   5.7   5.4   5.7   5.8   6.6   7.9
## 9    5.7   5.9   5.9   6.1   6.0   6.4   6.5   6.5   6.7   7.3   8.4
## 10   4.2   4.0   4.1   5.2   5.1   5.1   6.2   4.9   6.6   6.1   8.2
## 11   8.5   8.6   8.9   9.0   9.0   9.4   9.3   8.9   8.7   9.9  10.9

Now we can get the year value into one column and rename the values to remove the X’s.

library(tidyverse)
library(stringr)
df <- pivot_longer(df, 2:ncol(df), names_to = 'Year', values_to = 'Marriage_Rate')
df$Year <- str_remove_all(df$Year, 'X')
head(df, 10)

## # A tibble: 10 x 3
##    State   Year  Marriage_Rate
##    <chr>   <chr>         <dbl>
##  1 Alabama 2016            7.1
##  2 Alabama 2015            7.4
##  3 Alabama 2014            7.8
##  4 Alabama 2013            7.8
##  5 Alabama 2012            8.2
##  6 Alabama 2011            8.4
##  7 Alabama 2010            8.2
##  8 Alabama 2009            8.3
##  9 Alabama 2008            8.6
## 10 Alabama 2007            8.9

Much cleaner! Now that the data is properly tidied, we can begin to analyze the data. The first thing I want to see is the overall trend across all states.

yearmeans <- df %>% group_by(Year) %>% summarize(yearmean = mean(Marriage_Rate, na.rm = TRUE))
yearmeans %>% ggplot(aes(Year, yearmean, group = 1)) + geom_line(size = 1) + geom_point() + ylab('Average Marriage Rate') + ggtitle('Average Marriage Rate by Year') + theme(axis.text.x = element_text(angle = 90))

Although this is not a true marriage rate for the entire country as it is simply a mean of the means (correctly calculating the overall marriage rate would require us to weight each mean with population data for each state), this does tell a very clear story: overall, marriage rates in the US have been going down.

Let us examine the states with the overall highest and lowest marriage rates.

statemeans <- df %>% group_by(State) %>% summarize(statemean = mean(Marriage_Rate, na.rm = TRUE))
maxstate <- statemeans[statemeans$statemean == max(statemeans$statemean), 'State']
minstate <- statemeans[statemeans$statemean == min(statemeans$statemean), 'State']
maxmin <- df %>% filter(State %in% c(maxstate, minstate))
maxmin %>% ggplot(aes(Year, Marriage_Rate, color = State, group = State)) + geom_line() +
           theme(axis.text.x = element_text(angle = 90)) + ylab('Marriage Rate') + ggtitle('Max and Min States by Average Marriage Rate')

Interesting! This makes me think that perhaps the drop in marriage rates is driven by only a few states, as Nevada has seen a massive drop and New Jersey has seen almost no movement. I will plot the states with the 10 largest ranges to see if there are other states like Nevada.

top10 <- df %>% group_by(State) %>% summarize(staterange = max(Marriage_Rate, na.rm = TRUE) - 
                                              min(Marriage_Rate, na.rm = TRUE)) %>%
                                              arrange(desc(staterange)) %>% .[1:10,]
df %>% filter(State %in% top10$State) %>% ggplot(aes(Year, Marriage_Rate, group = 1)) + 
                                          facet_wrap( ~ State, ncol = 5) + geom_line() +
                                          theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())+
                                          labs(title = 'Marriage Rate by State', subtitle = 'Top 10 States by Range of Marriage Rate')

It appears Nevada is quite an extreme outlier for marriage rates. I will remake the first plot of marriage rates but exclude Nevada this time, as it is having an untoward effect on the rate while having a relatively low population.

yearmeans2 <- df %>% filter(State != 'Nevada') %>% group_by(Year) %>% 
              summarize(yearmean = mean(Marriage_Rate, na.rm = TRUE))
ggplot() + geom_line(data = yearmeans, aes(Year, yearmean, group = 1), color = 'red') + 
           geom_line(data = yearmeans2, aes(Year, yearmean, group = 1), color = 'blue') +
           ylab('Average Marriage Rate') + ggtitle('Marriage Rates with and without Nevada')

We see that Nevada was indeed having a strong effect on the overall data, pulling it upward approximately 2 percentage points, but the trend remains intact even when accounting for this outlier.

Project 2

Tidy Data