Dataset 1 - Marriage and Divorce rate

Libraries:

library(tidyverse)
library(readr)

Read data: Data uploaded to github, then run through rawgit.

The data has issues with column names, formatting, types (int, char). Showing an enlongated head to diplay issues with the data.

data <- read.csv("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/national_marriage_divorce_rates_00-16.csv")

head(data, 30)

##                                  ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016
## 1                                                                                                              
## 2                                                                                                          Year
## 3                                                                                                          2016
## 4                                                                                                          2015
## 5                                                                                                        2014/1
## 6                                                                                                        2013/1
## 7                                                                                                          2012
## 8                                                                                                          2011
## 9                                                                                                          2010
## 10                                                                                                         2009
## 11                                                                                                         2008
## 12                                                                                                         2007
## 13                                                                                                       2006/2
## 14                                                                                                         2005
## 15                                                                                                         2004
## 16                                                                                                         2003
## 17                                                                                                         2002
## 18                                                                                                         2001
## 19                                                                                                         2000
## 20                                                                                                             
## 21                                                                                 1/Excludes data for Georgia.
## 22                                                                               2/Excludes data for Louisiana.
## 23                                                                                                             
## 24 Note: Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 
## 25                                  and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26                                                           Source: CDC/NCHS National Vital Statistics System.
## 27                                                                                                             
## 28                                                                                                             
## 29                             Provisional number of divorces and annulments and rate: United States, 2000-2016
## 30                                                                                                             
##            X         X.1                             X.2 X.3 X.4 X.5 X.6
## 1                                                         NA  NA  NA  NA
## 2  Marriages  Population Rate per 1,000 total population  NA  NA  NA  NA
## 3  2,245,404 323,127,513                             6.9  NA  NA  NA  NA
## 4  2,221,579 321,418,820                             6.9  NA  NA  NA  NA
## 5  2,140,272 308,759,713                             6.9  NA  NA  NA  NA
## 6  2,081,301 306,136,672                             6.8  NA  NA  NA  NA
## 7  2,131,000 313,914,040                             6.8  NA  NA  NA  NA
## 8  2,118,000 311,591,917                             6.8  NA  NA  NA  NA
## 9  2,096,000 308,745,538                             6.8  NA  NA  NA  NA
## 10 2,080,000 306,771,529                             6.8  NA  NA  NA  NA
## 11 2,157,000 304,093,966                             7.1  NA  NA  NA  NA
## 12 2,197,000 301,231,207                             7.3  NA  NA  NA  NA
## 13 2,193,000 294,077,247                             7.5  NA  NA  NA  NA
## 14 2,249,000 295,516,599                             7.6  NA  NA  NA  NA
## 15 2,279,000 292,805,298                             7.8  NA  NA  NA  NA
## 16 2,245,000 290,107,933                             7.7  NA  NA  NA  NA
## 17 2,290,000 287,625,193                             8.0  NA  NA  NA  NA
## 18 2,326,000 284,968,955                             8.2  NA  NA  NA  NA
## 19 2,315,000 281,421,906                             8.2  NA  NA  NA  NA
## 20                                                        NA  NA  NA  NA
## 21                                                        NA  NA  NA  NA
## 22                                                        NA  NA  NA  NA
## 23                                                        NA  NA  NA  NA
## 24                                                        NA  NA  NA  NA
## 25                                                        NA  NA  NA  NA
## 26                                                        NA  NA  NA  NA
## 27                                                        NA  NA  NA  NA
## 28                                                        NA  NA  NA  NA
## 29                                                        NA  NA  NA  NA
## 30                                                        NA  NA  NA  NA
##    X.7 X.8
## 1   NA  NA
## 2   NA  NA
## 3   NA  NA
## 4   NA  NA
## 5   NA  NA
## 6   NA  NA
## 7   NA  NA
## 8   NA  NA
## 9   NA  NA
## 10  NA  NA
## 11  NA  NA
## 12  NA  NA
## 13  NA  NA
## 14  NA  NA
## 15  NA  NA
## 16  NA  NA
## 17  NA  NA
## 18  NA  NA
## 19  NA  NA
## 20  NA  NA
## 21  NA  NA
## 22  NA  NA
## 23  NA  NA
## 24  NA  NA
## 25  NA  NA
## 26  NA  NA
## 27  NA  NA
## 28  NA  NA
## 29  NA  NA
## 30  NA  NA

Converted data into a tibble. Tibbles make it easier to display data; which makes it easier to understand the data.

as.tbl(data)

## # A tibble: 60 x 10
##    ï..Provisional.n~ X     X.1   X.2   X.3   X.4   X.5   X.6   X.7   X.8  
##    <fct>             <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 ""                ""    ""    ""    NA    NA    NA    NA    NA    NA   
##  2 Year              Marr~ Popu~ Rate~ NA    NA    NA    NA    NA    NA   
##  3 2016              2,24~ 323,~ 6.9   NA    NA    NA    NA    NA    NA   
##  4 2015              2,22~ 321,~ 6.9   NA    NA    NA    NA    NA    NA   
##  5 2014/1            2,14~ 308,~ 6.9   NA    NA    NA    NA    NA    NA   
##  6 2013/1            2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
##  7 2012              2,13~ 313,~ 6.8   NA    NA    NA    NA    NA    NA   
##  8 2011              2,11~ 311,~ 6.8   NA    NA    NA    NA    NA    NA   
##  9 2010              2,09~ 308,~ 6.8   NA    NA    NA    NA    NA    NA   
## 10 2009              2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
## # ... with 50 more rows

tbl_df(data)

## # A tibble: 60 x 10
##    ï..Provisional.n~ X     X.1   X.2   X.3   X.4   X.5   X.6   X.7   X.8  
##    <fct>             <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 ""                ""    ""    ""    NA    NA    NA    NA    NA    NA   
##  2 Year              Marr~ Popu~ Rate~ NA    NA    NA    NA    NA    NA   
##  3 2016              2,24~ 323,~ 6.9   NA    NA    NA    NA    NA    NA   
##  4 2015              2,22~ 321,~ 6.9   NA    NA    NA    NA    NA    NA   
##  5 2014/1            2,14~ 308,~ 6.9   NA    NA    NA    NA    NA    NA   
##  6 2013/1            2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
##  7 2012              2,13~ 313,~ 6.8   NA    NA    NA    NA    NA    NA   
##  8 2011              2,11~ 311,~ 6.8   NA    NA    NA    NA    NA    NA   
##  9 2010              2,09~ 308,~ 6.8   NA    NA    NA    NA    NA    NA   
## 10 2009              2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
## # ... with 50 more rows

Select columns from the data and rename the first column something that is understandable.

data <- data %>%
  as.tbl() %>%
  select(ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016,
         X,
         X.1,
         X.2) %>%
  rename(num_marrage_rate = ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016)

data

## # A tibble: 60 x 4
##    num_marrage_rate X         X.1         X.2                            
##    <fct>            <fct>     <fct>       <fct>                          
##  1 ""               ""        ""          ""                             
##  2 Year             Marriages Population  Rate per 1,000 total population
##  3 2016             2,245,404 323,127,513 6.9                            
##  4 2015             2,221,579 321,418,820 6.9                            
##  5 2014/1           2,140,272 308,759,713 6.9                            
##  6 2013/1           2,081,301 306,136,672 6.8                            
##  7 2012             2,131,000 313,914,040 6.8                            
##  8 2011             2,118,000 311,591,917 6.8                            
##  9 2010             2,096,000 308,745,538 6.8                            
## 10 2009             2,080,000 306,771,529 6.8                            
## # ... with 50 more rows

Remove the “/” which references footnotes in the data. Using reg. expression to search through the column for what needds to be removed.

data1 <- data
data1$num_marrage_rate <- gsub("/\\d", "", data$num_marrage_rate)
data1

## # A tibble: 60 x 4
##    num_marrage_rate X         X.1         X.2                            
##    <chr>            <fct>     <fct>       <fct>                          
##  1 ""               ""        ""          ""                             
##  2 Year             Marriages Population  Rate per 1,000 total population
##  3 2016             2,245,404 323,127,513 6.9                            
##  4 2015             2,221,579 321,418,820 6.9                            
##  5 2014             2,140,272 308,759,713 6.9                            
##  6 2013             2,081,301 306,136,672 6.8                            
##  7 2012             2,131,000 313,914,040 6.8                            
##  8 2011             2,118,000 311,591,917 6.8                            
##  9 2010             2,096,000 308,745,538 6.8                            
## 10 2009             2,080,000 306,771,529 6.8                            
## # ... with 50 more rows

Remove first two rows. The first two rows have: 1. a blink row, and 2. a row that has the column names. We’ll rename these later.

data1 <- data1[-c(1:2), ]

Rename columns. Renaming the columns to something that is understandable. X becomes Marriages, and so on.

data1 <- rename(data1, marriages = X, population = X.1, rate_per_1000 = X.2)
data1 <- rename(data1, year = num_marrage_rate)
data1

## # A tibble: 58 x 4
##    year  marriages population  rate_per_1000
##    <chr> <fct>     <fct>       <fct>        
##  1 2016  2,245,404 323,127,513 6.9          
##  2 2015  2,221,579 321,418,820 6.9          
##  3 2014  2,140,272 308,759,713 6.9          
##  4 2013  2,081,301 306,136,672 6.8          
##  5 2012  2,131,000 313,914,040 6.8          
##  6 2011  2,118,000 311,591,917 6.8          
##  7 2010  2,096,000 308,745,538 6.8          
##  8 2009  2,080,000 306,771,529 6.8          
##  9 2008  2,157,000 304,093,966 7.1          
## 10 2007  2,197,000 301,231,207 7.3          
## # ... with 48 more rows

Remove ’,’s. When trying to use calculations on the data, the ,’s were interfering. Turns out the numbers are actually characters and not numbers.

data1$marriages <- gsub(",", "", data1$marriages)
data1$population <- gsub(",", "", data1$population)
data1

## # A tibble: 58 x 4
##    year  marriages population rate_per_1000
##    <chr> <chr>     <chr>      <fct>        
##  1 2016  2245404   323127513  6.9          
##  2 2015  2221579   321418820  6.9          
##  3 2014  2140272   308759713  6.9          
##  4 2013  2081301   306136672  6.8          
##  5 2012  2131000   313914040  6.8          
##  6 2011  2118000   311591917  6.8          
##  7 2010  2096000   308745538  6.8          
##  8 2009  2080000   306771529  6.8          
##  9 2008  2157000   304093966  7.1          
## 10 2007  2197000   301231207  7.3          
## # ... with 48 more rows

Convert columns into numeric instead of char using as.interger.

data1$marriages <- as.integer(data1$marriages)
data1$population <- as.integer(data1$population)
head(data1)

## # A tibble: 6 x 4
##   year  marriages population rate_per_1000
##   <chr>     <int>      <int> <fct>        
## 1 2016    2245404  323127513 6.9          
## 2 2015    2221579  321418820 6.9          
## 3 2014    2140272  308759713 6.9          
## 4 2013    2081301  306136672 6.8          
## 5 2012    2131000  313914040 6.8          
## 6 2011    2118000  311591917 6.8

Create a dataframe for just marrages. It’s easier to manipulate the data for marriages and divorce when they’re not in the same columns.

df_marrages <- data1[1:17,]
df_marrages

## # A tibble: 17 x 4
##    year  marriages population rate_per_1000
##    <chr>     <int>      <int> <fct>        
##  1 2016    2245404  323127513 6.9          
##  2 2015    2221579  321418820 6.9          
##  3 2014    2140272  308759713 6.9          
##  4 2013    2081301  306136672 6.8          
##  5 2012    2131000  313914040 6.8          
##  6 2011    2118000  311591917 6.8          
##  7 2010    2096000  308745538 6.8          
##  8 2009    2080000  306771529 6.8          
##  9 2008    2157000  304093966 7.1          
## 10 2007    2197000  301231207 7.3          
## 11 2006    2193000  294077247 7.5          
## 12 2005    2249000  295516599 7.6          
## 13 2004    2279000  292805298 7.8          
## 14 2003    2245000  290107933 7.7          
## 15 2002    2290000  287625193 8.0          
## 16 2001    2326000  284968955 8.2          
## 17 2000    2315000  281421906 8.2

Create datafrane for divorce

df_divorce <- data1[30:46,]
df_divorce

## # A tibble: 17 x 4
##    year  marriages population rate_per_1000
##    <chr>     <int>      <int> <fct>        
##  1 2016     827261  257904548 3.2          
##  2 2015     800909  258518265 3.1          
##  3 2014     813862  256483624 3.2          
##  4 2013     832157  254408815 3.3          
##  5 2012     851000  248041986 3.4          
##  6 2011     877000  246273366 3.6          
##  7 2010     872000  244122529 3.6          
##  8 2009     840000  242610561 3.5          
##  9 2008     844000  240545163 3.5          
## 10 2007     856000  238352850 3.6          
## 11 2006     872000  236094277 3.7          
## 12 2005     847000  233495163 3.6          
## 13 2004     879000  236402656 3.7          
## 14 2003     927000  243902090 3.8          
## 15 2002     955000  243108303 3.9          
## 16 2001     940000  236416762 4.0          
## 17 2000     944000  233550143 4.0

Rename marriage column to divorce.

df_divorce <- rename(df_divorce, divorce = marriages)
df_divorce

## # A tibble: 17 x 4
##    year  divorce population rate_per_1000
##    <chr>   <int>      <int> <fct>        
##  1 2016   827261  257904548 3.2          
##  2 2015   800909  258518265 3.1          
##  3 2014   813862  256483624 3.2          
##  4 2013   832157  254408815 3.3          
##  5 2012   851000  248041986 3.4          
##  6 2011   877000  246273366 3.6          
##  7 2010   872000  244122529 3.6          
##  8 2009   840000  242610561 3.5          
##  9 2008   844000  240545163 3.5          
## 10 2007   856000  238352850 3.6          
## 11 2006   872000  236094277 3.7          
## 12 2005   847000  233495163 3.6          
## 13 2004   879000  236402656 3.7          
## 14 2003   927000  243902090 3.8          
## 15 2002   955000  243108303 3.9          
## 16 2001   940000  236416762 4.0          
## 17 2000   944000  233550143 4.0

Crude Divorce Rate - The number of divorces per 1000 in the population.

This does not take into account the num of people who can’t marry (kids, etc.), as such it isn’t that accurate. We’ll see a better way down below looking at the divorce to marriage ratio.

df_marrages <- df_marrages %>%
  mutate(marrages_rate = marriages/population *1000)

df_marrages

## # A tibble: 17 x 5
##    year  marriages population rate_per_1000 marrages_rate
##    <chr>     <int>      <int> <fct>                 <dbl>
##  1 2016    2245404  323127513 6.9                    6.95
##  2 2015    2221579  321418820 6.9                    6.91
##  3 2014    2140272  308759713 6.9                    6.93
##  4 2013    2081301  306136672 6.8                    6.80
##  5 2012    2131000  313914040 6.8                    6.79
##  6 2011    2118000  311591917 6.8                    6.80
##  7 2010    2096000  308745538 6.8                    6.79
##  8 2009    2080000  306771529 6.8                    6.78
##  9 2008    2157000  304093966 7.1                    7.09
## 10 2007    2197000  301231207 7.3                    7.29
## 11 2006    2193000  294077247 7.5                    7.46
## 12 2005    2249000  295516599 7.6                    7.61
## 13 2004    2279000  292805298 7.8                    7.78
## 14 2003    2245000  290107933 7.7                    7.74
## 15 2002    2290000  287625193 8.0                    7.96
## 16 2001    2326000  284968955 8.2                    8.16
## 17 2000    2315000  281421906 8.2                    8.23

df_divorce <- df_divorce %>%
  mutate(divorse_rate = divorce/population*1000)

df_divorce

## # A tibble: 17 x 5
##    year  divorce population rate_per_1000 divorse_rate
##    <chr>   <int>      <int> <fct>                <dbl>
##  1 2016   827261  257904548 3.2                   3.21
##  2 2015   800909  258518265 3.1                   3.10
##  3 2014   813862  256483624 3.2                   3.17
##  4 2013   832157  254408815 3.3                   3.27
##  5 2012   851000  248041986 3.4                   3.43
##  6 2011   877000  246273366 3.6                   3.56
##  7 2010   872000  244122529 3.6                   3.57
##  8 2009   840000  242610561 3.5                   3.46
##  9 2008   844000  240545163 3.5                   3.51
## 10 2007   856000  238352850 3.6                   3.59
## 11 2006   872000  236094277 3.7                   3.69
## 12 2005   847000  233495163 3.6                   3.63
## 13 2004   879000  236402656 3.7                   3.72
## 14 2003   927000  243902090 3.8                   3.80
## 15 2002   955000  243108303 3.9                   3.93
## 16 2001   940000  236416762 4.0                   3.98
## 17 2000   944000  233550143 4.0                   4.04

Combine divorce and marriage dataframes into a single variable: df_combine

df_combine <- data.frame(c(df_marrages, df_divorce))

df_combine

##    year marriages population rate_per_1000 marrages_rate year.1 divorce
## 1  2016   2245404  323127513           6.9      6.948972   2016  827261
## 2  2015   2221579  321418820           6.9      6.911789   2015  800909
## 3  2014   2140272  308759713           6.9      6.931837   2014  813862
## 4  2013   2081301  306136672           6.8      6.798601   2013  832157
## 5  2012   2131000  313914040           6.8      6.788483   2012  851000
## 6  2011   2118000  311591917           6.8      6.797352   2011  877000
## 7  2010   2096000  308745538           6.8      6.788762   2010  872000
## 8  2009   2080000  306771529           6.8      6.780290   2009  840000
## 9  2008   2157000  304093966           7.1      7.093202   2008  844000
## 10 2007   2197000  301231207           7.3      7.293401   2007  856000
## 11 2006   2193000  294077247           7.5      7.457224   2006  872000
## 12 2005   2249000  295516599           7.6      7.610402   2005  847000
## 13 2004   2279000  292805298           7.8      7.783329   2004  879000
## 14 2003   2245000  290107933           7.7      7.738499   2003  927000
## 15 2002   2290000  287625193           8.0      7.961750   2002  955000
## 16 2001   2326000  284968955           8.2      8.162293   2001  940000
## 17 2000   2315000  281421906           8.2      8.226083   2000  944000
##    population.1 rate_per_1000.1 divorse_rate
## 1     257904548             3.2     3.207625
## 2     258518265             3.1     3.098075
## 3     256483624             3.2     3.173154
## 4     254408815             3.3     3.270944
## 5     248041986             3.4     3.430871
## 6     246273366             3.6     3.561083
## 7     244122529             3.6     3.571977
## 8     242610561             3.5     3.462339
## 9     240545163             3.5     3.508697
## 10    238352850             3.6     3.591314
## 11    236094277             3.7     3.693440
## 12    233495163             3.6     3.627484
## 13    236402656             3.7     3.718232
## 14    243902090             3.8     3.800705
## 15    243108303             3.9     3.928290
## 16    236416762             4.0     3.976029
## 17    233550143             4.0     4.041959

Select columns from the combined dataset, and rename some columns that were ‘duplicate’ names.

df_combine <- df_combine %>%
  select(year, marriages, population, marrages_rate, divorce, population.1, divorse_rate) %>%
  rename(population_m = population, population_d = population.1)

Visualization of marriages and divorse over the 16 years in the data set.

Looking at the data, it seems overall theres a relationship. As the # of marriages increase, so does the divorce rate. But what does this actually tell us? It could just be the total population over 16 years has increased…

ggplot(df_combine) + 
  geom_point(aes(year, population_m), color = 'blue') +
  geom_point(aes(year, population_d), color = 'red') +
  facet_grid(~year) +
  theme_minimal()

Divorse to Marriage ratio.

Number of divorces to the number of marriages in a given year. This takes into account how many people were actually married!

d_2_m_ratio <- df_combine$divorce/df_combine$marriages
d_2_m_ratio

##  [1] 0.3684241 0.3605134 0.3802610 0.3998254 0.3993430 0.4140699 0.4160305
##  [8] 0.4038462 0.3912842 0.3896222 0.3976288 0.3766118 0.3856955 0.4129176
## [15] 0.4170306 0.4041273 0.4077754

Add ratio column to dataframe for our visualization.

df_combine <- df_combine %>%
  mutate(dm_ratio = df_combine$divorce/df_combine$marriages)

ggplot(df_combine, aes(year, dm_ratio)) +
  geom_point(aes(color = dm_ratio)) +
  theme_dark()

It looks like the divorse rate has actually decreased over the 16 years in the dataset. From ~41% to ~37%. It would be interesting if the data had more information about the individuals surveyed. For example: the Age that they became married, their education status, and their income level. All of these factors could have an impact on if a couple gets divorsed.

Doing some research online, it seems that couples who have a college degree have a much lower chance of being divorced.

Marriage and Divorce rate

Nicholas Schettini

2018-03-12

Dataset 1 - Marriage and Divorce rate

Crude Divorce Rate - The number of divorces per 1000 in the population.

Divorse to Marriage ratio.