Dataset 1 - Marriage and Divorce rate
Libraries:
library(tidyverse)
library(readr)Read data: Data uploaded to github, then run through rawgit.
The data has issues with column names, formatting, types (int, char). Showing an enlongated head to diplay issues with the data.
data <- read.csv("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/national_marriage_divorce_rates_00-16.csv")
head(data, 30)## ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016
## 1
## 2 Year
## 3 2016
## 4 2015
## 5 2014/1
## 6 2013/1
## 7 2012
## 8 2011
## 9 2010
## 10 2009
## 11 2008
## 12 2007
## 13 2006/2
## 14 2005
## 15 2004
## 16 2003
## 17 2002
## 18 2001
## 19 2000
## 20
## 21 1/Excludes data for Georgia.
## 22 2/Excludes data for Louisiana.
## 23
## 24 Note: Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000
## 25 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26 Source: CDC/NCHS National Vital Statistics System.
## 27
## 28
## 29 Provisional number of divorces and annulments and rate: United States, 2000-2016
## 30
## X X.1 X.2 X.3 X.4 X.5 X.6
## 1 NA NA NA NA
## 2 Marriages Population Rate per 1,000 total population NA NA NA NA
## 3 2,245,404 323,127,513 6.9 NA NA NA NA
## 4 2,221,579 321,418,820 6.9 NA NA NA NA
## 5 2,140,272 308,759,713 6.9 NA NA NA NA
## 6 2,081,301 306,136,672 6.8 NA NA NA NA
## 7 2,131,000 313,914,040 6.8 NA NA NA NA
## 8 2,118,000 311,591,917 6.8 NA NA NA NA
## 9 2,096,000 308,745,538 6.8 NA NA NA NA
## 10 2,080,000 306,771,529 6.8 NA NA NA NA
## 11 2,157,000 304,093,966 7.1 NA NA NA NA
## 12 2,197,000 301,231,207 7.3 NA NA NA NA
## 13 2,193,000 294,077,247 7.5 NA NA NA NA
## 14 2,249,000 295,516,599 7.6 NA NA NA NA
## 15 2,279,000 292,805,298 7.8 NA NA NA NA
## 16 2,245,000 290,107,933 7.7 NA NA NA NA
## 17 2,290,000 287,625,193 8.0 NA NA NA NA
## 18 2,326,000 284,968,955 8.2 NA NA NA NA
## 19 2,315,000 281,421,906 8.2 NA NA NA NA
## 20 NA NA NA NA
## 21 NA NA NA NA
## 22 NA NA NA NA
## 23 NA NA NA NA
## 24 NA NA NA NA
## 25 NA NA NA NA
## 26 NA NA NA NA
## 27 NA NA NA NA
## 28 NA NA NA NA
## 29 NA NA NA NA
## 30 NA NA NA NA
## X.7 X.8
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## 7 NA NA
## 8 NA NA
## 9 NA NA
## 10 NA NA
## 11 NA NA
## 12 NA NA
## 13 NA NA
## 14 NA NA
## 15 NA NA
## 16 NA NA
## 17 NA NA
## 18 NA NA
## 19 NA NA
## 20 NA NA
## 21 NA NA
## 22 NA NA
## 23 NA NA
## 24 NA NA
## 25 NA NA
## 26 NA NA
## 27 NA NA
## 28 NA NA
## 29 NA NA
## 30 NA NA
Converted data into a tibble. Tibbles make it easier to display data; which makes it easier to understand the data.
as.tbl(data)## # A tibble: 60 x 10
## ï..Provisional.n~ X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8
## <fct> <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 "" "" "" "" NA NA NA NA NA NA
## 2 Year Marr~ Popu~ Rate~ NA NA NA NA NA NA
## 3 2016 2,24~ 323,~ 6.9 NA NA NA NA NA NA
## 4 2015 2,22~ 321,~ 6.9 NA NA NA NA NA NA
## 5 2014/1 2,14~ 308,~ 6.9 NA NA NA NA NA NA
## 6 2013/1 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## 7 2012 2,13~ 313,~ 6.8 NA NA NA NA NA NA
## 8 2011 2,11~ 311,~ 6.8 NA NA NA NA NA NA
## 9 2010 2,09~ 308,~ 6.8 NA NA NA NA NA NA
## 10 2009 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## # ... with 50 more rows
tbl_df(data)## # A tibble: 60 x 10
## ï..Provisional.n~ X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8
## <fct> <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 "" "" "" "" NA NA NA NA NA NA
## 2 Year Marr~ Popu~ Rate~ NA NA NA NA NA NA
## 3 2016 2,24~ 323,~ 6.9 NA NA NA NA NA NA
## 4 2015 2,22~ 321,~ 6.9 NA NA NA NA NA NA
## 5 2014/1 2,14~ 308,~ 6.9 NA NA NA NA NA NA
## 6 2013/1 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## 7 2012 2,13~ 313,~ 6.8 NA NA NA NA NA NA
## 8 2011 2,11~ 311,~ 6.8 NA NA NA NA NA NA
## 9 2010 2,09~ 308,~ 6.8 NA NA NA NA NA NA
## 10 2009 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## # ... with 50 more rows
Select columns from the data and rename the first column something that is understandable.
data <- data %>%
as.tbl() %>%
select(ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016,
X,
X.1,
X.2) %>%
rename(num_marrage_rate = ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016)
data## # A tibble: 60 x 4
## num_marrage_rate X X.1 X.2
## <fct> <fct> <fct> <fct>
## 1 "" "" "" ""
## 2 Year Marriages Population Rate per 1,000 total population
## 3 2016 2,245,404 323,127,513 6.9
## 4 2015 2,221,579 321,418,820 6.9
## 5 2014/1 2,140,272 308,759,713 6.9
## 6 2013/1 2,081,301 306,136,672 6.8
## 7 2012 2,131,000 313,914,040 6.8
## 8 2011 2,118,000 311,591,917 6.8
## 9 2010 2,096,000 308,745,538 6.8
## 10 2009 2,080,000 306,771,529 6.8
## # ... with 50 more rows
Remove the “/” which references footnotes in the data. Using reg. expression to search through the column for what needds to be removed.
data1 <- data
data1$num_marrage_rate <- gsub("/\\d", "", data$num_marrage_rate)
data1## # A tibble: 60 x 4
## num_marrage_rate X X.1 X.2
## <chr> <fct> <fct> <fct>
## 1 "" "" "" ""
## 2 Year Marriages Population Rate per 1,000 total population
## 3 2016 2,245,404 323,127,513 6.9
## 4 2015 2,221,579 321,418,820 6.9
## 5 2014 2,140,272 308,759,713 6.9
## 6 2013 2,081,301 306,136,672 6.8
## 7 2012 2,131,000 313,914,040 6.8
## 8 2011 2,118,000 311,591,917 6.8
## 9 2010 2,096,000 308,745,538 6.8
## 10 2009 2,080,000 306,771,529 6.8
## # ... with 50 more rows
Remove first two rows. The first two rows have: 1. a blink row, and 2. a row that has the column names. We’ll rename these later.
data1 <- data1[-c(1:2), ]Rename columns. Renaming the columns to something that is understandable. X becomes Marriages, and so on.
data1 <- rename(data1, marriages = X, population = X.1, rate_per_1000 = X.2)
data1 <- rename(data1, year = num_marrage_rate)
data1## # A tibble: 58 x 4
## year marriages population rate_per_1000
## <chr> <fct> <fct> <fct>
## 1 2016 2,245,404 323,127,513 6.9
## 2 2015 2,221,579 321,418,820 6.9
## 3 2014 2,140,272 308,759,713 6.9
## 4 2013 2,081,301 306,136,672 6.8
## 5 2012 2,131,000 313,914,040 6.8
## 6 2011 2,118,000 311,591,917 6.8
## 7 2010 2,096,000 308,745,538 6.8
## 8 2009 2,080,000 306,771,529 6.8
## 9 2008 2,157,000 304,093,966 7.1
## 10 2007 2,197,000 301,231,207 7.3
## # ... with 48 more rows
Remove ’,’s. When trying to use calculations on the data, the ,’s were interfering. Turns out the numbers are actually characters and not numbers.
data1$marriages <- gsub(",", "", data1$marriages)
data1$population <- gsub(",", "", data1$population)
data1## # A tibble: 58 x 4
## year marriages population rate_per_1000
## <chr> <chr> <chr> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
## 7 2010 2096000 308745538 6.8
## 8 2009 2080000 306771529 6.8
## 9 2008 2157000 304093966 7.1
## 10 2007 2197000 301231207 7.3
## # ... with 48 more rows
Convert columns into numeric instead of char using as.interger.
data1$marriages <- as.integer(data1$marriages)
data1$population <- as.integer(data1$population)
head(data1)## # A tibble: 6 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
Create a dataframe for just marrages. It’s easier to manipulate the data for marriages and divorce when they’re not in the same columns.
df_marrages <- data1[1:17,]
df_marrages## # A tibble: 17 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
## 7 2010 2096000 308745538 6.8
## 8 2009 2080000 306771529 6.8
## 9 2008 2157000 304093966 7.1
## 10 2007 2197000 301231207 7.3
## 11 2006 2193000 294077247 7.5
## 12 2005 2249000 295516599 7.6
## 13 2004 2279000 292805298 7.8
## 14 2003 2245000 290107933 7.7
## 15 2002 2290000 287625193 8.0
## 16 2001 2326000 284968955 8.2
## 17 2000 2315000 281421906 8.2
Create datafrane for divorce
df_divorce <- data1[30:46,]
df_divorce## # A tibble: 17 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 827261 257904548 3.2
## 2 2015 800909 258518265 3.1
## 3 2014 813862 256483624 3.2
## 4 2013 832157 254408815 3.3
## 5 2012 851000 248041986 3.4
## 6 2011 877000 246273366 3.6
## 7 2010 872000 244122529 3.6
## 8 2009 840000 242610561 3.5
## 9 2008 844000 240545163 3.5
## 10 2007 856000 238352850 3.6
## 11 2006 872000 236094277 3.7
## 12 2005 847000 233495163 3.6
## 13 2004 879000 236402656 3.7
## 14 2003 927000 243902090 3.8
## 15 2002 955000 243108303 3.9
## 16 2001 940000 236416762 4.0
## 17 2000 944000 233550143 4.0
Rename marriage column to divorce.
df_divorce <- rename(df_divorce, divorce = marriages)
df_divorce## # A tibble: 17 x 4
## year divorce population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 827261 257904548 3.2
## 2 2015 800909 258518265 3.1
## 3 2014 813862 256483624 3.2
## 4 2013 832157 254408815 3.3
## 5 2012 851000 248041986 3.4
## 6 2011 877000 246273366 3.6
## 7 2010 872000 244122529 3.6
## 8 2009 840000 242610561 3.5
## 9 2008 844000 240545163 3.5
## 10 2007 856000 238352850 3.6
## 11 2006 872000 236094277 3.7
## 12 2005 847000 233495163 3.6
## 13 2004 879000 236402656 3.7
## 14 2003 927000 243902090 3.8
## 15 2002 955000 243108303 3.9
## 16 2001 940000 236416762 4.0
## 17 2000 944000 233550143 4.0
Crude Divorce Rate - The number of divorces per 1000 in the population.
This does not take into account the num of people who can’t marry (kids, etc.), as such it isn’t that accurate. We’ll see a better way down below looking at the divorce to marriage ratio.
df_marrages <- df_marrages %>%
mutate(marrages_rate = marriages/population *1000)
df_marrages## # A tibble: 17 x 5
## year marriages population rate_per_1000 marrages_rate
## <chr> <int> <int> <fct> <dbl>
## 1 2016 2245404 323127513 6.9 6.95
## 2 2015 2221579 321418820 6.9 6.91
## 3 2014 2140272 308759713 6.9 6.93
## 4 2013 2081301 306136672 6.8 6.80
## 5 2012 2131000 313914040 6.8 6.79
## 6 2011 2118000 311591917 6.8 6.80
## 7 2010 2096000 308745538 6.8 6.79
## 8 2009 2080000 306771529 6.8 6.78
## 9 2008 2157000 304093966 7.1 7.09
## 10 2007 2197000 301231207 7.3 7.29
## 11 2006 2193000 294077247 7.5 7.46
## 12 2005 2249000 295516599 7.6 7.61
## 13 2004 2279000 292805298 7.8 7.78
## 14 2003 2245000 290107933 7.7 7.74
## 15 2002 2290000 287625193 8.0 7.96
## 16 2001 2326000 284968955 8.2 8.16
## 17 2000 2315000 281421906 8.2 8.23
df_divorce <- df_divorce %>%
mutate(divorse_rate = divorce/population*1000)
df_divorce## # A tibble: 17 x 5
## year divorce population rate_per_1000 divorse_rate
## <chr> <int> <int> <fct> <dbl>
## 1 2016 827261 257904548 3.2 3.21
## 2 2015 800909 258518265 3.1 3.10
## 3 2014 813862 256483624 3.2 3.17
## 4 2013 832157 254408815 3.3 3.27
## 5 2012 851000 248041986 3.4 3.43
## 6 2011 877000 246273366 3.6 3.56
## 7 2010 872000 244122529 3.6 3.57
## 8 2009 840000 242610561 3.5 3.46
## 9 2008 844000 240545163 3.5 3.51
## 10 2007 856000 238352850 3.6 3.59
## 11 2006 872000 236094277 3.7 3.69
## 12 2005 847000 233495163 3.6 3.63
## 13 2004 879000 236402656 3.7 3.72
## 14 2003 927000 243902090 3.8 3.80
## 15 2002 955000 243108303 3.9 3.93
## 16 2001 940000 236416762 4.0 3.98
## 17 2000 944000 233550143 4.0 4.04
Combine divorce and marriage dataframes into a single variable: df_combine
df_combine <- data.frame(c(df_marrages, df_divorce))
df_combine## year marriages population rate_per_1000 marrages_rate year.1 divorce
## 1 2016 2245404 323127513 6.9 6.948972 2016 827261
## 2 2015 2221579 321418820 6.9 6.911789 2015 800909
## 3 2014 2140272 308759713 6.9 6.931837 2014 813862
## 4 2013 2081301 306136672 6.8 6.798601 2013 832157
## 5 2012 2131000 313914040 6.8 6.788483 2012 851000
## 6 2011 2118000 311591917 6.8 6.797352 2011 877000
## 7 2010 2096000 308745538 6.8 6.788762 2010 872000
## 8 2009 2080000 306771529 6.8 6.780290 2009 840000
## 9 2008 2157000 304093966 7.1 7.093202 2008 844000
## 10 2007 2197000 301231207 7.3 7.293401 2007 856000
## 11 2006 2193000 294077247 7.5 7.457224 2006 872000
## 12 2005 2249000 295516599 7.6 7.610402 2005 847000
## 13 2004 2279000 292805298 7.8 7.783329 2004 879000
## 14 2003 2245000 290107933 7.7 7.738499 2003 927000
## 15 2002 2290000 287625193 8.0 7.961750 2002 955000
## 16 2001 2326000 284968955 8.2 8.162293 2001 940000
## 17 2000 2315000 281421906 8.2 8.226083 2000 944000
## population.1 rate_per_1000.1 divorse_rate
## 1 257904548 3.2 3.207625
## 2 258518265 3.1 3.098075
## 3 256483624 3.2 3.173154
## 4 254408815 3.3 3.270944
## 5 248041986 3.4 3.430871
## 6 246273366 3.6 3.561083
## 7 244122529 3.6 3.571977
## 8 242610561 3.5 3.462339
## 9 240545163 3.5 3.508697
## 10 238352850 3.6 3.591314
## 11 236094277 3.7 3.693440
## 12 233495163 3.6 3.627484
## 13 236402656 3.7 3.718232
## 14 243902090 3.8 3.800705
## 15 243108303 3.9 3.928290
## 16 236416762 4.0 3.976029
## 17 233550143 4.0 4.041959
Select columns from the combined dataset, and rename some columns that were ‘duplicate’ names.
df_combine <- df_combine %>%
select(year, marriages, population, marrages_rate, divorce, population.1, divorse_rate) %>%
rename(population_m = population, population_d = population.1)Visualization of marriages and divorse over the 16 years in the data set.
Looking at the data, it seems overall theres a relationship. As the # of marriages increase, so does the divorce rate. But what does this actually tell us? It could just be the total population over 16 years has increased…
ggplot(df_combine) +
geom_point(aes(year, population_m), color = 'blue') +
geom_point(aes(year, population_d), color = 'red') +
facet_grid(~year) +
theme_minimal()Divorse to Marriage ratio.
Number of divorces to the number of marriages in a given year. This takes into account how many people were actually married!
d_2_m_ratio <- df_combine$divorce/df_combine$marriages
d_2_m_ratio## [1] 0.3684241 0.3605134 0.3802610 0.3998254 0.3993430 0.4140699 0.4160305
## [8] 0.4038462 0.3912842 0.3896222 0.3976288 0.3766118 0.3856955 0.4129176
## [15] 0.4170306 0.4041273 0.4077754
Add ratio column to dataframe for our visualization.
df_combine <- df_combine %>%
mutate(dm_ratio = df_combine$divorce/df_combine$marriages)
ggplot(df_combine, aes(year, dm_ratio)) +
geom_point(aes(color = dm_ratio)) +
theme_dark()It looks like the divorse rate has actually decreased over the 16 years in the dataset. From ~41% to ~37%. It would be interesting if the data had more information about the individuals surveyed. For example: the Age that they became married, their education status, and their income level. All of these factors could have an impact on if a couple gets divorsed.
Doing some research online, it seems that couples who have a college degree have a much lower chance of being divorced.