Chaoter 12 of "Data Science For Business", discusses data reduction, latent information, and how
they can be a useful tool. It can be useful to manipulate, or "tidy" larger datasets and replace
them with a smaller data, while preserving information from the larger dataset. In many cases, a
smaller dataset will be easier to work with. Additionally, the smaller dataset may provide more
insights from the data.
In this project, I imported a CSV datasets, tidied it, and analyzed the data to answer potential
questions about the data.
The datsets came from the following link: https://www.cdc.gov/nchs/nvss/marriage-divorce.htm?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fnchs%2Fmardiv.htm
A larger dataset on marriage and divorce rates can be reduced to smaller datasets that we can use
to draw insights. Also, indications that are latent in the dataset.
Here is a link to the original datasets that I converted into CSV files.
I will use:
tidyr, dplyr, and stringr to reshape, replace, and tidy the data
knitr and kableExtra to create HTML tables
ggplot2 to visualize the data
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(ggplot2)
I imported the CSV from a folder on my github.
url <- 'https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/national-marriage-divorce-rates-00-18.csv'
MnD <- read.csv(url)
MnD
## Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2018
## 1
## 2 Year
## 3 2018
## 4 2017
## 5 2016
## 6 2015
## 7 20141
## 8 20131
## 9 2012
## 10 2011
## 11 2010
## 12 2009
## 13 2008
## 14 2007
## 15 20062
## 16 2005
## 17 2004
## 18 2003
## 19 2002
## 20 2001
## 21 2000
## 22 1 Excludes data for Georgia.
## 23 2 Excludes data for Louisiana.
## 24
## 25 Note: Number and rate for 2016 has been revised due to revised figures for Illinois. Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26
## 27
## 28 Source: CDC/NCHS National Vital Statistics System.
## 29
## 30
## 31 Provisional number of divorces and annulments and rate: United States, 2000-2018
## 32
## 33 Year
## 34 20181
## 35 20171
## 36 20162
## 37 20153
## 38 20143
## 39 20133
## 40 20124
## 41 20114
## 42 20104
## 43 20094
## 44 20084
## 45 20074
## 46 20064
## 47 20054
## 48 20045
## 49 20036
## 50 20027
## 51 20018
## 52 20008
## 53 1 Excludes data for California, Hawaii, Indiana, Minnesota, and New Mexico.
## 54 2 Excludes data for California, Georgia, Hawaii, Indiana, Minnesota, and New Mexico.
## 55 3 Excludes data for California, Georgia, Hawaii, Indiana, and Minnesota.
## 56 4 Excludes data for California, Georgia, Hawaii, Indiana, Louisiana, and Minnesota.
## 57 5 Excludes data for California, Georgia, Hawaii, Indiana, and Louisiana.
## 58 6 Excludes data for California, Hawaii, Indiana, and Oklahoma.
## 59 7 Excludes data for California, Indiana, and Oklahoma.
## 60 8 Excludes data for California, Indiana, Louisiana, and Oklahoma.
## 61
## 62 Note: Number and rate for 2016 has been revised due to revised figures for Illinois and Texas. Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 63
## 64
## 65 Source: CDC/NCHS National Vital Statistics System.
## X X.1 X.2 X.3 X.4
## 1 NA NA
## 2 Marriages Population Rate per 1,000 total population NA NA
## 3 2,132,853 327,167,434 6.5 NA NA
## 4 2,236,496 325,719,178 6.9 NA NA
## 5 2,251,411 323,127,513 7.0 NA NA
## 6 2,221,579 321,418,820 6.9 NA NA
## 7 2,140,272 308,759,713 6.9 NA NA
## 8 2,081,301 306,136,672 6.8 NA NA
## 9 2,131,000 313,914,040 6.8 NA NA
## 10 2,118,000 311,591,917 6.8 NA NA
## 11 2,096,000 308,745,538 6.8 NA NA
## 12 2,080,000 306,771,529 6.8 NA NA
## 13 2,157,000 304,093,966 7.1 NA NA
## 14 2,197,000 301,231,207 7.3 NA NA
## 15 2,193,000 294,077,247 7.5 NA NA
## 16 2,249,000 295,516,599 7.6 NA NA
## 17 2,279,000 292,805,298 7.8 NA NA
## 18 2,245,000 290,107,933 7.7 NA NA
## 19 2,290,000 287,625,193 8.0 NA NA
## 20 2,326,000 284,968,955 8.2 NA NA
## 21 2,315,000 281,421,906 8.2 NA NA
## 22 NA NA
## 23 NA NA
## 24 NA NA
## 25 NA NA
## 26 NA NA
## 27 NA NA
## 28 NA NA
## 29 NA NA
## 30 NA NA
## 31 NA NA
## 32 NA NA
## 33 Divorces & annulments Population Rate per 1,000 total population NA NA
## 34 782,038 271,791,413 2.9 NA NA
## 35 787,251 270,423,493 2.9 NA NA
## 36 776,288 257,904,548 3.0 NA NA
## 37 800,909 258,518,265 3.1 NA NA
## 38 813,862 256,483,624 3.2 NA NA
## 39 832,157 254,408,815 3.3 NA NA
## 40 851,000 248,041,986 3.4 NA NA
## 41 877,000 246,273,366 3.6 NA NA
## 42 872,000 244,122,529 3.6 NA NA
## 43 840,000 242,610,561 3.5 NA NA
## 44 844,000 240,545,163 3.5 NA NA
## 45 856,000 238,352,850 3.6 NA NA
## 46 872,000 236,094,277 3.7 NA NA
## 47 847,000 233,495,163 3.6 NA NA
## 48 879,000 236,402,656 3.7 NA NA
## 49 927,000 243,902,090 3.8 NA NA
## 50 955,000 243,108,303 3.9 NA NA
## 51 940,000 236,416,762 4.0 NA NA
## 52 944,000 233,550,143 4.0 NA NA
## 53 NA NA
## 54 NA NA
## 55 NA NA
## 56 NA NA
## 57 NA NA
## 58 NA NA
## 59 NA NA
## 60 NA NA
## 61 NA NA
## 62 NA NA
## 63 NA NA
## 64 NA NA
## 65 NA NA
## X.5 X.6 X.7 X.8 X.9 X.10
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## 7 NA NA NA NA NA NA
## 8 NA NA NA NA NA NA
## 9 NA NA NA NA NA NA
## 10 NA NA NA NA NA NA
## 11 NA NA NA NA NA NA
## 12 NA NA NA NA NA NA
## 13 NA NA NA NA NA NA
## 14 NA NA NA NA NA NA
## 15 NA NA NA NA NA NA
## 16 NA NA NA NA NA NA
## 17 NA NA NA NA NA NA
## 18 NA NA NA NA NA NA
## 19 NA NA NA NA NA NA
## 20 NA NA NA NA NA NA
## 21 NA NA NA NA NA NA
## 22 NA NA NA NA NA NA
## 23 NA NA NA NA NA NA
## 24 NA NA NA NA NA NA
## 25 NA NA NA NA NA NA
## 26 NA NA NA NA NA NA
## 27 NA NA NA NA NA NA
## 28 NA NA NA NA NA NA
## 29 NA NA NA NA NA NA
## 30 NA NA NA NA NA NA
## 31 NA NA NA NA NA NA
## 32 NA NA NA NA NA NA
## 33 NA NA NA NA NA NA
## 34 NA NA NA NA NA NA
## 35 NA NA NA NA NA NA
## 36 NA NA NA NA NA NA
## 37 NA NA NA NA NA NA
## 38 NA NA NA NA NA NA
## 39 NA NA NA NA NA NA
## 40 NA NA NA NA NA NA
## 41 NA NA NA NA NA NA
## 42 NA NA NA NA NA NA
## 43 NA NA NA NA NA NA
## 44 NA NA NA NA NA NA
## 45 NA NA NA NA NA NA
## 46 NA NA NA NA NA NA
## 47 NA NA NA NA NA NA
## 48 NA NA NA NA NA NA
## 49 NA NA NA NA NA NA
## 50 NA NA NA NA NA NA
## 51 NA NA NA NA NA NA
## 52 NA NA NA NA NA NA
## 53 NA NA NA NA NA NA
## 54 NA NA NA NA NA NA
## 55 NA NA NA NA NA NA
## 56 NA NA NA NA NA NA
## 57 NA NA NA NA NA NA
## 58 NA NA NA NA NA NA
## 59 NA NA NA NA NA NA
## 60 NA NA NA NA NA NA
## 61 NA NA NA NA NA NA
## 62 NA NA NA NA NA NA
## 63 NA NA NA NA NA NA
## 64 NA NA NA NA NA NA
## 65 NA NA NA NA NA NA
The dataset contains two tables; one that describes the number of marriages per year in the U.S population from 2000-2018, and another that describes the number of divorces and annulments during the same time period.
The dataset also contains many uneccesary columns, and variables that aren’t incorrectly formatted.
I split the larger dataset into two smaller datasets, one for marriages and one for divorces. Then, I renamed the column headers.
marriage <- MnD[3:21, 1:4]
names(marriage) <- c("Year", "Marriages", "Population", "Marriage_Rate")
head(marriage) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Year | Marriages | Population | Marriage_Rate | |
---|---|---|---|---|
3 | 2018 | 2,132,853 | 327,167,434 | 6.5 |
4 | 2017 | 2,236,496 | 325,719,178 | 6.9 |
5 | 2016 | 2,251,411 | 323,127,513 | 7.0 |
6 | 2015 | 2,221,579 | 321,418,820 | 6.9 |
7 | 20141 | 2,140,272 | 308,759,713 | 6.9 |
8 | 20131 | 2,081,301 | 306,136,672 | 6.8 |
divorce <- MnD[34:52, 1:4]
names(divorce) <- c("Year", "Divorces", "Population", "Divorce_Rate")
head(divorce) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year | Divorces | Population | Divorce_Rate | |
---|---|---|---|---|
34 | 20181 | 782,038 | 271,791,413 | 2.9 |
35 | 20171 | 787,251 | 270,423,493 | 2.9 |
36 | 20162 | 776,288 | 257,904,548 | 3.0 |
37 | 20153 | 800,909 | 258,518,265 | 3.1 |
38 | 20143 | 813,862 | 256,483,624 | 3.2 |
39 | 20133 | 832,157 | 254,408,815 | 3.3 |
I attempted to remove an extra numerical digit from each row in the “Year” column.
marriage_sep <- marriage %>%
separate(Year, c("Year", "X"), sep = ("[\\/]"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].
marriage <- marriage_sep[, -2]
marriage
## Year Marriages Population Marriage_Rate
## 3 2018 2,132,853 327,167,434 6.5
## 4 2017 2,236,496 325,719,178 6.9
## 5 2016 2,251,411 323,127,513 7.0
## 6 2015 2,221,579 321,418,820 6.9
## 7 20141 2,140,272 308,759,713 6.9
## 8 20131 2,081,301 306,136,672 6.8
## 9 2012 2,131,000 313,914,040 6.8
## 10 2011 2,118,000 311,591,917 6.8
## 11 2010 2,096,000 308,745,538 6.8
## 12 2009 2,080,000 306,771,529 6.8
## 13 2008 2,157,000 304,093,966 7.1
## 14 2007 2,197,000 301,231,207 7.3
## 15 20062 2,193,000 294,077,247 7.5
## 16 2005 2,249,000 295,516,599 7.6
## 17 2004 2,279,000 292,805,298 7.8
## 18 2003 2,245,000 290,107,933 7.7
## 19 2002 2,290,000 287,625,193 8.0
## 20 2001 2,326,000 284,968,955 8.2
## 21 2000 2,315,000 281,421,906 8.2
divorce_sep <- divorce %>%
separate(Year, c("Year", "X"), sep = "[\\/]")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].
divorce <- divorce_sep[, -2]
divorce
## Year Divorces Population Divorce_Rate
## 34 20181 782,038 271,791,413 2.9
## 35 20171 787,251 270,423,493 2.9
## 36 20162 776,288 257,904,548 3.0
## 37 20153 800,909 258,518,265 3.1
## 38 20143 813,862 256,483,624 3.2
## 39 20133 832,157 254,408,815 3.3
## 40 20124 851,000 248,041,986 3.4
## 41 20114 877,000 246,273,366 3.6
## 42 20104 872,000 244,122,529 3.6
## 43 20094 840,000 242,610,561 3.5
## 44 20084 844,000 240,545,163 3.5
## 45 20074 856,000 238,352,850 3.6
## 46 20064 872,000 236,094,277 3.7
## 47 20054 847,000 233,495,163 3.6
## 48 20045 879,000 236,402,656 3.7
## 49 20036 927,000 243,902,090 3.8
## 50 20027 955,000 243,108,303 3.9
## 51 20018 940,000 236,416,762 4.0
## 52 20008 944,000 233,550,143 4.0
Then, I removed the commas from the datasets and coerced each variable into a numeric.
Marriage
# Coerce "Year" into a numeric
marriage$Year <- as.numeric(
as.character(marriage$Year))
# Remove commas and coerce "Marriages" into a numeric
m_replace1 <- str_replace_all(marriage$Marriages, "[\\,]", "")
marriage$Marriages <- as.numeric(
as.character(m_replace1))
# Remove commas and coerce "Population" into a numeric
m_replace2 <- str_replace_all(marriage$Population, "[\\,]", "")
marriage$Population <- as.numeric(
as.character(m_replace2))
# Coerce "Rate_Per_1000" into a numeric
marriage$Marriage_Rate <- as.numeric(
as.character(marriage$Marriage_Rate))
Divorce
# Coerce "Year" into a numeric
divorce$Year <- as.numeric(
as.character(divorce$Year))
# Remove commas and coerce "Divorces" into a numeric
d_replace1 <- str_replace_all(divorce$Divorces, "[\\,]", "")
divorce$Divorces <- as.numeric(
as.character(d_replace1))
# Remove commas and coerce "Population" into a numeric
d_replace2 <- str_replace_all(divorce$Population, "[\\,]", "")
divorce$Population <- as.numeric(
as.character(d_replace2))
# Coerce "Rate_Per_1000" into a numeric
divorce$Divorce_Rate <- as.numeric(
as.character(divorce$Divorce_Rate))
I now have two datasets that contain clean variables in the correct format.
head(marriage) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year | Marriages | Population | Marriage_Rate | |
---|---|---|---|---|
3 | 2018 | 2132853 | 327167434 | 6.5 |
4 | 2017 | 2236496 | 325719178 | 6.9 |
5 | 2016 | 2251411 | 323127513 | 7.0 |
6 | 2015 | 2221579 | 321418820 | 6.9 |
7 | 20141 | 2140272 | 308759713 | 6.9 |
8 | 20131 | 2081301 | 306136672 | 6.8 |
head(divorce) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year | Divorces | Population | Divorce_Rate | |
---|---|---|---|---|
34 | 20181 | 782038 | 271791413 | 2.9 |
35 | 20171 | 787251 | 270423493 | 2.9 |
36 | 20162 | 776288 | 257904548 | 3.0 |
37 | 20153 | 800909 | 258518265 | 3.1 |
38 | 20143 | 813862 | 256483624 | 3.2 |
39 | 20133 | 832157 | 254408815 | 3.3 |
I was able to generate insights from a larger dataset by creating two smaller datasets. I found an insight based on the following hypothesis:
The decrease in the divorce rate may be attributedto the decrease in the marriage rate.
I may not be able to prove causality using this dataset due to other factors not included in the dataset. However, we can deduce whether the two rates move in the same direction, and predict what that might mean.
The data I wanted to visualize were the divorce rates and marriage rates over time. The two datasets have the “Year” column in common, so I performed a left join based on year.
Then, I created a separate dataframe called d1_viz with just the year and rates, and gathered the data into columns by “Rate_Type”, Marriage or Divorce – and “Rate”.
# Join the "Marriage" and "Divorce" datasets by Year
d1_joined <- left_join(marriage, divorce, by="Year")
# Create a new dataset with Year, Marriage Rate, and Divorce Rate
d1_viz <- data.frame(d1_joined$Year, d1_joined$Marriage_Rate, d1_joined$Divorce_Rate)
# Rename the columns of the new dataset
names(d1_viz) <- c("Year", "Marriage_Rate", "Divorce_Rate")
# Gather the dataset
d1_viz <- gather(d1_viz, "Rate_Type", "Rate", 2:3)
head(d1_viz) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year | Rate_Type | Rate |
---|---|---|
2018 | Marriage_Rate | 6.5 |
2017 | Marriage_Rate | 6.9 |
2016 | Marriage_Rate | 7.0 |
2015 | Marriage_Rate | 6.9 |
20141 | Marriage_Rate | 6.9 |
20131 | Marriage_Rate | 6.8 |
Visualize the data
I used ggplot2 to visualize the data in a smoothed line graph, which helped to uncover trends.
ggplot(d1_viz, aes(x = d1_viz$Year, y = d1_viz$Rate, group = d1_viz$Rate_Type, colour = d1_viz$Rate_Type)) +
geom_point() +
labs(title = "U.S. Marriage and Divorce Rates from 2000 - 2018", colour = "") +
xlab("Year") +
ylab("Rate (per 1000 people)") +
geom_smooth(method = "auto")
## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.
## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.
## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
The chart shows that both marriage and divorce rates have been on a downward decline since the year 2000 in the United States. However, the marriage rate appears to increase slightly after 2010, while the divorce rate continues to decline. This complicates the hypothesis that a decline in marriage causes a decline in divorce.
HO: Is the decrease in divorce rate due to the decrease in marriage rate?
We don’t have enough information to answer this question. The question assumes that both marriage and divorce rate move negatively together and directly affect one other. However, the data shows that they do not always move together, and we don’t have enough information on other individual factors such as, income level, education, sex or location.