Data Reduction, Latent Information and Predictions

Background

Chaoter 12 of "Data Science For Business", discusses data reduction, latent information, and how 
they can be a useful tool. It can be useful to manipulate, or "tidy" larger datasets and replace 
them with a smaller data, while preserving information from the larger dataset. In many cases, a
smaller dataset will be easier to work with.  Additionally, the smaller dataset may provide more 
insights from the data.


In this project, I imported a CSV datasets, tidied it, and analyzed the data to answer potential 
questions about the data. 

The datsets came from the following link:  https://www.cdc.gov/nchs/nvss/marriage-divorce.htm?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fnchs%2Fmardiv.htm

A larger dataset on marriage and divorce rates can be reduced to smaller datasets that we can use
to draw insights. Also, indications that are latent in the dataset.

CSV Dataset

Here is a link to the original datasets that I converted into CSV files.

U.S. Marriage and Divorce Rates

Load Libraries

I will use:

  • tidyr, dplyr, and stringr to reshape, replace, and tidy the data

  • knitr and kableExtra to create HTML tables

  • ggplot2 to visualize the data

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggplot2)

US Marriage and Divorce Rates 00-18

Importing and examining the dataset in R.

    I imported the CSV from a folder on my github.
    
url <- 'https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/national-marriage-divorce-rates-00-18.csv'
MnD <- read.csv(url)
MnD
##                                                                                                                                                                                                          Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2018
## 1                                                                                                                                                                                                                                                                                   
## 2                                                                                                                                                                                                                                                                               Year
## 3                                                                                                                                                                                                                                                                               2018
## 4                                                                                                                                                                                                                                                                               2017
## 5                                                                                                                                                                                                                                                                               2016
## 6                                                                                                                                                                                                                                                                               2015
## 7                                                                                                                                                                                                                                                                              20141
## 8                                                                                                                                                                                                                                                                              20131
## 9                                                                                                                                                                                                                                                                               2012
## 10                                                                                                                                                                                                                                                                              2011
## 11                                                                                                                                                                                                                                                                              2010
## 12                                                                                                                                                                                                                                                                              2009
## 13                                                                                                                                                                                                                                                                              2008
## 14                                                                                                                                                                                                                                                                              2007
## 15                                                                                                                                                                                                                                                                             20062
## 16                                                                                                                                                                                                                                                                              2005
## 17                                                                                                                                                                                                                                                                              2004
## 18                                                                                                                                                                                                                                                                              2003
## 19                                                                                                                                                                                                                                                                              2002
## 20                                                                                                                                                                                                                                                                              2001
## 21                                                                                                                                                                                                                                                                              2000
## 22                                                                                                                                                                                                                                                      1 Excludes data for Georgia.
## 23                                                                                                                                                                                                                                                    2 Excludes data for Louisiana.
## 24                                                                                                                                                                                                                                                                                  
## 25           Note: Number and rate for 2016 has been revised due to revised figures for Illinois.  Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26                                                                                                                                                                                                                                                                                  
## 27                                                                                                                                                                                                                                                                                  
## 28                                                                                                                                                                                                                                Source: CDC/NCHS National Vital Statistics System.
## 29                                                                                                                                                                                                                                                                                  
## 30                                                                                                                                                                                                                                                                                  
## 31                                                                                                                                                                                                  Provisional number of divorces and annulments and rate: United States, 2000-2018
## 32                                                                                                                                                                                                                                                                                  
## 33                                                                                                                                                                                                                                                                              Year
## 34                                                                                                                                                                                                                                                                             20181
## 35                                                                                                                                                                                                                                                                             20171
## 36                                                                                                                                                                                                                                                                             20162
## 37                                                                                                                                                                                                                                                                             20153
## 38                                                                                                                                                                                                                                                                             20143
## 39                                                                                                                                                                                                                                                                             20133
## 40                                                                                                                                                                                                                                                                             20124
## 41                                                                                                                                                                                                                                                                             20114
## 42                                                                                                                                                                                                                                                                             20104
## 43                                                                                                                                                                                                                                                                             20094
## 44                                                                                                                                                                                                                                                                             20084
## 45                                                                                                                                                                                                                                                                             20074
## 46                                                                                                                                                                                                                                                                             20064
## 47                                                                                                                                                                                                                                                                             20054
## 48                                                                                                                                                                                                                                                                             20045
## 49                                                                                                                                                                                                                                                                             20036
## 50                                                                                                                                                                                                                                                                             20027
## 51                                                                                                                                                                                                                                                                             20018
## 52                                                                                                                                                                                                                                                                             20008
## 53                                                                                                                                                                                                       1 Excludes data for California, Hawaii, Indiana, Minnesota, and New Mexico.
## 54                                                                                                                                                                                              2 Excludes data for California, Georgia, Hawaii, Indiana, Minnesota, and New Mexico.
## 55                                                                                                                                                                                                          3 Excludes data for California, Georgia, Hawaii, Indiana, and Minnesota.
## 56                                                                                                                                                                                               4 Excludes data for California, Georgia, Hawaii, Indiana, Louisiana, and Minnesota.
## 57                                                                                                                                                                                                          5 Excludes data for California, Georgia, Hawaii, Indiana, and Louisiana.
## 58                                                                                                                                                                                                                    6 Excludes data for California, Hawaii, Indiana, and Oklahoma.
## 59                                                                                                                                                                                                                            7 Excludes data for California, Indiana, and Oklahoma.
## 60                                                                                                                                                                                                                 8 Excludes data for California, Indiana, Louisiana, and Oklahoma.
## 61                                                                                                                                                                                                                                                                                  
## 62 Note: Number and rate for 2016 has been revised due to revised figures for Illinois and Texas.  Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 63                                                                                                                                                                                                                                                                                  
## 64                                                                                                                                                                                                                                                                                  
## 65                                                                                                                                                                                                                                Source: CDC/NCHS National Vital Statistics System.
##                        X         X.1                             X.2 X.3 X.4
## 1                                                                     NA  NA
## 2              Marriages  Population Rate per 1,000 total population  NA  NA
## 3              2,132,853 327,167,434                             6.5  NA  NA
## 4              2,236,496 325,719,178                             6.9  NA  NA
## 5              2,251,411 323,127,513                             7.0  NA  NA
## 6              2,221,579 321,418,820                             6.9  NA  NA
## 7              2,140,272 308,759,713                             6.9  NA  NA
## 8              2,081,301 306,136,672                             6.8  NA  NA
## 9              2,131,000 313,914,040                             6.8  NA  NA
## 10             2,118,000 311,591,917                             6.8  NA  NA
## 11             2,096,000 308,745,538                             6.8  NA  NA
## 12             2,080,000 306,771,529                             6.8  NA  NA
## 13             2,157,000 304,093,966                             7.1  NA  NA
## 14             2,197,000 301,231,207                             7.3  NA  NA
## 15             2,193,000 294,077,247                             7.5  NA  NA
## 16             2,249,000 295,516,599                             7.6  NA  NA
## 17             2,279,000 292,805,298                             7.8  NA  NA
## 18             2,245,000 290,107,933                             7.7  NA  NA
## 19             2,290,000 287,625,193                             8.0  NA  NA
## 20             2,326,000 284,968,955                             8.2  NA  NA
## 21             2,315,000 281,421,906                             8.2  NA  NA
## 22                                                                    NA  NA
## 23                                                                    NA  NA
## 24                                                                    NA  NA
## 25                                                                    NA  NA
## 26                                                                    NA  NA
## 27                                                                    NA  NA
## 28                                                                    NA  NA
## 29                                                                    NA  NA
## 30                                                                    NA  NA
## 31                                                                    NA  NA
## 32                                                                    NA  NA
## 33 Divorces & annulments  Population Rate per 1,000 total population  NA  NA
## 34               782,038 271,791,413                             2.9  NA  NA
## 35               787,251 270,423,493                             2.9  NA  NA
## 36               776,288 257,904,548                             3.0  NA  NA
## 37               800,909 258,518,265                             3.1  NA  NA
## 38               813,862 256,483,624                             3.2  NA  NA
## 39               832,157 254,408,815                             3.3  NA  NA
## 40               851,000 248,041,986                             3.4  NA  NA
## 41               877,000 246,273,366                             3.6  NA  NA
## 42               872,000 244,122,529                             3.6  NA  NA
## 43               840,000 242,610,561                             3.5  NA  NA
## 44               844,000 240,545,163                             3.5  NA  NA
## 45               856,000 238,352,850                             3.6  NA  NA
## 46               872,000 236,094,277                             3.7  NA  NA
## 47               847,000 233,495,163                             3.6  NA  NA
## 48               879,000 236,402,656                             3.7  NA  NA
## 49               927,000 243,902,090                             3.8  NA  NA
## 50               955,000 243,108,303                             3.9  NA  NA
## 51               940,000 236,416,762                             4.0  NA  NA
## 52               944,000 233,550,143                             4.0  NA  NA
## 53                                                                    NA  NA
## 54                                                                    NA  NA
## 55                                                                    NA  NA
## 56                                                                    NA  NA
## 57                                                                    NA  NA
## 58                                                                    NA  NA
## 59                                                                    NA  NA
## 60                                                                    NA  NA
## 61                                                                    NA  NA
## 62                                                                    NA  NA
## 63                                                                    NA  NA
## 64                                                                    NA  NA
## 65                                                                    NA  NA
##    X.5 X.6 X.7 X.8 X.9 X.10
## 1   NA  NA  NA  NA  NA   NA
## 2   NA  NA  NA  NA  NA   NA
## 3   NA  NA  NA  NA  NA   NA
## 4   NA  NA  NA  NA  NA   NA
## 5   NA  NA  NA  NA  NA   NA
## 6   NA  NA  NA  NA  NA   NA
## 7   NA  NA  NA  NA  NA   NA
## 8   NA  NA  NA  NA  NA   NA
## 9   NA  NA  NA  NA  NA   NA
## 10  NA  NA  NA  NA  NA   NA
## 11  NA  NA  NA  NA  NA   NA
## 12  NA  NA  NA  NA  NA   NA
## 13  NA  NA  NA  NA  NA   NA
## 14  NA  NA  NA  NA  NA   NA
## 15  NA  NA  NA  NA  NA   NA
## 16  NA  NA  NA  NA  NA   NA
## 17  NA  NA  NA  NA  NA   NA
## 18  NA  NA  NA  NA  NA   NA
## 19  NA  NA  NA  NA  NA   NA
## 20  NA  NA  NA  NA  NA   NA
## 21  NA  NA  NA  NA  NA   NA
## 22  NA  NA  NA  NA  NA   NA
## 23  NA  NA  NA  NA  NA   NA
## 24  NA  NA  NA  NA  NA   NA
## 25  NA  NA  NA  NA  NA   NA
## 26  NA  NA  NA  NA  NA   NA
## 27  NA  NA  NA  NA  NA   NA
## 28  NA  NA  NA  NA  NA   NA
## 29  NA  NA  NA  NA  NA   NA
## 30  NA  NA  NA  NA  NA   NA
## 31  NA  NA  NA  NA  NA   NA
## 32  NA  NA  NA  NA  NA   NA
## 33  NA  NA  NA  NA  NA   NA
## 34  NA  NA  NA  NA  NA   NA
## 35  NA  NA  NA  NA  NA   NA
## 36  NA  NA  NA  NA  NA   NA
## 37  NA  NA  NA  NA  NA   NA
## 38  NA  NA  NA  NA  NA   NA
## 39  NA  NA  NA  NA  NA   NA
## 40  NA  NA  NA  NA  NA   NA
## 41  NA  NA  NA  NA  NA   NA
## 42  NA  NA  NA  NA  NA   NA
## 43  NA  NA  NA  NA  NA   NA
## 44  NA  NA  NA  NA  NA   NA
## 45  NA  NA  NA  NA  NA   NA
## 46  NA  NA  NA  NA  NA   NA
## 47  NA  NA  NA  NA  NA   NA
## 48  NA  NA  NA  NA  NA   NA
## 49  NA  NA  NA  NA  NA   NA
## 50  NA  NA  NA  NA  NA   NA
## 51  NA  NA  NA  NA  NA   NA
## 52  NA  NA  NA  NA  NA   NA
## 53  NA  NA  NA  NA  NA   NA
## 54  NA  NA  NA  NA  NA   NA
## 55  NA  NA  NA  NA  NA   NA
## 56  NA  NA  NA  NA  NA   NA
## 57  NA  NA  NA  NA  NA   NA
## 58  NA  NA  NA  NA  NA   NA
## 59  NA  NA  NA  NA  NA   NA
## 60  NA  NA  NA  NA  NA   NA
## 61  NA  NA  NA  NA  NA   NA
## 62  NA  NA  NA  NA  NA   NA
## 63  NA  NA  NA  NA  NA   NA
## 64  NA  NA  NA  NA  NA   NA
## 65  NA  NA  NA  NA  NA   NA

The dataset contains two tables; one that describes the number of marriages per year in the U.S population from 2000-2018, and another that describes the number of divorces and annulments during the same time period.

The dataset also contains many uneccesary columns, and variables that aren’t incorrectly formatted.

Preparing the dataset.

I split the larger dataset into two smaller datasets, one for marriages and one for divorces. Then, I renamed the column headers.

marriage <- MnD[3:21, 1:4]
names(marriage) <- c("Year", "Marriages", "Population", "Marriage_Rate")
head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Year Marriages Population Marriage_Rate
3 2018 2,132,853 327,167,434 6.5
4 2017 2,236,496 325,719,178 6.9
5 2016 2,251,411 323,127,513 7.0
6 2015 2,221,579 321,418,820 6.9
7 20141 2,140,272 308,759,713 6.9
8 20131 2,081,301 306,136,672 6.8
divorce <- MnD[34:52, 1:4]
names(divorce) <- c("Year", "Divorces", "Population", "Divorce_Rate")
head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year Divorces Population Divorce_Rate
34 20181 782,038 271,791,413 2.9
35 20171 787,251 270,423,493 2.9
36 20162 776,288 257,904,548 3.0
37 20153 800,909 258,518,265 3.1
38 20143 813,862 256,483,624 3.2
39 20133 832,157 254,408,815 3.3

Tidying up the data.

I attempted to remove an extra numerical digit from each row in the “Year” column.

marriage_sep <- marriage %>%
  separate(Year, c("Year", "X"), sep = ("[\\/]"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].
marriage <- marriage_sep[, -2]
marriage
##     Year Marriages  Population Marriage_Rate
## 3   2018 2,132,853 327,167,434           6.5
## 4   2017 2,236,496 325,719,178           6.9
## 5   2016 2,251,411 323,127,513           7.0
## 6   2015 2,221,579 321,418,820           6.9
## 7  20141 2,140,272 308,759,713           6.9
## 8  20131 2,081,301 306,136,672           6.8
## 9   2012 2,131,000 313,914,040           6.8
## 10  2011 2,118,000 311,591,917           6.8
## 11  2010 2,096,000 308,745,538           6.8
## 12  2009 2,080,000 306,771,529           6.8
## 13  2008 2,157,000 304,093,966           7.1
## 14  2007 2,197,000 301,231,207           7.3
## 15 20062 2,193,000 294,077,247           7.5
## 16  2005 2,249,000 295,516,599           7.6
## 17  2004 2,279,000 292,805,298           7.8
## 18  2003 2,245,000 290,107,933           7.7
## 19  2002 2,290,000 287,625,193           8.0
## 20  2001 2,326,000 284,968,955           8.2
## 21  2000 2,315,000 281,421,906           8.2
divorce_sep <- divorce %>%
  separate(Year, c("Year", "X"), sep = "[\\/]")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].
divorce <- divorce_sep[, -2]
divorce
##     Year Divorces  Population Divorce_Rate
## 34 20181  782,038 271,791,413          2.9
## 35 20171  787,251 270,423,493          2.9
## 36 20162  776,288 257,904,548          3.0
## 37 20153  800,909 258,518,265          3.1
## 38 20143  813,862 256,483,624          3.2
## 39 20133  832,157 254,408,815          3.3
## 40 20124  851,000 248,041,986          3.4
## 41 20114  877,000 246,273,366          3.6
## 42 20104  872,000 244,122,529          3.6
## 43 20094  840,000 242,610,561          3.5
## 44 20084  844,000 240,545,163          3.5
## 45 20074  856,000 238,352,850          3.6
## 46 20064  872,000 236,094,277          3.7
## 47 20054  847,000 233,495,163          3.6
## 48 20045  879,000 236,402,656          3.7
## 49 20036  927,000 243,902,090          3.8
## 50 20027  955,000 243,108,303          3.9
## 51 20018  940,000 236,416,762          4.0
## 52 20008  944,000 233,550,143          4.0

Coerce into numeric

Then, I removed the commas from the datasets and coerced each variable into a numeric.

Marriage

# Coerce "Year" into a numeric
marriage$Year <- as.numeric(
                 as.character(marriage$Year))
# Remove commas and coerce "Marriages" into a numeric
m_replace1 <- str_replace_all(marriage$Marriages, "[\\,]", "")
marriage$Marriages <- as.numeric(
                      as.character(m_replace1))
# Remove commas and coerce "Population" into a numeric
m_replace2 <- str_replace_all(marriage$Population, "[\\,]", "")
marriage$Population <- as.numeric(
                       as.character(m_replace2))
# Coerce "Rate_Per_1000" into a numeric
marriage$Marriage_Rate <- as.numeric(
                          as.character(marriage$Marriage_Rate))

Divorce

# Coerce "Year" into a numeric
divorce$Year <- as.numeric(
                as.character(divorce$Year))
# Remove commas and coerce "Divorces" into a numeric
d_replace1 <- str_replace_all(divorce$Divorces, "[\\,]", "")
divorce$Divorces <- as.numeric(
                    as.character(d_replace1))
# Remove commas and coerce "Population" into a numeric
d_replace2 <- str_replace_all(divorce$Population, "[\\,]", "")
divorce$Population <- as.numeric(
                      as.character(d_replace2))
# Coerce "Rate_Per_1000" into a numeric
divorce$Divorce_Rate <- as.numeric(
                         as.character(divorce$Divorce_Rate))

Cleaned data

I now have two datasets that contain clean variables in the correct format.

head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year Marriages Population Marriage_Rate
3 2018 2132853 327167434 6.5
4 2017 2236496 325719178 6.9
5 2016 2251411 323127513 7.0
6 2015 2221579 321418820 6.9
7 20141 2140272 308759713 6.9
8 20131 2081301 306136672 6.8
head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year Divorces Population Divorce_Rate
34 20181 782038 271791413 2.9
35 20171 787251 270423493 2.9
36 20162 776288 257904548 3.0
37 20153 800909 258518265 3.1
38 20143 813862 256483624 3.2
39 20133 832157 254408815 3.3

Analysis and Visulization

I was able to generate insights from a larger dataset by creating two smaller datasets. I found an insight based on the following hypothesis:

The decrease in the divorce rate may be attributedto the decrease in the marriage rate.

I may not be able to prove causality using this dataset due to other factors not included in the dataset. However, we can deduce whether the two rates move in the same direction, and predict what that might mean.

Reshape the data

The data I wanted to visualize were the divorce rates and marriage rates over time. The two datasets have the “Year” column in common, so I performed a left join based on year.

Then, I created a separate dataframe called d1_viz with just the year and rates, and gathered the data into columns by “Rate_Type”, Marriage or Divorce – and “Rate”.

# Join the "Marriage" and "Divorce" datasets by Year
d1_joined <- left_join(marriage, divorce, by="Year")
# Create a new dataset with Year, Marriage Rate, and Divorce Rate
d1_viz <- data.frame(d1_joined$Year, d1_joined$Marriage_Rate, d1_joined$Divorce_Rate)
# Rename the columns of the new dataset
names(d1_viz) <- c("Year", "Marriage_Rate", "Divorce_Rate")
# Gather the dataset
d1_viz <- gather(d1_viz, "Rate_Type", "Rate", 2:3)
head(d1_viz) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year Rate_Type Rate
2018 Marriage_Rate 6.5
2017 Marriage_Rate 6.9
2016 Marriage_Rate 7.0
2015 Marriage_Rate 6.9
20141 Marriage_Rate 6.9
20131 Marriage_Rate 6.8

Visualize the data

I used ggplot2 to visualize the data in a smoothed line graph, which helped to uncover trends.

ggplot(d1_viz, aes(x = d1_viz$Year, y = d1_viz$Rate, group = d1_viz$Rate_Type, colour = d1_viz$Rate_Type)) +
  geom_point() +
  labs(title = "U.S. Marriage and Divorce Rates from 2000 - 2018", colour = "") +
  xlab("Year") +
  ylab("Rate (per 1000 people)") +
  geom_smooth(method = "auto")
## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.
## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.
## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.
## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).

Analysis

The chart shows that both marriage and divorce rates have been on a downward decline since the year 2000 in the United States. However, the marriage rate appears to increase slightly after 2010, while the divorce rate continues to decline. This complicates the hypothesis that a decline in marriage causes a decline in divorce.

HO: Is the decrease in divorce rate due to the decrease in marriage rate?

We don’t have enough information to answer this question. The question assumes that both marriage and divorce rate move negatively together and directly affect one other. However, the data shows that they do not always move together, and we don’t have enough information on other individual factors such as, income level, education, sex or location.