Data Reduction, Latent Information and Predictions

Background

Chaoter 12 of "Data Science For Business", discusses data reduction, latent information, and how 
they can be a useful tool. It can be useful to manipulate, or "tidy" larger datasets and replace 
them with a smaller data, while preserving information from the larger dataset. In many cases, a
smaller dataset will be easier to work with.  Additionally, the smaller dataset may provide more 
insights from the data.


In this project, I imported a CSV datasets, tidied it, and analyzed the data to answer potential 
questions about the data. 

The datsets came from the following link:  https://www.cdc.gov/nchs/nvss/marriage-divorce.htm?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fnchs%2Fmardiv.htm

A larger dataset on marriage and divorce rates can be reduced to smaller datasets that we can use
to draw insights. Also, indications that are latent in the dataset.

CSV Dataset

Here is a link to the original datasets that I converted into CSV files.

U.S. Marriage and Divorce Rates

Load Libraries

I will use:

tidyr, dplyr, and stringr to reshape, replace, and tidy the data
knitr and kableExtra to create HTML tables
ggplot2 to visualize the data

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggplot2)

US Marriage and Divorce Rates 00-18

Importing and examining the dataset in R.

    I imported the CSV from a folder on my github.

url <- 'https://raw.githubusercontent.com/Vthomps000/DATA607_VT/master/national-marriage-divorce-rates-00-18.csv'
MnD <- read.csv(url)
MnD

##                                                                                                                                                                                                          Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2018
## 1                                                                                                                                                                                                                                                                                   
## 2                                                                                                                                                                                                                                                                               Year
## 3                                                                                                                                                                                                                                                                               2018
## 4                                                                                                                                                                                                                                                                               2017
## 5                                                                                                                                                                                                                                                                               2016
## 6                                                                                                                                                                                                                                                                               2015
## 7                                                                                                                                                                                                                                                                              20141
## 8                                                                                                                                                                                                                                                                              20131
## 9                                                                                                                                                                                                                                                                               2012
## 10                                                                                                                                                                                                                                                                              2011
## 11                                                                                                                                                                                                                                                                              2010
## 12                                                                                                                                                                                                                                                                              2009
## 13                                                                                                                                                                                                                                                                              2008
## 14                                                                                                                                                                                                                                                                              2007
## 15                                                                                                                                                                                                                                                                             20062
## 16                                                                                                                                                                                                                                                                              2005
## 17                                                                                                                                                                                                                                                                              2004
## 18                                                                                                                                                                                                                                                                              2003
## 19                                                                                                                                                                                                                                                                              2002
## 20                                                                                                                                                                                                                                                                              2001
## 21                                                                                                                                                                                                                                                                              2000
## 22                                                                                                                                                                                                                                                      1 Excludes data for Georgia.
## 23                                                                                                                                                                                                                                                    2 Excludes data for Louisiana.
## 24                                                                                                                                                                                                                                                                                  
## 25           Note: Number and rate for 2016 has been revised due to revised figures for Illinois.  Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26                                                                                                                                                                                                                                                                                  
## 27                                                                                                                                                                                                                                                                                  
## 28                                                                                                                                                                                                                                Source: CDC/NCHS National Vital Statistics System.
## 29                                                                                                                                                                                                                                                                                  
## 30                                                                                                                                                                                                                                                                                  
## 31                                                                                                                                                                                                  Provisional number of divorces and annulments and rate: United States, 2000-2018
## 32                                                                                                                                                                                                                                                                                  
## 33                                                                                                                                                                                                                                                                              Year
## 34                                                                                                                                                                                                                                                                             20181
## 35                                                                                                                                                                                                                                                                             20171
## 36                                                                                                                                                                                                                                                                             20162
## 37                                                                                                                                                                                                                                                                             20153
## 38                                                                                                                                                                                                                                                                             20143
## 39                                                                                                                                                                                                                                                                             20133
## 40                                                                                                                                                                                                                                                                             20124
## 41                                                                                                                                                                                                                                                                             20114
## 42                                                                                                                                                                                                                                                                             20104
## 43                                                                                                                                                                                                                                                                             20094
## 44                                                                                                                                                                                                                                                                             20084
## 45                                                                                                                                                                                                                                                                             20074
## 46                                                                                                                                                                                                                                                                             20064
## 47                                                                                                                                                                                                                                                                             20054
## 48                                                                                                                                                                                                                                                                             20045
## 49                                                                                                                                                                                                                                                                             20036
## 50                                                                                                                                                                                                                                                                             20027
## 51                                                                                                                                                                                                                                                                             20018
## 52                                                                                                                                                                                                                                                                             20008
## 53                                                                                                                                                                                                       1 Excludes data for California, Hawaii, Indiana, Minnesota, and New Mexico.
## 54                                                                                                                                                                                              2 Excludes data for California, Georgia, Hawaii, Indiana, Minnesota, and New Mexico.
## 55                                                                                                                                                                                                          3 Excludes data for California, Georgia, Hawaii, Indiana, and Minnesota.
## 56                                                                                                                                                                                               4 Excludes data for California, Georgia, Hawaii, Indiana, Louisiana, and Minnesota.
## 57                                                                                                                                                                                                          5 Excludes data for California, Georgia, Hawaii, Indiana, and Louisiana.
## 58                                                                                                                                                                                                                    6 Excludes data for California, Hawaii, Indiana, and Oklahoma.
## 59                                                                                                                                                                                                                            7 Excludes data for California, Indiana, and Oklahoma.
## 60                                                                                                                                                                                                                 8 Excludes data for California, Indiana, Louisiana, and Oklahoma.
## 61                                                                                                                                                                                                                                                                                  
## 62 Note: Number and rate for 2016 has been revised due to revised figures for Illinois and Texas.  Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 63                                                                                                                                                                                                                                                                                  
## 64                                                                                                                                                                                                                                                                                  
## 65                                                                                                                                                                                                                                Source: CDC/NCHS National Vital Statistics System.
##                        X         X.1                             X.2 X.3 X.4
## 1                                                                     NA  NA
## 2              Marriages  Population Rate per 1,000 total population  NA  NA
## 3              2,132,853 327,167,434                             6.5  NA  NA
## 4              2,236,496 325,719,178                             6.9  NA  NA
## 5              2,251,411 323,127,513                             7.0  NA  NA
## 6              2,221,579 321,418,820                             6.9  NA  NA
## 7              2,140,272 308,759,713                             6.9  NA  NA
## 8              2,081,301 306,136,672                             6.8  NA  NA
## 9              2,131,000 313,914,040                             6.8  NA  NA
## 10             2,118,000 311,591,917                             6.8  NA  NA
## 11             2,096,000 308,745,538                             6.8  NA  NA
## 12             2,080,000 306,771,529                             6.8  NA  NA
## 13             2,157,000 304,093,966                             7.1  NA  NA
## 14             2,197,000 301,231,207                             7.3  NA  NA
## 15             2,193,000 294,077,247                             7.5  NA  NA
## 16             2,249,000 295,516,599                             7.6  NA  NA
## 17             2,279,000 292,805,298                             7.8  NA  NA
## 18             2,245,000 290,107,933                             7.7  NA  NA
## 19             2,290,000 287,625,193                             8.0  NA  NA
## 20             2,326,000 284,968,955                             8.2  NA  NA
## 21             2,315,000 281,421,906                             8.2  NA  NA
## 22                                                                    NA  NA
## 23                                                                    NA  NA
## 24                                                                    NA  NA
## 25                                                                    NA  NA
## 26                                                                    NA  NA
## 27                                                                    NA  NA
## 28                                                                    NA  NA
## 29                                                                    NA  NA
## 30                                                                    NA  NA
## 31                                                                    NA  NA
## 32                                                                    NA  NA
## 33 Divorces & annulments  Population Rate per 1,000 total population  NA  NA
## 34               782,038 271,791,413                             2.9  NA  NA
## 35               787,251 270,423,493                             2.9  NA  NA
## 36               776,288 257,904,548                             3.0  NA  NA
## 37               800,909 258,518,265                             3.1  NA  NA
## 38               813,862 256,483,624                             3.2  NA  NA
## 39               832,157 254,408,815                             3.3  NA  NA
## 40               851,000 248,041,986                             3.4  NA  NA
## 41               877,000 246,273,366                             3.6  NA  NA
## 42               872,000 244,122,529                             3.6  NA  NA
## 43               840,000 242,610,561                             3.5  NA  NA
## 44               844,000 240,545,163                             3.5  NA  NA
## 45               856,000 238,352,850                             3.6  NA  NA
## 46               872,000 236,094,277                             3.7  NA  NA
## 47               847,000 233,495,163                             3.6  NA  NA
## 48               879,000 236,402,656                             3.7  NA  NA
## 49               927,000 243,902,090                             3.8  NA  NA
## 50               955,000 243,108,303                             3.9  NA  NA
## 51               940,000 236,416,762                             4.0  NA  NA
## 52               944,000 233,550,143                             4.0  NA  NA
## 53                                                                    NA  NA
## 54                                                                    NA  NA
## 55                                                                    NA  NA
## 56                                                                    NA  NA
## 57                                                                    NA  NA
## 58                                                                    NA  NA
## 59                                                                    NA  NA
## 60                                                                    NA  NA
## 61                                                                    NA  NA
## 62                                                                    NA  NA
## 63                                                                    NA  NA
## 64                                                                    NA  NA
## 65                                                                    NA  NA
##    X.5 X.6 X.7 X.8 X.9 X.10
## 1   NA  NA  NA  NA  NA   NA
## 2   NA  NA  NA  NA  NA   NA
## 3   NA  NA  NA  NA  NA   NA
## 4   NA  NA  NA  NA  NA   NA
## 5   NA  NA  NA  NA  NA   NA
## 6   NA  NA  NA  NA  NA   NA
## 7   NA  NA  NA  NA  NA   NA
## 8   NA  NA  NA  NA  NA   NA
## 9   NA  NA  NA  NA  NA   NA
## 10  NA  NA  NA  NA  NA   NA
## 11  NA  NA  NA  NA  NA   NA
## 12  NA  NA  NA  NA  NA   NA
## 13  NA  NA  NA  NA  NA   NA
## 14  NA  NA  NA  NA  NA   NA
## 15  NA  NA  NA  NA  NA   NA
## 16  NA  NA  NA  NA  NA   NA
## 17  NA  NA  NA  NA  NA   NA
## 18  NA  NA  NA  NA  NA   NA
## 19  NA  NA  NA  NA  NA   NA
## 20  NA  NA  NA  NA  NA   NA
## 21  NA  NA  NA  NA  NA   NA
## 22  NA  NA  NA  NA  NA   NA
## 23  NA  NA  NA  NA  NA   NA
## 24  NA  NA  NA  NA  NA   NA
## 25  NA  NA  NA  NA  NA   NA
## 26  NA  NA  NA  NA  NA   NA
## 27  NA  NA  NA  NA  NA   NA
## 28  NA  NA  NA  NA  NA   NA
## 29  NA  NA  NA  NA  NA   NA
## 30  NA  NA  NA  NA  NA   NA
## 31  NA  NA  NA  NA  NA   NA
## 32  NA  NA  NA  NA  NA   NA
## 33  NA  NA  NA  NA  NA   NA
## 34  NA  NA  NA  NA  NA   NA
## 35  NA  NA  NA  NA  NA   NA
## 36  NA  NA  NA  NA  NA   NA
## 37  NA  NA  NA  NA  NA   NA
## 38  NA  NA  NA  NA  NA   NA
## 39  NA  NA  NA  NA  NA   NA
## 40  NA  NA  NA  NA  NA   NA
## 41  NA  NA  NA  NA  NA   NA
## 42  NA  NA  NA  NA  NA   NA
## 43  NA  NA  NA  NA  NA   NA
## 44  NA  NA  NA  NA  NA   NA
## 45  NA  NA  NA  NA  NA   NA
## 46  NA  NA  NA  NA  NA   NA
## 47  NA  NA  NA  NA  NA   NA
## 48  NA  NA  NA  NA  NA   NA
## 49  NA  NA  NA  NA  NA   NA
## 50  NA  NA  NA  NA  NA   NA
## 51  NA  NA  NA  NA  NA   NA
## 52  NA  NA  NA  NA  NA   NA
## 53  NA  NA  NA  NA  NA   NA
## 54  NA  NA  NA  NA  NA   NA
## 55  NA  NA  NA  NA  NA   NA
## 56  NA  NA  NA  NA  NA   NA
## 57  NA  NA  NA  NA  NA   NA
## 58  NA  NA  NA  NA  NA   NA
## 59  NA  NA  NA  NA  NA   NA
## 60  NA  NA  NA  NA  NA   NA
## 61  NA  NA  NA  NA  NA   NA
## 62  NA  NA  NA  NA  NA   NA
## 63  NA  NA  NA  NA  NA   NA
## 64  NA  NA  NA  NA  NA   NA
## 65  NA  NA  NA  NA  NA   NA

The dataset contains two tables; one that describes the number of marriages per year in the U.S population from 2000-2018, and another that describes the number of divorces and annulments during the same time period.

The dataset also contains many uneccesary columns, and variables that aren’t incorrectly formatted.

Preparing the dataset.

I split the larger dataset into two smaller datasets, one for marriages and one for divorces. Then, I renamed the column headers.

marriage <- MnD[3:21, 1:4]
names(marriage) <- c("Year", "Marriages", "Population", "Marriage_Rate")
head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	Year	Marriages	Population	Marriage_Rate
3	2018	2,132,853	327,167,434	6.5
4	2017	2,236,496	325,719,178	6.9
5	2016	2,251,411	323,127,513	7.0
6	2015	2,221,579	321,418,820	6.9
7	20141	2,140,272	308,759,713	6.9
8	20131	2,081,301	306,136,672	6.8

divorce <- MnD[34:52, 1:4]
names(divorce) <- c("Year", "Divorces", "Population", "Divorce_Rate")
head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Divorces	Population	Divorce_Rate
34	20181	782,038	271,791,413	2.9
35	20171	787,251	270,423,493	2.9
36	20162	776,288	257,904,548	3.0
37	20153	800,909	258,518,265	3.1
38	20143	813,862	256,483,624	3.2
39	20133	832,157	254,408,815	3.3

Tidying up the data.

I attempted to remove an extra numerical digit from each row in the “Year” column.

marriage_sep <- marriage %>%
  separate(Year, c("Year", "X"), sep = ("[\\/]"))

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].

marriage <- marriage_sep[, -2]
marriage

##     Year Marriages  Population Marriage_Rate
## 3   2018 2,132,853 327,167,434           6.5
## 4   2017 2,236,496 325,719,178           6.9
## 5   2016 2,251,411 323,127,513           7.0
## 6   2015 2,221,579 321,418,820           6.9
## 7  20141 2,140,272 308,759,713           6.9
## 8  20131 2,081,301 306,136,672           6.8
## 9   2012 2,131,000 313,914,040           6.8
## 10  2011 2,118,000 311,591,917           6.8
## 11  2010 2,096,000 308,745,538           6.8
## 12  2009 2,080,000 306,771,529           6.8
## 13  2008 2,157,000 304,093,966           7.1
## 14  2007 2,197,000 301,231,207           7.3
## 15 20062 2,193,000 294,077,247           7.5
## 16  2005 2,249,000 295,516,599           7.6
## 17  2004 2,279,000 292,805,298           7.8
## 18  2003 2,245,000 290,107,933           7.7
## 19  2002 2,290,000 287,625,193           8.0
## 20  2001 2,326,000 284,968,955           8.2
## 21  2000 2,315,000 281,421,906           8.2

divorce_sep <- divorce %>%
  separate(Year, c("Year", "X"), sep = "[\\/]")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].

divorce <- divorce_sep[, -2]
divorce

##     Year Divorces  Population Divorce_Rate
## 34 20181  782,038 271,791,413          2.9
## 35 20171  787,251 270,423,493          2.9
## 36 20162  776,288 257,904,548          3.0
## 37 20153  800,909 258,518,265          3.1
## 38 20143  813,862 256,483,624          3.2
## 39 20133  832,157 254,408,815          3.3
## 40 20124  851,000 248,041,986          3.4
## 41 20114  877,000 246,273,366          3.6
## 42 20104  872,000 244,122,529          3.6
## 43 20094  840,000 242,610,561          3.5
## 44 20084  844,000 240,545,163          3.5
## 45 20074  856,000 238,352,850          3.6
## 46 20064  872,000 236,094,277          3.7
## 47 20054  847,000 233,495,163          3.6
## 48 20045  879,000 236,402,656          3.7
## 49 20036  927,000 243,902,090          3.8
## 50 20027  955,000 243,108,303          3.9
## 51 20018  940,000 236,416,762          4.0
## 52 20008  944,000 233,550,143          4.0

Coerce into numeric

Then, I removed the commas from the datasets and coerced each variable into a numeric.

Marriage

# Coerce "Year" into a numeric
marriage$Year <- as.numeric(
                 as.character(marriage$Year))
# Remove commas and coerce "Marriages" into a numeric
m_replace1 <- str_replace_all(marriage$Marriages, "[\\,]", "")
marriage$Marriages <- as.numeric(
                      as.character(m_replace1))
# Remove commas and coerce "Population" into a numeric
m_replace2 <- str_replace_all(marriage$Population, "[\\,]", "")
marriage$Population <- as.numeric(
                       as.character(m_replace2))
# Coerce "Rate_Per_1000" into a numeric
marriage$Marriage_Rate <- as.numeric(
                          as.character(marriage$Marriage_Rate))

Divorce

# Coerce "Year" into a numeric
divorce$Year <- as.numeric(
                as.character(divorce$Year))
# Remove commas and coerce "Divorces" into a numeric
d_replace1 <- str_replace_all(divorce$Divorces, "[\\,]", "")
divorce$Divorces <- as.numeric(
                    as.character(d_replace1))
# Remove commas and coerce "Population" into a numeric
d_replace2 <- str_replace_all(divorce$Population, "[\\,]", "")
divorce$Population <- as.numeric(
                      as.character(d_replace2))
# Coerce "Rate_Per_1000" into a numeric
divorce$Divorce_Rate <- as.numeric(
                         as.character(divorce$Divorce_Rate))

Cleaned data

I now have two datasets that contain clean variables in the correct format.

head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Marriages	Population	Marriage_Rate
3	2018	2132853	327167434	6.5
4	2017	2236496	325719178	6.9
5	2016	2251411	323127513	7.0
6	2015	2221579	321418820	6.9
7	20141	2140272	308759713	6.9
8	20131	2081301	306136672	6.8

head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Divorces	Population	Divorce_Rate
34	20181	782038	271791413	2.9
35	20171	787251	270423493	2.9
36	20162	776288	257904548	3.0
37	20153	800909	258518265	3.1
38	20143	813862	256483624	3.2
39	20133	832157	254408815	3.3

Analysis and Visulization

I was able to generate insights from a larger dataset by creating two smaller datasets. I found an insight based on the following hypothesis:

The decrease in the divorce rate may be attributedto the decrease in the marriage rate.

I may not be able to prove causality using this dataset due to other factors not included in the dataset. However, we can deduce whether the two rates move in the same direction, and predict what that might mean.

Reshape the data

The data I wanted to visualize were the divorce rates and marriage rates over time. The two datasets have the “Year” column in common, so I performed a left join based on year.

Then, I created a separate dataframe called d1_viz with just the year and rates, and gathered the data into columns by “Rate_Type”, Marriage or Divorce – and “Rate”.

# Join the "Marriage" and "Divorce" datasets by Year
d1_joined <- left_join(marriage, divorce, by="Year")
# Create a new dataset with Year, Marriage Rate, and Divorce Rate
d1_viz <- data.frame(d1_joined$Year, d1_joined$Marriage_Rate, d1_joined$Divorce_Rate)
# Rename the columns of the new dataset
names(d1_viz) <- c("Year", "Marriage_Rate", "Divorce_Rate")
# Gather the dataset
d1_viz <- gather(d1_viz, "Rate_Type", "Rate", 2:3)
head(d1_viz) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Year	Rate_Type	Rate
2018	Marriage_Rate	6.5
2017	Marriage_Rate	6.9
2016	Marriage_Rate	7.0
2015	Marriage_Rate	6.9
20141	Marriage_Rate	6.9
20131	Marriage_Rate	6.8

Visualize the data

I used ggplot2 to visualize the data in a smoothed line graph, which helped to uncover trends.

ggplot(d1_viz, aes(x = d1_viz$Year, y = d1_viz$Rate, group = d1_viz$Rate_Type, colour = d1_viz$Rate_Type)) +
  geom_point() +
  labs(title = "U.S. Marriage and Divorce Rates from 2000 - 2018", colour = "") +
  xlab("Year") +
  ylab("Rate (per 1000 people)") +
  geom_smooth(method = "auto")

## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.

## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## Warning: Use of `d1_viz$Year` is discouraged. Use `Year` instead.

## Warning: Use of `d1_viz$Rate` is discouraged. Use `Rate` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## Warning: Use of `d1_viz$Rate_Type` is discouraged. Use `Rate_Type` instead.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 19 rows containing non-finite values (stat_smooth).

## Warning: Removed 19 rows containing missing values (geom_point).

Analysis

The chart shows that both marriage and divorce rates have been on a downward decline since the year 2000 in the United States. However, the marriage rate appears to increase slightly after 2010, while the divorce rate continues to decline. This complicates the hypothesis that a decline in marriage causes a decline in divorce.

HO: Is the decrease in divorce rate due to the decrease in marriage rate?

We don’t have enough information to answer this question. The question assumes that both marriage and divorce rate move negatively together and directly affect one other. However, the data shows that they do not always move together, and we don’t have enough information on other individual factors such as, income level, education, sex or location.

DATA 607- Data In Context

Vanita Thompson

4/20/2020