The goal of this assignment is to give you practice in preparing different datasets for downstream
analysis work.
Your task is to:
(1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may
use your own dataset; please don’t use my Sample Post dataset, since that was used in your
Week 6 assignment!) For each of the three chosen datasets:
 Create a .CSV file (or optionally, a MySQL database!) that includes all of the information
included in the dataset. You’re encouraged to use a “wide” structure similar to how the
information appears in the discussion item, so that you can practice tidying and
transformations as described below.
 Read the information from your .CSV file into R, and use tidyr and dplyr as needed to
tidy and transform your data. [Most of your grade will be based on this step!]
 Perform the analysis requested in the discussion item.
 Your code should be in an R Markdown file, posted to rpubs.com, and should include
narrative descriptions of your data cleanup work, analysis, and conclusions.
(2) Please include in your homework submission, for each of the three chosen datasets:
 The URL to the .Rmd file in your GitHub repository, and
 The URL for your rpubs.com web page.

DATASET: POPULATION vs MARRIAGE

Found here: https://raw.githubusercontent.com/theoracley/Data607/master/Project2/national_marriage_divorce_rates.csv

In this dataset, we will see what’s the relationship between Marriage and Poplation rate between 2000 and 2016, as we analyse their growth. We will also see what drives this growth in these specific years.

#Load all required packages
library(DT)
library(tidyr)
library(dplyr)    
library(ggplot2) 
library(tidyverse)

Load the data first

#Reading our data from csv file
data <- read.csv("https://raw.githubusercontent.com/theoracley/Data607/master/Project2/national_marriage_divorce_rates.csv", header=FALSE, sep=",")

#using tibble, convert and check out the data
as.tibble(data)
## # A tibble: 61 x 10
##    V1             V2     V3    V4       V5    V6    V7    V8    V9    V10  
##    <fct>          <fct>  <fct> <fct>    <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 Provisiona~ ""     ""    ""       NA    NA    NA    NA    NA    NA   
##  2 ""             ""     ""    ""       NA    NA    NA    NA    NA    NA   
##  3 Year           Marri~ Popu~ Rate pe~ NA    NA    NA    NA    NA    NA   
##  4 2016           2,245~ 323,~ 6.9      NA    NA    NA    NA    NA    NA   
##  5 2015           2,221~ 321,~ 6.9      NA    NA    NA    NA    NA    NA   
##  6 2014/1         2,140~ 308,~ 6.9      NA    NA    NA    NA    NA    NA   
##  7 2013/1         2,081~ 306,~ 6.8      NA    NA    NA    NA    NA    NA   
##  8 2012           2,131~ 313,~ 6.8      NA    NA    NA    NA    NA    NA   
##  9 2011           2,118~ 311,~ 6.8      NA    NA    NA    NA    NA    NA   
## 10 2010           2,096~ 308,~ 6.8      NA    NA    NA    NA    NA    NA   
## # ... with 51 more rows

Time for clean up

#check out the columns
colnames(data)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"
colnames(data) <- c("years","marriage","population","population_rate","X.3","X.4","X.5","X.6","X.7","X.8")
#extract rows we want
data <- data[-c(33:61),]
data <- data[-c(21:32),]
data <- data[-c(1:3),]

#throw away the columns we don't want
data_cols <- c("years","marriage","population","population_rate")

#get a cleaned data
clean_data <- data[data_cols]

clean_data$years <- gsub("/\\d", "", clean_data$years)

#Resetting the index as usual
rownames(clean_data) <- 1:nrow(clean_data)

#finally we get it
clean_data
##    years  marriage  population population_rate
## 1   2016 2,245,404 323,127,513             6.9
## 2   2015 2,221,579 321,418,820             6.9
## 3   2014 2,140,272 308,759,713             6.9
## 4   2013 2,081,301 306,136,672             6.8
## 5   2012 2,131,000 313,914,040             6.8
## 6   2011 2,118,000 311,591,917             6.8
## 7   2010 2,096,000 308,745,538             6.8
## 8   2009 2,080,000 306,771,529             6.8
## 9   2008 2,157,000 304,093,966             7.1
## 10  2007 2,197,000 301,231,207             7.3
## 11  2006 2,193,000 294,077,247             7.5
## 12  2005 2,249,000 295,516,599             7.6
## 13  2004 2,279,000 292,805,298             7.8
## 14  2003 2,245,000 290,107,933             7.7
## 15  2002 2,290,000 287,625,193             8.0
## 16  2001 2,326,000 284,968,955             8.2
## 17  2000 2,315,000 281,421,906             8.2

Let’s plot

qplot(data=clean_data, x=years, y=population, size=I(3), color=I("#388e3c"), main="Population vs Years")

qplot(data=clean_data, x=years, y=population_rate, size=I(3), color=I("#4dd0e1"), main="Population Rate vs Years")

qplot(data=clean_data, x=years, y=marriage, size=I(3), color=I("#ff6d00"), main="Marriage vs Years")

Conclusion

Part of the life, Population is always increasing. From the plats above, we can see that the marriage was declining from 2000 to 2009 (“Marriage vs Years”), then picks up increasinly after 2009. In the same time (“Population Rate vs years”), the population rate was also declining from 2000 to 2009, and then stayed study from 2009 to 2013, then picks up increasinly since then. Therefore there is a strong positive correclation between Population rate and Marriage. In fact, this can be explained easily, as after 9/11, lot of businesses were lost and therfore lot of layoffs, families losing their jobs, which discourage couples from getting married, hence less new borns and the rate of population goes down. Add to that the housing crisis, Enron, banks fraud…etc which leads into a big recession. To bring the economy back on its feet, the government passed lot of new laws and regulations to stop the housing madness. It also added new packages to stimulate the conomy and create jobs.After that happned in 2019, the conomy started to pick up and people start having confidence in the economy. Things become normal again and couples feeling good about the economy, start thinking about making families, leading to an increase in new borns.