Project 2 Assignment

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Choose any three of the wide datasets identified in the Week 5 Discussion items. For each of the three chosen datasets: Create a .CSV fil, that includes all of the information included in the dataset:

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)
library(knitr)
library(stringr)
library(xtable)

Project 2, Dataset 1: IMDB Movies

This untidy dataset was scraped from IMDB with a lot of interesting information on the top movies ever made. However, there is instances of missing and inconsistent information that makes using this dataset difficult in its current state. In this section, I will tidy and pivot the original data into meaningful subgroups using tidyr and dplyr.I performed all of these transformations in a row with the pipe (%>%) operator.

Reading in the original CSV dataframe:

initial_df <- read_csv("C:/Data 607/Project 2 Dataset/ds1_movie_metadata.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   color = col_character(),
##   director_name = col_character(),
##   actor_2_name = col_character(),
##   genres = col_character(),
##   actor_1_name = col_character(),
##   movie_title = col_character(),
##   actor_3_name = col_character(),
##   plot_keywords = col_character(),
##   movie_imdb_link = col_character(),
##   language = col_character(),
##   country = col_character(),
##   content_rating = col_character(),
##   imdb_score = col_double(),
##   aspect_ratio = col_double()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col     row col    expected   actual      file                                 expected   <int> <chr>  <chr>      <chr>       <chr>                                actual 1  2324 budget an integer 2400000000  'C:/Data 607/Project 2 Dataset/ds1_~ file 2  2989 budget an integer 12215500000 'C:/Data 607/Project 2 Dataset/ds1_~ row 3  3006 budget an integer 2500000000  'C:/Data 607/Project 2 Dataset/ds1_~ col 4  3860 budget an integer 4200000000  'C:/Data 607/Project 2 Dataset/ds1_~
kable(head(initial_df,5),  caption = "Initial Movie Datatable")
Initial Movie Datatable
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster plot_keywords movie_imdb_link num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
Color James Cameron 723 178 0 855 Joel David Moore 1000 760505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 4834 Wes Studi 0 avatar|future|marine|native|paraplegic http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1 3054 English USA PG-13 2.37e+08 2009 936 7.9 1.78 33000
Color Gore Verbinski 302 169 563 1000 Orlando Bloom 40000 309404152 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World’s End 471220 48350 Jack Davenport 0 goddess|marriage ceremony|marriage proposal|pirate|singapore http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1 1238 English USA PG-13 3.00e+08 2007 5000 7.1 2.35 0
Color Sam Mendes 602 148 0 161 Rory Kinnear 11000 200074175 Action|Adventure|Thriller Christoph Waltz Spectre 275868 11700 Stephanie Sigman 1 bomb|espionage|sequel|spy|terrorist http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1 994 English UK PG-13 2.45e+08 2015 393 6.8 2.35 85000
Color Christopher Nolan 813 164 22000 23000 Christian Bale 27000 448130642 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 106759 Joseph Gordon-Levitt 0 deception|imprisonment|lawlessness|police officer|terrorist plot http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1 2701 English USA PG-13 2.50e+08 2012 23000 8.5 2.35 164000
NA Doug Walker NA NA 131 NA Rob Walker 131 NA Documentary Doug Walker Star Wars: Episode VII - The Force Awakens 8 143 NA 0 NA http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1 NA NA NA NA NA NA 12 7.1 NA 0

I was interested in the genres included in the top 5,000 movies. Each firm had a few genres tagged. The genres were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melt, restructure, the data and remove the movie titles. What I am left with is two columns: the genre and the counter. I decided to group the data and summarize by the counter.

#Top Genres In The Top 5,000 Movies

genre <-select(initial_df, movie_title, genres) %>%
    separate(genres, c("Genre_1", "Genre_2", "Genre_3", "Genre_4", "Genre_5", "Genre_6", "Genre_7"), sep = "\\|",
    extra = "drop", fill = "right")%>% 
    melt(id.vars=c("movie_title"))%>% 
    select(-(variable: movie_title ))%>%
    mutate(countvar =1) %>%
    group_by(value) %>% 
    summarise(genre_sum = sum(countvar)) %>%
    arrange(desc(genre_sum))%>%
    filter(value != "Fi") %>%
    drop_na()%>%
    rename(movie_genre = value)

kable(head(genre,5),  caption = "Movie Genre Datatable")
Movie Genre Datatable
movie_genre genre_sum
Drama 2594
Comedy 1872
Thriller 1408
Action 1153
Romance 1106

I then plot the top 10 rated genres

ggplot(head(genre,10), aes(x= reorder(movie_genre, -genre_sum), y= genre_sum)) +
  geom_bar(stat="identity", color="blue", fill="white") +
  geom_text(aes(label=genre_sum), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Top 10 Movie Genres")

I was also interested most profitable directors and their associated facebook likes. Each row in the dataset has the director’s name, the budget and the gross of each movie. I used mutate the find the net profit of firms for each director. I then grouped the data by the director and summarized the average profit and sum of facebook likes.

#Top Profitable Directors In The Top 5,000 Movies

director <-select(initial_df,director_name,director_facebook_likes, budget, gross) %>%
    mutate(countvar =1) %>%
    mutate(net_profit= gross - budget) %>%
    group_by(director_name) %>% 
    #drop_na()%>%
    summarise(avg_profit = mean(net_profit, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE),
              facebook_likes = sum(director_facebook_likes, na.rm = TRUE)) %>%
    arrange(desc(avg_profit))

kable(head(director,5),  caption = "Director Datatable")
Director Datatable
director_name avg_profit movie_count facebook_likes
Tim Miller 305024263 1 84
George Lucas 277328296 5 0
Richard Marquand 276625409 1 37
Kyle Balda 262029560 1 22
Colin Trevorrow 252717532 2 730

I was also interested the actor information. More specifically, I was interest in the number of “top rated” movies each actor stared in as welll as and their facebook likes. Each row in the dataset has the actors’ name and their facebook likes. I used rbind to merge all the temp dataframes together. I then grouped the data by the actor and summarized by the sum of facebook likes and movie counts.

df1 <- select( initial_df,actor_1_name,actor_1_facebook_likes)%>%
        rename(actor_name= actor_1_name, fb_likes = actor_1_facebook_likes)
df2 <- select(initial_df, actor_2_name, actor_2_facebook_likes)%>%
        rename(actor_name= actor_2_name, fb_likes = actor_2_facebook_likes)
df3 <- select( initial_df,actor_3_name,actor_3_facebook_likes)%>%
        rename(actor_name= actor_3_name, fb_likes = actor_3_facebook_likes)

actors <- rbind(df1, df2, df3)%>%
    mutate(countvar =1) %>%
    group_by(actor_name) %>% 
    #drop_na()%>%
    summarise(fb_likes = sum(fb_likes, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE)) %>%
    arrange(desc(fb_likes))

kable(head(actors,5),  caption = "Actors Datatable")
Actors Datatable
actor_name fb_likes movie_count
Johnny Depp 1640000 41
Robin Williams 1323000 27
Robert De Niro 1188000 54
Matthew Ziff 780000 3
J.K. Simmons 744000 31

I was also interested most popular search keywords for the top movies. The keywords were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melted, restructured, the data and remove the movie titles. I grouped by the keyword and summarized with the sum of the counter.

keyword <-select(initial_df, movie_title, plot_keywords) %>%
    separate(plot_keywords, c("key_1", "key_2", "key_3", "key_4", "key_5", "key_6", "key_7", "key_8",
                              "key_9", "key_10"), sep = "\\|",
    extra = "drop", fill = "right")%>% 
    melt(id.vars=c("movie_title"))%>% 
    select(-(variable: movie_title ))%>%
    mutate(countvar =1) %>%
    group_by(value) %>% 
    summarise(key_word = sum(countvar)) %>%
    arrange(desc(key_word))%>%
    drop_na()
kable(head(keyword,10),  caption = "Keyword Datatable")
Keyword Datatable
value key_word
love 198
friend 166
murder 161
death 132
police 126
new york city 91
high school 89
alien 82
school 73
boy 72

Plotted the top 15 keywords

ggplot(head(keyword, 15), aes(reorder(x=value,-key_word), y=key_word)) + 
  geom_point(size=3) + 
  geom_segment(aes(x=value, 
                   xend=value, 
                   y=0, 
                   yend=key_word)) + 
  labs(title="Keyword Chart", 
       subtitle="Search Keyword Popularity") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

Project 2, Dataset 2: US Chronic Disease Indicators (CDI)

From the description provided on the discussion board: The dataset provided by fed for US chronic disease indicator (CDI) is an interesting dataset. The dataset actually outlines that how different states are impacted by certain types of disease category along with clear indicators such as alcohol use among youth, Binge drinking prevalence among adults aged ???18 years, Heavy drinking among adults aged ???18 years, Chronic liver disease mortality etc.

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
initial_cdi_df<-(read.table('https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv',  header = FALSE, sep = ",", quote = "", stringsAsFactors = FALSE,fill = TRUE)) 

names(initial_cdi_df) <- c("Year", "LocationAbbr","LocationDesc","Category","Indicator","Datasource","DataValueUnit","DataValueType","DataValue",
"DataValueAlt","DataValueFootnoteSymbol","DataValueFootnote","Gender","StratificationID1","IndicatorID","LocationID","LowConfidenceInterval",
"HighConfidenceInterval","GeoLocation")
initial_cdi_df <- initial_cdi_df[-1,]

kable(head(initial_cdi_df,10),  caption = "Initial CDI Datatable")
Initial CDI Datatable
Year LocationAbbr LocationDesc Category Indicator Datasource DataValueUnit DataValueType DataValue DataValueAlt DataValueFootnoteSymbol DataValueFootnote Gender StratificationID1 IndicatorID LocationID LowConfidenceInterval HighConfidenceInterval GeoLocation NA
2 2013 AL Alabama Alcohol Alcohol use among youth YRBSS % Crude Prevalence 35 35.0 ALC1_1 01 30.1 40.3 “(32.84057112200048 -86.63186076199969)“
3 2013 AK Alaska Alcohol Alcohol use among youth YRBSS % Crude Prevalence 22.5 22.5 ALC1_1 02 19.3 26.1 “(64.84507995700051 -147.72205903599973)“
4 2013 AZ Arizona Alcohol Alcohol use among youth YRBSS % Crude Prevalence 36 36.0 ALC1_1 04 31.4 40.9 “(34.865970280000454 -111.76381127699972)“
5 2013 AR Arkansas Alcohol Alcohol use among youth YRBSS % Crude Prevalence 36.3 36.3 ALC1_1 05 32.3 40.4 “(34.74865012400045 -92.27449074299966)“
6 2013 CA California Alcohol Alcohol use among youth YRBSS % Crude Prevalence ¯ No data available ALC1_1 06 “(37.63864012300047 -120.99999953799971)“
7 2013 CO Colorado Alcohol Alcohol use among youth YRBSS % Crude Prevalence ¯ No data available ALC1_1 08 “(38.843840757000464 -106.13361092099967)“
8 2013 CT Connecticut Alcohol Alcohol use among youth YRBSS % Crude Prevalence 36.7 36.7 ALC1_1 09 32.7 41.0 “(41.56266102000046 -72.64984095199964)“
9 2013 DE Delaware Alcohol Alcohol use among youth YRBSS % Crude Prevalence 36.3 36.3 ALC1_1 10 33.7 39.0 “(39.008830667000495 -75.57774116799965)“
10 2013 DC District of Columbia Alcohol Alcohol use among youth YRBSS % Crude Prevalence 31.4 31.4 ALC1_1 11 30.2 32.5 “(38.907192 -77.036871)“
11 2013 FL Florida Alcohol Alcohol use among youth YRBSS % Crude Prevalence 34.8 34.8 ALC1_1 12 33.1 36.6 “(28.932040377000476 -81.92896053899966)“

I was interested in the aggregate prevalence of drinking in different age groups. I filted by the associated category and data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I then replaced the special character to “>=” using the stringr library. I grouped by the age group and summarized by the average crude prevalence. I added a sequental counter to the dataframe.

category_df <-select(initial_cdi_df, Category, Indicator,DataValueType, DataValue) %>%
    filter(Category == "Alcohol" & DataValueType == "Crude Prevalence" ) %>%
  filter(complete.cases(.)) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  #mutate(Indicator= (str_replace_all(Indicator, "â???¥18", ">=18")))%>%
    group_by(Indicator, DataValueType) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
    arrange(Indicator)%>%
  rename(Measure = DataValueType) 
category_df$ID <- seq.int(nrow(category_df))

kable(head(category_df,10),  caption = "Alcohol Datatable")
Alcohol Datatable
Indicator Measure average ID
Alcohol use among youth Crude Prevalence 31.388889 1
Alcohol use before pregnancy Crude Prevalence 55.408696 2
Binge drinking prevalence among adults aged ≥18 years Crude Prevalence 16.775309 3
Binge drinking prevalence among women aged 18-44 years Crude Prevalence 17.325926 4
Binge drinking prevalence among youth Crude Prevalence 17.689130 5
Heavy drinking among adults aged ≥18 years Crude Prevalence 5.972222 6
Heavy drinking among women aged 18-44 years Crude Prevalence 5.788679 7
# Draw plot
ggplot(category_df, aes(x=reorder(ID, -average), y=average)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") +
  geom_text(aes(label=round(average)))+
  labs(title="Ranging Alcohol Use", 
       subtitle="Crude Prevalence", 
       caption="source: CDI") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

I was interested in the average mortality rates of asthma for males and females by geographic location. I selected the location, gender, category, confidence interval data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I grouped by the location, gender, indicator and summarized by the average asthma mortality rates.

Asthma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Gender, DataValueType, DataValue, HighConfidenceInterval) %>%
    filter(Category == "Asthma" & DataValueType == "Age-adjusted rate"
           & Gender != "Total" & Indicator == "Asthma mortality rate") %>%
  #filter(complete.cases(.)) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  mutate(HighConfidenceInterval =  as.numeric(HighConfidenceInterval)) %>%
    group_by(LocationAbbr,Gender,Indicator, DataValueType) %>%
    summarise(average = mean(DataValue, na.rm = TRUE), confidence_average = mean(HighConfidenceInterval, na.rm =TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueType) 

kable(head(Asthma_df,10),  caption = "Asthma Datatable")
Asthma Datatable
LocationAbbr Gender Indicator Measure average confidence_average
AL Female Asthma mortality rate Age-adjusted rate 15.370 20.630
AR Female Asthma mortality rate Age-adjusted rate 15.960 22.930
AZ Female Asthma mortality rate Age-adjusted rate 8.790 12.480
AZ Male Asthma mortality rate Age-adjusted rate 8.220 11.950
CA Female Asthma mortality rate Age-adjusted rate 13.485 16.065
CA Male Asthma mortality rate Age-adjusted rate 8.880 11.305
CO Female Asthma mortality rate Age-adjusted rate 13.030 18.300
CT Female Asthma mortality rate Age-adjusted rate 11.630 16.600
FL Female Asthma mortality rate Age-adjusted rate 9.350 11.050
FL Male Asthma mortality rate Age-adjusted rate 6.370 8.180

Plot the asthma mortality rates, split by males and females

ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = average, fill = Gender)) +   
                              geom_bar(stat = "identity", width = .6)

Plot the average confidence interval for the mortality rates

ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = confidence_average, fill = Gender)) +   
                              geom_bar(stat = "identity", width = .6)

I was interested in the instances of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.

melanoma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
                    DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
    filter( Category == "Cancer" & Indicator == "\"Invasive melanoma") %>%
  unite_("Indicator", c("Indicator","Datasource")) %>%
  unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
  filter( DataValueAlt == "Average Annual Age-adjusted Rate" | 
          DataValueUnit == "Average Annual Age-adjusted Rate")%>%
  rename(DataValue = DataValueFootnoteSymbol) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  group_by(LocationAbbr,Indicator, DataValueAlt) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueAlt) 

kable(head(melanoma_df,10),  caption = "Melanoma Datatable")
Melanoma Datatable
LocationAbbr Indicator Measure average
AK “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 11.90000
AL “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 20.90000
AR “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 15.00000
AZ “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 16.65000
CO “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 20.90000
DC “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 8.60000
DE “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 27.20000
FL “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 17.77143
GA “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 24.00000
ID “Invasive melanoma_ incidence” Average Annual Age-adjusted Rate 24.80000
ggplot(head(melanoma_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
  geom_bar(stat="identity", color="blue", fill="white") +
  geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Top 20 States With Melanoma Cases %")

I was interested in the mortality rates of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.

melanomadeaths_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
                    DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
    filter( Category == "Cancer" & Indicator == "\"Melanoma") %>%
  unite_("Indicator", c("Indicator","Datasource")) %>%
  unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
  filter( DataValueAlt == "Average Annual Age-adjusted Rate" | 
          DataValueUnit == "Average Annual Age-adjusted Rate")%>%
  rename(DataValue = DataValueFootnoteSymbol) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  group_by(LocationAbbr,Indicator, DataValueAlt) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueAlt) 

kable(head(melanomadeaths_df,10),  caption = "Melanoma Mortality Datatable")
Melanoma Mortality Datatable
LocationAbbr Indicator Measure average
AK “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.3
AL “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.9
AR “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.7
AZ “Melanoma_ mortality” Average Annual Age-adjusted Rate 3.0
CA “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.6
CO “Melanoma_ mortality” Average Annual Age-adjusted Rate 3.5
CT “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.5
DC “Melanoma_ mortality” Average Annual Age-adjusted Rate 1.3
DE “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.7
FL “Melanoma_ mortality” Average Annual Age-adjusted Rate 2.9
ggplot(head(melanomadeaths_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
  geom_bar(stat="identity", color="red", fill="red") +
  geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Average Melanoma Deaths By State %")

Project 2, Dataset 3: Marriage and Divorce Rates

Taken from the week 5 discussion board. National marriage and divorce rate data from National Vital Statistics System

Read in the initial dataframe. I used the na.strings to turn all blanks to NAs. This is will me quickly remove unwanted columns in the tidying process

initial_md_df<-(read.table('C:/Data 607/Project 2 Dataset/marriage_divorce.csv',  header = FALSE, sep = ",", quote = "\"" , stringsAsFactors = FALSE,fill = TRUE,na.strings = c("", "NA"))) 

#initial_md_df <- initial_md_df[!apply(is.na(initial_md_df) | initial_md_df == "", 1, all),]

kable(head(initial_md_df,10),  caption = "Initial Marriages & Divorces Datatable")
Initial Marriages & Divorces Datatable
V1 V2 V3 V4
Provisional number of marriages and marriage rate: United States, 2000-2016 NA NA NA
NA NA NA NA
Year Marriages Population Rate per 1,000 total population
2016 2,245,404 323,127,513 6.9
2015 2,221,579 321,418,820 6.9
2014/1 2,140,272 308,759,713 6.9
2013/1 2,081,301 306,136,672 6.8
2012 2,131,000 313,914,040 6.8
2011 2,118,000 311,591,917 6.8
2010 2,096,000 308,745,538 6.8

I created the marriages dataframe. I used the stringr library to replace the commas to blanks and the converted all the columns to numeric. I noticed during my initial examination of the dataset that the number of marriages exceed 1 million per year and the number of divorces are below 1 million. To make the marriages datatable I filtered population column for above 1 million.

marriages_df <-initial_md_df %>% 
               na.omit() %>%
              mutate(V2 = (str_replace_all(V2, ",", "")),
                     V3 = (str_replace_all(V3, ",", "")),
                     V4 = (str_replace_all(V4, ",", "")))%>%

             
              mutate(V1 = (str_replace_all(V1, "/1", "")),
                     V1 = (str_replace_all(V1, "/2", "")))%>%
  
                mutate(V2 =  as.numeric(V2),
                       V3 =  as.numeric(V3),
                       V1 =  as.numeric(V1),
                       V4 =  as.numeric(V4))%>%
               filter( V2 > 1000000) %>%
      
             
              rename(Year = V1, Marriages = V2, Population = V3,
                     Rate = V4)
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
kable(head(marriages_df,10),  caption = "Marriages Datatable")             
Marriages Datatable
Year Marriages Population Rate
2016 2245404 323127513 6.9
2015 2221579 321418820 6.9
2014 2140272 308759713 6.9
2013 2081301 306136672 6.8
2012 2131000 313914040 6.8
2011 2118000 311591917 6.8
2010 2096000 308745538 6.8
2009 2080000 306771529 6.8
2008 2157000 304093966 7.1
2007 2197000 301231207 7.3
theme_set(theme_bw())
ggplot(marriages_df, aes(Population, Marriages))+ 
      geom_point()+
  labs( title="Marraiges vs Population", 
       caption="Source: marriages_df")

ggplot(marriages_df, aes(x= Year, y= Marriages)) +
  geom_bar(stat="identity", color="yellow", fill="white") +
  geom_text(aes(label= Marriages), vjust=-0.3, size=2, color="blue")+
    labs(subtitle="Marriage Numbers From 2000 to 2016", 
     
       title="Marriages Per Year", 
       caption="Source: marriages_df")

ggplot(marriages_df)  + 
    geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
    geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
    labs(subtitle="Rate per 1,000 total population", 
     
       title="Marriage Rates", 
       caption="Source: marriages_df")

divorse_df <-initial_md_df %>% 
               na.omit() %>%
              mutate(V2 = (str_replace_all(V2, ",", "")),
                     V3 = (str_replace_all(V3, ",", "")),
                     V4 = (str_replace_all(V4, ",", "")))%>%

             
              mutate(V1 = (str_replace_all(V1, "/1", "")),
                     V1 = (str_replace_all(V1, "/2", "")),
                     V1 = (str_replace_all(V1, "/3", "")),
                     V1 = (str_replace_all(V1, "/4", "")),
                     V1 = (str_replace_all(V1, "/5", "")),
                     V1 = (str_replace_all(V1, "/6", "")),
                     V1 = (str_replace_all(V1, "/7", "")))%>%
  
                mutate(V2 =  as.numeric(V2),
                       V3 =  as.numeric(V3),
                       V1 =  as.numeric(V1),
                       V4 =  as.numeric(V4))%>%
               filter( V2 < 1000000) %>%
      
             
              rename(Year = V1, Divorces_Annulments = V2, Population = V3,
                     Rate = V4)
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
kable(head(divorse_df ,10),  caption = "Divorse Datatable")              
Divorse Datatable
Year Divorces_Annulments Population Rate
2016 827261 257904548 3.2
2015 800909 258518265 3.1
2014 813862 256483624 3.2
2013 832157 254408815 3.3
2012 851000 248041986 3.4
2011 877000 246273366 3.6
2010 872000 244122529 3.6
2009 840000 242610561 3.5
2008 844000 240545163 3.5
2007 856000 238352850 3.6
theme_set(theme_bw())
ggplot(divorse_df, aes(Population,Divorces_Annulments))+ 
      geom_point()+
  labs(title="Divorces & Annulments vs Population", 
       caption="Source: divorse_df")

ggplot(divorse_df, aes(x= Year, y= Divorces_Annulments)) +
  geom_bar(stat="identity", color="purple", fill="white") +
  geom_text(aes(label= Divorces_Annulments), vjust=-0.3, size=2, color="blue")+
    labs(subtitle="Divorces & Annulments Numbers From 2000 to 2016", 
       title="Divorces & Annulments Per Year", 
       caption="Source: divorse_df")

ggplot(divorse_df)  + 
    geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
    geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
    labs(subtitle="Rate per 1,000 total population", 
       title="Divorce Rates", 
       caption="Source: divorse_df")

I joined the reformatted divorce and marriages dataframes together so I could create a scatter plot of population of divorces vs marriages.

joined_df <- marriages_df %>%
    left_join(divorse_df, by='Year')

kable(head(joined_df ,10),  caption = "Total Datatable")
Total Datatable
Year Marriages Population.x Rate.x Divorces_Annulments Population.y Rate.y
2016 2245404 323127513 6.9 827261 257904548 3.2
2015 2221579 321418820 6.9 800909 258518265 3.1
2014 2140272 308759713 6.9 813862 256483624 3.2
2013 2081301 306136672 6.8 832157 254408815 3.3
2012 2131000 313914040 6.8 851000 248041986 3.4
2011 2118000 311591917 6.8 877000 246273366 3.6
2010 2096000 308745538 6.8 872000 244122529 3.6
2009 2080000 306771529 6.8 840000 242610561 3.5
2008 2157000 304093966 7.1 844000 240545163 3.5
2007 2197000 301231207 7.3 856000 238352850 3.6
theme_set(theme_bw())
ggplot(joined_df, aes(Marriages,Divorces_Annulments))+ 
      geom_point()+ 
  geom_smooth(method="lm", se=F) +
  labs(title="Marriage vs Divorce", 
       caption="Source: joined_df")