Project 2

Project 2 Assignment

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Choose any three of the wide datasets identified in the Week 5 Discussion items. For each of the three chosen datasets: Create a .CSV fil, that includes all of the information included in the dataset:

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step! -Perform the analysis requested in the discussion item.Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggplot2)
library(knitr)
library(stringr)
library(xtable)

Project 2, Dataset 1: IMDB Movies

This untidy dataset was scraped from IMDB with a lot of interesting information on the top movies ever made. However, there is instances of missing and inconsistent information that makes using this dataset difficult in its current state. In this section, I will tidy and pivot the original data into meaningful subgroups using tidyr and dplyr.I performed all of these transformations in a row with the pipe (%>%) operator.

Reading in the original CSV dataframe:

initial_df <- read_csv("C:/Data 607/Project 2 Dataset/ds1_movie_metadata.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   color = col_character(),
##   director_name = col_character(),
##   actor_2_name = col_character(),
##   genres = col_character(),
##   actor_1_name = col_character(),
##   movie_title = col_character(),
##   actor_3_name = col_character(),
##   plot_keywords = col_character(),
##   movie_imdb_link = col_character(),
##   language = col_character(),
##   country = col_character(),
##   content_rating = col_character(),
##   imdb_score = col_double(),
##   aspect_ratio = col_double()
## )

## See spec(...) for full column specifications.

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col     row col    expected   actual      file                                 expected   <int> <chr>  <chr>      <chr>       <chr>                                actual 1  2324 budget an integer 2400000000  'C:/Data 607/Project 2 Dataset/ds1_~ file 2  2989 budget an integer 12215500000 'C:/Data 607/Project 2 Dataset/ds1_~ row 3  3006 budget an integer 2500000000  'C:/Data 607/Project 2 Dataset/ds1_~ col 4  3860 budget an integer 4200000000  'C:/Data 607/Project 2 Dataset/ds1_~

kable(head(initial_df,5),  caption = "Initial Movie Datatable")

Initial Movie Datatable
color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	actor_1_name	movie_title	num_voted_users	cast_total_facebook_likes	actor_3_name	facenumber_in_poster	plot_keywords	movie_imdb_link	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
Color	James Cameron	723	178	0	855	Joel David Moore	1000	760505847	Action\|Adventure\|Fantasy\|Sci-Fi	CCH Pounder	Avatar	886204	4834	Wes Studi	0	avatar\|future\|marine\|native\|paraplegic	http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1	3054	English	USA	PG-13	2.37e+08	2009	936	7.9	1.78	33000
Color	Gore Verbinski	302	169	563	1000	Orlando Bloom	40000	309404152	Action\|Adventure\|Fantasy	Johnny Depp	Pirates of the Caribbean: At World’s End	471220	48350	Jack Davenport	0	goddess\|marriage ceremony\|marriage proposal\|pirate\|singapore	http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1	1238	English	USA	PG-13	3.00e+08	2007	5000	7.1	2.35	0
Color	Sam Mendes	602	148	0	161	Rory Kinnear	11000	200074175	Action\|Adventure\|Thriller	Christoph Waltz	Spectre	275868	11700	Stephanie Sigman	1	bomb\|espionage\|sequel\|spy\|terrorist	http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1	994	English	UK	PG-13	2.45e+08	2015	393	6.8	2.35	85000
Color	Christopher Nolan	813	164	22000	23000	Christian Bale	27000	448130642	Action\|Thriller	Tom Hardy	The Dark Knight Rises	1144337	106759	Joseph Gordon-Levitt	0	deception\|imprisonment\|lawlessness\|police officer\|terrorist plot	http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1	2701	English	USA	PG-13	2.50e+08	2012	23000	8.5	2.35	164000
NA	Doug Walker	NA	NA	131	NA	Rob Walker	131	NA	Documentary	Doug Walker	Star Wars: Episode VII - The Force Awakens	8	143	NA	0	NA	http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1	NA	NA	NA	NA	NA	NA	12	7.1	NA	0

I was interested in the genres included in the top 5,000 movies. Each firm had a few genres tagged. The genres were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melt, restructure, the data and remove the movie titles. What I am left with is two columns: the genre and the counter. I decided to group the data and summarize by the counter.

#Top Genres In The Top 5,000 Movies

genre <-select(initial_df, movie_title, genres) %>%
    separate(genres, c("Genre_1", "Genre_2", "Genre_3", "Genre_4", "Genre_5", "Genre_6", "Genre_7"), sep = "\\|",
    extra = "drop", fill = "right")%>% 
    melt(id.vars=c("movie_title"))%>% 
    select(-(variable: movie_title ))%>%
    mutate(countvar =1) %>%
    group_by(value) %>% 
    summarise(genre_sum = sum(countvar)) %>%
    arrange(desc(genre_sum))%>%
    filter(value != "Fi") %>%
    drop_na()%>%
    rename(movie_genre = value)

kable(head(genre,5),  caption = "Movie Genre Datatable")

Movie Genre Datatable
movie_genre	genre_sum
Drama	2594
Comedy	1872
Thriller	1408
Action	1153
Romance	1106

I then plot the top 10 rated genres

ggplot(head(genre,10), aes(x= reorder(movie_genre, -genre_sum), y= genre_sum)) +
  geom_bar(stat="identity", color="blue", fill="white") +
  geom_text(aes(label=genre_sum), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Top 10 Movie Genres")

I was also interested most profitable directors and their associated facebook likes. Each row in the dataset has the director’s name, the budget and the gross of each movie. I used mutate the find the net profit of firms for each director. I then grouped the data by the director and summarized the average profit and sum of facebook likes.

#Top Profitable Directors In The Top 5,000 Movies

director <-select(initial_df,director_name,director_facebook_likes, budget, gross) %>%
    mutate(countvar =1) %>%
    mutate(net_profit= gross - budget) %>%
    group_by(director_name) %>% 
    #drop_na()%>%
    summarise(avg_profit = mean(net_profit, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE),
              facebook_likes = sum(director_facebook_likes, na.rm = TRUE)) %>%
    arrange(desc(avg_profit))

kable(head(director,5),  caption = "Director Datatable")

Director Datatable
director_name	avg_profit	movie_count	facebook_likes
Tim Miller	305024263	1	84
George Lucas	277328296	5	0
Richard Marquand	276625409	1	37
Kyle Balda	262029560	1	22
Colin Trevorrow	252717532	2	730

I was also interested the actor information. More specifically, I was interest in the number of “top rated” movies each actor stared in as welll as and their facebook likes. Each row in the dataset has the actors’ name and their facebook likes. I used rbind to merge all the temp dataframes together. I then grouped the data by the actor and summarized by the sum of facebook likes and movie counts.

df1 <- select( initial_df,actor_1_name,actor_1_facebook_likes)%>%
        rename(actor_name= actor_1_name, fb_likes = actor_1_facebook_likes)
df2 <- select(initial_df, actor_2_name, actor_2_facebook_likes)%>%
        rename(actor_name= actor_2_name, fb_likes = actor_2_facebook_likes)
df3 <- select( initial_df,actor_3_name,actor_3_facebook_likes)%>%
        rename(actor_name= actor_3_name, fb_likes = actor_3_facebook_likes)

actors <- rbind(df1, df2, df3)%>%
    mutate(countvar =1) %>%
    group_by(actor_name) %>% 
    #drop_na()%>%
    summarise(fb_likes = sum(fb_likes, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE)) %>%
    arrange(desc(fb_likes))

kable(head(actors,5),  caption = "Actors Datatable")

Actors Datatable
actor_name	fb_likes	movie_count
Johnny Depp	1640000	41
Robin Williams	1323000	27
Robert De Niro	1188000	54
Matthew Ziff	780000	3
J.K. Simmons	744000	31

I was also interested most popular search keywords for the top movies. The keywords were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melted, restructured, the data and remove the movie titles. I grouped by the keyword and summarized with the sum of the counter.

keyword <-select(initial_df, movie_title, plot_keywords) %>%
    separate(plot_keywords, c("key_1", "key_2", "key_3", "key_4", "key_5", "key_6", "key_7", "key_8",
                              "key_9", "key_10"), sep = "\\|",
    extra = "drop", fill = "right")%>% 
    melt(id.vars=c("movie_title"))%>% 
    select(-(variable: movie_title ))%>%
    mutate(countvar =1) %>%
    group_by(value) %>% 
    summarise(key_word = sum(countvar)) %>%
    arrange(desc(key_word))%>%
    drop_na()
kable(head(keyword,10),  caption = "Keyword Datatable")

Keyword Datatable
value	key_word
love	198
friend	166
murder	161
death	132
police	126
new york city	91
high school	89
alien	82
school	73
boy	72

Plotted the top 15 keywords

ggplot(head(keyword, 15), aes(reorder(x=value,-key_word), y=key_word)) + 
  geom_point(size=3) + 
  geom_segment(aes(x=value, 
                   xend=value, 
                   y=0, 
                   yend=key_word)) + 
  labs(title="Keyword Chart", 
       subtitle="Search Keyword Popularity") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

Project 2, Dataset 2: US Chronic Disease Indicators (CDI)

From the description provided on the discussion board: The dataset provided by fed for US chronic disease indicator (CDI) is an interesting dataset. The dataset actually outlines that how different states are impacted by certain types of disease category along with clear indicators such as alcohol use among youth, Binge drinking prevalence among adults aged ???18 years, Heavy drinking among adults aged ???18 years, Chronic liver disease mortality etc.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

initial_cdi_df<-(read.table('https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv',  header = FALSE, sep = ",", quote = "", stringsAsFactors = FALSE,fill = TRUE)) 

names(initial_cdi_df) <- c("Year", "LocationAbbr","LocationDesc","Category","Indicator","Datasource","DataValueUnit","DataValueType","DataValue",
"DataValueAlt","DataValueFootnoteSymbol","DataValueFootnote","Gender","StratificationID1","IndicatorID","LocationID","LowConfidenceInterval",
"HighConfidenceInterval","GeoLocation")
initial_cdi_df <- initial_cdi_df[-1,]

kable(head(initial_cdi_df,10),  caption = "Initial CDI Datatable")

Initial CDI Datatable
	Year	LocationAbbr	LocationDesc	Category	Indicator	Datasource	DataValueUnit	DataValueType	DataValue	DataValueAlt	DataValueFootnoteSymbol	DataValueFootnote	IndicatorID	LocationID	LowConfidenceInterval	HighConfidenceInterval	GeoLocation	NA
2	2013	AL	Alabama	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	35	35.0			ALC1_1	01	30.1	40.3	“(32.84057112200048	-86.63186076199969)“
3	2013	AK	Alaska	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	22.5	22.5			ALC1_1	02	19.3	26.1	“(64.84507995700051	-147.72205903599973)“
4	2013	AZ	Arizona	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	36	36.0			ALC1_1	04	31.4	40.9	“(34.865970280000454	-111.76381127699972)“
5	2013	AR	Arkansas	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	36.3	36.3			ALC1_1	05	32.3	40.4	“(34.74865012400045	-92.27449074299966)“
6	2013	CA	California	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence			Â¯	No data available	ALC1_1	06			“(37.63864012300047	-120.99999953799971)“
7	2013	CO	Colorado	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence			Â¯	No data available	ALC1_1	08			“(38.843840757000464	-106.13361092099967)“
8	2013	CT	Connecticut	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	36.7	36.7			ALC1_1	09	32.7	41.0	“(41.56266102000046	-72.64984095199964)“
9	2013	DE	Delaware	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	36.3	36.3			ALC1_1	10	33.7	39.0	“(39.008830667000495	-75.57774116799965)“
10	2013	DC	District of Columbia	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	31.4	31.4			ALC1_1	11	30.2	32.5	“(38.907192	-77.036871)“
11	2013	FL	Florida	Alcohol	Alcohol use among youth	YRBSS	%	Crude Prevalence	34.8	34.8			ALC1_1	12	33.1	36.6	“(28.932040377000476	-81.92896053899966)“

I was interested in the aggregate prevalence of drinking in different age groups. I filted by the associated category and data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I then replaced the special character to “>=” using the stringr library. I grouped by the age group and summarized by the average crude prevalence. I added a sequental counter to the dataframe.

category_df <-select(initial_cdi_df, Category, Indicator,DataValueType, DataValue) %>%
    filter(Category == "Alcohol" & DataValueType == "Crude Prevalence" ) %>%
  filter(complete.cases(.)) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  #mutate(Indicator= (str_replace_all(Indicator, "â???¥18", ">=18")))%>%
    group_by(Indicator, DataValueType) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
    arrange(Indicator)%>%
  rename(Measure = DataValueType) 
category_df$ID <- seq.int(nrow(category_df))

kable(head(category_df,10),  caption = "Alcohol Datatable")

Alcohol Datatable
Indicator	Measure	average	ID
Alcohol use among youth	Crude Prevalence	31.388889	1
Alcohol use before pregnancy	Crude Prevalence	55.408696	2
Binge drinking prevalence among adults aged â¥18 years	Crude Prevalence	16.775309	3
Binge drinking prevalence among women aged 18-44 years	Crude Prevalence	17.325926	4
Binge drinking prevalence among youth	Crude Prevalence	17.689130	5
Heavy drinking among adults aged â¥18 years	Crude Prevalence	5.972222	6
Heavy drinking among women aged 18-44 years	Crude Prevalence	5.788679	7

# Draw plot
ggplot(category_df, aes(x=reorder(ID, -average), y=average)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") +
  geom_text(aes(label=round(average)))+
  labs(title="Ranging Alcohol Use", 
       subtitle="Crude Prevalence", 
       caption="source: CDI") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

I was interested in the average mortality rates of asthma for males and females by geographic location. I selected the location, gender, category, confidence interval data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I grouped by the location, gender, indicator and summarized by the average asthma mortality rates.

Asthma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Gender, DataValueType, DataValue, HighConfidenceInterval) %>%
    filter(Category == "Asthma" & DataValueType == "Age-adjusted rate"
           & Gender != "Total" & Indicator == "Asthma mortality rate") %>%
  #filter(complete.cases(.)) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  mutate(HighConfidenceInterval =  as.numeric(HighConfidenceInterval)) %>%
    group_by(LocationAbbr,Gender,Indicator, DataValueType) %>%
    summarise(average = mean(DataValue, na.rm = TRUE), confidence_average = mean(HighConfidenceInterval, na.rm =TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueType) 

kable(head(Asthma_df,10),  caption = "Asthma Datatable")

Asthma Datatable
LocationAbbr	Gender	Indicator	Measure	average	confidence_average
AL	Female	Asthma mortality rate	Age-adjusted rate	15.370	20.630
AR	Female	Asthma mortality rate	Age-adjusted rate	15.960	22.930
AZ	Female	Asthma mortality rate	Age-adjusted rate	8.790	12.480
AZ	Male	Asthma mortality rate	Age-adjusted rate	8.220	11.950
CA	Female	Asthma mortality rate	Age-adjusted rate	13.485	16.065
CA	Male	Asthma mortality rate	Age-adjusted rate	8.880	11.305
CO	Female	Asthma mortality rate	Age-adjusted rate	13.030	18.300
CT	Female	Asthma mortality rate	Age-adjusted rate	11.630	16.600
FL	Female	Asthma mortality rate	Age-adjusted rate	9.350	11.050
FL	Male	Asthma mortality rate	Age-adjusted rate	6.370	8.180

Plot the asthma mortality rates, split by males and females

ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = average, fill = Gender)) +   
                              geom_bar(stat = "identity", width = .6)

Plot the average confidence interval for the mortality rates

ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = confidence_average, fill = Gender)) +   
                              geom_bar(stat = "identity", width = .6)

I was interested in the instances of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.

melanoma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
                    DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
    filter( Category == "Cancer" & Indicator == "\"Invasive melanoma") %>%
  unite_("Indicator", c("Indicator","Datasource")) %>%
  unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
  filter( DataValueAlt == "Average Annual Age-adjusted Rate" | 
          DataValueUnit == "Average Annual Age-adjusted Rate")%>%
  rename(DataValue = DataValueFootnoteSymbol) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  group_by(LocationAbbr,Indicator, DataValueAlt) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueAlt) 

kable(head(melanoma_df,10),  caption = "Melanoma Datatable")

Melanoma Datatable
LocationAbbr	Indicator	Measure	average
AK	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	11.90000
AL	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	20.90000
AR	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	15.00000
AZ	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	16.65000
CO	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	20.90000
DC	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	8.60000
DE	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	27.20000
FL	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	17.77143
GA	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	24.00000
ID	“Invasive melanoma_ incidence”	Average Annual Age-adjusted Rate	24.80000

ggplot(head(melanoma_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
  geom_bar(stat="identity", color="blue", fill="white") +
  geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Top 20 States With Melanoma Cases %")

I was interested in the mortality rates of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.

melanomadeaths_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
                    DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
    filter( Category == "Cancer" & Indicator == "\"Melanoma") %>%
  unite_("Indicator", c("Indicator","Datasource")) %>%
  unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
  filter( DataValueAlt == "Average Annual Age-adjusted Rate" | 
          DataValueUnit == "Average Annual Age-adjusted Rate")%>%
  rename(DataValue = DataValueFootnoteSymbol) %>%
  mutate(DataValue =  as.numeric(DataValue)) %>%
  group_by(LocationAbbr,Indicator, DataValueAlt) %>%
    summarise(average = mean(DataValue, na.rm = TRUE))%>%
  na.omit() %>%
    arrange(Indicator)%>%
  rename(Measure = DataValueAlt) 

kable(head(melanomadeaths_df,10),  caption = "Melanoma Mortality Datatable")

Melanoma Mortality Datatable
LocationAbbr	Indicator	Measure	average
AK	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.3
AL	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.9
AR	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.7
AZ	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	3.0
CA	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.6
CO	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	3.5
CT	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.5
DC	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	1.3
DE	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.7
FL	“Melanoma_ mortality”	Average Annual Age-adjusted Rate	2.9

ggplot(head(melanomadeaths_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
  geom_bar(stat="identity", color="red", fill="red") +
  geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
  theme_minimal()+ labs(title="Average Melanoma Deaths By State %")

Project 2, Dataset 3: Marriage and Divorce Rates

Taken from the week 5 discussion board. National marriage and divorce rate data from National Vital Statistics System

Read in the initial dataframe. I used the na.strings to turn all blanks to NAs. This is will me quickly remove unwanted columns in the tidying process

initial_md_df<-(read.table('C:/Data 607/Project 2 Dataset/marriage_divorce.csv',  header = FALSE, sep = ",", quote = "\"" , stringsAsFactors = FALSE,fill = TRUE,na.strings = c("", "NA"))) 

#initial_md_df <- initial_md_df[!apply(is.na(initial_md_df) | initial_md_df == "", 1, all),]

kable(head(initial_md_df,10),  caption = "Initial Marriages & Divorces Datatable")

Initial Marriages & Divorces Datatable
V1	V2	V3	V4
Provisional number of marriages and marriage rate: United States, 2000-2016	NA	NA	NA
NA	NA	NA	NA
Year	Marriages	Population	Rate per 1,000 total population
2016	2,245,404	323,127,513	6.9
2015	2,221,579	321,418,820	6.9
2014/1	2,140,272	308,759,713	6.9
2013/1	2,081,301	306,136,672	6.8
2012	2,131,000	313,914,040	6.8
2011	2,118,000	311,591,917	6.8
2010	2,096,000	308,745,538	6.8

I created the marriages dataframe. I used the stringr library to replace the commas to blanks and the converted all the columns to numeric. I noticed during my initial examination of the dataset that the number of marriages exceed 1 million per year and the number of divorces are below 1 million. To make the marriages datatable I filtered population column for above 1 million.

marriages_df <-initial_md_df %>% 
               na.omit() %>%
              mutate(V2 = (str_replace_all(V2, ",", "")),
                     V3 = (str_replace_all(V3, ",", "")),
                     V4 = (str_replace_all(V4, ",", "")))%>%

             
              mutate(V1 = (str_replace_all(V1, "/1", "")),
                     V1 = (str_replace_all(V1, "/2", "")))%>%
  
                mutate(V2 =  as.numeric(V2),
                       V3 =  as.numeric(V3),
                       V1 =  as.numeric(V1),
                       V4 =  as.numeric(V4))%>%
               filter( V2 > 1000000) %>%
      
             
              rename(Year = V1, Marriages = V2, Population = V3,
                     Rate = V4)

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

kable(head(marriages_df,10),  caption = "Marriages Datatable")

Marriages Datatable
Year	Marriages	Population	Rate
2016	2245404	323127513	6.9
2015	2221579	321418820	6.9
2014	2140272	308759713	6.9
2013	2081301	306136672	6.8
2012	2131000	313914040	6.8
2011	2118000	311591917	6.8
2010	2096000	308745538	6.8
2009	2080000	306771529	6.8
2008	2157000	304093966	7.1
2007	2197000	301231207	7.3

theme_set(theme_bw())
ggplot(marriages_df, aes(Population, Marriages))+ 
      geom_point()+
  labs( title="Marraiges vs Population", 
       caption="Source: marriages_df")

ggplot(marriages_df, aes(x= Year, y= Marriages)) +
  geom_bar(stat="identity", color="yellow", fill="white") +
  geom_text(aes(label= Marriages), vjust=-0.3, size=2, color="blue")+
    labs(subtitle="Marriage Numbers From 2000 to 2016", 
     
       title="Marriages Per Year", 
       caption="Source: marriages_df")

ggplot(marriages_df)  + 
    geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
    geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
    labs(subtitle="Rate per 1,000 total population", 
     
       title="Marriage Rates", 
       caption="Source: marriages_df")

divorse_df <-initial_md_df %>% 
               na.omit() %>%
              mutate(V2 = (str_replace_all(V2, ",", "")),
                     V3 = (str_replace_all(V3, ",", "")),
                     V4 = (str_replace_all(V4, ",", "")))%>%

             
              mutate(V1 = (str_replace_all(V1, "/1", "")),
                     V1 = (str_replace_all(V1, "/2", "")),
                     V1 = (str_replace_all(V1, "/3", "")),
                     V1 = (str_replace_all(V1, "/4", "")),
                     V1 = (str_replace_all(V1, "/5", "")),
                     V1 = (str_replace_all(V1, "/6", "")),
                     V1 = (str_replace_all(V1, "/7", "")))%>%
  
                mutate(V2 =  as.numeric(V2),
                       V3 =  as.numeric(V3),
                       V1 =  as.numeric(V1),
                       V4 =  as.numeric(V4))%>%
               filter( V2 < 1000000) %>%
      
             
              rename(Year = V1, Divorces_Annulments = V2, Population = V3,
                     Rate = V4)

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion

kable(head(divorse_df ,10),  caption = "Divorse Datatable")

Divorse Datatable
Year	Divorces_Annulments	Population	Rate
2016	827261	257904548	3.2
2015	800909	258518265	3.1
2014	813862	256483624	3.2
2013	832157	254408815	3.3
2012	851000	248041986	3.4
2011	877000	246273366	3.6
2010	872000	244122529	3.6
2009	840000	242610561	3.5
2008	844000	240545163	3.5
2007	856000	238352850	3.6

theme_set(theme_bw())
ggplot(divorse_df, aes(Population,Divorces_Annulments))+ 
      geom_point()+
  labs(title="Divorces & Annulments vs Population", 
       caption="Source: divorse_df")

ggplot(divorse_df, aes(x= Year, y= Divorces_Annulments)) +
  geom_bar(stat="identity", color="purple", fill="white") +
  geom_text(aes(label= Divorces_Annulments), vjust=-0.3, size=2, color="blue")+
    labs(subtitle="Divorces & Annulments Numbers From 2000 to 2016", 
       title="Divorces & Annulments Per Year", 
       caption="Source: divorse_df")

ggplot(divorse_df)  + 
    geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
    geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
    labs(subtitle="Rate per 1,000 total population", 
       title="Divorce Rates", 
       caption="Source: divorse_df")

I joined the reformatted divorce and marriages dataframes together so I could create a scatter plot of population of divorces vs marriages.

joined_df <- marriages_df %>%
    left_join(divorse_df, by='Year')

kable(head(joined_df ,10),  caption = "Total Datatable")

Total Datatable
Year	Marriages	Population.x	Rate.x	Divorces_Annulments	Population.y	Rate.y
2016	2245404	323127513	6.9	827261	257904548	3.2
2015	2221579	321418820	6.9	800909	258518265	3.1
2014	2140272	308759713	6.9	813862	256483624	3.2
2013	2081301	306136672	6.8	832157	254408815	3.3
2012	2131000	313914040	6.8	851000	248041986	3.4
2011	2118000	311591917	6.8	877000	246273366	3.6
2010	2096000	308745538	6.8	872000	244122529	3.6
2009	2080000	306771529	6.8	840000	242610561	3.5
2008	2157000	304093966	7.1	844000	240545163	3.5
2007	2197000	301231207	7.3	856000	238352850	3.6

theme_set(theme_bw())
ggplot(joined_df, aes(Marriages,Divorces_Annulments))+ 
      geom_point()+ 
  geom_smooth(method="lm", se=F) +
  labs(title="Marriage vs Divorce", 
       caption="Source: joined_df")

Project 2

Meaghan Burke

March 11, 2018

Project 2 Assignment

Project 2, Dataset 1: IMDB Movies

Project 2, Dataset 2: US Chronic Disease Indicators (CDI)

Project 2, Dataset 3: Marriage and Divorce Rates