The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Choose any three of the wide datasets identified in the Week 5 Discussion items. For each of the three chosen datasets: Create a .CSV fil, that includes all of the information included in the dataset:
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
library(knitr)
library(stringr)
library(xtable)
This untidy dataset was scraped from IMDB with a lot of interesting information on the top movies ever made. However, there is instances of missing and inconsistent information that makes using this dataset difficult in its current state. In this section, I will tidy and pivot the original data into meaningful subgroups using tidyr and dplyr.I performed all of these transformations in a row with the pipe (%>%) operator.
Reading in the original CSV dataframe:
initial_df <- read_csv("C:/Data 607/Project 2 Dataset/ds1_movie_metadata.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## color = col_character(),
## director_name = col_character(),
## actor_2_name = col_character(),
## genres = col_character(),
## actor_1_name = col_character(),
## movie_title = col_character(),
## actor_3_name = col_character(),
## plot_keywords = col_character(),
## movie_imdb_link = col_character(),
## language = col_character(),
## country = col_character(),
## content_rating = col_character(),
## imdb_score = col_double(),
## aspect_ratio = col_double()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2324 budget an integer 2400000000 'C:/Data 607/Project 2 Dataset/ds1_~ file 2 2989 budget an integer 12215500000 'C:/Data 607/Project 2 Dataset/ds1_~ row 3 3006 budget an integer 2500000000 'C:/Data 607/Project 2 Dataset/ds1_~ col 4 3860 budget an integer 4200000000 'C:/Data 607/Project 2 Dataset/ds1_~
kable(head(initial_df,5), caption = "Initial Movie Datatable")
| color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | actor_1_name | movie_title | num_voted_users | cast_total_facebook_likes | actor_3_name | facenumber_in_poster | plot_keywords | movie_imdb_link | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Color | James Cameron | 723 | 178 | 0 | 855 | Joel David Moore | 1000 | 760505847 | Action|Adventure|Fantasy|Sci-Fi | CCH Pounder | Avatar | 886204 | 4834 | Wes Studi | 0 | avatar|future|marine|native|paraplegic | http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1 | 3054 | English | USA | PG-13 | 2.37e+08 | 2009 | 936 | 7.9 | 1.78 | 33000 |
| Color | Gore Verbinski | 302 | 169 | 563 | 1000 | Orlando Bloom | 40000 | 309404152 | Action|Adventure|Fantasy | Johnny Depp | Pirates of the Caribbean: At World’s End | 471220 | 48350 | Jack Davenport | 0 | goddess|marriage ceremony|marriage proposal|pirate|singapore | http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1 | 1238 | English | USA | PG-13 | 3.00e+08 | 2007 | 5000 | 7.1 | 2.35 | 0 |
| Color | Sam Mendes | 602 | 148 | 0 | 161 | Rory Kinnear | 11000 | 200074175 | Action|Adventure|Thriller | Christoph Waltz | Spectre | 275868 | 11700 | Stephanie Sigman | 1 | bomb|espionage|sequel|spy|terrorist | http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1 | 994 | English | UK | PG-13 | 2.45e+08 | 2015 | 393 | 6.8 | 2.35 | 85000 |
| Color | Christopher Nolan | 813 | 164 | 22000 | 23000 | Christian Bale | 27000 | 448130642 | Action|Thriller | Tom Hardy | The Dark Knight Rises | 1144337 | 106759 | Joseph Gordon-Levitt | 0 | deception|imprisonment|lawlessness|police officer|terrorist plot | http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1 | 2701 | English | USA | PG-13 | 2.50e+08 | 2012 | 23000 | 8.5 | 2.35 | 164000 |
| NA | Doug Walker | NA | NA | 131 | NA | Rob Walker | 131 | NA | Documentary | Doug Walker | Star Wars: Episode VII - The Force Awakens | 8 | 143 | NA | 0 | NA | http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1 | NA | NA | NA | NA | NA | NA | 12 | 7.1 | NA | 0 |
I was interested in the genres included in the top 5,000 movies. Each firm had a few genres tagged. The genres were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melt, restructure, the data and remove the movie titles. What I am left with is two columns: the genre and the counter. I decided to group the data and summarize by the counter.
#Top Genres In The Top 5,000 Movies
genre <-select(initial_df, movie_title, genres) %>%
separate(genres, c("Genre_1", "Genre_2", "Genre_3", "Genre_4", "Genre_5", "Genre_6", "Genre_7"), sep = "\\|",
extra = "drop", fill = "right")%>%
melt(id.vars=c("movie_title"))%>%
select(-(variable: movie_title ))%>%
mutate(countvar =1) %>%
group_by(value) %>%
summarise(genre_sum = sum(countvar)) %>%
arrange(desc(genre_sum))%>%
filter(value != "Fi") %>%
drop_na()%>%
rename(movie_genre = value)
kable(head(genre,5), caption = "Movie Genre Datatable")
| movie_genre | genre_sum |
|---|---|
| Drama | 2594 |
| Comedy | 1872 |
| Thriller | 1408 |
| Action | 1153 |
| Romance | 1106 |
I then plot the top 10 rated genres
ggplot(head(genre,10), aes(x= reorder(movie_genre, -genre_sum), y= genre_sum)) +
geom_bar(stat="identity", color="blue", fill="white") +
geom_text(aes(label=genre_sum), vjust=-0.3, size=3.5)+
theme_minimal()+ labs(title="Top 10 Movie Genres")
I was also interested most profitable directors and their associated facebook likes. Each row in the dataset has the director’s name, the budget and the gross of each movie. I used mutate the find the net profit of firms for each director. I then grouped the data by the director and summarized the average profit and sum of facebook likes.
#Top Profitable Directors In The Top 5,000 Movies
director <-select(initial_df,director_name,director_facebook_likes, budget, gross) %>%
mutate(countvar =1) %>%
mutate(net_profit= gross - budget) %>%
group_by(director_name) %>%
#drop_na()%>%
summarise(avg_profit = mean(net_profit, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE),
facebook_likes = sum(director_facebook_likes, na.rm = TRUE)) %>%
arrange(desc(avg_profit))
kable(head(director,5), caption = "Director Datatable")
| director_name | avg_profit | movie_count | facebook_likes |
|---|---|---|---|
| Tim Miller | 305024263 | 1 | 84 |
| George Lucas | 277328296 | 5 | 0 |
| Richard Marquand | 276625409 | 1 | 37 |
| Kyle Balda | 262029560 | 1 | 22 |
| Colin Trevorrow | 252717532 | 2 | 730 |
I was also interested the actor information. More specifically, I was interest in the number of “top rated” movies each actor stared in as welll as and their facebook likes. Each row in the dataset has the actors’ name and their facebook likes. I used rbind to merge all the temp dataframes together. I then grouped the data by the actor and summarized by the sum of facebook likes and movie counts.
df1 <- select( initial_df,actor_1_name,actor_1_facebook_likes)%>%
rename(actor_name= actor_1_name, fb_likes = actor_1_facebook_likes)
df2 <- select(initial_df, actor_2_name, actor_2_facebook_likes)%>%
rename(actor_name= actor_2_name, fb_likes = actor_2_facebook_likes)
df3 <- select( initial_df,actor_3_name,actor_3_facebook_likes)%>%
rename(actor_name= actor_3_name, fb_likes = actor_3_facebook_likes)
actors <- rbind(df1, df2, df3)%>%
mutate(countvar =1) %>%
group_by(actor_name) %>%
#drop_na()%>%
summarise(fb_likes = sum(fb_likes, na.rm = TRUE), movie_count = sum(countvar, na.rm = TRUE)) %>%
arrange(desc(fb_likes))
kable(head(actors,5), caption = "Actors Datatable")
| actor_name | fb_likes | movie_count |
|---|---|---|
| Johnny Depp | 1640000 | 41 |
| Robin Williams | 1323000 | 27 |
| Robert De Niro | 1188000 | 54 |
| Matthew Ziff | 780000 | 3 |
| J.K. Simmons | 744000 | 31 |
I was also interested most popular search keywords for the top movies. The keywords were stored in one column, pipe delimited. To access the information, I used the “separate” function. I then melted, restructured, the data and remove the movie titles. I grouped by the keyword and summarized with the sum of the counter.
keyword <-select(initial_df, movie_title, plot_keywords) %>%
separate(plot_keywords, c("key_1", "key_2", "key_3", "key_4", "key_5", "key_6", "key_7", "key_8",
"key_9", "key_10"), sep = "\\|",
extra = "drop", fill = "right")%>%
melt(id.vars=c("movie_title"))%>%
select(-(variable: movie_title ))%>%
mutate(countvar =1) %>%
group_by(value) %>%
summarise(key_word = sum(countvar)) %>%
arrange(desc(key_word))%>%
drop_na()
kable(head(keyword,10), caption = "Keyword Datatable")
| value | key_word |
|---|---|
| love | 198 |
| friend | 166 |
| murder | 161 |
| death | 132 |
| police | 126 |
| new york city | 91 |
| high school | 89 |
| alien | 82 |
| school | 73 |
| boy | 72 |
Plotted the top 15 keywords
ggplot(head(keyword, 15), aes(reorder(x=value,-key_word), y=key_word)) +
geom_point(size=3) +
geom_segment(aes(x=value,
xend=value,
y=0,
yend=key_word)) +
labs(title="Keyword Chart",
subtitle="Search Keyword Popularity") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
From the description provided on the discussion board: The dataset provided by fed for US chronic disease indicator (CDI) is an interesting dataset. The dataset actually outlines that how different states are impacted by certain types of disease category along with clear indicators such as alcohol use among youth, Binge drinking prevalence among adults aged ???18 years, Heavy drinking among adults aged ???18 years, Chronic liver disease mortality etc.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## The following objects are masked from 'package:dplyr':
##
## between, first, last
initial_cdi_df<-(read.table('https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv', header = FALSE, sep = ",", quote = "", stringsAsFactors = FALSE,fill = TRUE))
names(initial_cdi_df) <- c("Year", "LocationAbbr","LocationDesc","Category","Indicator","Datasource","DataValueUnit","DataValueType","DataValue",
"DataValueAlt","DataValueFootnoteSymbol","DataValueFootnote","Gender","StratificationID1","IndicatorID","LocationID","LowConfidenceInterval",
"HighConfidenceInterval","GeoLocation")
initial_cdi_df <- initial_cdi_df[-1,]
kable(head(initial_cdi_df,10), caption = "Initial CDI Datatable")
| Year | LocationAbbr | LocationDesc | Category | Indicator | Datasource | DataValueUnit | DataValueType | DataValue | DataValueAlt | DataValueFootnoteSymbol | DataValueFootnote | Gender | StratificationID1 | IndicatorID | LocationID | LowConfidenceInterval | HighConfidenceInterval | GeoLocation | NA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2013 | AL | Alabama | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 35 | 35.0 | ALC1_1 | 01 | 30.1 | 40.3 | “(32.84057112200048 | -86.63186076199969)“ | ||||
| 3 | 2013 | AK | Alaska | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 22.5 | 22.5 | ALC1_1 | 02 | 19.3 | 26.1 | “(64.84507995700051 | -147.72205903599973)“ | ||||
| 4 | 2013 | AZ | Arizona | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 36 | 36.0 | ALC1_1 | 04 | 31.4 | 40.9 | “(34.865970280000454 | -111.76381127699972)“ | ||||
| 5 | 2013 | AR | Arkansas | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 36.3 | 36.3 | ALC1_1 | 05 | 32.3 | 40.4 | “(34.74865012400045 | -92.27449074299966)“ | ||||
| 6 | 2013 | CA | California | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | ¯ | No data available | ALC1_1 | 06 | “(37.63864012300047 | -120.99999953799971)“ | ||||||
| 7 | 2013 | CO | Colorado | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | ¯ | No data available | ALC1_1 | 08 | “(38.843840757000464 | -106.13361092099967)“ | ||||||
| 8 | 2013 | CT | Connecticut | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 36.7 | 36.7 | ALC1_1 | 09 | 32.7 | 41.0 | “(41.56266102000046 | -72.64984095199964)“ | ||||
| 9 | 2013 | DE | Delaware | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 36.3 | 36.3 | ALC1_1 | 10 | 33.7 | 39.0 | “(39.008830667000495 | -75.57774116799965)“ | ||||
| 10 | 2013 | DC | District of Columbia | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 31.4 | 31.4 | ALC1_1 | 11 | 30.2 | 32.5 | “(38.907192 | -77.036871)“ | ||||
| 11 | 2013 | FL | Florida | Alcohol | Alcohol use among youth | YRBSS | % | Crude Prevalence | 34.8 | 34.8 | ALC1_1 | 12 | 33.1 | 36.6 | “(28.932040377000476 | -81.92896053899966)“ |
I was interested in the aggregate prevalence of drinking in different age groups. I filted by the associated category and data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I then replaced the special character to “>=” using the stringr library. I grouped by the age group and summarized by the average crude prevalence. I added a sequental counter to the dataframe.
category_df <-select(initial_cdi_df, Category, Indicator,DataValueType, DataValue) %>%
filter(Category == "Alcohol" & DataValueType == "Crude Prevalence" ) %>%
filter(complete.cases(.)) %>%
mutate(DataValue = as.numeric(DataValue)) %>%
#mutate(Indicator= (str_replace_all(Indicator, "â???¥18", ">=18")))%>%
group_by(Indicator, DataValueType) %>%
summarise(average = mean(DataValue, na.rm = TRUE))%>%
arrange(Indicator)%>%
rename(Measure = DataValueType)
category_df$ID <- seq.int(nrow(category_df))
kable(head(category_df,10), caption = "Alcohol Datatable")
| Indicator | Measure | average | ID |
|---|---|---|---|
| Alcohol use among youth | Crude Prevalence | 31.388889 | 1 |
| Alcohol use before pregnancy | Crude Prevalence | 55.408696 | 2 |
| Binge drinking prevalence among adults aged â¥18 years | Crude Prevalence | 16.775309 | 3 |
| Binge drinking prevalence among women aged 18-44 years | Crude Prevalence | 17.325926 | 4 |
| Binge drinking prevalence among youth | Crude Prevalence | 17.689130 | 5 |
| Heavy drinking among adults aged â¥18 years | Crude Prevalence | 5.972222 | 6 |
| Heavy drinking among women aged 18-44 years | Crude Prevalence | 5.788679 | 7 |
# Draw plot
ggplot(category_df, aes(x=reorder(ID, -average), y=average)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
geom_text(aes(label=round(average)))+
labs(title="Ranging Alcohol Use",
subtitle="Crude Prevalence",
caption="source: CDI") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
I was interested in the average mortality rates of asthma for males and females by geographic location. I selected the location, gender, category, confidence interval data value type. Since all the dataset values are stored as characters, I had to convert to a numeric. I grouped by the location, gender, indicator and summarized by the average asthma mortality rates.
Asthma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Gender, DataValueType, DataValue, HighConfidenceInterval) %>%
filter(Category == "Asthma" & DataValueType == "Age-adjusted rate"
& Gender != "Total" & Indicator == "Asthma mortality rate") %>%
#filter(complete.cases(.)) %>%
mutate(DataValue = as.numeric(DataValue)) %>%
mutate(HighConfidenceInterval = as.numeric(HighConfidenceInterval)) %>%
group_by(LocationAbbr,Gender,Indicator, DataValueType) %>%
summarise(average = mean(DataValue, na.rm = TRUE), confidence_average = mean(HighConfidenceInterval, na.rm =TRUE))%>%
na.omit() %>%
arrange(Indicator)%>%
rename(Measure = DataValueType)
kable(head(Asthma_df,10), caption = "Asthma Datatable")
| LocationAbbr | Gender | Indicator | Measure | average | confidence_average |
|---|---|---|---|---|---|
| AL | Female | Asthma mortality rate | Age-adjusted rate | 15.370 | 20.630 |
| AR | Female | Asthma mortality rate | Age-adjusted rate | 15.960 | 22.930 |
| AZ | Female | Asthma mortality rate | Age-adjusted rate | 8.790 | 12.480 |
| AZ | Male | Asthma mortality rate | Age-adjusted rate | 8.220 | 11.950 |
| CA | Female | Asthma mortality rate | Age-adjusted rate | 13.485 | 16.065 |
| CA | Male | Asthma mortality rate | Age-adjusted rate | 8.880 | 11.305 |
| CO | Female | Asthma mortality rate | Age-adjusted rate | 13.030 | 18.300 |
| CT | Female | Asthma mortality rate | Age-adjusted rate | 11.630 | 16.600 |
| FL | Female | Asthma mortality rate | Age-adjusted rate | 9.350 | 11.050 |
| FL | Male | Asthma mortality rate | Age-adjusted rate | 6.370 | 8.180 |
Plot the asthma mortality rates, split by males and females
ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = average, fill = Gender)) +
geom_bar(stat = "identity", width = .6)
Plot the average confidence interval for the mortality rates
ggplot(Asthma_df, aes(x = reorder(LocationAbbr,-average), y = confidence_average, fill = Gender)) +
geom_bar(stat = "identity", width = .6)
I was interested in the instances of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.
melanoma_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
filter( Category == "Cancer" & Indicator == "\"Invasive melanoma") %>%
unite_("Indicator", c("Indicator","Datasource")) %>%
unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
filter( DataValueAlt == "Average Annual Age-adjusted Rate" |
DataValueUnit == "Average Annual Age-adjusted Rate")%>%
rename(DataValue = DataValueFootnoteSymbol) %>%
mutate(DataValue = as.numeric(DataValue)) %>%
group_by(LocationAbbr,Indicator, DataValueAlt) %>%
summarise(average = mean(DataValue, na.rm = TRUE))%>%
na.omit() %>%
arrange(Indicator)%>%
rename(Measure = DataValueAlt)
kable(head(melanoma_df,10), caption = "Melanoma Datatable")
| LocationAbbr | Indicator | Measure | average |
|---|---|---|---|
| AK | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 11.90000 |
| AL | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 20.90000 |
| AR | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 15.00000 |
| AZ | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 16.65000 |
| CO | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 20.90000 |
| DC | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 8.60000 |
| DE | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 27.20000 |
| FL | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 17.77143 |
| GA | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 24.00000 |
| ID | “Invasive melanoma_ incidence” | Average Annual Age-adjusted Rate | 24.80000 |
ggplot(head(melanoma_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
geom_bar(stat="identity", color="blue", fill="white") +
geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
theme_minimal()+ labs(title="Top 20 States With Melanoma Cases %")
I was interested in the mortality rates of melanoma by geographic location. The .readcsv delimiter separator incorrectly split columns due to the presence of a comma. I used the unite function to correct the columns. I then flited by the age-adjusted rate indicator. Converted summarize columns to numeric and omitted the NAs.
melanomadeaths_df <-select(initial_cdi_df, LocationAbbr, Category, Indicator,Datasource,
DataValueUnit,DataValueType, DataValue, DataValueAlt, DataValueFootnoteSymbol) %>%
filter( Category == "Cancer" & Indicator == "\"Melanoma") %>%
unite_("Indicator", c("Indicator","Datasource")) %>%
unite_("DataValueUnit",c("DataValueType","DataValue")) %>%
filter( DataValueAlt == "Average Annual Age-adjusted Rate" |
DataValueUnit == "Average Annual Age-adjusted Rate")%>%
rename(DataValue = DataValueFootnoteSymbol) %>%
mutate(DataValue = as.numeric(DataValue)) %>%
group_by(LocationAbbr,Indicator, DataValueAlt) %>%
summarise(average = mean(DataValue, na.rm = TRUE))%>%
na.omit() %>%
arrange(Indicator)%>%
rename(Measure = DataValueAlt)
kable(head(melanomadeaths_df,10), caption = "Melanoma Mortality Datatable")
| LocationAbbr | Indicator | Measure | average |
|---|---|---|---|
| AK | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.3 |
| AL | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.9 |
| AR | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.7 |
| AZ | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 3.0 |
| CA | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.6 |
| CO | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 3.5 |
| CT | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.5 |
| DC | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 1.3 |
| DE | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.7 |
| FL | “Melanoma_ mortality” | Average Annual Age-adjusted Rate | 2.9 |
ggplot(head(melanomadeaths_df,20), aes(x= reorder(LocationAbbr, -average), y= average)) +
geom_bar(stat="identity", color="red", fill="red") +
geom_text(aes(label=round(average)), vjust=-0.3, size=3.5)+
theme_minimal()+ labs(title="Average Melanoma Deaths By State %")
Taken from the week 5 discussion board. National marriage and divorce rate data from National Vital Statistics System
Read in the initial dataframe. I used the na.strings to turn all blanks to NAs. This is will me quickly remove unwanted columns in the tidying process
initial_md_df<-(read.table('C:/Data 607/Project 2 Dataset/marriage_divorce.csv', header = FALSE, sep = ",", quote = "\"" , stringsAsFactors = FALSE,fill = TRUE,na.strings = c("", "NA")))
#initial_md_df <- initial_md_df[!apply(is.na(initial_md_df) | initial_md_df == "", 1, all),]
kable(head(initial_md_df,10), caption = "Initial Marriages & Divorces Datatable")
| V1 | V2 | V3 | V4 |
|---|---|---|---|
| Provisional number of marriages and marriage rate: United States, 2000-2016 | NA | NA | NA |
| NA | NA | NA | NA |
| Year | Marriages | Population | Rate per 1,000 total population |
| 2016 | 2,245,404 | 323,127,513 | 6.9 |
| 2015 | 2,221,579 | 321,418,820 | 6.9 |
| 2014/1 | 2,140,272 | 308,759,713 | 6.9 |
| 2013/1 | 2,081,301 | 306,136,672 | 6.8 |
| 2012 | 2,131,000 | 313,914,040 | 6.8 |
| 2011 | 2,118,000 | 311,591,917 | 6.8 |
| 2010 | 2,096,000 | 308,745,538 | 6.8 |
I created the marriages dataframe. I used the stringr library to replace the commas to blanks and the converted all the columns to numeric. I noticed during my initial examination of the dataset that the number of marriages exceed 1 million per year and the number of divorces are below 1 million. To make the marriages datatable I filtered population column for above 1 million.
marriages_df <-initial_md_df %>%
na.omit() %>%
mutate(V2 = (str_replace_all(V2, ",", "")),
V3 = (str_replace_all(V3, ",", "")),
V4 = (str_replace_all(V4, ",", "")))%>%
mutate(V1 = (str_replace_all(V1, "/1", "")),
V1 = (str_replace_all(V1, "/2", "")))%>%
mutate(V2 = as.numeric(V2),
V3 = as.numeric(V3),
V1 = as.numeric(V1),
V4 = as.numeric(V4))%>%
filter( V2 > 1000000) %>%
rename(Year = V1, Marriages = V2, Population = V3,
Rate = V4)
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
kable(head(marriages_df,10), caption = "Marriages Datatable")
| Year | Marriages | Population | Rate |
|---|---|---|---|
| 2016 | 2245404 | 323127513 | 6.9 |
| 2015 | 2221579 | 321418820 | 6.9 |
| 2014 | 2140272 | 308759713 | 6.9 |
| 2013 | 2081301 | 306136672 | 6.8 |
| 2012 | 2131000 | 313914040 | 6.8 |
| 2011 | 2118000 | 311591917 | 6.8 |
| 2010 | 2096000 | 308745538 | 6.8 |
| 2009 | 2080000 | 306771529 | 6.8 |
| 2008 | 2157000 | 304093966 | 7.1 |
| 2007 | 2197000 | 301231207 | 7.3 |
theme_set(theme_bw())
ggplot(marriages_df, aes(Population, Marriages))+
geom_point()+
labs( title="Marraiges vs Population",
caption="Source: marriages_df")
ggplot(marriages_df, aes(x= Year, y= Marriages)) +
geom_bar(stat="identity", color="yellow", fill="white") +
geom_text(aes(label= Marriages), vjust=-0.3, size=2, color="blue")+
labs(subtitle="Marriage Numbers From 2000 to 2016",
title="Marriages Per Year",
caption="Source: marriages_df")
ggplot(marriages_df) +
geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
labs(subtitle="Rate per 1,000 total population",
title="Marriage Rates",
caption="Source: marriages_df")
divorse_df <-initial_md_df %>%
na.omit() %>%
mutate(V2 = (str_replace_all(V2, ",", "")),
V3 = (str_replace_all(V3, ",", "")),
V4 = (str_replace_all(V4, ",", "")))%>%
mutate(V1 = (str_replace_all(V1, "/1", "")),
V1 = (str_replace_all(V1, "/2", "")),
V1 = (str_replace_all(V1, "/3", "")),
V1 = (str_replace_all(V1, "/4", "")),
V1 = (str_replace_all(V1, "/5", "")),
V1 = (str_replace_all(V1, "/6", "")),
V1 = (str_replace_all(V1, "/7", "")))%>%
mutate(V2 = as.numeric(V2),
V3 = as.numeric(V3),
V1 = as.numeric(V1),
V4 = as.numeric(V4))%>%
filter( V2 < 1000000) %>%
rename(Year = V1, Divorces_Annulments = V2, Population = V3,
Rate = V4)
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
kable(head(divorse_df ,10), caption = "Divorse Datatable")
| Year | Divorces_Annulments | Population | Rate |
|---|---|---|---|
| 2016 | 827261 | 257904548 | 3.2 |
| 2015 | 800909 | 258518265 | 3.1 |
| 2014 | 813862 | 256483624 | 3.2 |
| 2013 | 832157 | 254408815 | 3.3 |
| 2012 | 851000 | 248041986 | 3.4 |
| 2011 | 877000 | 246273366 | 3.6 |
| 2010 | 872000 | 244122529 | 3.6 |
| 2009 | 840000 | 242610561 | 3.5 |
| 2008 | 844000 | 240545163 | 3.5 |
| 2007 | 856000 | 238352850 | 3.6 |
theme_set(theme_bw())
ggplot(divorse_df, aes(Population,Divorces_Annulments))+
geom_point()+
labs(title="Divorces & Annulments vs Population",
caption="Source: divorse_df")
ggplot(divorse_df, aes(x= Year, y= Divorces_Annulments)) +
geom_bar(stat="identity", color="purple", fill="white") +
geom_text(aes(label= Divorces_Annulments), vjust=-0.3, size=2, color="blue")+
labs(subtitle="Divorces & Annulments Numbers From 2000 to 2016",
title="Divorces & Annulments Per Year",
caption="Source: divorse_df")
ggplot(divorse_df) +
geom_line(aes(x=Year, y=Rate),stat="identity", color="yellow") +
geom_text(aes(label=Rate, x=Year, y=Rate), color="blue")+
labs(subtitle="Rate per 1,000 total population",
title="Divorce Rates",
caption="Source: divorse_df")
I joined the reformatted divorce and marriages dataframes together so I could create a scatter plot of population of divorces vs marriages.
joined_df <- marriages_df %>%
left_join(divorse_df, by='Year')
kable(head(joined_df ,10), caption = "Total Datatable")
| Year | Marriages | Population.x | Rate.x | Divorces_Annulments | Population.y | Rate.y |
|---|---|---|---|---|---|---|
| 2016 | 2245404 | 323127513 | 6.9 | 827261 | 257904548 | 3.2 |
| 2015 | 2221579 | 321418820 | 6.9 | 800909 | 258518265 | 3.1 |
| 2014 | 2140272 | 308759713 | 6.9 | 813862 | 256483624 | 3.2 |
| 2013 | 2081301 | 306136672 | 6.8 | 832157 | 254408815 | 3.3 |
| 2012 | 2131000 | 313914040 | 6.8 | 851000 | 248041986 | 3.4 |
| 2011 | 2118000 | 311591917 | 6.8 | 877000 | 246273366 | 3.6 |
| 2010 | 2096000 | 308745538 | 6.8 | 872000 | 244122529 | 3.6 |
| 2009 | 2080000 | 306771529 | 6.8 | 840000 | 242610561 | 3.5 |
| 2008 | 2157000 | 304093966 | 7.1 | 844000 | 240545163 | 3.5 |
| 2007 | 2197000 | 301231207 | 7.3 | 856000 | 238352850 | 3.6 |
theme_set(theme_bw())
ggplot(joined_df, aes(Marriages,Divorces_Annulments))+
geom_point()+
geom_smooth(method="lm", se=F) +
labs(title="Marriage vs Divorce",
caption="Source: joined_df")