library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
When I began this journey of becomming a data scientist, I had trouble identifying the primary targets when performing data analysis in a dataset, making the right questions and choices of what and how to look at data and what to be analized became the biggest challenge in assignments and projects. My motivation for the final project is to challenge myself and to prove that I gained the skills and knowledge need it to become in a data scientist. I decided to choose a public dataset from the NYC open data, about the New York City leading causes of death. This dataset contains rich data to work with, it has information about leading causes of death by sex and ethnicity in the city since 2007, which includes the year, cause of death sex, etc. In addition to this data, I would like to include a similar dataset from another city to make a comparison and to analyze escenario and causes of deaths in both cities.
nyc_chart<-read.csv("C:/Users/vitug/Downloads/New_York_City_Leading_Causes_of_Death_20240416.csv")
head(nyc_chart)
## Year Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
I will start by performing exploratory data analysis on both datasets by death causes and ethnicity. I will tidy clean and the data for analysis using the tools learned in this course, for example, handling missing values and correcting inconsistencies, adding new columns with results of analysis, and delete unnecesary data on both dataframes. We will then use visualization techniques such as bar charts,and scatterplots to display the data in a meaningful way and draw conclusions based on the findings findings.
colnames(nyc_chart)[2] = "Cause"
colnames(nyc_chart)[4] = "Ethnicity"
comparison_col <- colnames(nyc_chart[3:length(nyc_chart)])
comparison_col
## [1] "Sex" "Ethnicity"
## [3] "Deaths" "Death.Rate"
## [5] "Age.Adjusted.Death.Rate"
nyc_data <- nyc_chart %>% pivot_longer(cols=comparison_col,names_to = "comparison_column", values_to = "Num_of_deaths")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(comparison_col)
##
## # Now:
## data %>% select(all_of(comparison_col))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(nyc_data)
## # A tibble: 5,470 × 4
## Year Cause comparison_column Num_of_deaths
## <int> <chr> <chr> <chr>
## 1 2011 Nephritis, Nephrotic Syndrome and Neph… Sex F
## 2 2011 Nephritis, Nephrotic Syndrome and Neph… Ethnicity Black Non-Hi…
## 3 2011 Nephritis, Nephrotic Syndrome and Neph… Deaths 83
## 4 2011 Nephritis, Nephrotic Syndrome and Neph… Death.Rate 7.9
## 5 2011 Nephritis, Nephrotic Syndrome and Neph… Age.Adjusted.Dea… 6.9
## 6 2009 Human Immunodeficiency Virus Disease (… Sex F
## 7 2009 Human Immunodeficiency Virus Disease (… Ethnicity Hispanic
## 8 2009 Human Immunodeficiency Virus Disease (… Deaths 96
## 9 2009 Human Immunodeficiency Virus Disease (… Death.Rate 8
## 10 2009 Human Immunodeficiency Virus Disease (… Age.Adjusted.Dea… 8.1
## # ℹ 5,460 more rows
By exploring this public dataset, I will gain valuable insights of the leading causes of deaths in NYC and NJ, it will provide me valuable skills learned in this course to work towards my future as a data analyst, and it will bring me challenges of what, why and how to work with data and what questions to ask when I’m performing a data analysis.