DATA 607 - Project # 3

Vladimir Nimchenko

INTRODUCTION:

The alcohol consumption by country data set(I chose only a small subset of it for the purposes of this project) by country shows how much of each type of alcohol (three categories: beer,spirit,and wine) is consumed. I will transform and tidy my data in a few ways. The reason being is I want to analyze the data from a few different perspectives. Firstly, I want to compare the number of total deaths for each disease. Secondly, I would like to compare the total number of deaths by gender.

DATA LOAD

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
 #Manually created csv from the chart,loaded it to github and added it to an object called: "disease_burden"
 disease_burden <- read.csv("https://raw.githubusercontent.com/GitHub-Vlad/Data-Science/main/Global%20Burden%20of%20Disease.csv",header = TRUE)

Data Tidying/Transformation 1

This prepares the data from my analysis of the number of female and male deaths per year (1990 and 2017) for each disease.

The data is currently in wide format. We must first pivot the data and turn it into long format so that we can perform analysis on the data. I will also separate the gender column into two columns: “gender” and “year”. For example, I will separate the column “Female_1990” into two column: “Female” and “1990”. This would make much more sense when preparing data for analysis.

#Pivoting the data from a wide format to a long format
disease_burden_gender <-disease_burden %>% pivot_longer(cols=c("Female_1990","Female_2017","Male_1990","Male_2017"),
                    names_to="gender",
                    values_to="deaths_in_millions")%>%

#Separating the "gender" column into two columns: "gender" and "year"  
separate(gender, into = c("gender", "year"), sep = "_", convert = TRUE) 

#remove the year column.
disease_burden_gender <- disease_burden_gender[,-3]
print(disease_burden_gender)

## # A tibble: 12 x 3
##    cause                     gender deaths_in_millions
##    <chr>                     <chr>               <dbl>
##  1 Communicable diseases     Female               7.3 
##  2 Communicable diseases     Female               4.91
##  3 Communicable diseases     Male                 8.06
##  4 Communicable diseases     Male                 5.47
##  5 Injuries                  Female               1.41
##  6 Injuries                  Female               1.42
##  7 Injuries                  Male                 2.84
##  8 Injuries                  Male                 3.05
##  9 Non-communicable diseases Female              12.8 
## 10 Non-communicable diseases Female              19.2 
## 11 Non-communicable diseases Male                13.9 
## 12 Non-communicable diseases Male                21.7

#remove the disease column.
disease_burden_gender <- disease_burden_gender[,-1]

 #summing number of deaths by gender. 
 disease_burden_gender<-aggregate(deaths_in_millions ~ gender , disease_burden_gender, sum)

Data Tidying/Transformation 2

This prepares the data from my analysis of number of total deaths by year (1990 and 2017) for each disease

The data is currently in wide format. We must first pivot the data and turn it into long format so that we can perform analysis on the data. Afterwards, I will remove the gender column because it is not needed for this analysis. Finally I will group the data by cause to get it into the necessary format for this analysis.

#Pivoting the data from a wide format to a long format
disease_burden_year <-disease_burden %>% pivot_longer(cols=c("Female_1990","Female_2017","Male_1990","Male_2017"),
                    names_to="gender",
                    values_to="deaths_in_millions")%>%
#Separating the "gender" column into two columns: "gender" and "year"  
separate(gender, into = c("gender", "year"), sep = "_", convert = TRUE) 


#remove the gender column.
disease_burden_year <- disease_burden_year[,-2]
  
#remove the year column.
disease_burden_year <- disease_burden_year[,-2]


 #summing number of deaths by disease. 
 disease_burden_year<-aggregate(deaths_in_millions ~ cause , disease_burden_year, sum)

DATA ANALYSIS 1

For my analysis, I will create a bar plot to compare number of total deaths for each gender.

#creating an disease vector to represent the x-axis.
disease_gender <- c("Female","Male")

#creating a values vector to represent the y-axis
deaths_in_millions <- disease_burden_gender$deaths_in_millions

print(disease_burden_gender)

##   gender deaths_in_millions
## 1 Female              46.99
## 2   Male              55.07

#plotting the bar graph
barplot(deaths_in_millions,names.arg=disease_gender,xlab="Gender",ylab="Death in Millions",col="blue",
main="Number of Deaths by Gender",border="red")

My first analysis was to compares the total number of deaths by gender. This comparison is important because it will reveal which gender gets more diseases overall. Knowing this would open up a variety of interesting questions as to, Why do more males get sick than females? Does it have to do with their biological make up? This visual gives way to many interesting questions which could help understand/help reduce these results.

DATA ANALYSIS 2

For my analysis, I will create a bar plot to compare number of total deaths for each disease.

#creating an disease vector to represent the x-axis.
disease_type <- c("Injuries","Communicable","Non-Communicable")

#creating a values vector to represent the y-axis
deaths_in_millions <- disease_burden_year$deaths_in_millions

#plotting the bar graph
barplot(deaths_in_millions,names.arg=disease_type,xlab="Disease Type",ylab="Death in Millions",col="blue",
main="Number of Deaths by Disease",border="red")

My second analysis compares the total number of deaths to disease. It is very important to understand the amount of deaths from each to disease to have an idea of which ones should get resolved first. In our case, we see that there is much more deaths from non-communicable disease than any other type. With this visual we can now tackle this type and try to further breakdown the disease type,etc….

CONCLUSION:

After conducting two different analysis on one set of data, it became clear that the analysis (visuals) opens a seaway to many questions to which the answers can help lower the number of deaths. Knowing the right questions to ask is one if not the most important part of problem solving. I also learned that it is very important to break a particular data set into various subsets to be able to analyze it from different angles (perspectives). This will provide you with crucial information which you would not have been able to obtain by just looking at a data set from one perspective.In my example, knowing the fact that more males died from a disease than females and also knowing that non-communicable diseases caused the most deaths, gives way to links which can prove indispensable in trying to bring death rates down for each disease for both males and females.