` About the Dataset used in this paper
This dataset supplied comprises statistics on the #causes of death#in various countries and territories over a specific time period. This dataset enables the investigation and exploration of mortality patterns and trends in various locations. The dataset gives a detailed overview of death causes across various geographies and time periods. This information can be used to obtain insight into the prevalence and impact of various diseases and ailments in various countries or territories.
Objective
The objective of this project is to gain further knowledge about the #causes of death#in various countries and territories over a specific time period.Although this dataset is quite huge and contains many numbers. More than thirty diseases(death causese ) are reported as causes of death in many countries around the world. Depending on the questions we need to answer, we can design a lot of data displays. For the purpose of this project, I just answered a few questions.
Content
This Dataset contains Historical Data on various causes of mortality for people of all ages all around the world. The following are the main features of this Dataset: Meningitis, Alzheimer’s Disease and Other Dementias, Parkinson’s Disease, Nutritional Deficiencies, Malaria, Drowning, Interpersonal Violence, Maternal Disorders, HIV/AIDS, Drug Use Disorders, Tuberculosis, Cardiovascular Diseases, Lower Respiratory Infections, Neonatal Disorders, Alcohol Use Disorders, Self-harm, Forces of Nature, Diarrheal Diseases,
General steps to import the “cause_of_deaths.csv” file into R Set the working directory as follows: Read the CSV file: Use the read.csv() function to read the CSV file into a data frame. Assign the result to a variable, such as df. Here’s an example
Explore the data: After importing the data,it’s a good practice to explore the data to understand its structure and quality.we can use functions like head(), summary(), str(),or any other relevant functions to get an overview of the data frame.
First: The analysis for at least one categorical variable and
at least one numerical variable. Analysis for
categorical variable: Country.Territory Plotting bar
chart for country counts
Numerical variable analysis: Cardiovascular disease, as it is the first cause of death. Plotting boxplot for average cardiovascular diseases that caused death by country
To Observed trends in fatalities related to cardiovascular
illnesses.
2nd :To perform an analysis using two or more variables
let’s consider the relationship between “Year” and
“Cardiovascular.Diseases” variables from the df data frame.
We all need to know what the biggest causes of death are, so here are the top ten causes of death Calculate the total number of deaths for each cause.what are top 10 causes.
3rd :Let us investigate the distribution of the “Cardiovascular.Diseases” variable from the dataset as it is the largest death caused. To visualize the distribution of numerical data, we can generate a histogram and a density plot.
We may use descriptive statistics and a box plot to investigate the
distribution of the “Cardiovascular.Diseases” data. This will provide
information about the data distribution central tendency, spread, and
shape.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 2028 11742 73160 42546 4584273
Calculate the percentage of each cause of death
4th: Central Limit Theorem To Draw various random
samples of the data and show the applicability of the Central Limit
Theorem for this variable.
Simple random sampling yields a representative sample of the population,
its clear here how it takes bell shape or close to(normal distributed
asCentral Limit Theorem) .And the results of the analysis can be
extrapolated to the complete population.However,the diversity in the
results maybe greater when bigger sample sizes or stratified sampling
are used.
5th :To Show how various sampling methods can be used on your
data. What are your conclusions if these samples are used
instead of the whole dataset.#stratified sampling#
systematic sampling
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
To Calculate the means and compare between diferent sampling
techniques
Here the ggplots are shown according to the cause of death you chose which is Parkinson’s disease, Medications, Use Disorders, Cardiovascular Disease and Maternal Diseases Disorders. But you can choose any of these causes of death according to your interests##
## SamplingMethod Mean
## 1 mean_whole 1173.169
## 2 Random Sampling 164.420
## 3 Stratified Sampling 73708.896
## 4 Systematic Sampling 1246.051
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
Group the data by Country.Territory and calculate the total
occurrences of each disease
The pattern of diseases with in years(the relation ship between
years and specific disease) Group the data by Year and
calculate the total occurrences of the specific disease
we can Repeat the above code for other diseases of interest Calculate the total number of deaths for each cause
## Meningitis Alzheimer.s.Disease.and.Other.Dementias Parkinson.s.Disease
## 1 10524572 29768839 7179795
## Nutritional.Deficiencies Malaria Drowning Interpersonal.Violence
## 1 13792032 25342676 10301999 12752839
## Maternal.Disorders HIV.AIDS Drug.Use.Disorders Tuberculosis
## 1 7727046 36364419 2656121 45850603
## Cardiovascular.Diseases Lower.Respiratory.Infections Neonatal.Disorders
## 1 447741982 83770038 76860729
## Alcohol.Use.Disorders Self.harm Exposure.to.Forces.of.Nature
## 1 4819018 23713931 1490132
## Diarrheal.Diseases Environmental.Heat.and.Cold.Exposure Neoplasms
## 1 66235508 1788851 229758538
## Conflict.and.Terrorism Diabetes.Mellitus Chronic.Kidney.Disease Poisonings
## 1 3294053 31448872 28911692 2601082
## Protein.Energy.Malnutrition Road.Injuries Chronic.Respiratory.Diseases
## 1 12031885 36296469 104605334
## Cirrhosis.and.Other.Chronic.Liver.Diseases Digestive.Diseases
## 1 37479321 65638635
## Fire..Heat..and.Hot.Substances
## 1 3602914
Calculate the percentage of each cause of death
## Meningitis Alzheimer.s.Disease.and.Other.Dementias Parkinson.s.Disease
## 1 10524572 29768839 7179795
## Nutritional.Deficiencies Malaria Drowning Interpersonal.Violence
## 1 13792032 25342676 10301999 12752839
## Maternal.Disorders HIV.AIDS Drug.Use.Disorders Tuberculosis
## 1 7727046 36364419 2656121 45850603
## Cardiovascular.Diseases Lower.Respiratory.Infections Neonatal.Disorders
## 1 447741982 83770038 76860729
## Alcohol.Use.Disorders Self.harm Exposure.to.Forces.of.Nature
## 1 4819018 23713931 1490132
## Diarrheal.Diseases Environmental.Heat.and.Cold.Exposure Neoplasms
## 1 66235508 1788851 229758538
## Conflict.and.Terrorism Diabetes.Mellitus Chronic.Kidney.Disease Poisonings
## 1 3294053 31448872 28911692 2601082
## Protein.Energy.Malnutrition Road.Injuries Chronic.Respiratory.Diseases
## 1 12031885 36296469 104605334
## Cirrhosis.and.Other.Chronic.Liver.Diseases Digestive.Diseases
## 1 37479321 65638635
## Fire..Heat..and.Hot.Substances Percent_Meningitis
## 1 3602914 1
## Percent_Alzheimer.s.Disease.and.Other.Dementias Percent_Parkinson.s.Disease
## 1 1 1
## Percent_Nutritional.Deficiencies Percent_Malaria Percent_Drowning
## 1 1 1 1
## Percent_Interpersonal.Violence Percent_Maternal.Disorders Percent_HIV.AIDS
## 1 1 1 1
## Percent_Drug.Use.Disorders Percent_Tuberculosis
## 1 1 1
## Percent_Cardiovascular.Diseases Percent_Lower.Respiratory.Infections
## 1 1 1
## Percent_Neonatal.Disorders Percent_Alcohol.Use.Disorders Percent_Self.harm
## 1 1 1 1
## Percent_Exposure.to.Forces.of.Nature Percent_Diarrheal.Diseases
## 1 1 1
## Percent_Environmental.Heat.and.Cold.Exposure Percent_Neoplasms
## 1 1 1
## Percent_Conflict.and.Terrorism Percent_Diabetes.Mellitus
## 1 1 1
## Percent_Chronic.Kidney.Disease Percent_Poisonings
## 1 1 1
## Percent_Protein.Energy.Malnutrition Percent_Road.Injuries
## 1 1 1
## Percent_Chronic.Respiratory.Diseases
## 1 1
## Percent_Cirrhosis.and.Other.Chronic.Liver.Diseases Percent_Digestive.Diseases
## 1 1 1
## Percent_Fire..Heat..and.Hot.Substances
## 1 1
6th:Use Data wrangling techniques for the appropriate analysis
of the data
Filter rows based on conditions
To Combine select and filter operations
To Count the number of missing values in each column
sum(is.na(df))
Remove rows with missing values
cleaned_df <- na.omit(df)
To modifying the structure or format of the data to meet the Convert data types
To Create new variables
cleaned_df$Total_Deaths <- rowSums(cleaned_df[, 4:33])
cleaned_df$Total_Deaths
Aggregate data by year and calculate the total deaths for each year
Calculate the average deaths per year
## [1] 1464349925
Seventh :Use plotly for your plots for interactivity.
7th: Interactive
## `summarise()` has grouped output by 'Country.Territory'. You can override using
## the `.groups` argument.
Conclusion
CARDIOVASCULAR DISEASES have been a major contributor and cause of death in the top five countries in the world.This is also due to the large population of these five countries. As a result of unhealthy living, increased intake of dangerous substances such as drugs, alcohol, smoking, and a polluted environment, the number of deaths has increased over time. NEOPLASMS, the second leading cause of death, has also posed a significant danger to these countries. There is an ANAMOLY in the dataset indicating RUSSIA HAS ZERO DEATHS DUE TO MALARIA.(This could be due to a dataset issue.) The data is displayed below.
It is worth mentioning that, depending on the individual analysis or research aims, further preparation and data wrangling may be required to address missing values, outliers, and inconsistencies in the dataset.