` About the Dataset used in this paper

This dataset supplied comprises statistics on the #causes of death#in various countries and territories over a specific time period. This dataset enables the investigation and exploration of mortality patterns and trends in various locations. The dataset gives a detailed overview of death causes across various geographies and time periods. This information can be used to obtain insight into the prevalence and impact of various diseases and ailments in various countries or territories.

Objective

The objective of this project is to gain further knowledge about the #causes of death#in various countries and territories over a specific time period.Although this dataset is quite huge and contains many numbers. More than thirty diseases(death causese ) are reported as causes of death in many countries around the world. Depending on the questions we need to answer, we can design a lot of data displays. For the purpose of this project, I just answered a few questions.

Content

This Dataset contains Historical Data on various causes of mortality for people of all ages all around the world. The following are the main features of this Dataset: Meningitis, Alzheimer’s Disease and Other Dementias, Parkinson’s Disease, Nutritional Deficiencies, Malaria, Drowning, Interpersonal Violence, Maternal Disorders, HIV/AIDS, Drug Use Disorders, Tuberculosis, Cardiovascular Diseases, Lower Respiratory Infections, Neonatal Disorders, Alcohol Use Disorders, Self-harm, Forces of Nature, Diarrheal Diseases,

General steps to import the “cause_of_deaths.csv” file into R Set the working directory as follows: Read the CSV file: Use the read.csv() function to read the CSV file into a data frame. Assign the result to a variable, such as df. Here’s an example

Explore the data: After importing the data,it’s a good practice to explore the data to understand its structure and quality.we can use functions like head(), summary(), str(),or any other relevant functions to get an overview of the data frame.

First: The analysis for at least one categorical variable and at least one numerical variable. Analysis for categorical variable: Country.Territory Plotting bar chart for country counts

Numerical variable analysis: Cardiovascular disease, as it is the first cause of death. Plotting boxplot for average cardiovascular diseases that caused death by country

To Observed trends in fatalities related to cardiovascular illnesses. 2nd :To perform an analysis using two or more variables let’s consider the relationship between “Year” and “Cardiovascular.Diseases” variables from the df data frame.

We all need to know what the biggest causes of death are, so here are the top ten causes of death Calculate the total number of deaths for each cause.what are top 10 causes.

3rd :Let us investigate the distribution of the “Cardiovascular.Diseases” variable from the dataset as it is the largest death caused. To visualize the distribution of numerical data, we can generate a histogram and a density plot.

We may use descriptive statistics and a box plot to investigate the distribution of the “Cardiovascular.Diseases” data. This will provide information about the data distribution central tendency, spread, and shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4    2028   11742   73160   42546 4584273

Calculate the percentage of each cause of death 4th: Central Limit Theorem To Draw various random samples of the data and show the applicability of the Central Limit Theorem for this variable. Simple random sampling yields a representative sample of the population, its clear here how it takes bell shape or close to(normal distributed asCentral Limit Theorem) .And the results of the analysis can be extrapolated to the complete population.However,the diversity in the results maybe greater when bigger sample sizes or stratified sampling are used.

5th :To Show how various sampling methods can be used on your data. What are your conclusions if these samples are used instead of the whole dataset.#stratified sampling# systematic sampling

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

To Calculate the means and compare between diferent sampling techniques

Here the ggplots are shown according to the cause of death you chose which is Parkinson’s disease, Medications, Use Disorders, Cardiovascular Disease and Maternal Diseases Disorders. But you can choose any of these causes of death according to your interests##

##        SamplingMethod      Mean
## 1          mean_whole  1173.169
## 2     Random Sampling   164.420
## 3 Stratified Sampling 73708.896
## 4 Systematic Sampling  1246.051

## TableGrob (2 x 2) "arrange": 4 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]

Group the data by Country.Territory and calculate the total occurrences of each disease The pattern of diseases with in years(the relation ship between years and specific disease) Group the data by Year and calculate the total occurrences of the specific disease

we can Repeat the above code for other diseases of interest Calculate the total number of deaths for each cause

##   Meningitis Alzheimer.s.Disease.and.Other.Dementias Parkinson.s.Disease
## 1   10524572                                29768839             7179795
##   Nutritional.Deficiencies  Malaria Drowning Interpersonal.Violence
## 1                 13792032 25342676 10301999               12752839
##   Maternal.Disorders HIV.AIDS Drug.Use.Disorders Tuberculosis
## 1            7727046 36364419            2656121     45850603
##   Cardiovascular.Diseases Lower.Respiratory.Infections Neonatal.Disorders
## 1               447741982                     83770038           76860729
##   Alcohol.Use.Disorders Self.harm Exposure.to.Forces.of.Nature
## 1               4819018  23713931                      1490132
##   Diarrheal.Diseases Environmental.Heat.and.Cold.Exposure Neoplasms
## 1           66235508                              1788851 229758538
##   Conflict.and.Terrorism Diabetes.Mellitus Chronic.Kidney.Disease Poisonings
## 1                3294053          31448872               28911692    2601082
##   Protein.Energy.Malnutrition Road.Injuries Chronic.Respiratory.Diseases
## 1                    12031885      36296469                    104605334
##   Cirrhosis.and.Other.Chronic.Liver.Diseases Digestive.Diseases
## 1                                   37479321           65638635
##   Fire..Heat..and.Hot.Substances
## 1                        3602914

Calculate the percentage of each cause of death

##   Meningitis Alzheimer.s.Disease.and.Other.Dementias Parkinson.s.Disease
## 1   10524572                                29768839             7179795
##   Nutritional.Deficiencies  Malaria Drowning Interpersonal.Violence
## 1                 13792032 25342676 10301999               12752839
##   Maternal.Disorders HIV.AIDS Drug.Use.Disorders Tuberculosis
## 1            7727046 36364419            2656121     45850603
##   Cardiovascular.Diseases Lower.Respiratory.Infections Neonatal.Disorders
## 1               447741982                     83770038           76860729
##   Alcohol.Use.Disorders Self.harm Exposure.to.Forces.of.Nature
## 1               4819018  23713931                      1490132
##   Diarrheal.Diseases Environmental.Heat.and.Cold.Exposure Neoplasms
## 1           66235508                              1788851 229758538
##   Conflict.and.Terrorism Diabetes.Mellitus Chronic.Kidney.Disease Poisonings
## 1                3294053          31448872               28911692    2601082
##   Protein.Energy.Malnutrition Road.Injuries Chronic.Respiratory.Diseases
## 1                    12031885      36296469                    104605334
##   Cirrhosis.and.Other.Chronic.Liver.Diseases Digestive.Diseases
## 1                                   37479321           65638635
##   Fire..Heat..and.Hot.Substances Percent_Meningitis
## 1                        3602914                  1
##   Percent_Alzheimer.s.Disease.and.Other.Dementias Percent_Parkinson.s.Disease
## 1                                               1                           1
##   Percent_Nutritional.Deficiencies Percent_Malaria Percent_Drowning
## 1                                1               1                1
##   Percent_Interpersonal.Violence Percent_Maternal.Disorders Percent_HIV.AIDS
## 1                              1                          1                1
##   Percent_Drug.Use.Disorders Percent_Tuberculosis
## 1                          1                    1
##   Percent_Cardiovascular.Diseases Percent_Lower.Respiratory.Infections
## 1                               1                                    1
##   Percent_Neonatal.Disorders Percent_Alcohol.Use.Disorders Percent_Self.harm
## 1                          1                             1                 1
##   Percent_Exposure.to.Forces.of.Nature Percent_Diarrheal.Diseases
## 1                                    1                          1
##   Percent_Environmental.Heat.and.Cold.Exposure Percent_Neoplasms
## 1                                            1                 1
##   Percent_Conflict.and.Terrorism Percent_Diabetes.Mellitus
## 1                              1                         1
##   Percent_Chronic.Kidney.Disease Percent_Poisonings
## 1                              1                  1
##   Percent_Protein.Energy.Malnutrition Percent_Road.Injuries
## 1                                   1                     1
##   Percent_Chronic.Respiratory.Diseases
## 1                                    1
##   Percent_Cirrhosis.and.Other.Chronic.Liver.Diseases Percent_Digestive.Diseases
## 1                                                  1                          1
##   Percent_Fire..Heat..and.Hot.Substances
## 1                                      1

6th:Use Data wrangling techniques for the appropriate analysis of the data

Filter rows based on conditions

To Combine select and filter operations

To Count the number of missing values in each column

sum(is.na(df)) 

Remove rows with missing values

cleaned_df <- na.omit(df)

To modifying the structure or format of the data to meet the Convert data types

To Create new variables

cleaned_df$Total_Deaths <- rowSums(cleaned_df[, 4:33]) 
cleaned_df$Total_Deaths
Aggregate data by year and calculate the total deaths for each year

Calculate the average deaths per year

## [1] 1464349925

Seventh :Use plotly for your plots for interactivity.

7th: Interactive

## `summarise()` has grouped output by 'Country.Territory'. You can override using
## the `.groups` argument.

Conclusion

CARDIOVASCULAR DISEASES have been a major contributor and cause of death in the top five countries in the world.This is also due to the large population of these five countries. As a result of unhealthy living, increased intake of dangerous substances such as drugs, alcohol, smoking, and a polluted environment, the number of deaths has increased over time. NEOPLASMS, the second leading cause of death, has also posed a significant danger to these countries. There is an ANAMOLY in the dataset indicating RUSSIA HAS ZERO DEATHS DUE TO MALARIA.(This could be due to a dataset issue.) The data is displayed below.

It is worth mentioning that, depending on the individual analysis or research aims, further preparation and data wrangling may be required to address missing values, outliers, and inconsistencies in the dataset.