Part I - Data Visualisation

The visualisation analysed here is Analysis of death causes of Clebrities, created by Elena Petrova, posted in Kaggle. taken from Kaggle.This visualisation has been created to investigate the claim that 2016 had an unnaturally large number of celebrity deaths. The original visualisation is given below.

Clebrity Deaths

Taken from: Kaggle

The data used for the visualisation is available in the following link:

https://www.kaggle.com/hugodarwood/celebrity-deaths/downloads/celebrity-deaths.zip

Part II - Deconstruct

The visualisation given here is conveying the most frequent causes of celebrity deaths through a bar plot. The number of deaths associated with 8 of the frequent causes of celebrity deaths a are visualised. Few of the significant problems associated with the visualisation that could be fixed are:

  1. Many of the frequent reasons for the celebrity deaths are hidden in the “other” category. Not all of the frequent reasons are considered here e.g. “Death after long illness” is directed to “other” category.
  2. Since some of the categories are too small compared to the other big groups like “Other” and “Cancer”, it is difficult to identify thse groups from the plot, e.g “Murder”.
  3. Change of values over the years is minimal, so identfying this difference is difficult, e.g. change in number of people died over cancer and heart disease over the recent years.
  4. Title of the visualisation is not exactly reffering to the analysis behind. Just by looking at the plot, it is difficult to understand the story behind. A proper heading and a proper legend name would have been more convincing. One of the other problem associated with the visualisation which is due to the data manipulation error was that, switch cases doesnot take ‘OR’ operator and the author had used ‘OR’ to group similar death cases, which resulted in leaving many of the frequent cases in ‘other’ category.

Part III - Reconstruct

#importing dataset
CelebrityDeath <- read.csv("~/Downloads/celebrity_deaths.csv",
                     na.strings = c("", "NA"), 
                     stringsAsFactors = F) %>% tbl_df 

# sql solution  
CelebrityDeath <- sqldf("SELECT *,
CASE 
  WHEN cause_of_death LIKE '%cancer%' THEN 'Cancer'
  WHEN cause_of_death LIKE '%leukemia%' THEN 'Cancer'
  WHEN cause_of_death LIKE '%natural%' THEN 'Natural Death' 
  WHEN cause_of_death LIKE '%dies%' THEN 'Natural Death'
  WHEN cause_of_death LIKE '%sleep%'  THEN 'Natural Death' 
  WHEN cause_of_death LIKE '%health%' THEN 'Natural Death' 
  WHEN cause_of_death LIKE '%murder%' THEN 'murder'
  WHEN cause_of_death LIKE '%stab%' THEN 'murder'
  WHEN cause_of_death LIKE '%kill%' THEN 'murder'
  WHEN cause_of_death LIKE '%shot%' THEN 'murder'
  WHEN cause_of_death LIKE '%Alzheimer%' THEN 'Alzheimer or Parkinson'
  WHEN cause_of_death LIKE '%Parkinson%' THEN 'Alzheimer or Parkinson'
  WHEN cause_of_death LIKE '%kidney%' THEN 'Kidney, Liver or Brain Damage'
  WHEN cause_of_death LIKE '%liver%' THEN 'Kidney, Liver or Brain Damage'
  WHEN cause_of_death LIKE '%brain%' THEN 'Kidney, Liver or Brain Damage'
  WHEN cause_of_death LIKE '%cerebral%' THEN 'Kidney, Liver or Brain Damage'
  WHEN cause_of_death LIKE '%heart%' THEN 'Cardiac'
  WHEN cause_of_death LIKE '%stroke%' THEN 'Cardiac'
  WHEN cause_of_death LIKE '%cardiac%' THEN 'Cardiac'
  WHEN cause_of_death LIKE '%chest%' THEN 'Cardiac'
  WHEN cause_of_death LIKE '%illness%' THEN 'After Long Illness'
  WHEN cause_of_death LIKE '%suicide%' THEN 'suicide'
  WHEN cause_of_death LIKE '%pneumonia%'  THEN 'Pneumonia'
  WHEN cause_of_death LIKE '%lung%'  THEN 'Pneumonia'
  WHEN cause_of_death LIKE '%pulmonary%' THEN 'Pneumonia'
  WHEN cause_of_death LIKE '%respiratory%' THEN 'Pneumonia'
  WHEN cause_of_death LIKE '%crash%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%accident%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%fall%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%collision%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%car%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%bike%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%biking%'  THEN 'Accident'
  WHEN cause_of_death LIKE '%injury%'  THEN 'Accident'
  WHEN cause_of_death IS NULL THEN NULL
ELSE 'Other Reasons' END AS Death_Cause
FROM CelebrityDeath")
Manipulated_Data = as.data.frame(table(subset(CelebrityDeath, select= c(Death_Cause,death_year), dnn='Death_Cause')))
ggplot(Manipulated_Data, aes(death_year, Death_Cause, fill = Freq)) +geom_tile(size=0) +   theme(text = element_text(size = 12, 
                            family = "Calibri"),
                  plot.title = element_text(hjust = 0.5)) +scale_fill_viridis( discrete = FALSE, option = "D",guide_legend(title="Frequency of Deaths")) + ggtitle("Major reasons for Celebrity Deaths from 2006 to 2016 ") + xlab("Year of Death") + ylab("Reason of Death")

Modified Visualisation from: Kaggle
Data taken from: Wikipedia

Issue with the data manipulation was corrected and modification were made on the visualisation to solve the above mentioned problems in the original visualisation, 1) Many of the other frequent groups which caused the death of celebrities has been included in the visualisation. These include, “Kidney, Liver and Brain Damage”, “After long illness” and “Natural Death”. 2) All the frequent groups are represented on the y-axis, so that it is easier to identify individual groups and see their changes over the years.
3) It is clearly visible that the cancer was the most frequent cause of death of clelbrity in 2016. Also celebrities that that died due to cancer is increasing over the years. Similarly in case of heart disease, it is easy to recognise the change of frequency of deaths. 4) Title of the visualisation is changed so that story of the visualisation is readable from the title. The title for the legend is also changed for easy understanding. Trifecta check-up has been done on the visualisation. The question behind was “Does 2016 had an unnaturally large celebrity deaths?” All three questions in the Trifecta check-up result in the same answer. The visualisation had a question behind which is interesting. A reliable Data source is provided and the visualisation answers the question behind the analysis.

