Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Kaggle (2017).


The data used in this assessment has been sourced from Occupational Safety and Health Administration aka OSHA by US government from the year 2015 to 2017. The data are collected from employers who reported all severe work-related injuries starting from the year January 2015. It covers 22,00 incidents and contains 26 columns including geographical data, employers, and outcome.

Objective

  • The primary goal of the data visualization is to draw the attention of the audience on the secondary sources of worker’s injury in the US from the year 2015 to 2017.

  • The targeted audiences for this visualization are government, employers and workers who are trying to reduce the accidents occurs during work hours in the US. The workers are victims in the injury who are responsible for working with more precaution, employers are responsible to take good care on their employees’ safety and provide a safe environment for the employees to work and the government is responsible to analyse the root cause for better solution and stricter the rules and regulation to ensure the working environment is safe for everyone.

The visualisation chosen had the following three main issues:

  • Selection of the chart: The author relied on the pyramid chart to represent too many secondary sources. However, the pyramid chart is only suitable to arrange the data in a way that show hierarchical structure. Hence, it creates some difficulties to read and understand the data. For example, there are many divisions in the pyramid chart that cause the audience to read the numeric values and compare with each other division which is time consuming.
  • Poor labelling: The labels on the pyramid chart are messy and labels of smaller proportions are hard to see. This might misguide the audience in understanding the visualization. For example, water label is nearby to the structural element proportion. Although the visualization has labels with arrows, it has poor and confused labelling that causes poor readability.
  • Deceptive: Some segments’ values are similar, but the portion of a segment is bigger than the another segment of the same value. This can give false information saying that one segment has a better value that the other one. For example, strapping and jacks have the same counts of 37. However, strapping is below jacks. This might deceive the audience if they did not read the numeric values.

Reference

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)
library(readr)
library(dplyr)
library(lubridate)
library(highcharter)
library(data.table)


#Load the data
df <- read_csv("severeinjury.csv")


#Remove the whitespaces from the column name
name = names(df)

clean_name <- gsub(' ', '_', name)
colnames(df) <- clean_name

df<- df %>% select(c(EventDate, Employer, Zip, City, State, Longitude, Latitude, 
           NatureTitle, Part_of_Body_Title, Hospitalized, Amputation, 
           EventTitle, SourceTitle, Secondary_Source_Title, Final_Narrative )) %>%mutate(EventDate = mdy(EventDate)) 


df2<-df %>% group_by(Secondary_Source_Title) %>% 
  filter(Hospitalized !=0 || Amputation !=0) %>%  na.omit(count) %>%  
  summarise(count = n()) %>% 
  arrange(desc(count)) %>% 
  head(30)

p1<-df2%>%
  hchart("bar", innerSize= '60%', showInLegend= F, 
        hcaes(x = Secondary_Source_Title, y = count, colour = -count)) %>%
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Top 30 secondary sources of workers injury from 2015 onward") %>%
  hc_credits(enabled = TRUE, text = "Sources: U.S Department of Labor") %>%
  hc_yAxis(plotLines = list(list(label = list(text = "Average"),
                  color = "#A93226",
                  width = 2,
                  value = mean(df2$count))))

Data Reference

Reconstruction

The following plot fixes the main issues in the original.