We are Laura and Harley, two undergraduate public health majors who are learning to tell stories about data using the statistical programming software, R. For a recent class assignment, we reproduced data visuals that had been published by Julia Silge at r-bloggers.com. The visualizations were created using mortality data from Utah’s open data catalog, and a recently introduced R package called gganimate. You can find Julia’s original post here and our recreation of the visuals here

Exciting as the gganimate package was to use, we found that when the data was animated, it was difficult to visualize how mortality rates from the top 10 causes of death have changed over the years. To get an easier-to-view picture, we plotted each of the top 10 causes of death in a single, static graph.

First we loaded the packages necessary for creating our visuals:

library(tidyr)
library(dplyr)
library(ggplot2)
library(RSocrata)

Next, we needed to load and clean our data, just like we did to create our animated graph:

deathDF <- read.socrata("https://opendata.utah.gov/resource/fu2n-aa2y.csv")
colnames(deathDF) <- c("cause", "year", "number", "notes", "population", 
                       "adjustedrate", "LL95CI", "UL95CI", "standarderror")
sapply(deathDF, class)
##         cause          year        number         notes    population 
##      "factor"     "integer"     "integer"      "factor"      "factor" 
##  adjustedrate        LL95CI        UL95CI standarderror 
##      "factor"      "factor"      "factor"      "factor"
deathDF <- deathDF[!is.na(deathDF$year),]
deathDF$cause <- as.factor(as.character(deathDF$cause))
deathDF$population <- as.numeric(gsub("[[:punct:]]", "", deathDF$population))
summary(deathDF$population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 2193000 2325000 2526000 2541000 2774000 2901000      67
deathDF[,6:9] <- apply(deathDF[,6:9], 2, function(x) gsub("*", "", x))
deathDF[,6:9] <- apply(deathDF[,6:9], 2, as.numeric)
deathDF <- complete(deathDF, cause, year)
totalDF <- deathDF[deathDF$cause == "Total",]
deathDF <- left_join(deathDF[,c("cause", "year", "number", "adjustedrate")], 
totalDF[,c("year", "number", "population","adjustedrate")],by = "year")
colnames(deathDF) <- c("cause", "year", "number", "adjustedrate", "totalnumber", "population", "totaladjustedrate")
deathDF$number[is.na(deathDF$number)] <- 0
deathDF$adjustedrate[is.na(deathDF$adjustedrate)] <- 0
summary(deathDF$number)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     4.0    32.5   473.2   172.2 12670.0
summary(deathDF$adjustedrate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.63   25.31    9.24  657.50
top10 <- deathDF[deathDF$cause != "Total",] %>% 
  group_by(cause) %>% summarise(adjustedrate = mean(adjustedrate)) %>% 
  top_n(10, adjustedrate) %>% arrange(desc(adjustedrate))
deathDFtop10 <- deathDF[deathDF$cause %in% top10$cause,]
deathDFtop10$cause <- as.factor(as.character(deathDFtop10$cause))
deathDFtop10$shortcause <- deathDFtop10$cause
levels(deathDFtop10$shortcause) <- c("Alzheimer's", "Stroke", "COPD","Diabetes", "Heart disease", "Flu/pneumonia", "Suicide", "Cancer", "Kidney disease", "Accident")
deathDFtop10$shortcause <- as.factor(as.character(deathDFtop10$shortcause))

Finally, we created our graph:

ggplot(data=deathDFtop10, aes(x=year, y=adjustedrate, color=cause))+geom_line(size=2.0,alpha=0.7)+geom_point(size=0.5)+xlab("Year")+ylab("Age adjusted mortality rate")+ggtitle("Top 10 Causes of Death in Utah")

From this graph we could see how age adjusted mortality rates for the top 10 causes of death relate to each other and how each rate has changed over time. This graph shows that although mortality rates related to disease of the heart have decreased substantially since 2000, the rate still far exceeds almost any other cause of death in Utah. Interestingly, this rate is well below the rate of heart disease deaths in the U.S. overall, most likely because Utah has the youngest population of the 50 states.

After creating this graph, we were curious to see what rates look like for other diseases in Utah, so we loaded a dataset containing information regarding the number of communicable disease cases in the state. The data included information regarding 93 different diseases from 2000-2009.

First we loaded the packages we would use to create our plot:

library(reshape2)
library(data.table)
library(RCurl)

We downloaded and cleaned the data:

disease<- getURL('https://opendata.utah.gov/api/views/wy8g-i9mg/rows.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
disease<- read.csv(textConnection(disease), header=T)
disease[ disease == "?" ] = NA
names(disease) <- c("Disease", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009")
disease2 <- melt(disease, id = "Disease")
names(disease2) <- c("Disease", "Year", "n")
disease3 <- arrange(disease2, Disease)
disease4 <- filter(disease3, !is.na(disease3$n) & disease3$n != "*")
disease4$n <- as.numeric(disease4$n)
disease4$Disease <- as.character(disease4$Disease)
disease4$Year <- as.character(disease4$Year)
disease6<- disease4 %>% 
      group_by(Disease)  %>%
      tally() %>%
      mutate(rank = row_number(-n)) %>%
      arrange(rank)
topsix <- head(disease6, 6)
label <- as.vector(unique(topsix$Disease))
label
## [1] "Chlamydia"                                                  
## [2] "Methicillin-Resistant Staphylococcus aureus (MRSA) Isolates"
## [3] "Chickenpox"                                                 
## [4] "Gonorrhea"                                                  
## [5] "Giardiasis"                                                 
## [6] "Influenza-associated hospitalization"
disease6 <- subset(disease4, Disease %in% label)
disease6$Disease[disease6$Disease == "Methicillin-Resistant Staphylococcus aureus (MRSA) Isolates"] <- "MRSA"

Then we created our graph:

p4 <- ggplot(data = disease6,
              aes(x = as.numeric(Year), y = n, 
                   col = Disease, shape= Disease)) +
              geom_line( size = 1.5, alpha =.4)  +
              geom_point(size = 2, )+
              scale_x_continuous( breaks = c(2000, 2001, 2002, 2003,2004, 2005, 2006, 2007, 2008, 2009)) +
              scale_y_continuous(breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000)) +
              xlab("Disease") +
              ylab("# Reported Cases")+
              ggtitle("Total Reported Cases of Disease")
p4

Looking at the raw numbers for the six most common communicable diseases from 2000-2009, there was a large increase in reported cases of Chlamydia, and between 2002-2006, a sharp rise in reported cases of MRSA. Anyone planning to implement public health interventions or prevetion work related to communicable dieases in Utah might want to focus on MRSA, since the sharp increase in cases is unlikely to simply be a reflection of Utah’s increasing population. Because this data set included only absolute numbers rather than numbers as a proportion of the population, it is difficult to make inferences about trends in diseases aside from MRSA. Public health officials might want to calculate numbers of disease per 100,000 population to see if these diseases have in fact been trending upward over time, or if the increased numbers can be attributed to a population increase.