Coursera's “Computing For Data Analysis” Programming Assignment 3 Graphs Using ggplot2

The base graphics package in R is powerful and flexible, but has some quriks, inconsistencies and other issues. The ggplot2 package aims to “take the good parts of base and lattice graphics and none of the bad parts.”

Two funamentals of ggplot are that it works off of data frames – making it more straightforward to use with a wide variety of workflows in R – and is based on the grammar of graphics.

It can handle much of the dirty work for you and can make very visually appealing charts with little effort.

I thought it would be an interesting exercise to

We'll be using the following libraries (there's a short description next to each of them):

library(ggplot2)  # need this to use'ggplot'
library(ggthemes)  # many additional beautiful themes
library(gridExtra)  # replaces 'par()'

We still need to do the data work..

df = read.csv("~/Desktop/p3/outcome-of-care-measures.csv", colClasses = "character")

# I like to work with column names vs indicies

colnames(df)[11] = "HeartAttack"
colnames(df)[17] = "HeartFailure"
colnames(df)[23] = "Pneumonia"

# convert everything we need to numbers

df$HeartAttack = as.numeric(df$HeartAttack)
df$HeartFailure = as.numeric(df$HeartFailure)
df$Pneumonia = as.numeric(df$Pneumonia)

# compute the ranges of each column

rHA = range(df$HeartAttack, na.rm = TRUE)
rHF = range(df$HeartFailure, na.rm = TRUE)
rP = range(df$Pneumonia, na.rm = TRUE)

# get min/max for xlim

rng = c(min(c(rHA[1], rHF[1], rP[1])), max(c(rHA[2], rHF[2], rP[2])))

# get mean values

meanHeartAttack = round(mean(df$HeartAttack, na.rm = TRUE))
meanHeartFailure = round(mean(df$HeartFailure, na.rm = TRUE))
meanPneumonia = round(mean(df$Pneumonia, na.rm = TRUE))

Here's where the ggplot work comes in. We build three graphs of the death rate for each condition and put them into a grid.

I like to put each ggplot element on a separate line as I'm building a plot, and that also makes it easier to explain what's going on:

Lather/rinse/repeat for each death reate.

haPlot = ggplot(df, aes(x = HeartAttack)) + geom_histogram(aes(y = ..density..), 
    fill = "lightblue", color = "black") + geom_density() + geom_vline(xintercept = median(df$HeartAttack, 
    na.rm = TRUE), color = "maroon") + labs(title = paste("Heart Attack 30-day Death Rate ( X =", 
    meanHeartAttack, ")"), x = "30-day Death Rate") + scale_x_continuous(limits = rng) + 
    theme_few()

hfPlot = ggplot(df, aes(x = HeartFailure)) + geom_histogram(aes(y = ..density..), 
    fill = "lightblue", color = "black") + geom_density() + geom_vline(xintercept = median(df$HeartFailure, 
    na.rm = TRUE), color = "maroon") + labs(title = paste("Heart Failure 30-day Death Rate ( X =", 
    meanHeartFailure, ")"), x = "30-day Death Rate") + scale_x_continuous(limits = rng) + 
    theme_few()

pnPlot = ggplot(df, aes(x = Pneumonia)) + geom_histogram(aes(y = ..density..), 
    fill = "lightblue", color = "black") + geom_density() + geom_vline(xintercept = median(df$Pneumonia, 
    na.rm = TRUE), color = "maroon") + labs(title = paste("Pneumonia 30-day Death Rate ( bar(X) =", 
    meanPneumonia, ")"), x = "30-day Death Rate") + scale_x_continuous(limits = rng) + 
    theme_few()

We then use grid.arrange() to mimic the par() function to show our work:

grid.arrange(haPlot, hfPlot, pnPlot, nrow = 3)

plot of chunk figure1

Next, we'll show the boxplots from the assignment, and I'll show the sorted & non-sorted ones together. I'm ignoring the math symbol requirement as it seemed to be a rather silly focus for a data analysis course.

I've annotated the R code pretty well, so no need to explain more here.

# get median values and merge back into the data frame

dfHAMed = with(df, aggregate(HeartAttack, by = list(State), FUN = function(v) {
    round(median(v, na.rm = TRUE))
}))
colnames(dfHAMed) = c("State", "Median.Heart.Attack.Death.Rate")
df = merge(df, dfHAMed, by = "State")

# since we're going to show the sorted & non-sorted plots together make
# another data frame with the sorted values vs sort in place

df.sort1 = transform(df, State = reorder(State, Median.Heart.Attack.Death.Rate))

# this is a function that we'll use in stat_summary() which will provide
# the population size for the state at the median line (as I'm not a fan
# of altering the x axis labels with this info)

lengthformean <- function(x) {
    return(c(y = mean(x), label = length(x)))
}

Now, we enerate the sorted & non-sorted box plots for use with the grid plot. Again, we store these in varibles (_ggplot_s are just objects) for actual plotting later on.

bystate = ggplot(df, aes(factor(State), HeartAttack)) + geom_boxplot(outlier.colour = "maroon", 
    fill = "white", color = "lightblue") + labs(x = "State", y = "30-day Death Rate", 
    title = "Heart Attack 30-day Death Rate by State") + stat_summary(fun.data = lengthformean, 
    geom = "text", color = "black", size = 3, position = "stack") + theme_few() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

bymedian = ggplot(df.sort1, aes(factor(State), HeartAttack)) + geom_boxplot(outlier.colour = "maroon", 
    fill = "white", color = "lightblue") + labs(x = "State", y = "30-day Death Rate", 
    title = "Heart Attack 30-day Death Rate by State") + stat_summary(fun.data = lengthformean, 
    geom = "text", color = "black", size = 3, position = "stack") + theme_few() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Finally, we plot the figures

grid.arrange(bystate, bymedian, nrow = 2)

plot of chunk unnamed-chunk-6

Now we're ready to compare the 30-day death rates and numbers of patients with latticed scatterplots.

# read in the additional data
hospital = read.csv("~/Desktop/p3/hospital-data.csv", colClasses = "character")

# merge the old & new data frames together
outcome.hospital = merge(df, hospital, by = "Provider.Number")

# I like decent column names to work with
colnames(outcome.hospital)[11] = "death"
colnames(outcome.hospital)[15] = "npatient"

# Make sure the necessary column are numeric (read in as strings,
# remember)
outcome.hospital$death = as.numeric(outcome.hospital$death)
outcome.hospital$npatient = as.numeric(outcome.hospital$npatient)

The ggplot2 library is nothing short of magic when it comes to how succinctly we can make complex graphs. For the latticed scatterplot, we:

sc = ggplot(outcome.hospital, aes(x = npatient, y = death)) + geom_point(shape = 1) + 
    geom_smooth(method = lm) + facet_wrap(~Hospital.Ownership) + labs(x = "Number of Patients Seen", 
    y = "30-day Death Rate", title = "Heart Attack 30-day Death Rate by Ownership") + 
    theme_few()
sc

plot of chunk unnamed-chunk-9