Introduction

Download the file ProgAssignment3-data.zip file containing the data for Programming Assignment 3 from the Coursera web site. Unzip the file in a directory that will serve as your working directory. When you start up R make sure to change your working directory to the directory where you unzipped the data.

The data for this assignment come from the Hospital Compare web site (http://hospitalcompare.hhs.gov) run by the U.S. Department of Health and Human Services. The purpose of the web site is to provide data and information about the quality of care at over 4,000 Medicare-certified hospitals in the U.S. This dataset essentially covers all major U.S. hospitals. This dataset is used for a variety of purposes, including determining whether hospitals should be fined for not providing high quality care to patients (see http://goo.gl/jAXFX for some background on this particular topic).

The Hospital Compare web site contains a lot of data and we will only look at a small subset for this assignment. The zip file for this assignment contains three files

A description of the variables in each of the files is in the included PDF file named Hospital_Revised_Flatfiles.pdf. This document contains information about many other files that are not included with this programming assignment. You will want to focus on the variables for Number 19 (“Outcome of Care Measures.csv”) and Number 11 (“Hospital Data.csv”). You may find it useful to print out this document (at least the pages for Tables 19 and 11) to have next to you while you work on this assignment. In particular, the numbers of the variables for each table indicate column indices in each table (i.e. “Hospital Name” is column 2 in the outcome-of-care-measures.csv file).

Downloading data

Set up directory for downloading data

dir.create("./data", showWarnings = FALSE)

downloadURL <- "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2FProgAssignment3-data.zip"
downloadedFile <- "./data/rprog_data_ProgAssignment3-data.zip"

Download and upzip data

if(!file.exists(downloadedFile)) {
    download.file(downloadURL, downloadedFile, method = "curl")
    unzip(downloadedFile, exdir = "./data")
}

1. Plot the 30-day mortality rates for heart attack

Read the outcome data into R via the read.csv function and get the number of dimensions.

outcome <- read.csv("./data/outcome-of-care-measures.csv", colClasses = "character")
dim(outcome)
## [1] 4706   46

Make a simple histogram of the 30-day death rates from heart attack

outcome[, 11] <- as.numeric(outcome[, 11])
hist(outcome[, 11], main = "30-day death rates from heart attack", xlab = "Death rates")

→ As can be seen from the plot, the majority death rates is 16.

2. Finding the best hospital in a state

In this part, we will write a function to discover a name of the hospital in the outcome-of-care-measures.csv file has the best (i.e. lowest) 30-day mortality for the specific outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of “heart attack”, “heart failure” or “pneumonia”. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties

If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the first hospital in the set should be chosen (i.e. if hospitals “b”, “c” and “f” are tied for the best, then hospital “b” should be returned).

Function summary

best(state, outcome)

Input

  • state: the 2-character abbreviated name of a state
  • outcome: an outcome name

Output

Return a hospital name in that sate with the lowest 30-day death rate

Function definitions

  • Util functions

check.validity(state, outcome): Check the validity of state and outcome with the following conditions: state must be a 2-character abbreviated name of a state, outcome must be one of “heart attack”, “heart failure” or “pneumonia”

check.validity <- function(state, outcome) {
    stopifnot(nchar(state) == 2)
    stopifnot(outcome %in% c("heart attack", "heart failure", "pneumonia"))
}

get.hospital(df, state, outcome): filter the hospital in that state then sort them by 30-day death rate and name

get.hospital <- function(df, state, outcome) {
    ## uppercase the first letter of every word in 'outcome', 
    ## remove spaces between words and make the column name
    outcome %<>% str_to_title() %>%
        make.names() %>%
        paste("Hospital.30.Day.Death..Mortality..Rates.from.", ., sep = "")
    
    ## find the hospital names and their outcome in 'state'
    hospital.outcome <- df[df$State == state, c("Hospital.Name", outcome)]
    hospital.outcome[, outcome] <- as.numeric(hospital.outcome[, outcome])
    
    ## remove rows containing NA values
    hospital.outcome <- na.omit(hospital.outcome)
    
    ## sort 'hospital.outcome'
    hospital.outcome <- hospital.outcome[order(hospital.outcome[, outcome],
                                               hospital.outcome$Hospital.Name), ]
    
    return(hospital.outcome)
}
  • Main function
library(stringr)
library(dplyr)
best <- function(state, outcome) {
    ## read the outcome data
    df <- read.csv("./data/outcome-of-care-measures.csv", colClasses = "character")
    
    check.validity(state, outcome)
    
    hospital.outcome <- get.hospital(df, state, outcome)
    
    return(hospital.outcome[1, 1])
}
  • Some illustrations
best("TX", "heart attack")
## [1] "CYPRESS FAIRBANKS MEDICAL CENTER"
best("TX", "heart failure")
## [1] "FORT DUNCAN MEDICAL CENTER"
best("MD", "heart attack")
## [1] "JOHNS HOPKINS HOSPITAL, THE"
best("MD", "pneumonia")
## [1] "GREATER BALTIMORE MEDICAL CENTER"
best("BB", "heart attack")
## [1] NA
best("NY", "hert attack")
## Error in check.validity(state, outcome): outcome %in% c("heart attack", "heart failure", "pneumonia") is not TRUE

3. Ranking hospitals by outcome in a state

In this part, we will write a function called rankhospital, the function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the specified ranking. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties

It may occur multiple hospitals have the same 30-day mortality rate for a given cause of death. In those cases ties should be broken by using the hospital name.

Function summary

rankhospital(state, outcome, name)

Input

  • state: The 2-character abbreviated name of a state
  • outcome: An outcome name
  • num: The ranking of a hospital in that state for that outcome. This argument can take values “best”, “worst”, or an integer indicating the ranking (smaller numbers are better). If the given number is larger than the number of hospitals in that state, then the function should return NA.

Output

Return hospital names in that state with the given rank 30-day death rate. For example, the call rankhospital("MD", "heart failure", 5) would return a character vector containing the name of the hospital with 5th lowest 30-day death rate for heart value.

Function definition

  • Main function
rankhospital <- function(state, outcome, num = "best") {
    ## read the outcome data
    df <- read.csv("./data/outcome-of-care-measures.csv", colClasses = "character")
    
    check.validity(state, outcome)
    
    hospital.outcome <- get.hospital(df, state, outcome)
    
    if(num == "best") return(hospital.outcome[1, 1])
    if(num == "worst") return(tail(hospital.outcome[, 1], 1))
    return(hospital.outcome[num, 1])
}
  • Some illustrations
rankhospital("TX", "heart failure", 4)
## [1] "DETAR HOSPITAL NAVARRO"
rankhospital("MD", "heart attack", "worst")
## [1] "HARFORD MEMORIAL HOSPITAL"
 rankhospital("MN", "heart attack", 5000)
## [1] NA

4. Ranking hospitals in all states

In this part, we will write a function called rankdall, the function reads the outcome-of-care-measures.csv file and returns a 2-column data frame containing the hospital in each state that has the given ranking.

Handling ties

The rankall function should handle ties in the 30-day mortality rates in the same way that the rankhospital function handles ties. The function should return a value for every state (some may be NA). The first column in the data frame is named hospital, which contains the hospital name, and the second column is named state, which contains the 2-character abbreviation for the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Function summary

rankall(outcome, num = "best")

Input

  • outcome: An outcome name
  • num: The ranking of a hospital in that state for that outcome. This argument can take values “best”, “worst”, or an integer indicating the ranking (smaller numbers are better). If the given number is larger than the number of hospitals in that state, then the function should return NA.

Output

Return a data frame with the hospital names and the abbreviated state name. For example the function call rankall("heart attack", "best") would return a data frame containing the names of the hospitals that are the best in their respective states for 30-day heart attack death rates.

Function definition

  • Main function
rankall <- function(outcome, num = "best") {
    ## read the 'State' column in the outcome data
    state.df <- read.csv("./data/outcome-of-care-measures.csv", colClasses = c(rep("NULL", 6), "character", rep("NULL", 39)))[, 1]
    
    states<- unique(sort(state.df))
    
    hospitals <- c()
    for (state in states) {
        hospital <- rankhospital(state, outcome, num)
        hospitals <- append(hospitals, hospital)
    }
    
    data.frame(state = states, hospital = hospitals)
}
  • Some illustrations
head(rankall("heart attack", 20), 10)
##    state                            hospital
## 1     AK                                <NA>
## 2     AL      D W MCMILLAN MEMORIAL HOSPITAL
## 3     AR   ARKANSAS METHODIST MEDICAL CENTER
## 4     AZ JOHN C LINCOLN DEER VALLEY HOSPITAL
## 5     CA               SHERMAN OAKS HOSPITAL
## 6     CO            SKY RIDGE MEDICAL CENTER
## 7     CT             MIDSTATE MEDICAL CENTER
## 8     DC                                <NA>
## 9     DE                                <NA>
## 10    FL      SOUTH FLORIDA BAPTIST HOSPITAL
tail(rankall("pneumonia", "worst"), 3)
##    state                                   hospital
## 52    WI MAYO CLINIC HEALTH SYSTEM - NORTHLAND, INC
## 53    WV                     PLATEAU MEDICAL CENTER
## 54    WY           NORTH BIG HORN HOSPITAL DISTRICT
tail(rankall("heart failure"), 10)
##    state                                                          hospital
## 45    TN                         WELLMONT HAWKINS COUNTY MEMORIAL HOSPITAL
## 46    TX                                        FORT DUNCAN MEDICAL CENTER
## 47    UT VA SALT LAKE CITY HEALTHCARE - GEORGE E. WAHLEN VA MEDICAL CENTER
## 48    VA                                          SENTARA POTOMAC HOSPITAL
## 49    VI                            GOV JUAN F LUIS HOSPITAL & MEDICAL CTR
## 50    VT                                              SPRINGFIELD HOSPITAL
## 51    WA                                         HARBORVIEW MEDICAL CENTER
## 52    WI                                    AURORA ST LUKES MEDICAL CENTER
## 53    WV                                         FAIRMONT GENERAL HOSPITAL
## 54    WY                                        CHEYENNE VA MEDICAL CENTER

Libraries

Here are the libraries and their versions that I am using in this project:

##   Library Version
## 1 stringr   1.4.0
## 2   dplyr   1.0.7