1- Histogram- 30 day mortality rate from heart attack

Read the outcome data into R via the read.csv function and look at the first few rows. outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character") head(outcome)
There are many columns in this dataset. You can see how many by typing ncol(outcome) (you can see the number of rows with the nrow function). In addition, you can see the names of each column by typing names(outcome) (the names are also in the PDF document. To make a simple histogram of the 30-day death rates from heart attack (column 11 in the outcome dataset), run outcome[, 11] <- as.numeric(outcome[, 11])
You may get a warning about NAs being introduced; that is okay hist(outcome[, 11]) Because we originally read the data in as character (by specifying colClasses = “character” we need to coerce the column to be numeric.

Creating the histogram

Approach:
- Read data into R
- plot histogram

outcome <- read.csv(file, colClasses = "character")
outcome[, 11] <- as.numeric(outcome[, 11])
## Warning: NAs introduced by coercion
hist(outcome[, 11])

2- Finding the best hospital in a state

Write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the specified outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of “heart attack”, “heart failure”, or “pneumonia”. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.
Handling ties. If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the first hospital in that set should be chosen (i.e. if hospitals “b”, “c”, and “f” are tied for best, then hospital “b” should be returned).

Cleaning the Data:

Approach:
- Read data into R
- extract data set with only relevant information
- convert rate columns originally coded as character to numeric, and get rid of NAs
- group data by given parameters for easier analysis

data <- read_csv(file)
reldata <- data  %>%
  select(starts_with("Hospital"), -contains("Readmission"), State) %>%
  mutate_at(vars(contains("Mortality")), as.numeric)%>%
  tidyr::drop_na()%>%
  group_by(`Hospital Name`,State)

Creating the function:

Approach:
- use stop() function to check validity of input
- filter rows by specified state and select column with specified outcome
- arrange by rate, as well as name of hospital for tie-breaking
- extract the information we need

best<- function (state, outcome){
  
  outcome <- regex(outcome, ignore_case = T)
  `%notin%` <- Negate(`%in%`)
  
  if (TRUE %notin% str_detect(names(reldata), outcome)){
    stop("invalid outcome")
    
  } else if (state %notin% reldata$State){
    stop ("invalid state")
    
  } else {
    filter(reldata, State== state)%>%
      select(`Hospital Name`, State, Rate= contains(outcome)) %>%
      arrange(Rate, `Hospital Name`)%>%
      extract2(1,1)
  }
}

Test Cases:

best("TX","heart attack")
## [1] "CYPRESS FAIRBANKS MEDICAL CENTER"
best("TX", "heart failure")
## [1] "FORT DUNCAN MEDICAL CENTER"
best("MD", "heart attack")
## [1] "JOHNS HOPKINS HOSPITAL, THE"
best("MD", "pneumonia")
## [1] "GREATER BALTIMORE MEDICAL CENTER"

For the errors, error

3- Ranking hospitals by outcome in a state

Write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num). The function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the ranking specified by the num argument. For example, the call rankhospital("MD", "heart failure", 5) would return a character vector containing the name of the hospital with the 5th lowest 30-day death rate for heart failure. The num argument can take values “best”, “worst”, or an integer indicating the ranking (smaller numbers are better). If the number given by num is larger than the number of hospitals in that state, then the function should return NA. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.
Handling ties. It may occur that multiple hospitals have the same 30-day mortality rate for a given cause of death. In those cases ties should be broken by using the hospital name.
The function should check the validity of its arguments. If an invalid state value is passed to rankhospital, the function should throw an error via the stop function with the exact message “invalid state”. If an invalid outcome value is passed to rankhospital, the function should throw an error via the stop function with the exact message “invalid outcome”.

Creating the function:

Approach:
- Build upon previously filtered data:
- add “Rank” column using dplyr::mutate, but first ungroup so each hospital gets its own rank
- specify conditions for num, as we’re allowing inputs such as “best” and “worst”
- extract the information we need

rankhospital<- function(state, outcome, num= "best"){
  
  outcome <- regex(outcome, ignore_case = T)
  `%notin%` <- Negate(`%in%`)
  
  if (TRUE %notin% str_detect(names(reldata), outcome)){
    stop("invalid outcome")
    
  } else if (state %notin% reldata$State){
    stop ("invalid state")
    
  } else {
    rank <- filter(reldata, State== state)%>%
      select(`Hospital Name`, State, Rate= contains(outcome)) %>%
      arrange(Rate, `Hospital Name`)%>%
      ungroup()%>%
      mutate(Rank= row_number())
    
    if (num== "best"){
      num=min(rank$Rank)
    } else if (num== "worst"){
      num= max(rank$Rank)
    } else if(num %notin% rank$Rank){
      return(NA)
    } else {
      num= num
    }
    rank %>%
      filter(Rank== num) %>%
      extract2(1,1)
    }

}

Test Cases:

rankhospital("TX", "heart failure", 4)
## [1] "DETAR HOSPITAL NAVARRO"
rankhospital("MD", "heart attack", "worst")
## [1] "HARFORD MEMORIAL HOSPITAL"
rankhospital("MN", "heart attack", 5000)
## [1] NA
rankhospital("AL", "pneumonia")
## [1] "LAWRENCE MEDICAL CENTER"

4- Ranking hospitals by outcome in a state

Write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital rank- ing (num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame containing the hospital in each state that has the ranking specified in num. For example the function call rankall(“heart attack”, “best”) would return a data frame containing the names of the hospitals that are the best in their respective states for 30-day heart attack death rates. The function should return a value for every state (some may be NA). The first column in the data frame is named hospital, which contains the hospital name, and the second column is named state, which contains the 2-character abbreviation for the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.
Handling ties. The rankall function should handle ties in the 30-day mortality rates in the same way that the rankhospital function handles ties.
NOTE: For the purpose of this part of the assignment (and for efficiency), your function should NOT call the rankhospital function from the previous section.
The function should check the validity of its arguments. If an invalid outcome value is passed to rankall, the function should throw an error via the stop function with the exact message “invalid outcome”. The num variable can take values “best”, “worst”, or an integer indicating the ranking (smaller numbers are better). If the number given by num is larger than the number of hospitals in that state, then the function should return NA.

Creating the function:

Approach:
- Since output should also show missing data, recreate data frame without dropping NAs. - Use if/else statements to specify behavior in case of non- numeric rank input:
- for “worst”, used dplyr::summarize to first get the highest rank per state aka worst rates, then inner_join to pull the hospital name from original dataframe. for “best”, got the min rank directly
- To represent NAs, first filtered the original dataframe with specified rank input, then created a dataset with the states with missing data- and finally combined both and arranged accordingly.

reldata2 <- data  %>%
  select(starts_with("Hospital"), -contains("Readmission"), State) %>%
  mutate_at(vars(contains("Mortality")), as.numeric)%>%
  group_by(`Hospital Name`,State)

rankall<- function(outcome, num= "best"){
  
  outcome <- regex(outcome, ignore_case = T)
  `%notin%` <- Negate(`%in%`)
  
  if (TRUE %notin% str_detect(names(reldata2), outcome)){
    stop("invalid outcome")
  } 
  
    ranking <- reldata2 %>%
      select(hospital= `Hospital Name`, state= State, Rate= contains(outcome)) %>%
      arrange(Rate, hospital) %>%
      group_by(state)%>%
      arrange(state) %>%
      mutate(rank= row_number())
    
    if (num== "worst"){
      worst= summarize(ranking%>%drop_na(), rank= max(rank))
      return(inner_join(ranking, worst, by= c("state", "rank"))
             %>% select(hospital, state))
      } else if (num== "best"){
      num= min(ranking$rank)
      } else {
      num= num
      }
      ranking <- ranking %>%
      filter(rank== num)%>%
      select(hospital, state)
    
      statenames <- unique(reldata2$State)
      subs<- statenames[statenames %notin% ranking$state]
      missing= NULL
      for (i in seq_along(subs)){
        hospital <-  "<NA>"
        state<-  subs[i]
       missing= bind_rows(missing, tibble(hospital,state))
       }
        bind_rows(ranking,missing) %>%
        arrange(state)
}

Test Cases:

head(rankall("heart attack", 20), 10)
## # A tibble: 10 x 2
## # Groups:   state [10]
##    hospital                            state
##    <chr>                               <chr>
##  1 <NA>                                AK   
##  2 D W MCMILLAN MEMORIAL HOSPITAL      AL   
##  3 ARKANSAS METHODIST MEDICAL CENTER   AR   
##  4 JOHN C LINCOLN DEER VALLEY HOSPITAL AZ   
##  5 SHERMAN OAKS HOSPITAL               CA   
##  6 SKY RIDGE MEDICAL CENTER            CO   
##  7 MIDSTATE MEDICAL CENTER             CT   
##  8 <NA>                                DC   
##  9 <NA>                                DE   
## 10 SOUTH FLORIDA BAPTIST HOSPITAL      FL
tail(rankall("pneumonia", "worst"), 3)
## # A tibble: 3 x 2
## # Groups:   state [3]
##   hospital                                   state
##   <chr>                                      <chr>
## 1 MAYO CLINIC HEALTH SYSTEM - NORTHLAND, INC WI   
## 2 PLATEAU MEDICAL CENTER                     WV   
## 3 NORTH BIG HORN HOSPITAL DISTRICT           WY
tail(rankall("heart failure"), 10)
## # A tibble: 10 x 2
## # Groups:   state [10]
##    hospital                                                          state
##    <chr>                                                             <chr>
##  1 WELLMONT HAWKINS COUNTY MEMORIAL HOSPITAL                         TN   
##  2 FORT DUNCAN MEDICAL CENTER                                        TX   
##  3 VA SALT LAKE CITY HEALTHCARE - GEORGE E. WAHLEN VA MEDICAL CENTER UT   
##  4 SENTARA POTOMAC HOSPITAL                                          VA   
##  5 GOV JUAN F LUIS HOSPITAL & MEDICAL CTR                            VI   
##  6 SPRINGFIELD HOSPITAL                                              VT   
##  7 HARBORVIEW MEDICAL CENTER                                         WA   
##  8 AURORA ST LUKES MEDICAL CENTER                                    WI   
##  9 FAIRMONT GENERAL HOSPITAL                                         WV   
## 10 CHEYENNE VA MEDICAL CENTER                                        WY

Notes:

I did this assignment before, but decided to retry it from scratch to better familiarize myself with the tidyverse. This may not be the best approach, and the code may be clunky, but it’s an original solution and an attempt at oiling rusty gears.

R session info:

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_1.5  stringr_1.4.0 tidyr_0.8.3   dplyr_0.8.0.1 readr_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       knitr_1.21       hms_0.4.2        tidyselect_0.2.5
##  [5] R6_2.4.0         rlang_0.4.0      fansi_0.4.0      tools_3.5.2     
##  [9] xfun_0.5         utf8_1.1.4       cli_1.1.0        htmltools_0.3.6 
## [13] yaml_2.2.0       digest_0.6.18    assertthat_0.2.0 tibble_2.0.1    
## [17] crayon_1.3.4     purrr_0.3.1      glue_1.3.0       evaluate_0.13   
## [21] rmarkdown_1.11   stringi_1.3.1    compiler_3.5.2   pillar_1.3.1    
## [25] pkgconfig_2.0.2