Please submit this .Rmd file and compiled (Knit) output (as .html or .pdf)

Scenario: Welcome to Care4All PCPs, the largest Primary Care Network in the state of California! We are glad to have someone with your analytical prowess on board. We are interested in ensuring that our patients get the highest quality care at the fairest price. To that end, we hired a consultant to perform an analysis of hospitals in California to help us understand 1) Which hospitals are the highest quality? 2) Which hospitals charge the most/least?

Based on our request, the consultant provided data and code for each of those questions. While this was helpful, we want to rewrite the code in a different language and explain it in detail (no comments or explanations were provided). Then, we would like to extend this work to learn about the relationship between health quality and cost for our patients. Therefore, we have laid out 3 tasks.

Your Tasks:

Task 1: Describe hospital quality ratings in California Using code written in R, 1a) Explain the code then 1b) Translate that code into Python, improving it as necessary to best answer the question

Task 2: Describe procedure prices at hospitals in California Using code written in Python, 2a) Explain the code, then 2b) Translate that code into R, improving it as necessary to best answer the question

Task 3: Combine Data and Create Visualization Use the data from the first two tasks to determine the relationship between price and quality.

Hints and Advice

Task 1: Hospital quality ratings in the state of California

For this task, you are given a .csv from data.medicare.gov/data/hospital-compare to help answer the question. This dataset contains identifying information for the hospital (Provider ID, Hospital Name, Phone Number, etc.) as well as Medicare-determined quality metrics (Overall rating, national comparison, safety of care, etc.).

1a) Explain the code

Explain in as much detail as possible what the code is doing and how it arrives at the final output table. How does this address the task of describing quality ratings of hospitals in California?

Add comments in the code and a several sentence summary to complete this task.

# Get the unique hospital names
hosp_names = hosp_info %>% #creating new dataset 'hosp_names' from 'hosp_info' dataset
  filter(`Hospital Type` == "Acute Care Hospitals") %>% #keeping only rows that contain "Acute Care Hospitals" as the 'Hospital Type'  
  filter(State == "CA") %>% #keeping only rows that contain "CA" as 'State'
  pull(`Hospital Name`) #pulling the list of hospital names that meet both criteria that was filtered for

# Filter based on those hospital names
hosp_info_CA = hosp_info %>% #creating new dataset 'hosp_info_CA' from 'hosp_info' dataset
  rename(Hospital = `Hospital Name`, #renaming column 'Hospital Name' to 'Hospital'
         Provider_ID = `Provider ID`, #renaming column 'Provider ID' to 'Provider_ID'
         Safety = `Safety of care national comparison`, #renaming column 'Safety of care national comparison' to 'Safety'
         Effectiveness = `Effectiveness of care national comparison` #renaming column 'Effectiveness of care national comparison' to 'Effectiveness'
         ) %>%
  filter(Hospital %in% hosp_names, State == "CA") %>% #keeping only hospitals that are present in 'hosp_names' dataset and that also contain "CA" as the 'State'
  mutate(Overall_Rating = as.numeric(`Hospital overall rating`)) %>% #converting 'Hospital overall rating' column to numeric data type and saving the values in new column 'Overall_Rating' 
  drop_na(Overall_Rating) #removing rows in 'Overall Rating' where data is missing or "NA" 
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Overall_Rating = as.numeric(`Hospital overall rating`)`.
## Caused by warning:
## ! NAs introduced by coercion
hosp_info_CA %>% 
  arrange(desc(Overall_Rating), Hospital) %>% #sorting hospitals by descending order of 'Overall Rating', then by ascending order of values in 'Hospital'
  head(7) #displaying top 7 rows of this dataset
## # A tibble: 7 × 30
##   Provider_ID Hospital              Address City  State `ZIP Code` `County Name`
##         <dbl> <chr>                 <chr>   <chr> <chr>      <dbl> <chr>        
## 1       50145 COMMUNITY HOSPITAL O… 23625 … MONT… CA         93940 MONTEREY     
## 2       50357 GOLETA VALLEY COTTAG… 351 S … SANT… CA         93111 SANTA BARBARA
## 3       50238 METHODIST HOSPITAL O… 300 W … ARCA… CA         91006 LOS ANGELES  
## 4       50396 SANTA BARBARA COTTAG… 400 WE… SANT… CA         93102 SANTA BARBARA
## 5       50424 SCRIPPS GREEN HOSPIT… 10666 … LA J… CA         92037 SAN DIEGO    
## 6       50324 SCRIPPS MEMORIAL HOS… 9888 G… LA J… CA         92037 SAN DIEGO    
## 7       50281 ALHAMBRA HOSPITAL ME… 100 S … ALHA… CA         91801 LOS ANGELES  
## # ℹ 23 more variables: `Phone Number` <dbl>, `Hospital Type` <chr>,
## #   `Hospital Ownership` <chr>, `Emergency Services` <lgl>,
## #   `Meets criteria for meaningful use of EHRs` <lgl>,
## #   `Hospital overall rating` <chr>, `Hospital overall rating footnote` <chr>,
## #   `Mortality national comparison` <chr>,
## #   `Mortality national comparison footnote` <chr>, Safety <chr>,
## #   `Safety of care national comparison footnote` <chr>, …
hosp_info_CA %>% 
  group_by(Overall_Rating, Safety) %>% #grouping data by 'Overall_Rating' and 'Safety' columns 
  count() #getting the count for each occurance
## # A tibble: 17 × 3
## # Groups:   Overall_Rating, Safety [17]
##    Overall_Rating Safety                           n
##             <dbl> <chr>                        <int>
##  1              1 Above the national average       1
##  2              1 Below the national average       7
##  3              1 Same as the national average     1
##  4              2 Above the national average      10
##  5              2 Below the national average      26
##  6              2 Not Available                    6
##  7              2 Same as the national average    46
##  8              3 Above the national average      36
##  9              3 Below the national average      16
## 10              3 Not Available                    4
## 11              3 Same as the national average    69
## 12              4 Above the national average      24
## 13              4 Below the national average       4
## 14              4 Not Available                    3
## 15              4 Same as the national average    17
## 16              5 Above the national average       5
## 17              5 Same as the national average     1
write_csv(hosp_info_CA, 'hosp_info_CA.csv') #writing final dataset 'hosp_info_CA' to a csv file and exporting to working directory

#this code pulls in the raw dataset, filtering for Acute Care hospitals in California. the new dataset is then arranged first in descending order by the overall rating and then ascending order of hospital, with the top 7 values displayed. the final output groups the hospitals by both the overall rating and the national comparison of safety metric and gives that count of hospitals in each group. this final output does not answer the question of "Which hospitals are the highest quality?" since it only gives the count of each rating/safety group and not the hospitals assigned to each group. additionally, since the safety metric is already considered in the overall rating, it does not seem useful to have this step. however, the step above that displayed the top 7 values does provide a helpful dataset to answer this question. the medicare website outlines the process that goes into the overall rating and how the metrics on the raw data are weighted to provide a rating from 0-5 for the hospital. therefore, the overall rating metric appears to be the most useful metric to find the highest quality hospitals. this code has some redundancies: I would suggest eliminating the dataset "hosp_names" since the hospitals are filtered for CA in the next step and filtering for "Acute Care Hospitals" can easily be added to that step as well. additionally, there does not seem to be a reason for creating a new column "Overall_Rating" since the values are the same as in "Hospital overall rating". The column could have been renamed with the other columns and converted to numeric without creating a new column (ex: mutate(`Hospital overall rating` = as.numeric(`Hospital overall rating`))). lastly, it might be helpful to filter the hospitals based on a specific rating (ex: with ratings >= 4) rather than just the top "n" number for this task.

1b) (Translation to Python, see .ipynb)

hosp_info_CA = hosp_info[ (hosp_info[‘Hospital Type’] == “Acute Care Hospitals”) & (hosp_info[‘State’] == “CA”) #filtering for ‘Acute Care Hospitals’ in ‘CA’ and creating new dataset ].rename(columns={ ‘Hospital Name’: ‘Hospital’, #renaming column ‘Hospital Name’ to ‘Hospital’ ‘Provider ID’: ‘Provider_ID’, #renaming column ‘Provider ID’ to ‘Provider_ID’ ‘Safety of care national comparison’: ‘Safety’, #renaming column ‘Safety of care national comparison’ to ‘Safety’ ‘Effectiveness of care national comparison’: ‘Effectiveness’, #renaming column ‘Effectiveness of care national comparison’ to ‘Effectiveness’ ‘Hospital overall rating’: ‘Overall_Rating’ #renaming column ‘Hospital overall rating’ to ‘Overall_Rating’ })

hosp_info_CA = ( hosp_info_CA.assign(Overall_Rating=lambda df: pd.to_numeric(df[‘Overall_Rating’], errors=‘coerce’)) #converting ‘Overall_Rating’ column to numeric datatype; any non-numeric text replaced with NaN .sort_values(by=[‘Overall_Rating’, ‘Provider_ID’], ascending=[False, True]) #sorting dataset first by descending order of ‘Overall Rating’, then by ascending order of hospital names )

hosp_info_CA_filtered = hosp_info_CA.loc[:, [‘Provider_ID’, ‘Hospital’, ‘City’, ‘State’, ‘Overall_Rating’]] #selecting only the columns needed for the final report

hosp_info_CA_top = ( hosp_info_CA_filtered.query(“Overall_Rating >= 5”) #filtering dataset for hospitals with a rating of 5 or higher; #the inequality allows for the code be adjusted easily to expand the rating range )

hosp_info_CA_top.to_csv(‘hosp_info_CA.csv’, index=False) #writing final dataset ‘hosp_info_CA’ to a csv file and exporting to working directory

##the output of this code gives the list of hospitals with a rating of 5, which is the highest rating that can be given from this rating system. since the task was to find the highest quality hospitals, I kept the rating to only those that were rated 5. after researching the calculations that the medicare website outlines to calculate overall rating metric (https://data.cms.gov/provider-data/topics/hospitals/overall-hospital-quality-star-rating), I felt like this metric was the most useful metric to find the highest quality hospitals and that no additional calculations from the dataset were needed. it does seem interesting to note that there are x2 Scripps and x2 Cottage Hospitals in the top 6 values here, it would be interesting to see if these hospital groups tend to have higher quality rankings overall or if it is just these facilities.

Task 2: Hospital Costs in the state of California

Motivating Question : Which hospitals charge the most/least?

For this task, you are given a .csv from https://data.cms.gov/Medicare-Inpatient/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3 to help investigate hospital costs in California. The dataset contains identifying information for the hospital (Provider ID, Hospital Name, Address, Zip Code), the diagnosis-related group (DRG), and associated costs (Average Total Payments, Average Medicare Payments)

Average Total Payments: The average of Medicare payments to the provider for the DRG including the DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Also included are co-payment and deductible amounts that the patient is responsible for.

2a) (Code Explanation, see.ipynb)

costs = pd.read_csv(“Inpatient_Prospective_Payment_System__IPPS__Provider_Summary_for_the_Top_100_Diagnosis-Related_Groups__DRG__-_FY2011.csv”)

costs = costs.rename(columns = {‘DRG Definition’: ‘DRG’, ’ Average Total Payments ‘: ’Total_Cost’, # Note spaces around ’ Average Total Payments ’ ’ Total Discharges ‘: ’Count_Discharges’}) # Note spaces around ’ Total Discharges ’

costs[[‘DRG_Code’, ‘DRG_Description’]] = costs[‘DRG’].str.split(’ - ‘, expand = True) # Note spaces around’ - ’ costs = costs.drop([‘DRG’], axis = 1)

costs[‘Avg_DRG_Cost’] = costs.groupby(‘DRG_Code’)[‘Total_Cost’].transform(np.mean) costs[‘Cost_Diff’] = costs[‘Total_Cost’] - costs[‘Avg_DRG_Cost’]

top_drgs = (costs .groupby(‘DRG_Code’)[‘Count_Discharges’] .sum() .sort_values(ascending = False) )

top_drgs.index[0]

(costs .query(“DRG_Code == ‘470’ and ‘Provider State’ == ‘CA’”) .to_csv(“Hip_Replacement_Costs_by_Hosp.csv”, index = False) )

#this code pulls in the raw dataset (“costs”) and groups the data by DRG code (patient classification framework based on procedures required during a hospital visit) and the average cost for each code. the difference between the total cost by provider and the average cost by DRG code (across all providers) is calculated. a new dataset (“top_drgs”) is created which groups the DRG codes and number of discharges by code. the discharges are then summed and sorted in ascending order; the DRG code associated with the largest sum of the discharges is pulled. finally the “costs” dataset is filtered for the code pulled in the last step (largest sum of discharges) and for CA hospitals.

#this code does not clearly answer the question that was asked “Which hospitals in California cost the most/least?”. the method used to find the “Cost_Diff” in the first code chunk calculates the difference between the average total payment to a hospital per DRG code and the average cost overall hospitals for the same DRG code. since the goal is too look at which hospitals charge the most or the least, I do not think that looking at where the hospital stands compared to the average among hospitals is the best approach. additionally, the last two code chunks give a final dataset is filtered only for the first DRG value provided from the “top_drgs” column: DRG code = 470 (MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC). I think that an approach that would make this data more clear would be to look at the difference between the total cost and the average Medicare payments by hospital. This would result in the average amount owed by patients at a specific facility and by sorting the list descending or ascending, would give a clearer answer on which hospitals charge the most or the least.

2b) Translate the Python Code to R

Translate the provided code from Python to R, improving it if necessary to best address the question: Which hospitals cost the most/least?.

Provide your insights from the output of the code.

#Loading the necessary libraries
library(dplyr)
library(tidyr)

costs <- read.csv("INPATI~1.CSV") #loading in CSV file with needed data; file name did not import as written but showed up as this value when the file path was copied

colnames(costs) <- gsub("\\.", "_", colnames(costs)) #replacing all '.' in the header with '_' to standardize column names

#renaming column names for clarity
costs <- costs %>%
  rename(
    DRG = DRG_Definition, #renaming column `DRG_Definition` to `DRG`
    Total_Cost = Average_Total_Payments, #renaming column `Average_Total_Payments` to `Total_Cost`
    Count_Discharges = Total_Discharges, #renaming column `Total_Discharges` to `Count_Discharges`
    Medicare_Coverage = Average_Medicare_Payments, #renaming `Average_Medicare_Payments` to `Medicare_Coverage`
    Hospital = Provider_Name, #renaming column `Provider_Name` to `Hospital`
    State = Provider_State, #renaming column `Provider_State` to `State`
    Provider_ID = Provider_Id, #renaming column `Provider_Id` to `Provider_ID`
  ) %>%
  separate(DRG, into = c("DRG_Code", "DRG_Description"), sep = " - ") %>% #splitting column 'DRG' into 'DRG_Code', 'DRG_Description' at ' - '
  group_by(Provider_ID, Hospital, State) %>% #grouping the data by provider id and state
  summarise(
    Average_Total_Cost = mean(Total_Cost, na.rm = TRUE), #calculating the average Total Cost over all DRG codes by provider
    Average_Medicare_Payments = mean(Medicare_Coverage, na.rm = TRUE) #calculating the average Medicare Payment over all DRG codes by provider
  ) %>%
  mutate(Average_Patient_Cost = Average_Total_Cost - Average_Medicare_Payments) %>% #calculating average patient cost by provider by taking the difference between the average total payment to the hospital and the average total Medicare payment to the hospital
  ungroup() #removing provider grouping if further calculations are needed on the data
## `summarise()` has grouped output by 'Provider_ID', 'Hospital'. You can override
## using the `.groups` argument.
costs_CA <- costs %>%
  filter(State == "CA") #filtering dataset for only California hospitals


costs_filtered <- costs_CA %>%
  select(Provider_ID, Hospital, Average_Patient_Cost) %>% #selecting only columns needed to simplify dataset; this dataset can be used if a list of all the hospitals are needed
arrange(desc(Average_Patient_Cost))

print(costs_filtered)
## # A tibble: 299 × 3
##    Provider_ID Hospital                                     Average_Patient_Cost
##          <int> <chr>                                                       <dbl>
##  1       50708 FRESNO SURGICAL HOSPITAL                                    2970.
##  2       50127 WOODLAND MEMORIAL HOSPITAL                                  2619.
##  3       50017 MERCY GENERAL HOSPITAL                                      2454.
##  4       50492 CLOVIS COMMUNITY MEDICAL CENTER                             2181.
##  5       50334 SALINAS VALLEY MEMORIAL HOSPITAL                            2091.
##  6       50516 MERCY SAN JUAN MEDICAL CENTER                               2004.
##  7       50414 MERCY HOSPITAL OF FOLSOM                                    1992.
##  8       50454 UCSF MEDICAL CENTER                                         1883.
##  9       50180 JOHN MUIR MEDICAL CENTER - WALNUT CREEK CAM…                1805.
## 10       50022 RIVERSIDE COMMUNITY HOSPITAL                                1755.
## # ℹ 289 more rows
write.csv(costs_filtered, "CA_hospitals_costs.csv")
highest_10_hospital <- costs_filtered %>%
  arrange(desc(Average_Patient_Cost)) %>% #arranging dataset in descending order by average patient cost to get the highest values at the top
  slice_head(n = 10) #selecting top 10 hospitals in descending order by average cost to patients; the number of hospitals listed can be adjusted as needed for client

print(highest_10_hospital)

lowest_10_hospital <- costs_filtered %>%
  arrange(Average_Patient_Cost) %>% #arranging dataset in ascending order by average patient cost to get lowest values at the top
  slice_head(n = 10) #selecting bottom 10 hospitals in ascending order by average cost to patients; the number of hospitals listed can be adjusted as needed for client

print(lowest_10_hospital)


#the output of my modified code provides a full list of all CA hospitals in the dataset and the associated average patient cost; along with the 10 hospitals with the highest average patient cost and the 10 hospitals with the lowest average patient cost. this both answers the question of which hospitals cost the most/least, but also allows for the code to be easily modified to pull "n" number of values for both the highest and lowest hospitals. 
#from the data we see that the average patient cost across all CA hospitals is $1070.65, with a standard deviation of $323.78. however, the cost difference between the top and bottom hospitals is large ($2,530.82) with a 150% difference between the values (top: $2970.22; bottom: $439.40). while this task provides a big-picture look at the average cost associated with each hospital, I think that it would be helpful to take a deeper look into why we observe this gap and the factors that contribute to it. some examples of the additional ways that the data can be filtered to provide more useful insights are: looking at the average income level in the same zip code as the hospital(s), grouping hospitals by hospital group, type, ownership, DRG code, etc. 

Task 3: What is the relationship between cost and quality?

Is it the case that “you get what you pay for”? Now that we have completed some preliminary analyses of the cost and quality of Hospitals in the state of California, we would like to take a look at their relationship jointly. That is, we would like to see how cost relates to quality by combining the output from the first two questions.

With the language of your choosing (either R or Python),

3a) Join/Merge together Cost and Quality tables

Join together the resulting tables from tasks 1 and 2. What type of join did you perform and why? How many hospitals were removed (if any) due to the type of join?

3b) Create a Visualization

Using the insights you gained from the sections above, create a visualization to address the question. Provide a detailed explanation of the insights gained from your visualization.

3c) Extend the insights from above

With the code and data you used in the previous tasks as a base, provide additional insights that augment those from the previous task (3b).

For example, you could consider: - Visualizing more variables from the datasets in tasks 1 and 2 - Leveraging the whole dataset to understand where the California providers stand nationally - Gather outside data (e.g. Census) and join with the data in this task (e.g. using Zip Code) - Create an interactive plot (plotly or ggplotly) to help explore an expanded dataset

Be sure to structure this response as:

1) Question or hypothesis
2) Code/Data Formatting and Plotting
3) Description of new insights

**See.ipynb for task 3