library(tidyverse)
library(psych)
hospital_df_raw <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20606/Final%20Project/Hospital_General_Information.csv")
income_df_raw <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20606/Final%20Project/Income%20data.csv")
# Tidying the hospital data frame. The only columns I need are facility name, city, state, zipcode, hospital overall rating
hospital_df <- hospital_df_raw %>%
select("Facility.Name","City", "State", "ZIP.Code", "Hospital.overall.rating")
colnames(hospital_df) <- c("HospitalName", "City", "State", "ZipCode", "HospitalRating")
# Tidying the income data frame. Looking to make a data frame with 2 columns: zip code and median income
income_df <- income_df_raw %>%
select("NAME","S1901_C01_012E")
income_df <- income_df[-1,]
income_df$NAME <- as.numeric(substr(income_df$NAME, nchar(income_df$NAME) - 4, nchar(income_df$NAME)))
income_df$S1901_C01_012E <- as.numeric(income_df$S1901_C01_012E)
## Warning: NAs introduced by coercion
colnames(income_df) <- c("ZipCode", "Income")
# Merging both the hospital and income data frames together
merged_df <- merge(income_df, hospital_df, by ="ZipCode")
ratings_df <- merged_df %>%
select(c("HospitalName","City","State","ZipCode","Income","HospitalRating")) %>%
filter(merged_df$HospitalRating != "Not Available") %>%
group_by(HospitalRating) %>%
filter(Income >= 0)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Based on zip code, is there a correlation between hospital rating and income?
What are the cases, and how many are there?
Each case is a zip code that contains a hospital rating and income. There are 2956 cases.
Describe the method of data collection.
The hospital data was collected from the Centers for Medicare & Medicaid Services (CMS) website (https://data.cms.gov/provider-data/dataset/xubh-q36u), which was released on January 2023. Only data from Medicare-certified hospitals were included. The Overall Hospital Quality Star Rating (scale 1-5) is an average of a hospital’s score for the following measure group: “Mortality”, “Safety”, “Readmission”, “Patient Experience”, and Timely & Effective Care”.
The 2020 median income data was collected from the Internal Revenue Service (IRS) website (https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2020-zip-code-data-soi). Their data is gathered from individual tax returns filed every year.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
The hospital data was collected from the CMS website (https://data.cms.gov/provider-data/dataset/xubh-q36u). It was extracted to a csv file and pushed to my Github.
The income data was collected from the IRS website (https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2020-zip-code-data-soi). Again, it was extracted to a csv file and pushed to my Github.
What is the response variable? Is it quantitative or qualitative?
The response variable is the hospital’s rating, which is a categorical variable measure from 1 to 5. It is qualitative.
The independent variable is median income and it is quantitative.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
#Summary statistic of the independent variable
summary(ratings_df$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11404 48686 58870 65631 75885 232400
describeBy(ratings_df$Income, group = ratings_df$HospitalRating, mat = TRUE)
## item group1 vars n mean sd median trimmed mad min
## X11 1 1 1 181 59186.27 27148.16 53449.0 55390.79 17244.12 16764
## X12 2 2 1 664 61236.21 24760.65 54574.0 57760.15 16151.44 20239
## X13 3 3 1 850 62739.36 23588.13 57207.5 59637.69 16410.16 19083
## X14 4 4 1 853 68829.42 25437.08 62054.0 65728.34 18627.39 15833
## X15 5 5 1 408 74978.42 30610.96 66046.5 71216.35 23825.38 11404
## max range skew kurtosis se
## X11 160890 144126 1.474958 2.287252 2017.9070
## X12 174419 154180 1.661035 3.605549 960.8995
## X13 194462 175379 1.558184 3.433349 809.0664
## X14 232400 216567 1.534699 3.928313 870.9490
## X15 216286 204882 1.399343 2.707254 1515.4685
ratings_df %>%
ggplot(aes(x=Income)) +
geom_histogram(bins = 100)
ratings_df %>%
ggplot(aes(x = Income, y= HospitalRating))+
geom_boxplot()
#Hospital Rating and mean income
ratings_df$Income <- as.numeric(ratings_df$Income)
ratings_df %>%
summarise(mean_income = mean(Income))
## # A tibble: 5 × 2
## HospitalRating mean_income
## <chr> <dbl>
## 1 1 59186.
## 2 2 61236.
## 3 3 62739.
## 4 4 68829.
## 5 5 74978.