R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Milestone 2

We left our milestone 2 code her for context, but you can skip to the Milestone 3 section if needed.

Question 1

Description of dataset - What is the data source? (1-2 sentences on where the data is coming from, dates included, etc.) - How does the dataset relate to the group problem statement and question?

The data source is coming from is coming from 3 locations. The first data source is from the California county census demographic data, which was collected from census surveys in 2012. The second data source is on mortality surveillance in California which is from the California Open Data Portal and reports counts of death for California counties from information entered on death certificates from 2014-2020. The last data source is the account of HCAI funding from California Open Data Portal demonstrating the total dollar value and number of healthcare projects either in review, pending construction start, under construction, or in closure aggregated in California counties once every two weeks since September 2013.

The data set relates to the group problem statement and question because it includes information on demographics of each county, mortality data, and healthcare projects that are in closure in each county. We are tasked to explore and evaluate which 5 counties are best targets for funding a public-private patnership for healthcare facility improvement in rural areas of California that have received minimal funding from the Department of Health Care Access and Information (HCAI) over the last 5 years. From there, we can look at mortality surveillance data to see which counties are disproportionately impacted by chronic disease to further guide our funding choice.

Question 2

Import statement NOTE: Please use datasets available in the PHW251 Project Data github repoLinks to an external site. (this is important to make sure everyone is using the same datasets) - Use appropriate import function and package based on the type of file - Utilize function arguments to control relevant components (i.e. change column types, column names, missing values, etc.) - Document the import process

library(tidyverse)
library(kableExtra)
library(DT)
library(formattable)

# STEP 1: LOAD IN DATA 
# --------------------
# We need to load in our three datasets. To do this, we need a file path for each .csv file. 

# Create a string of the file path for the ca_county_demographic.csv file
# The paste function just joins strings together, so this should join your working directory with the rest of the file path.
ca_demo_file_path <- paste(getwd(), "/data/ca_county_demographic.csv", sep="")
ca_demo_file_path

## [1] "/home/rstudio/PHW251_Fall2022/rcourse_group_project/data/ca_county_demographic.csv"

# Create a string of the file path for the ca_county_mortality.csv file
ca_mortality_file_path <- paste(getwd(), "/data/ca_county_mortality.csv", sep="")

# Create a string of the file path for the hcai_healthcare_construction.csv file
hcai_hc_file_path <- paste(getwd(), "/data/hcai_healthcare_construction.csv", sep="")

# Now that we have the file paths for all three .csv files, we can load in the datasets:

# CA DEMOGRAPHIC DATA
# -------------------
ca_demo_data <- read_csv(
  ca_demo_file_path,
  skip_empty_rows = TRUE)

# CA MORTALITY DATA
# -----------------
# For consistency, let's make all the column names use lowercase snake case like
# in our first dataset. 
col_names_mortality <- c("year", "county", "geography_type", "strata", "strata_name", 
                "cause", "cause_desc", "count", "annotation_code", 
                "annotation_desc")
ca_mortality_data <- read_csv(
  ca_mortality_file_path, 
  col_names = col_names_mortality,
  skip = 1, # Skip first row since it just contains column names
  skip_empty_rows = TRUE)

# HCAI DATA
# ---------
# For consistency, let's make all the column names use lowercase snake case like
# in our first dataset. 
# We now need to skip the first row since we're re-writing the column names and 
# that row contains the original column names. 
col_names_hcai <- c("county", "data_generation_date", "oshpd_project_status", 
                "total_costs_oshpd", "num_oshpd_projs", "county_collection")
hcai_hc_data <- read_csv(
  hcai_hc_file_path,
  col_names = col_names_hcai,
  skip = 1, # Skip first row since it just contains column names
  skip_empty_rows = TRUE)

Question 3

Identify data types for 5+ data elements/columns/variables - Identify 5+ data elements required for your specified scenario. If <5 elements are required to complete the analysis, please choose additional variables of interest in the data set to explore in this milestone. - Utilize functions or resources in RStudio to determine the types of each data element (i.e. character, numeric, factor) - Identify the desired type/format for each variable—will you need to convert any columns to numeric or another type?

# 5 data elements in specified scenario: `pop12_sqmi`, `med_age`, `renter_occ`,
# `owner_occ`, `county`, `OSHPD Project Status`, `Total Costs of OSHPD Projects`

# |pop12_sqmi|
typeof(ca_demo_data$pop12_sqmi)

## [1] "double"

# It's a double; this is fine

# |med_age|
typeof(ca_demo_data$med_age)

## [1] "double"

# It's a double;  this is fine

# |renter_occ|
typeof(ca_demo_data$renter_occ)

## [1] "double"

# It's a double;  this is fine

# |owner_occ|
typeof(ca_demo_data$owner_occ)

## [1] "double"

# It's a double;  this is fine

# |count|
typeof(ca_mortality_data$county)

## [1] "character"

# It's a character variable;  this is fine

# |OSHPD Project Status|
typeof(hcai_hc_data$oshpd_project_status)

## [1] "character"

# It's a character variable; this in fine

# |Total Costs of OSHPD Projects|
typeof(hcai_hc_data$total_costs_oshpd)

## [1] "character"

# This is a character variable but we'll probably need it in numeric format. 
# Let's create a numeric version:
hcai_hc_data_cl <- hcai_hc_data %>%
  mutate(
    total_costs_oshpd_cl = str_remove(total_costs_oshpd, "\\$"), # Get rid of #
    total_costs_oshpd_cl_a = str_remove_all(total_costs_oshpd_cl, "\\,"), # Get rid of commas
    total_costs_oshpd_num = as.numeric(total_costs_oshpd_cl_a), # Convert to num
    total_costs_oshpd_cl = NULL, # Drop this column
    total_costs_oshpd_cl_a = NULL # Drop this column 
  )
# Now we have a numeric version of |total_costs_oshpd|

Question 4

Provide a basic description of the 5+ data elements - Numeric: mean, median, range - Character: unique values/categories - Or any other descriptives that will be useful to the analysis

# For numeric variables:
summary(ca_demo_data$owner_occ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357   13089   39306  121300  120804 1544749

summary(ca_demo_data$med_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.60   33.70   37.05   38.49   43.08   51.00

summary(ca_demo_data$renter_occ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     140    6080   25140   95554   84189 1696455

summary(ca_demo_data$owner_occ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357   13089   39306  121300  120804 1544749

summary(hcai_hc_data_cl$total_costs_oshpd_num)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 9.807e+04 2.824e+06 5.914e+07 2.845e+07 2.340e+09

# For character, unique/distinct function:
county_distinct <- distinct(ca_mortality_data, county)
project_status_distinct <- distinct(hcai_hc_data, oshpd_project_status)

Milestone 3

Subset rows or columns, as needed

We filtered ca_mortality_data to only look at strata marked as “Total Population.” We discarded columns on all three datsets that we won’t be using.

# Look at ca_mortality_data “strata” column: only keep rows where strata = 
# Total Population
ca_mortality_data <- ca_mortality_data %>%
  filter(strata == "Total Population")

# Drop columns that we don't need in ca_demo_data
ca_demo_data <- ca_demo_data %>%
  select(name, pop2012, pop12_sqmi, med_age, owner_occ, renter_occ)

# Drop columns that we don't need in ca_mortality_data
ca_mortality_data <- ca_mortality_data %>%
  select(year, county, cause, cause_desc, count)

# Drop columns that we don't need hcai_hc_data
hcai_hc_data_cl <- hcai_hc_data_cl %>%
  select(county, data_generation_date, oshpd_project_status, total_costs_oshpd)

Create new variables needed for analysis

We’ll need a variable for the proportion of renters to homeowners in each county, so we created a new variable called proportion_rent which is equal to renter_occ / owner_occ There are no NA values for renter_occ or owner_occ so we don’t need to worry about that.

ca_demo_data <- ca_demo_data %>%
  mutate(
    proportion_rent = renter_occ / owner_occ
  ) %>%
  select(name, pop2012, pop12_sqmi, med_age, proportion_rent)

We’ll also need to know whether cause of death was due to chronic illness or not, so we created an indicator variable isChronic which equals 1 if the cause of death was chronic disease and 0 otherwise.

# Run this line of code to see a list of all causes of death (descriptions)
unique(pull(ca_mortality_data, var = cause_desc))

##  [1] "All causes (total)"                                   
##  [2] "Alzheimer's disease"                                  
##  [3] "Malignant neoplasms"                                  
##  [4] "Chronic lower respiratory diseases"                   
##  [5] "Diabetes mellitus"                                    
##  [6] "Assault (homicide)"                                   
##  [7] "Diseases of heart"                                    
##  [8] "Essential hypertension and hypertensive renal disease"
##  [9] "Accidents (unintentional injuries)"                   
## [10] "Chronic liver disease and cirrhosis"                  
## [11] "Nephritis, nephrotic syndrome and nephrosis"          
## [12] "Parkinson's disease"                                  
## [13] "Influenza and pneumonia"                              
## [14] "Cerebrovascular diseases"                             
## [15] "Intentional self-harm (suicide)"

chronic_disease_codes <- c("ALZ", "CLD", "DIA", "HTD", "HYP", "LIV", "NEP", "PAR")  

ca_mortality_data <- ca_mortality_data %>%
  mutate(
    isChronic = if_else(cause %in% chronic_disease_codes, 
      1,
      0
    )
  )

Clean variables needed for analysis

We wondered if there were any values that needed to be coded as missings, so we wrote the following code to figure that out. For each dataset, we loop over the columns, and for each column we print out all the unique values of that column. We then scan through those unique values to look for any values that should be coded as missings (e.g. a ” ” or a “*“). We use eval = false here because it prints out a LOT of information, and ultimately we dind’t find any values that needed to be coded as missings, but you can see the code we used to figure this out.

# CA DEMOGRAPHIC DATA
# -------------------
# Let's see what the unique values are for each column so we can see if there 
# are any values that should be coded as missings. 

# Start by creating a vector of column names that we can loop over
col_names_demo_data <- colnames(ca_demo_data) 

# Loop over the column names. For each column, print out all the unique values
# in that column and scan through the results to see if there's anything that
# should be coded as missing. 
for (i in 1:length(col_names_demo_data)) {
  print(unique(pull(ca_demo_data, var = col_names_demo_data[i])))
}
# They all look good; nothing needs to be coded as missing. 

# Check types of each column
str(ca_demo_data)
# Look good

# CA MORTALITY DATA
# -----------------
# Loop over the column names. For each column, print out all the unique values
# in that column and scan through the results to see if there's anything that
# should be coded as missing. 
col_names_mortality <- c("year", "county", "cause", "cause_desc", "count")

for (i in 1:length(col_names_mortality)) {
  print(unique(pull(ca_mortality_data, var = col_names_mortality[i])))
}
# They all look good; nothing needs to be coded as missing. 

# Check types of each column
str(ca_mortality_data)
# Looks good

# HCAI DATA
# ---------
# Loop over the column names. For each column, print out all the unique values
# in that column and scan through the results to see if there's anything that
# should be coded as missing. 
for (i in 1:length(col_names_hcai)) {
  print(unique(pull(hcai_hc_data, var = col_names_hcai[i])))
}
# They all look good; nothing needs to be coded as missing. 
# However, let's recode the counties to get rid of the numbers at the beginning.

# Check types of each column
str(hcai_hc_data)

In our hcai_hc_data dataset, the counties all had numbers and dashes in front of them. We cleaned the county variable to strip these leading numbers so they look like the counties in other datasets. For example, “01 - Alameda” should just be “Alameda”

hcai_hc_data_cl <- hcai_hc_data_cl %>%
  mutate(
    county_cl = substr(county, 6, nchar(county)),
    county = county_cl,
    county_cl = NULL
  )

In our ca_mortality_data dataset, we needed to clean the count column; there were many “NAs” which we recoded to 0s.

ca_mortality_data <- ca_mortality_data %>%
  mutate(
    count_cl = if_else(
      is.na(count), 
      0,
      count
    ), 
    count = count_cl, 
    count_cl = NULL
  )

Data dictionary

CA mortality data:

count

Type: Double
Desc: Total count of deaths due to chronic illnesses by county. If 0, data missing.

isChronic
Type: Double
Desc: Serves as an identifier. Provides output 1 if death due to chronic illness and 0 if not.

HCAI data:

county_cl
Type: Character
Desc: County name

CA demographic data

proportion_rent
Type: Double
Desc: Proportion of renters to owners.

Merged dataset (CA demographic data & CA mortality data)

estimated_mortality_rate
Type: Double
Desc: Estimated rate of death from chronic illnesses/1000 people (utilizing average of chronic deaths by county from 2014-2020 divided by total population from 2012 census)

Tables with descriptive stats

Step 1: Prep data for tables by merging datasets

We started by joining our ca_demo_data and ca_mortality_data datasets. We did this by first filtering ca_demo_data to only contain rows for chronic diseases. We then grouped the rows by county and year and summed up the total number of deaths per year due to chronic disease in each county-year pair. We pivoted the table from long to wide so that there was one row for each county and a column for each year containing the total # of chronic disease deaths in that county in each year. Then, we joined this table with ca_demo_data which also has one row per county. Once we joined them, we created a new row that estimated the average mortality rate from 2014-2020. This is an estimate because we used the 2012 population to calculate the mortality rate for each year.

# We are only interested in chronic disease for ca_mortality_data
ca_mortality_data_summarized <- ca_mortality_data %>%
  filter(isChronic == 1) %>%
  group_by(county, year) %>%
  summarize(total_chronic_deaths_per_year = sum(count))

# Pivot the table from long to wide
ca_mortality_data_wide <- ca_mortality_data_summarized %>% pivot_wider(
        names_from = "year",  
        names_prefix = "cd_deaths_",
        values_from = total_chronic_deaths_per_year, 
) 

# Merge our demographic and mortality datasets
merged_dataset <- left_join(ca_mortality_data_wide, ca_demo_data, 
                            by = c("county" = "name"))

# Create a new column that averages over # of chronic disease deaths each year
merged_dataset_a <- merged_dataset
merged_dataset_a$avg_chronic_deaths <- rowMeans(merged_dataset[ , 
                                          c(2, 3, 4, 5, 6, 7, 8)], na.rm=TRUE) 

# Create a new column that is the "estimated" mortality rate from 2014-2020, 
# using the 2012 population as the denominator in the rate calculation.
merged_dataset_a <- merged_dataset_a %>%
  mutate(
    estimated_mortality_rate = avg_chronic_deaths / pop2012
  )

Step 2: Create Table 1

Table 1: we’ll display some important data for our project at the county-level, like the population in 2012 per square mile, the median age, the proportion of renters to homeowners, and the estimated mortality rate.

# Now our dataset is ready to be turned into a table!
# Make a new dataset where we will format things differently for table
# e.g. we will round the rates to 2 decimal points and get rid of columns that
# we won't display
merged_dataset_tbl<- merged_dataset_a %>%
  mutate(
    estimated_mortality_rate = round(estimated_mortality_rate * 1000, 2),
    proportion_rent = round(proportion_rent, 2),
    pop12_sqmi = round(pop12_sqmi, 2)
  ) %>%
  select(county, pop12_sqmi, med_age, proportion_rent, estimated_mortality_rate)
  
# Table 1:
# Display relevant information for each county
# -----------------------------------------------
kable(merged_dataset_tbl, booktabs = T, longtable = T, align = 'lcccc', col.names = c("County", 
        "Pop'l/sq mi", "Median age", "Renters:Owners", 
        "~Mortality rate/1000"), caption = "Descriptive Stats Per County")

Descriptive Stats Per County
County	Pop’l/sq mi	Median age	Renters:Owners	~Mortality rate/1000
Alameda	2062.40	36.6	0.87	5.25
Alpine	1.54	46.4	0.39	0.00
Amador	63.29	48.2	0.34	8.56
Butte	132.55	37.1	0.72	9.36
Calaveras	44.58	49.1	0.30	7.18
Colusa	18.83	33.5	0.63	3.67
Contra Costa	1405.33	38.4	0.49	5.86
Del Norte	28.30	39.0	0.62	6.27
El Dorado	102.16	43.5	0.37	6.62
Fresno	157.17	30.7	0.82	6.69
Glenn	21.49	35.3	0.61	4.64
Humboldt	38.06	37.3	0.82	7.24
Imperial	39.74	32.0	0.79	4.80
Inyo	1.82	45.5	0.57	6.02
Kern	104.28	30.7	0.67	6.69
Kings	111.43	31.1	0.85	4.47
Lake	49.08	45.0	0.52	8.71
Lassen	7.42	37.0	0.53	3.31
Los Angeles	2423.26	34.8	1.10	6.36
Madera	71.07	33.1	0.56	5.56
Marin	486.10	44.5	0.60	6.29
Mariposa	12.61	49.2	0.47	5.78
Mendocino	25.08	41.6	0.70	6.83
Merced	129.90	29.6	0.84	5.19
Modoc	2.33	46.0	0.46	5.85
Mono	4.60	37.2	0.79	1.36
Monterey	126.86	33.0	0.97	4.68
Napa	172.31	39.7	0.60	7.45
Nevada	102.56	47.5	0.39	8.22
Orange	3822.42	36.2	0.69	6.09
Placer	237.08	40.3	0.41	8.36
Plumas	7.65	49.5	0.44	5.99
Riverside	305.04	33.7	0.48	7.29
Sacramento	1441.22	34.8	0.74	7.01
San Benito	40.63	34.3	0.54	3.46
San Bernardino	102.56	31.7	0.59	6.59
San Diego	740.58	34.7	0.84	5.95
San Francisco	17398.35	38.5	1.80	5.45
San Joaquin	482.64	32.7	0.69	6.87
San Luis Obispo	81.82	39.4	0.67	7.11
San Mateo	1591.22	39.2	0.68	5.56
Santa Barbara	154.04	33.7	0.90	6.53
Santa Clara	1401.07	36.2	0.73	4.51
Santa Cruz	587.52	36.8	0.74	5.09
Shasta	46.48	41.8	0.55	12.14
Sierra	3.35	51.0	0.39	0.00
Siskiyou	7.12	46.8	0.54	9.38
Solano	470.01	36.9	0.58	6.33
Sonoma	306.32	39.8	0.66	7.05
Stanislaus	342.54	32.9	0.66	8.26
Sutter	157.13	34.6	0.64	5.77
Tehama	21.52	39.5	0.55	7.90
Trinity	4.38	49.2	0.42	5.06
Tulare	92.74	29.6	0.70	6.17
Tuolumne	24.30	47.1	0.43	8.82
Ventura	444.79	36.2	0.53	6.25
Yolo	199.66	30.5	0.89	5.31
Yuba	113.15	32.2	0.68	8.16

Step 3: Create Table 2

Table 2: We first created three categories for median age: <37, 38-44, and 45+. We will display how many counties fall into each of these categories. This is important for our project because we are interested in targeting counties with aging populations.

# Table 2: 
# Categorize the median ages for each county 
# Print frequencies of how many counties fall in each age category
# ---------------------------------------------------------------------
min(ca_demo_data$med_age) # Min median age: 29.6

## [1] 29.6

max(ca_demo_data$med_age) # Max median age: 51

## [1] 51

bucket_size <- (max(ca_demo_data$med_age) - min(ca_demo_data$med_age))/3
bucket1_lower_limit <- min(ca_demo_data$med_age)
bucket2_lower_limit <- min(ca_demo_data$med_age) + bucket_size
bucket3_lower_limit <- min(ca_demo_data$med_age) + bucket_size*2
# Bucket 1: 29.6 - 36.73
# Bucket 2: 36.74 - 43.8733
# Bucket 3: 43.8733 - 51
# Round these buckets to:
# Bucket 1: <37
# Bucket 2: 38 - 44
# Bucket 3: 45+ 

merged_dataset_tbl2 <- merged_dataset_tbl %>%
  mutate(
    median_age_cat = case_when(
      med_age < 37 ~ "<37",
      med_age < 44 ~ "37 - 44",
      TRUE ~ "45+"
    )
  )

merged_dataset_tbl3 <- merged_dataset_tbl2 %>%
  group_by(median_age_cat) %>%
  summarise(total_counties = n())

kable(merged_dataset_tbl3, booktabs = T, align = 'cc', col.names = c("Median Age", 
        "Total Counties"), caption = "Frequency of Median Age Range by County")

Frequency of Median Age Range by County
Median Age	Total Counties
<37	28
37 - 44	16
45+	14

Milestone 4

# Now we'll clean up the HCAI dataset
# We are only interested in the snapshot of costs for projects in closure
# for the most recent week in the dataset, which we can see is 8/11/22

hcai_recent <- hcai_hc_data_cl %>%
  filter(data_generation_date == "2022-08-11" & 
           oshpd_project_status == "In Closure")

# Now we can merge this dataset with our merged dataset that has demographic
# data
merged_dataset_tbl4 <- left_join(merged_dataset_tbl, hcai_recent, 
                            by = c("county" = "county")) %>%
  select(county, med_age, proportion_rent, pop12_sqmi, estimated_mortality_rate,
         total_costs_oshpd) %>%
  mutate(
    total_costs_oshpd = if_else(total_costs_oshpd == "0", "$0", total_costs_oshpd)
  )

Visualization 1

Color scheme: Median age is highlighted in red, with darker red signifying higher median age. Proportion rent is highlighted in rent, with darker red signifying a higher proportion of renters. Population density is highlighted in red if the population is considered “rural” (we used USDA’s definition of ≥500 people/sq mi = urban). If the county is urban, it is not highlighted in red. Mortality rate due to chronic disease is highlighted in red with darker red signifying a higher mortality rate. OSHPD total costs is highlighted in red if the country received no funding at the last time period.

# First, OHE would like you to focus on rural areas, non-homeowners, and aging individuals as populations of interest in your analysis. Your task is to explore the California county census demographic data Links to an external site.and begin to identify counties that share three common attributes: low population per square mile `pop12_sqmi1`, high median age `med_age`, a high proportion of renters vs. homeowners (you may need to create a new variable for this third criteria). 

# STRATEGY
# --------
# Let's make a formattable table and highlight the range of these important variables
# to help us identify counties that meet all the criteria. 
# |med_age|: darker red = older
# |proportion_rent|: darker red = more renters
# |pop12_sq_mi|: red = rural, white = urban. Using a cut-off of 500 people/square
# |total_costs_oshpd|: red if no funding, white if had funding at last snapshot
# mile (this is how USDA defines urban)
# Cite: https://www.ers.usda.gov/topics/rural-economy-population/rural-classifications/what-is-rural.aspx
# |estimated_mortality_rate|: darker red = higher mortality rate
formattable(merged_dataset_tbl4, caption= "Table 1", col.names = c("County", "Median Age", "Proportion Renters vs. Homeowners", "Population per Square Mile", "Chronic Disease Mortality Rate", "Total HCAI Funding"), 
            align = c("l", "c", "c", "c", "c", "c"),
            list(
                  'county' = formatter("span", style = ~style(color="grey", font.weight = "bold")),
                  'med_age' = formatter("span", style = x ~ style(display = "block",
                                 "border-radius" = "4px",
                                 "padding-right" = "4px",
                                 color = "black",
                                "background-color" = csscolor(gradient(as.numeric(x), "white", "red")))),
                  'proportion_rent' = formatter("span", style = x ~ style(display = "block",
                                 "border-radius" = "4px",
                                 "padding-right" = "4px",
                                 color = "black",
                                "background-color" = csscolor(gradient(as.numeric(x), "white", "red")))),
                  'pop12_sqmi' = formatter("span", 
                                          style= ~ style(
                                            display = "block",
                                            "border-radius" = "4px",
                                            "padding-right" = "4px",
                                            "background-color" = ifelse(pop12_sqmi >= 500, "white", "red"))),
                  'estimated_mortality_rate' = formatter("span", style = x ~ style(display = "block",
                                 "border-radius" = "4px",
                                 "padding-right" = "4px",
                                 color = "black",
                                "background-color" = csscolor(gradient(as.numeric(x), "white", "red")))),
                  'total_costs_oshpd' = formatter("span", 
                                          style= ~ style(
                                            display = "block",
                                            "border-radius" = "4px",
                                            "padding-right" = "4px",
                                            "background-color" = ifelse(total_costs_oshpd == "$0", "red", "white")))
                  ) 
          )

Table 1
County	Median Age	Proportion Renters vs. Homeowners	Population per Square Mile	Chronic Disease Mortality Rate	Total HCAI Funding
Alameda	36.6	0.87	2062.40	5.25	$15,250,836.10
Alpine	46.4	0.39	1.54	0.00	$0
Amador	48.2	0.34	63.29	8.56	$0
Butte	37.1	0.72	132.55	9.36	$0
Calaveras	49.1	0.30	44.58	7.18	$0
Colusa	33.5	0.63	18.83	3.67	$0
Contra Costa	38.4	0.49	1405.33	5.86	$7,837,754.00
Del Norte	39.0	0.62	28.30	6.27	$0
El Dorado	43.5	0.37	102.16	6.62	$30,961.00
Fresno	30.7	0.82	157.17	6.69	$5,230,681.00
Glenn	35.3	0.61	21.49	4.64	$0
Humboldt	37.3	0.82	38.06	7.24	$0
Imperial	32.0	0.79	39.74	4.80	$0
Inyo	45.5	0.57	1.82	6.02	$0
Kern	30.7	0.67	104.28	6.69	$2,000,187.99
Kings	31.1	0.85	111.43	4.47	$0
Lake	45.0	0.52	49.08	8.71	$0
Lassen	37.0	0.53	7.42	3.31	$0
Los Angeles	34.8	1.10	2423.26	6.36	$129,179,056.02
Madera	33.1	0.56	71.07	5.56	$139,488.40
Marin	44.5	0.60	486.10	6.29	$5,788,177.72
Mariposa	49.2	0.47	12.61	5.78	$0
Mendocino	41.6	0.70	25.08	6.83	$34,803.00
Merced	29.6	0.84	129.90	5.19	$167,026.00
Modoc	46.0	0.46	2.33	5.85	$0
Mono	37.2	0.79	4.60	1.36	$0
Monterey	33.0	0.97	126.86	4.68	$10,657,237.90
Napa	39.7	0.60	172.31	7.45	$2,743,185.00
Nevada	47.5	0.39	102.56	8.22	$625,345.00
Orange	36.2	0.69	3822.42	6.09	$64,278,886.66
Placer	40.3	0.41	237.08	8.36	$3,985,582.15
Plumas	49.5	0.44	7.65	5.99	$0
Riverside	33.7	0.48	305.04	7.29	$268,651,237.29
Sacramento	34.8	0.74	1441.22	7.01	$12,724,854.30
San Benito	34.3	0.54	40.63	3.46	$0
San Bernardino	31.7	0.59	102.56	6.59	$55,980,818.58
San Diego	34.7	0.84	740.58	5.95	$58,237,267.71
San Francisco	38.5	1.80	17398.35	5.45	$17,012,804.99
San Joaquin	32.7	0.69	482.64	6.87	$0
San Luis Obispo	39.4	0.67	81.82	7.11	$89,105.00
San Mateo	39.2	0.68	1591.22	5.56	$4,254,277.00
Santa Barbara	33.7	0.90	154.04	6.53	$1,709,878.00
Santa Clara	36.2	0.73	1401.07	4.51	$21,401,921.35
Santa Cruz	36.8	0.74	587.52	5.09	$232,403.00
Shasta	41.8	0.55	46.48	12.14	$505,710.00
Sierra	51.0	0.39	3.35	0.00	$0
Siskiyou	46.8	0.54	7.12	9.38	$0
Solano	36.9	0.58	470.01	6.33	$0
Sonoma	39.8	0.66	306.32	7.05	$1,084,897.00
Stanislaus	32.9	0.66	342.54	8.26	$3,039,277.00
Sutter	34.6	0.64	157.13	5.77	$0
Tehama	39.5	0.55	21.52	7.90	$0
Trinity	49.2	0.42	4.38	5.06	$0
Tulare	29.6	0.70	92.74	6.17	$0
Tuolumne	47.1	0.43	24.30	8.82	$0
Ventura	36.2	0.53	444.79	6.25	$17,037,565.00
Yolo	30.5	0.89	199.66	5.31	$0
Yuba	32.2	0.68	113.15	8.16	$0

# Dark red cells under population per square mile are defined as rural by a cut-off of 500 people/square mile (USDA). Dark red cells under the HCAI funding column were counties identified as receiving $0 in funding. These were the primary indicators. The gradient colors for proportion renters vs. homeowners, median age and chronic disease mortality rates show a wide range across counties meeting the criteria for rural and lack of funding. Further analysis is required to identify grantees.

Visualizations 2 and 3

# Convert |total_costs_oshpd| to a numeric variable called |project_cost|
# Add indicator variable for whether county is urban or rural
merged_dataset_tbl5 <- merged_dataset_tbl4 %>% 
  mutate(
    project_cost = substr(total_costs_oshpd, 2, nchar(total_costs_oshpd)),
    project_cost_cl = as.numeric(gsub(",","", project_cost)),
    project_cost = project_cost_cl,
    project_cost_cl = NULL,
    total_costs_oshpd = NULL,
    is_urban = if_else(pop12_sqmi >= 500, "Urban", "Rural")
  )

# FILTER DATA WITH CUT-OFFS
# -------------------------
# First we'll limit to only look at rural counties that did not receive
# funding
merged_dataset_tbl6 <- merged_dataset_tbl5 %>%
  filter(is_urban == "Rural", project_cost == 0)

summary(merged_dataset_tbl6$med_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    29.6    34.3    37.3    40.1    46.8    51.0

# The median median age of all the counties left is 37.3

merged_dataset_tbl7 <- merged_dataset_tbl6 %>%
  filter(med_age > 37.3)

summary(merged_dataset_tbl7$proportion_rent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3000  0.3975  0.4500  0.4600  0.5350  0.6200

# The median proportion renters of all the counties left is .45

merged_dataset_tbl8 <- merged_dataset_tbl7 %>%
  filter(proportion_rent >= .45)

# This leaves us with 7 counties. Now let's make a bar graph for each of these
# counties to compare their |estimated_mortality_rate| due to chronic illnesses:
ggplot(data = merged_dataset_tbl8, aes(x = county, y = estimated_mortality_rate,
                                       color = county, fill=county)) + 
  theme(legend.position="none") + 
  theme(plot.caption = element_text(hjust = 0))+
    geom_bar(stat = "identity")+ 
  labs(x="County", y="Chronic Death Mortality Rate/1,000 people", 
                  title="Figure 1: Average Chronic Disease Mortality Rate by County")  +
           labs(caption = str_wrap("Shown in this figure are estimated mortality rates/1,000 people for the counties that met the criteria for being rural (based off the population per square mile), having a high proportion of renters (higher than the median), and having no HCAI funding. Note that the estimated mortality rates are calculated based on an average of mortality totals from years 2014-2020 in the numerator and the 2012 total population estimate in the denominator. In this figure, the counties of Lake, Siskiyou, and Tehama have the three highest estimated chronic disease mortality rates. It is more challenging to see the differences for the counties of Del Norte, Inyo, Mariposa, and Modoc. Further analysis is needed to narrow down the county selections. See Figure 2 for further analysis.", width=100,))

# Some of the counties are pretty close to each other, so let's look more into
# their mortality data.
# Bring back the dataset that had the mortality counts (due to chronic illness)
# over the years, and we'll compare these 7 counties' mortality rates over the
# years to see if some of those counties that have similar mortality rates
# have INCREASING mortality rates vs. DECREASING.
ca_mortality_data_summarized_a <- ca_mortality_data_summarized %>%
  filter(county %in% c("Del Norte", "Inyo", "Lake", "Mariposa", "Modoc", "Siskiyou", "Tehama"))

ca_mortality_data_b <- left_join(ca_mortality_data_summarized_a, ca_demo_data, 
                            by = c("county" = "name")) %>%
  select(county, year, total_chronic_deaths_per_year, pop2012) %>%
  mutate(chronic_death_rate = (total_chronic_deaths_per_year/pop2012)*1000)

ggplot(data = ca_mortality_data_b, aes(group = county))+  # Everything we're doing has that group stratification
    geom_line(aes(x=year, y=chronic_death_rate, color = county))+
  theme(plot.caption = element_text(hjust = 0))+
                labs(x="Year", y="Chronic Death Mortality Rate/1,000 people", 
                  title="Figure 2: County Chronic Disease Mortality Rates by Year", subtitle="2014-2020",colour="County")+
  labs(caption = str_wrap("Shown plotted here are the chronic disease mortality rates for
the remaining 7 counties to analyze. Of the four lowest mortality rates, both 
Del Norte and Inyo have decreasing mortality rates. 
Modoc's rate is increasing while Mariposa's is remaining steady.", width=100,))

#Shown plotted here are the chronic disease mortality rates for the remaining 7 counties to analyze. Of the four lowest mortality rates, both Del Norte and Inyo have decreasing mortality rates. Modoc's rate is increasing while Mariposa's is remaining steady.

Milestone 4

Amanda Pitts, Serine Dumbrique, Rachel Grau

2022-09-29

R Markdown

Milestone 2

Question 1

Question 2

Question 3

Question 4

Milestone 3

Subset rows or columns, as needed

Create new variables needed for analysis

Clean variables needed for analysis

Data dictionary

CA mortality data:

HCAI data:

CA demographic data

Merged dataset (CA demographic data & CA mortality data)

Tables with descriptive stats

Step 1: Prep data for tables by merging datasets

Step 2: Create Table 1

Step 3: Create Table 2

Milestone 4

Visualization 1

Visualizations 2 and 3