Part 1

Question 1

Below is a function that ought to return “I’m an even number!” if x is an even number. However, we’re having trouble receiving a value despite x == 4, which we know is an even number. Fix the code chunk and explain why this error is occurring. You will have to change the eval=FALSE option in the code chunk header to get the chunk to knit in your PDF.

NOTE: %% is the “modulo” operator, which returns the remainder when you divide the left number by the right number. For example, try 2 %% 2 (should equal 0 as 2/2 = 1 with no remainder) and 5 %% 2 (should equal 1 as 5/2 = 2 with a remainder of 1).

# Corrected code: 2 is the default value; else completes the operation 
# based on the argument

return_even <- function(x = 2) {
  if (x %% 2 == 0) {
    return("I'm an even number!")
  } else {                        #else completes operation based on the argument
    return("I'm not an even number!")
  }
}

x <- 4

return_even(x)
## [1] "I'm an even number!"

EXPLAIN THE ISSUE

The error message “argument ‘x’ is missing” occurs because x is a parameter and there is no default value or value to the function. The “if” argument is incomplete without providing “else” statement.

Question 2

R functions are not able to access global variables unless we provide them as inputs.

Below is a function that determines if a number is odd and adds 1 to that number. The function ought to return that value, but we can’t seem to access the value. Debug the code and explain why this error is occurring. Does it make sense to try and call odd_add_1 after running the function?

# Incorrect code to debug

return_odd <- function(y){
  if (y %% 2 != 0) {
    odd_add_1 <- y + 1
  }
}

#return_odd(y)   
#odd_add_1


# Rewrite the code to return the output and complete the statements that 
# operate on the function

# Example

y <- 7

return_odd <- function(y) {
  if (y %% 2 == 0) {    # checks if the number is even
    odd_add_1 <- y + 1
    return(odd_add_1)
  } else {              # output for the case where the y is odd
    return(y)
  }
}

EXPLAIN THE ISSUE HERE

The function, return_odd, is misleading because it does not return an odd number. If the input is odd, it adds 1 to the input (making it even) and the output that is retunred is an even number. To debug this, remove the ! and replace with == so that when y is an equal number, then y + 1 will return an odd number. To complete the conditional statement, if y is an odd number, then return y. In this way, only odd numbers will be returned.
If you create odd_add_1 outside the function as a global scope variable, then you can print the output for odd_add_1 after you run the function. If odd_add_1 is only inside the function, then it is not a defined object.

Question 3

BMI calculations and conversions: - metric: \(BMI = weight (kg) / [height (m)]^2\) - imperial: \(BMI = 703 * weight (lbs) / [height (in)]^2\) - 1 foot = 12 inches - 1 cm = 0.01 meter

Below is a function bmi_imperial() that calculates BMI and assumes the weight and height inputs are from the imperial system (height in feet and weight in pounds).

df_colorado <- read_csv("data/colorado_data.csv")

bmi_imperial <- function(height, weight){
  bmi = (703 * weight)/(height * 12)^2
  return(bmi)
}

# calculate bmi for the first observation
bmi_imperial(df_colorado$height[1], df_colorado$weight[1])
## [1] 42.62802

Write a function called bmi_metric() that calculates BMI based on the metric system. You can test your function with the Taiwan data set excel file in the data folder, which has height in cm and weight in kg.

# your code here
library(readxl)

df_taiwan <- read_xlsx("data/taiwan_data.xlsx")

bmi_metric <- function(height, weight){
  bmi = weight / (height * .01)^2
  return(bmi)
}

# uncomment the line below to test the bmi calculation on the first row

bmi_metric(df_taiwan$height[1], df_taiwan$weight[1])
## [1] 21.45357

Question 4

Can you write a function called calculate_bmi() that combines both of the BMI functions? You will need to figure out a way to determine which calculation to perform based on the values in the data.

# your code here

# test
# calculate_bmi(df_colorado$height[1], df_colorado$weight[1])
# calculate_bmi(df_taiwan$height[1], df_taiwan$weight[1])


calculate_bmi <- function(height, weight){
  if(height < 9){
    return(bmi_imperial(height, weight))
  }
  else {
    return(bmi_metric(height, weight))
  }  
}
  
  
calculate_bmi(df_taiwan$height[1], df_taiwan$weight[1])
## [1] 21.45357
calculate_bmi(df_colorado$height[1], df_colorado$weight[1])
## [1] 42.62802

Question 5

Use your function calculate_bmi() to answer the following questions:

What is the average BMI of the individuals in the Colorado data set?

45.60881

# your code here

library(dplyr)

calculate_bmi <- function(height, weight) {
  ifelse(height < 9,
         703 * weight/(height * 12)^2,
          weight / (height * .01)^2)
          
}


colorado_results <- df_colorado %>%
  mutate(bmi_colorado = calculate_bmi(height, weight))

mean(colorado_results$bmi_colorado)
## [1] 45.60881

What is the average BMI of the individuals in the Taiwan data set?

22.99287

# your code here


taiwan_results <- df_taiwan %>%
  mutate(bmi_taiwan = calculate_bmi(height, weight))

mean(taiwan_results$bmi_taiwan)
## [1] 22.99287

Question 6

Combine the Colorado and Taiwan data sets into one data frame and calculate the BMI for every row using your calculate_bmi() function. Print the first six rows and the last six rows of that new data set.

# your code here
# Combine datasets and calculate BMI 

# Convert dates to consistent format 
combined_data <- bind_rows( 
# Convert Colorado numeric dates to proper date format 
df_colorado %>% 
mutate(date = as.Date(as.character(date), format="%Y%d%m")), 

# Convert Taiwan character dates to same date format 
df_taiwan %>% 
mutate(date = as.Date(date, format="%d-%b-%y")) 
) %>% 
# Calculate BMI for all rows 
mutate(bmi = calculate_bmi(height, weight)) 

# Show first 6 and last 6 rows 
head(combined_data) 
tail(combined_data) 

Question 7

Make a boxplot that shows the BMI distribution of the combined data, separated by location on the x-axis. Use a theme of your choice, put a title on your graph, and hide the y-axis title.

NOTE: These data are for practice only and are not representative populations, which is why we aren’t comparing them with statistical tests. It would not be responsible to draw any conclusions from this graph!

# your code here

# ggplot_bmi

ggplot(combined_data, aes(x = location, y = bmi)) +
geom_boxplot() +
ggtitle("BMI Distribution by Location") +
theme_minimal() +
theme(axis.title.y = element_blank()) +
labs(x = "Location")

Part 2

Question 8

Recall the patient data from a healthcare facility that we used in Part 2 of Problem Set 7.

We had four tables that were relational to each other and the following keys linking the tables together:

  • patient_id: patients, schedule
  • visit_id: schedule, visits
  • doctor_id: visits, doctors

Use a join to find out which patients have no visits on the schedule.

# your code here

#Left-join shows NA in columns visit_id and date for 
# all patients without visits

patients_without_visits <- left_join(x= patients, 
                                     y = schedule, 
                                     by = "patient_id")
print(patients_without_visits)
## # A tibble: 131 × 8
##    patient_id   age race_ethnicity  gender_identity height weight visit_id date 
##         <dbl> <dbl> <chr>           <chr>            <dbl>  <dbl>    <dbl> <chr>
##  1       1000    54 Asian           woman              163     57       17 7/5/…
##  2       1001    60 Hispanic, Lati… woman              190     80        1 1/2/…
##  3       1001    60 Hispanic, Lati… woman              190     80       37 2/7/…
##  4       1001    60 Hispanic, Lati… woman              190     80       53 8/3/…
##  5       1001    60 Hispanic, Lati… woman              190     80       80 3/7/…
##  6       1001    60 Hispanic, Lati… woman              190     80       83 4/7/…
##  7       1002    72 Asian           transgender        156     52       50 6/13…
##  8       1003    38 African Americ… man                183    105       23 8/12…
##  9       1004    40 African Americ… woman              162     68        3 1/15…
## 10       1005    44 White           woman              169     88       84 4/11…
## # ℹ 121 more rows
# Select rows where visit_id is NA. There are seven patients with no scheduled visits.

rows_with_na_visit_id <- patients_without_visits[is.na(patients_without_visits$visit_id), ]
print(rows_with_na_visit_id)
## # A tibble: 7 × 8
##   patient_id   age race_ethnicity   gender_identity height weight visit_id date 
##        <dbl> <dbl> <chr>            <chr>            <dbl>  <dbl>    <dbl> <chr>
## 1       1013    32 <NA>             man             180       58        NA <NA> 
## 2       1015    71 White            man             151       62        NA <NA> 
## 3       1017    72 Asian            woman           191       62        NA <NA> 
## 4       1023    48 Asian            man               4.82   203.       NA <NA> 
## 5       1033    64 Asian            woman             6.43   256.       NA <NA> 
## 6       1038    38 African America… transgender       4.86   342.       NA <NA> 
## 7       1042    66 White            man               4.95   251.       NA <NA>

Question 9

With this data, can you tell if those patients with no visits on the schedule have been assigned to a doctor? Why or why not?

No because the only way to connect patient_id (in schedule) to doctor_id (in visits) is an inner_join between schedule and visits by visit_id, and patients with no_visits are not on the schedule (no visit_id), so they will not be present in the inner_join between schedule and visits.

# your code here? (optional)

#  Left join with prior dataset to visits data confirms 
# that there is no doctor id listed (NAs) for patients without visits                                                                                    
                                                                
no_doctors_for_patients_without_visits <- 
  left_join(x = patients_without_visits, y = visits, by = "visit_id")

# As final check, to confirm which patients do not have 
# scheduled visits and therefore no doctor assigned, you 
# can do an anti-join. This returns all rows in x where 
# there is not a match in y. The same seven patients are listed. 


anti_join_patients_schedule <- 
  anti_join(x = patients, y = schedule, by = "patient_id")

print(anti_join_patients_schedule)
## # A tibble: 7 × 6
##   patient_id   age race_ethnicity         gender_identity height weight
##        <dbl> <dbl> <chr>                  <chr>            <dbl>  <dbl>
## 1       1013    32 <NA>                   man             180       58 
## 2       1015    71 White                  man             151       62 
## 3       1017    72 Asian                  woman           191       62 
## 4       1023    48 Asian                  man               4.82   203.
## 5       1033    64 Asian                  woman             6.43   256.
## 6       1038    38 African American/Black transgender       4.86   342.
## 7       1042    66 White                  man               4.95   251.

YOUR ANSWER HERE

Question 10

Assume those patients need primary care and haven’t been assigned a doctor yet. Which primary care doctors have the least amount of visits? Rank them from least to most visits.

# your code here


visits_doctors <- inner_join(visits, doctors, by = "doctor_id") %>%
  filter(speciality == "primary care")

visits_doctors %>% 
group_by(doctor) %>% 
summarise(total_visits = n()) %>% 
arrange(total_visits)

Part 3

Recall in Problem Set 5, Part 2, we were working with data from New York City that tested children under 6 years old for elevated blood lead levels (BLL). [You can read more about the data on their website]).

About the data:

All NYC children are required to be tested for lead poisoning at around age 1 and age 2, and to be screened for risk of lead poisoning, and tested if at risk, up until age 6. These data are an indicator of children younger that 6 years of age tested in NYC in a given year with blood lead levels (BLL) of 5 mcg/dL or greater. In 2012, CDC established that a blood lead level of 5 mcg/dL is the reference level for exposure to lead in children. This level is used to identify children who have blood lead levels higher than most children’s levels. The reference level is determined by measuring the NHANES blood lead distribution in US children ages 1 to 5 years, and is reviewed every 4 years.

Question 11

Load in a cleaned-up version of the blood lead levels data:

library(dplyr)

bll_nyc_per_1000 <- read_csv("data/bll_nyc_per_1000.csv") %>%
  rename(Borough = borough_id)

Create a formattable table (example below) that shows the elevated blood lead levels per 1000 tested across 2013-2016. If the BLL increases from the previous year, turn the text red. If the BLL decreases from the previous year, turn the text green. To accomplish this color changing, you may want to create three indicator variables that check the value between years (e.g. use if_else). If you’ve have used conditional formatting on excel/google sheets, the concept is the same, but with R.

Note: If you are using if_else (hint hint) and checking by the year, you will likely need to use the left quote, actute, backtip, to reference the variable.

We have also provided you a function that you can use within your formattable table to reference this indicator variable to help reduce the code. However, you do not have to use this, and feel free to change the hex colors.

Borough 2013 2014 2015 2016
Bronx 20.1 18.7 15.7 15.0
Brooklyn 30.2 26.8 22.6 22.3
Manhattan 15.2 14.1 10.6 8.1
Queens 18.2 18.5 15.4 14.3
Staten Island 17.6 17.1 12.0 14.8
# in the event that plotly was run below, detach plotly
# the option 'style' conflicts when both libraries are loaded
# detach("package:plotly", unload=TRUE)

# your code here

# see above for my code to color code the text

Question 12

Starting with the data frame bll_nyc_per_1000 create a table with the DT library showing elevated blood lead levels per 1000 tested in 2013-2016 by borough. Below is an example of the table to replicate.

# your code here

library(DT)
library(dplyr)

# Sort the dataset by the Borough column
bll_nyc_per_1000 <- arrange(bll_nyc_per_1000, Borough)

# Datatable

datatable(bll_nyc_per_1000,
  options = list(
    pageLength = 10,   # Display 10 rows per page
    autoWidth = FALSE,  # Automatically adjust column widths
    dom = 'ltpf'          # Hide the default table controls
  ),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
    "New York City: Elevated Blood Lead Levels 2013-2016 by Borough"
  ),
  colnames = c("Borough", "Year", "BLL>5"
  ),
  rownames = F)

Part 4

Question 13

For this question, we will use suicide rates data that comes from the CDC.

Replicate the graph below using plotly.

# issues with formattable and plotly together since the "style" option overlap
# https://stackoverflow.com/questions/39319427/using-formattable-and-plotly-simultaneously
detach("package:formattable", unload=TRUE)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
df_suicide <- read_csv("data/Suicide Mortality by State.csv")
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_suicide_clean <- df_suicide %>% 
  rename(year = YEAR, rate = RATE, state = STATE) %>%
  filter(state %in% c("AZ", "CA", "FL", "HI", "MI", "NY", "WY"))


plot_suicide_rate <- plot_ly(
  data = df_suicide_clean,  # Explicitly specify the data argument
  x = ~year,
  y = ~rate,
  type = "scatter",
  mode = "lines",
  color = ~state
) %>%
  layout(
    title = "Suicide Rates by State Over Time",
    showlegend = TRUE,  # Ensure legend is displayed
    legend = list(title = list(text = NULL)),  # Remove legend title
    xaxis = list(
      title = ""  # Remove x-axis label
    ),
    yaxis = list(
      title = "Suicide Rate per 100,000"  # Keep y-axis label
    )
  )

print(plot_suicide_rate)
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the plotly package.
##   Please report the issue at <https://github.com/ropensci/plotly/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Question 14

Create an interactive (choropleth) map with plotly similar to the one presented on the CDC website. On the CDC map you can hover over each state and see the state name in bold, the death rate, and deaths. We can do much of this with plotly. As a challenge, use only the 2018 data to create an interactive US map colored by the suicide rates of that state. When you hover over the state, you should see the state name in bold, the death rate, and the number of deaths.

Some key search terms that may help:

  • choropleth
  • hover text plotly r
  • hover text bold plotly r
  • plotly and html
  • html bold
  • html subtitle

Below is an image of an example final map when you hover over California.

Here is the shell of the map to get you started. Copy the plotly code into the chunk below this one and customize it.

# data pulled from CDC website described above 
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
  filter(YEAR == 2018)

#plot_ly(df_suicide,
        #type="choropleth",
        #locationmode = "USA-states") %>%
  #layout(geo=list(scope="usa")) 
# your code here


library(plotly)
library(dplyr)
library(readr)

# Load and filter the data
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
  filter(YEAR == 2018)
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a choropleth map
plot_ly(
  data = df_suicide,
  type = "choropleth",
  locations = ~STATE,         # Column with state abbreviations (e.g., "NY", "CA")
  locationmode = "USA-states", # Use two-letter state codes
  z = ~RATE,                  # Column with the values to display (e.g., suicide rates)
  text = ~STATE,         # Optional: Add state names as hover text
  colorscale = "Blues"        # Choose a color scale (e.g., Blues, Reds)
) %>%
  layout(
    title = "Suicide Mortality Rates by State (2018)",
    geo = list(
      scope = "usa",          
      projection = list(type = "albers usa") # Use appropriate map projection
    )
  )