Below is a function that ought to return “I’m an even number!” if
x is an even number. However, we’re having trouble
receiving a value despite x == 4, which we know is an even
number. Fix the code chunk and explain why this error is occurring. You
will have to change the eval=FALSE option in the code chunk
header to get the chunk to knit in your PDF.
NOTE: %% is the “modulo” operator, which returns the
remainder when you divide the left number by the right number. For
example, try 2 %% 2 (should equal 0 as 2/2 = 1 with no
remainder) and 5 %% 2 (should equal 1 as 5/2 = 2 with a
remainder of 1).
# Corrected code: 2 is the default value; else completes the operation
# based on the argument
return_even <- function(x = 2) {
if (x %% 2 == 0) {
return("I'm an even number!")
} else { #else completes operation based on the argument
return("I'm not an even number!")
}
}
x <- 4
return_even(x)
## [1] "I'm an even number!"
EXPLAIN THE ISSUE
The error message “argument ‘x’ is missing” occurs because x is a parameter and there is no default value or value to the function. The “if” argument is incomplete without providing “else” statement.
R functions are not able to access global variables unless we provide them as inputs.
Below is a function that determines if a number is odd and adds 1 to that number. The function ought to return that value, but we can’t seem to access the value. Debug the code and explain why this error is occurring. Does it make sense to try and call odd_add_1 after running the function?
# Incorrect code to debug
return_odd <- function(y){
if (y %% 2 != 0) {
odd_add_1 <- y + 1
}
}
#return_odd(y)
#odd_add_1
# Rewrite the code to return the output and complete the statements that
# operate on the function
# Example
y <- 7
return_odd <- function(y) {
if (y %% 2 == 0) { # checks if the number is even
odd_add_1 <- y + 1
return(odd_add_1)
} else { # output for the case where the y is odd
return(y)
}
}
EXPLAIN THE ISSUE HERE
The function, return_odd, is misleading because it does not return an
odd number. If the input is odd, it adds 1 to the input (making it even)
and the output that is retunred is an even number. To debug this, remove
the ! and replace with == so that when y is an equal number, then y + 1
will return an odd number. To complete the conditional statement, if y
is an odd number, then return y. In this way, only odd numbers will be
returned.
If you create odd_add_1 outside the function as a global scope variable,
then you can print the output for odd_add_1 after you run the function.
If odd_add_1 is only inside the function, then it is not a defined
object.
BMI calculations and conversions: - metric: \(BMI = weight (kg) / [height (m)]^2\) - imperial: \(BMI = 703 * weight (lbs) / [height (in)]^2\) - 1 foot = 12 inches - 1 cm = 0.01 meter
Below is a function bmi_imperial() that calculates BMI
and assumes the weight and height inputs are from the imperial system
(height in feet and weight in pounds).
df_colorado <- read_csv("data/colorado_data.csv")
bmi_imperial <- function(height, weight){
bmi = (703 * weight)/(height * 12)^2
return(bmi)
}
# calculate bmi for the first observation
bmi_imperial(df_colorado$height[1], df_colorado$weight[1])
## [1] 42.62802
Write a function called bmi_metric() that calculates BMI
based on the metric system. You can test your function with the Taiwan
data set excel file in the data folder, which has height in cm and
weight in kg.
# your code here
library(readxl)
df_taiwan <- read_xlsx("data/taiwan_data.xlsx")
bmi_metric <- function(height, weight){
bmi = weight / (height * .01)^2
return(bmi)
}
# uncomment the line below to test the bmi calculation on the first row
bmi_metric(df_taiwan$height[1], df_taiwan$weight[1])
## [1] 21.45357
Can you write a function called calculate_bmi() that
combines both of the BMI functions? You will need to figure out a way to
determine which calculation to perform based on the values in the
data.
# your code here
# test
# calculate_bmi(df_colorado$height[1], df_colorado$weight[1])
# calculate_bmi(df_taiwan$height[1], df_taiwan$weight[1])
calculate_bmi <- function(height, weight){
if(height < 9){
return(bmi_imperial(height, weight))
}
else {
return(bmi_metric(height, weight))
}
}
calculate_bmi(df_taiwan$height[1], df_taiwan$weight[1])
## [1] 21.45357
calculate_bmi(df_colorado$height[1], df_colorado$weight[1])
## [1] 42.62802
Use your function calculate_bmi() to answer the
following questions:
What is the average BMI of the individuals in the Colorado data set?
45.60881
# your code here
library(dplyr)
calculate_bmi <- function(height, weight) {
ifelse(height < 9,
703 * weight/(height * 12)^2,
weight / (height * .01)^2)
}
colorado_results <- df_colorado %>%
mutate(bmi_colorado = calculate_bmi(height, weight))
mean(colorado_results$bmi_colorado)
## [1] 45.60881
What is the average BMI of the individuals in the Taiwan data set?
22.99287
# your code here
taiwan_results <- df_taiwan %>%
mutate(bmi_taiwan = calculate_bmi(height, weight))
mean(taiwan_results$bmi_taiwan)
## [1] 22.99287
Combine the Colorado and Taiwan data sets into one data frame and
calculate the BMI for every row using your calculate_bmi()
function. Print the first six rows and the last six rows of that new
data set.
# your code here
# Combine datasets and calculate BMI
# Convert dates to consistent format
combined_data <- bind_rows(
# Convert Colorado numeric dates to proper date format
df_colorado %>%
mutate(date = as.Date(as.character(date), format="%Y%d%m")),
# Convert Taiwan character dates to same date format
df_taiwan %>%
mutate(date = as.Date(date, format="%d-%b-%y"))
) %>%
# Calculate BMI for all rows
mutate(bmi = calculate_bmi(height, weight))
# Show first 6 and last 6 rows
head(combined_data)
tail(combined_data)
Make a boxplot that shows the BMI distribution of the combined data, separated by location on the x-axis. Use a theme of your choice, put a title on your graph, and hide the y-axis title.
NOTE: These data are for practice only and are not representative populations, which is why we aren’t comparing them with statistical tests. It would not be responsible to draw any conclusions from this graph!
# your code here
# ggplot_bmi
ggplot(combined_data, aes(x = location, y = bmi)) +
geom_boxplot() +
ggtitle("BMI Distribution by Location") +
theme_minimal() +
theme(axis.title.y = element_blank()) +
labs(x = "Location")
Recall the patient data from a healthcare facility that we used in Part 2 of Problem Set 7.
We had four tables that were relational to each other and the following keys linking the tables together:
Use a join to find out which patients have no visits on the schedule.
# your code here
#Left-join shows NA in columns visit_id and date for
# all patients without visits
patients_without_visits <- left_join(x= patients,
y = schedule,
by = "patient_id")
print(patients_without_visits)
## # A tibble: 131 × 8
## patient_id age race_ethnicity gender_identity height weight visit_id date
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1000 54 Asian woman 163 57 17 7/5/…
## 2 1001 60 Hispanic, Lati… woman 190 80 1 1/2/…
## 3 1001 60 Hispanic, Lati… woman 190 80 37 2/7/…
## 4 1001 60 Hispanic, Lati… woman 190 80 53 8/3/…
## 5 1001 60 Hispanic, Lati… woman 190 80 80 3/7/…
## 6 1001 60 Hispanic, Lati… woman 190 80 83 4/7/…
## 7 1002 72 Asian transgender 156 52 50 6/13…
## 8 1003 38 African Americ… man 183 105 23 8/12…
## 9 1004 40 African Americ… woman 162 68 3 1/15…
## 10 1005 44 White woman 169 88 84 4/11…
## # ℹ 121 more rows
# Select rows where visit_id is NA. There are seven patients with no scheduled visits.
rows_with_na_visit_id <- patients_without_visits[is.na(patients_without_visits$visit_id), ]
print(rows_with_na_visit_id)
## # A tibble: 7 × 8
## patient_id age race_ethnicity gender_identity height weight visit_id date
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1013 32 <NA> man 180 58 NA <NA>
## 2 1015 71 White man 151 62 NA <NA>
## 3 1017 72 Asian woman 191 62 NA <NA>
## 4 1023 48 Asian man 4.82 203. NA <NA>
## 5 1033 64 Asian woman 6.43 256. NA <NA>
## 6 1038 38 African America… transgender 4.86 342. NA <NA>
## 7 1042 66 White man 4.95 251. NA <NA>
With this data, can you tell if those patients with no visits on the schedule have been assigned to a doctor? Why or why not?
No because the only way to connect patient_id (in schedule) to doctor_id (in visits) is an inner_join between schedule and visits by visit_id, and patients with no_visits are not on the schedule (no visit_id), so they will not be present in the inner_join between schedule and visits.
# your code here? (optional)
# Left join with prior dataset to visits data confirms
# that there is no doctor id listed (NAs) for patients without visits
no_doctors_for_patients_without_visits <-
left_join(x = patients_without_visits, y = visits, by = "visit_id")
# As final check, to confirm which patients do not have
# scheduled visits and therefore no doctor assigned, you
# can do an anti-join. This returns all rows in x where
# there is not a match in y. The same seven patients are listed.
anti_join_patients_schedule <-
anti_join(x = patients, y = schedule, by = "patient_id")
print(anti_join_patients_schedule)
## # A tibble: 7 × 6
## patient_id age race_ethnicity gender_identity height weight
## <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1013 32 <NA> man 180 58
## 2 1015 71 White man 151 62
## 3 1017 72 Asian woman 191 62
## 4 1023 48 Asian man 4.82 203.
## 5 1033 64 Asian woman 6.43 256.
## 6 1038 38 African American/Black transgender 4.86 342.
## 7 1042 66 White man 4.95 251.
YOUR ANSWER HERE
Assume those patients need primary care and haven’t been assigned a doctor yet. Which primary care doctors have the least amount of visits? Rank them from least to most visits.
# your code here
visits_doctors <- inner_join(visits, doctors, by = "doctor_id") %>%
filter(speciality == "primary care")
visits_doctors %>%
group_by(doctor) %>%
summarise(total_visits = n()) %>%
arrange(total_visits)
Recall in Problem Set 5, Part 2, we were working with data from New York City that tested children under 6 years old for elevated blood lead levels (BLL). [You can read more about the data on their website]).
About the data:
All NYC children are required to be tested for lead poisoning at around age 1 and age 2, and to be screened for risk of lead poisoning, and tested if at risk, up until age 6. These data are an indicator of children younger that 6 years of age tested in NYC in a given year with blood lead levels (BLL) of 5 mcg/dL or greater. In 2012, CDC established that a blood lead level of 5 mcg/dL is the reference level for exposure to lead in children. This level is used to identify children who have blood lead levels higher than most children’s levels. The reference level is determined by measuring the NHANES blood lead distribution in US children ages 1 to 5 years, and is reviewed every 4 years.
Load in a cleaned-up version of the blood lead levels data:
library(dplyr)
bll_nyc_per_1000 <- read_csv("data/bll_nyc_per_1000.csv") %>%
rename(Borough = borough_id)
Create a formattable table (example below) that shows the elevated blood lead levels per 1000 tested across 2013-2016. If the BLL increases from the previous year, turn the text red. If the BLL decreases from the previous year, turn the text green. To accomplish this color changing, you may want to create three indicator variables that check the value between years (e.g. use if_else). If you’ve have used conditional formatting on excel/google sheets, the concept is the same, but with R.
Note: If you are using if_else (hint hint) and checking by the year, you will likely need to use the left quote, actute, backtip, to reference the variable.
We have also provided you a function that you can use within your formattable table to reference this indicator variable to help reduce the code. However, you do not have to use this, and feel free to change the hex colors.
| Borough | 2013 | 2014 | 2015 | 2016 |
|---|---|---|---|---|
| Bronx | 20.1 | 18.7 | 15.7 | 15.0 |
| Brooklyn | 30.2 | 26.8 | 22.6 | 22.3 |
| Manhattan | 15.2 | 14.1 | 10.6 | 8.1 |
| Queens | 18.2 | 18.5 | 15.4 | 14.3 |
| Staten Island | 17.6 | 17.1 | 12.0 | 14.8 |
# in the event that plotly was run below, detach plotly
# the option 'style' conflicts when both libraries are loaded
# detach("package:plotly", unload=TRUE)
# your code here
# see above for my code to color code the text
Starting with the data frame bll_nyc_per_1000 create a
table with the DT library showing elevated blood lead levels per 1000
tested in 2013-2016 by borough. Below is an example of the table to
replicate.
# your code here
library(DT)
library(dplyr)
# Sort the dataset by the Borough column
bll_nyc_per_1000 <- arrange(bll_nyc_per_1000, Borough)
# Datatable
datatable(bll_nyc_per_1000,
options = list(
pageLength = 10, # Display 10 rows per page
autoWidth = FALSE, # Automatically adjust column widths
dom = 'ltpf' # Hide the default table controls
),
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
"New York City: Elevated Blood Lead Levels 2013-2016 by Borough"
),
colnames = c("Borough", "Year", "BLL>5"
),
rownames = F)
For this question, we will use suicide rates data that comes from the CDC.
Replicate the graph below using plotly.
# issues with formattable and plotly together since the "style" option overlap
# https://stackoverflow.com/questions/39319427/using-formattable-and-plotly-simultaneously
detach("package:formattable", unload=TRUE)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
df_suicide <- read_csv("data/Suicide Mortality by State.csv")
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_suicide_clean <- df_suicide %>%
rename(year = YEAR, rate = RATE, state = STATE) %>%
filter(state %in% c("AZ", "CA", "FL", "HI", "MI", "NY", "WY"))
plot_suicide_rate <- plot_ly(
data = df_suicide_clean, # Explicitly specify the data argument
x = ~year,
y = ~rate,
type = "scatter",
mode = "lines",
color = ~state
) %>%
layout(
title = "Suicide Rates by State Over Time",
showlegend = TRUE, # Ensure legend is displayed
legend = list(title = list(text = NULL)), # Remove legend title
xaxis = list(
title = "" # Remove x-axis label
),
yaxis = list(
title = "Suicide Rate per 100,000" # Keep y-axis label
)
)
print(plot_suicide_rate)
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the plotly package.
## Please report the issue at <https://github.com/ropensci/plotly/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Create an interactive (choropleth) map with plotly similar to the one presented on the CDC website. On the CDC map you can hover over each state and see the state name in bold, the death rate, and deaths. We can do much of this with plotly. As a challenge, use only the 2018 data to create an interactive US map colored by the suicide rates of that state. When you hover over the state, you should see the state name in bold, the death rate, and the number of deaths.
Some key search terms that may help:
Below is an image of an example final map when you hover over California.
Here is the shell of the map to get you started. Copy the plotly code into the chunk below this one and customize it.
# data pulled from CDC website described above
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
filter(YEAR == 2018)
#plot_ly(df_suicide,
#type="choropleth",
#locationmode = "USA-states") %>%
#layout(geo=list(scope="usa"))
# your code here
library(plotly)
library(dplyr)
library(readr)
# Load and filter the data
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
filter(YEAR == 2018)
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a choropleth map
plot_ly(
data = df_suicide,
type = "choropleth",
locations = ~STATE, # Column with state abbreviations (e.g., "NY", "CA")
locationmode = "USA-states", # Use two-letter state codes
z = ~RATE, # Column with the values to display (e.g., suicide rates)
text = ~STATE, # Optional: Add state names as hover text
colorscale = "Blues" # Choose a color scale (e.g., Blues, Reds)
) %>%
layout(
title = "Suicide Mortality Rates by State (2018)",
geo = list(
scope = "usa",
projection = list(type = "albers usa") # Use appropriate map projection
)
)