Below is a function that ought to return “I’m an even number!” if
x is an even number. However, we’re having trouble
receiving a value despite x == 4, which we know is an even
number. Fix the code chunk and explain why this error is occurring. You
will have to change the eval=FALSE option in the code chunk
header to get the chunk to knit in your PDF.
NOTE: %% is the “modulo” operator, which returns the
remainder when you divide the left number by the right number. For
example, try 2 %% 2 (should equal 0 as 2/2 = 1 with no
remainder) and 5 %% 2 (should equal 1 as 5/2 = 2 with a
remainder of 1).
return_even <- function(x){
if (x %% 2 == 0) {
return("I'm an even number!")
}
}
x <- 4
return_even(x)
## [1] "I'm an even number!"
EXPLAIN THE ISSUE HERE
The issue is that the function was not called, it was simply written as return_even() instead of return_even(x).
R functions are not able to access global variables unless we provide them as inputs.
Below is a function that determines if a number is odd and adds 1 to that number. The function ought to return that value, but we can’t seem to access the value. Debug the code and explain why this error is occurring. Does it make sense to try and call odd_add_1 after running the function?
return_odd <- function(y) {
if (y %% 2 != 0) {
return(y + 1)
}
}
odd_add_1 <- return_odd(3)
odd_add_1
## [1] 4
EXPLAIN THE ISSUE HERE
The issue is that the variable odd_add_1 is created inside the function return_odd() and is thus local to the function. Variables defined inside a function are not accessible outside of that function’s scope in R, which is why trying to access odd_add_1 after calling the function results in an error.
It doesn’t make sense to call the variable odd_add_1 because it is inaccessible outside of the function, and can only be re-assigned to out put the return_odd varibale as follows;
odd_add_1 <- return_odd(3) odd_add_1
BMI calculations and conversions: - metric: \(BMI = weight (kg) / [height (m)]^2\) - imperial: \(BMI = 703 * weight (lbs) / [height (in)]^2\) - 1 foot = 12 inches - 1 cm = 0.01 meter
Below is a function bmi_imperial() that calculates BMI
and assumes the weight and height inputs are from the imperial system
(height in feet and weight in pounds).
df_colorado <- read_csv("data/colorado_data.csv")
## Rows: 24 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): location, gender
## dbl (3): height, weight, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bmi_imperial <- function(height, weight){
bmi = (703 * weight)/(height * 12)^2
return(bmi)
}
# calculate bmi for the first observation
bmi_imperial(df_colorado$height[1], df_colorado$weight[1])
## [1] 42.62802
Write a function called bmi_metric() that calculates BMI
based on the metric system. You can test your function with the Taiwan
data set excel file in the data folder, which has height in cm and
weight in kg.
df_taiwan <- readxl::read_excel("data/taiwan_data.xlsx")
bmi_metric <- function(height, weight) {
height <- height * 0.01
bmi <- weight/ (height^2)
return(bmi)
}
bmi_metric(df_taiwan$height[1], df_taiwan$weight[1])
## [1] 21.45357
Can you write a function called calculate_bmi() that
combines both of the BMI functions? You will need to figure out a way to
determine which calculation to perform based on the values in the
data.
calculate_bmi <- function(height, weight, system = "metric") {
if (system == "metric") {
height <- height * 0.01
bmi <- weight / (height^2)
} else if (system == "imperial") {
bmi <- (703 * weight) / ((height * 12)^2)
} else {
stop("Invalid system. Please use 'metric' or 'imperial'.")
}
return(bmi)
}
calculate_bmi(df_colorado$height[1], df_colorado$weight[1], system = "imperial")
## [1] 42.62802
calculate_bmi(df_taiwan$height[1], df_taiwan$weight[1], system = "metric")
## [1] 21.45357
Use your function calculate_bmi() to answer the
following questions:
What is the average BMI of the individuals in the Colorado data set?
avg_bmi_colorado <- df_colorado %>%
mutate(bmi = calculate_bmi(height, weight, system = "imperial")) %>%
summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
pull(avg_bmi)
avg_bmi_colorado
## [1] 45.60881
What is the average BMI of the individuals in the Taiwan data set?
avg_bmi_taiwan <- df_taiwan %>%
mutate(bmi = calculate_bmi(height, weight, system = "metric")) %>%
summarize(avg_bmi = mean(bmi, na.rm = TRUE)) %>%
pull(avg_bmi)
avg_bmi_taiwan
## [1] 22.99287
Combine the Colorado and Taiwan data sets into one data frame and
calculate the BMI for every row using your calculate_bmi()
function. Print the first six rows and the last six rows of that new
data set.
#I will select out the height and weight columns in each data set and do a full join
df_colorado <- df_colorado %>%
select("location", "height", "weight")
df_taiwan <- df_taiwan %>%
select("location", "height", "weight")
df_combined <- full_join(df_taiwan, df_colorado, by = c("location","height", "weight"))
df_combined
#I will now calculate bmi for each row
df_combined <- df_combined %>%
mutate(
bmi = if_else(
location == "Taiwan",
calculate_bmi(height, weight, "metric"),
calculate_bmi(height, weight, "imperial")
)
)
df_combined
Make a boxplot that shows the BMI distribution of the combined data, separated by location on the x-axis. Use a theme of your choice, put a title on your graph, and hide the y-axis title.
NOTE: These data are for practice only and are not representative populations, which is why we aren’t comparing them with statistical tests. It would not be responsible to draw any conclusions from this graph!
library(ggplot2)
ggplot(df_combined, aes(x = location, y = bmi)) +
geom_boxplot() +
theme_minimal() +
ggtitle("BMI Distribution by Location") +
theme(axis.title.y = element_blank())
Recall the patient data from a healthcare facility that we used in Part 2 of Problem Set 7.
We had four tables that were relational to each other and the following keys linking the tables together:
Use a join to find out which patients have no visits on the schedule.
# Perform the left join between patients, schedule, and visits
patients_no_visits <- patients %>%
left_join(schedule, by = "patient_id")%>%
left_join(visits, by = "visit_id") %>%
filter(is.na(visit_id))
# Display the result
print(patients_no_visits)
## # A tibble: 7 × 10
## patient_id age race_ethnicity gender_identity height weight visit_id date
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1013 32 <NA> man 180 58 NA <NA>
## 2 1015 71 White man 151 62 NA <NA>
## 3 1017 72 Asian woman 191 62 NA <NA>
## 4 1023 48 Asian man 4.82 203. NA <NA>
## 5 1033 64 Asian woman 6.43 256. NA <NA>
## 6 1038 38 African America… transgender 4.86 342. NA <NA>
## 7 1042 66 White man 4.95 251. NA <NA>
## # ℹ 2 more variables: follow_up <chr>, doctor_id <dbl>
With this data, can you tell if those patients with no visits on the schedule have been assigned to a doctor? Why or why not?
patients_no_visits_with_doctor <- patients_no_visits %>%
select(patient_id, doctor_id, follow_up)
print(patients_no_visits_with_doctor)
## # A tibble: 7 × 3
## patient_id doctor_id follow_up
## <dbl> <dbl> <chr>
## 1 1013 NA <NA>
## 2 1015 NA <NA>
## 3 1017 NA <NA>
## 4 1023 NA <NA>
## 5 1033 NA <NA>
## 6 1038 NA <NA>
## 7 1042 NA <NA>
Patients with no visits on the schedule have not been assigned to any doctor. This could be so because they have no follow up assigned to them as well.
Assume those patients need primary care and haven’t been assigned a doctor yet. Which primary care doctors have the least amount of visits? Rank them from least to most visits.
primary_care_doctors <- doctors %>%
filter(speciality == "primary care")
doctor_visit_counts <- visits %>%
group_by(doctor_id)%>%
summarise(visit_count = n())%>%
left_join(primary_care_doctors, by = "doctor_id")%>%
arrange(visit_count)%>%
select(doctor_id, doctor, visit_count)%>%
na.omit()
print(doctor_visit_counts)
## # A tibble: 11 × 3
## doctor_id doctor visit_count
## <dbl> <chr> <int>
## 1 5015 Aron Randolph 1
## 2 5021 Jemima Velazquez 2
## 3 5000 Daanyaal Griffin 3
## 4 5006 Merlin Jacobs 3
## 5 5010 Irfan Mcghee 3
## 6 5013 Abdul Bostock 3
## 7 5025 Amritpal Goodman 3
## 8 5024 Bea Frame 4
## 9 5020 Ariah Atherton 5
## 10 5002 Rabia Browning 7
## 11 5014 Huzaifa Chung 7
Recall in Problem Set 5, Part 2, we were working with data from New York City that tested children under 6 years old for elevated blood lead levels (BLL). [You can read more about the data on their website]).
About the data:
All NYC children are required to be tested for lead poisoning at around age 1 and age 2, and to be screened for risk of lead poisoning, and tested if at risk, up until age 6. These data are an indicator of children younger that 6 years of age tested in NYC in a given year with blood lead levels (BLL) of 5 mcg/dL or greater. In 2012, CDC established that a blood lead level of 5 mcg/dL is the reference level for exposure to lead in children. This level is used to identify children who have blood lead levels higher than most children’s levels. The reference level is determined by measuring the NHANES blood lead distribution in US children ages 1 to 5 years, and is reviewed every 4 years.
Load in a cleaned-up version of the blood lead levels data:
bll_nyc_per_1000 <- read_csv("data/bll_nyc_per_1000.csv")
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): borough_id
## dbl (2): time_period, bll_5plus_1k
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Create a formattable table (example below) that shows the elevated blood lead levels per 1000 tested across 2013-2016. If the BLL increases from the previous year, turn the text red. If the BLL decreases from the previous year, turn the text green. To accomplish this color changing, you may want to create three indicator variables that check the value between years (e.g. use if_else). If you’ve have used conditional formatting on excel/google sheets, the concept is the same, but with R.
Note: If you are using if_else (hint hint) and checking by the year, you will likely need to use the left quote, actute, backtip, to reference the variable.
We have also provided you a function that you can use within your formattable table to reference this indicator variable to help reduce the code. However, you do not have to use this, and feel free to change the hex colors.
#knitr::include_graphics('data/table_1.png')
library(formattable)
bll_nyc_per_1000_years_only <- bll_nyc_per_1000 %>%
rename(
`Borough ID` = `borough_id`,
`Year` = `time_period`,
`BLL>5` = `bll_5plus_1k`
)
bll_nyc_per_1000_years_only_wide <- bll_nyc_per_1000_years_only %>%
spread(key = Year, value = `BLL>5`)
bll_nyc_per_1000_years_only_wide <- bll_nyc_per_1000_years_only_wide %>%
mutate(
increase_2014 = ifelse(`2014` > `2013`, 1, 0),
increase_2015 = ifelse(`2015` > `2014`, 1, 0),
increase_2016 = ifelse(`2016` > `2015`, 1, 0)
)
bll_nyc_per_1000_years_only_wide <- bll_nyc_per_1000_years_only_wide %>%
select(-increase_2014, -increase_2015, -increase_2016)
formattable(
bll_nyc_per_1000_years_only_wide,
align = c("l", "c", "c", "c", "c"),
list(
`2013` = color_tile("white", "lightgreen"),
`2014` = formatter("span",
style = ~ ifelse(`2014` > `2013`, "color: red;", "color: green;"),
x ~ as.character(x)
),
`2015` = formatter("span",
style = ~ ifelse(`2015` > `2014`, "color: red;", "color: green;"),
x ~ as.character(x)
),
`2016` = formatter("span",
style = ~ ifelse(`2016` > `2015`, "color: red;", "color: green;"),
x ~ as.character(x)
)
)
)
| Borough ID | 2013 | 2014 | 2015 | 2016 |
|---|---|---|---|---|
| Bronx | 20.1 | 18.7 | 15.7 | 15 |
| Brooklyn | 30.2 | 26.8 | 22.6 | 22.3 |
| Manhattan | 15.2 | 14.1 | 10.6 | 8.1 |
| Queens | 18.2 | 18.5 | 15.4 | 14.3 |
| Staten Island | 17.6 | 17.1 | 12 | 14.8 |
library(formattable)
# Detach plotly if it's loaded to avoid style conflicts
if ("package:plotly" %in% .packages()) {
detach("package:plotly", unload = TRUE)
}
Starting with the data frame bll_nyc_per_1000 create a
table with the DT library showing elevated blood lead levels per 1000
tested in 2013-2016 by borough. Below is an example of the table to
replicate.
library(DT)
# Rename the columns
bll_nyc_per_1000 <- bll_nyc_per_1000 %>%
rename(
`Borough ID` = borough_id, # Rename borough_id to Borough ID
Year = time_period, # Rename time_period to Year
`BLL > 5` = bll_5plus_1k # Rename bll_5plus_1k to BLL > 5
)
# Create the interactive DT table without row numbers
datatable(
bll_nyc_per_1000,
rownames = FALSE, # Disable row numbers
options = list(
pageLength = 10, # Show 10 rows per page
searching = TRUE, # Enable search
ordering = TRUE # Enable column sorting
),
caption = "New York City: Elevated Blood Lead Levels 2013-2016 by Borough"
)
For this question, we will use suicide rates data that comes from the CDC.
Replicate the graph below using plotly.
# issues with formattable and plotly together since the "style" option overlap
# https://stackoverflow.com/questions/39319427/using-formattable-and-plotly-simultaneously
#detach("package:formattable", unload=TRUE)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:formattable':
##
## style
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
df_suicide <- read_csv("data/Suicide Mortality by State.csv")
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter to include only the specified states and summarize the data
df_filtered <- df_suicide %>%
filter(STATE %in% c("AZ", "CA", "FL", "HI", "MI", "NY", "WY")) %>%
group_by(STATE, YEAR) %>%
summarize(RATE = mean(RATE, na.rm = TRUE), .groups = 'drop')
# Create the plot
plot <- df_filtered %>%
plot_ly(
x = ~YEAR, y = ~RATE, color = ~STATE, type = 'scatter', mode = 'lines') %>%
layout(
title = "Suicide Rates by State Over Time",
xaxis = list(title = ""),
yaxis = list(title = "Suicide Rate per 100,000"),
legend = list(title = list(text = "State"))
)
plot
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the plotly package.
## Please report the issue at <https://github.com/ropensci/plotly/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Create an interactive (choropleth) map with plotly similar to the one presented on the CDC website. On the CDC map you can hover over each state and see the state name in bold, the death rate, and deaths. We can do much of this with plotly. As a challenge, use only the 2018 data to create an interactive US map colored by the suicide rates of that state. When you hover over the state, you should see the state name in bold, the death rate, and the number of deaths.
Some key search terms that may help:
Below is an image of an example final map when you hover over California.
Here is the shell of the map to get you started. Copy the plotly code into the chunk below this one and customize it.
# data pulled from CDC website described above
#df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
#filter(YEAR == 2018)
#plot_ly(df_suicide,
# type="choropleth",
# locationmode = "USA-states") %>%
#layout(geo=list(scope="usa"))
# Load and filter the dataset for the year 2018
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
filter(YEAR == 2018)
## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## num (1): DEATHS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Plot the choropleth map
plot_ly(df_suicide,
type = "choropleth",
locationmode = "USA-states",
locations = df_suicide$STATE,
z = df_suicide$RATE,
text = paste("State: ", df_suicide$STATE, "<br>Rate: ", df_suicide$RATE), color = df_suicide$RATE,
colorscale = "Viridis",
colorbar = list(title = "Suicide Rate"),
showlegend = TRUE) %>%
layout(title = "Suicide Mortality by State in 2018",
geo = list(scope = "usa"))