Replace “lastname_firstname” in the file name with your own last and first name before beginning the project (e.g., Doe_Jane_PHAR305_finalproject.Rmd). The same naming convention will automatically apply when you knit this file to HTML.
This project will assess your understanding of how to formulate and test a scientific research question, conduct and interpret a regression analysis, and contextualize your research findings. You will apply these concepts to a dataset of your choice.
Total Points: 60
Deadline to submit knit HTML and RMD files on Canvas: April 7, 2026 at 11:59 PM
| Section | Topic | Points |
|---|---|---|
| Part 0 | Dataset Identified | 1 |
| Part 1 | Research Question & Hypotheses | 11 |
| Part 2 | Data Exploration | 13 |
| Part 3 | Regression Analysis | 15 |
| Part 4 | Communicating Results | 6 |
| Part 5 | Limitations & Interpretation | 6 |
| Overall | RMD Rendered Successfully | 3 |
| Overall | Code readability | 5 |
| Total | 60 |
Provide the name of your chosen dataset
Your answer: Heart Disease (Cleveland UCI) Dataset
Load your chosen dataset.
## # A tibble: 6 × 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 69 1 0 160 234 1 2 131 0 0.1 1
## 2 69 0 0 140 239 0 0 151 0 1.8 0
## 3 66 0 0 150 226 0 0 114 0 2.6 2
## 4 65 1 0 138 282 1 2 174 0 1.4 1
## 5 64 1 0 110 211 0 2 144 1 1.8 1
## 6 64 1 0 170 227 0 2 155 0 0.6 1
## # ℹ 3 more variables: ca <dbl>, thal <dbl>, condition <dbl>
data <- read_csv("Heart_Disease_Cleveland_UCI.csv")
data <- rename(data,
heart_rate = "thalach",
cholesterol_Level= "chol",
blood_pressure="trestbps"
)
# Convert sex to a factor and set Female as the reference group
data$sex <- factor(data$sex, levels = c(0,1), labels = c("Female","Male"))
data$sex <- relevel(data$sex, ref = "Female")Based on your chosen dataset, formulate a research question using the PICO framework (5 points):
Population
Your answer:Adults included in the Cleveland Heart Disease dataset
Intervention
Your answer: Higher serum cholesterol level (mg/dL)
Comparator
Your answer: Lower serum cholesterol levels.
Outcome
Your answer: Maximum heart rate achieved
Your Research Question (1 sentence)
*Your answer: Among adults in the Cleveland Heart Disease UCI dataset, is higher serum cholesterol level associated with lower or higher maximum heart rate achieved compared with lower cholesterol levels?
Now, identify the variables in your dataset that will allow you to test this research question (3 points):
Outcome (Y) What is your dependent variable?
Your answer: Maximum heart rate achieved
Exposure (X) What is your main predictor of interest?
Your answer: Serum cholesterol level, measured in mg/dL.
List 2-3 potential confounders for the X-Y relationship
Your answer: Age, Sex, resting blood pressure
Confounder justification
Your answer: Age: Age is a well‑established risk factor for cardiovascular disease and is commonly included in risk prediction models because older adults have a substantially higher likelihood of developing heart disease than younger adults. This is partly due to the cumulative effect of risk factors like cholesterol and blood pressure over time, as well as age‑related changes in vascular function that increase cardiovascular risk. Age also influences cholesterol levels and lipid metabolism across the life course, meaning that without adjustment, age could bias the estimated relationship between cholesterol and heart disease. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3297980/ Sex: Biological sex is associated with differences in cardiovascular disease risk and lipid metabolism, with men having higher risk at earlier ages and women experiencing shifts in lipid profiles and risk after menopause due to hormonal changes. These differences in cholesterol regulation and heart disease susceptibility by sex mean that sex is related to both the exposure (cholesterol level) and the outcome (heart disease), making it a confounder that should be adjusted for in regression analysis. https://pubmed.ncbi.nlm.nih.gov/36881927/ Blood Pressure: Blood pressure is closely linked to both cholesterol levels and heart disease risk, as elevated blood pressure interacts with dyslipidemia to increase long‑term coronary heart disease risk. Individuals with high blood pressure often also have abnormal lipid profiles, and both risk factors independently contribute to cardiovascular disease outcomes. Because blood pressure is associated with the exposure and the outcome, failing to adjust for it can lead to biased effect estimates in regression analysis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6764095/ ————————————————————————
Provide at least one informative summary statistic for each of your key variables (outcome, exposure, and each of your confounders identified in Section 1.2).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.0 133.0 153.0 149.6 166.0 202.0
## [1] 22.94156
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 126.0 211.0 243.0 247.4 276.0 564.0
## [1] 51.99758
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 48.00 56.00 54.54 61.00 77.00
## [1] 9.049736
##
## Female Male
## 96 201
##
## Female Male
## 0.3232323 0.6767677
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.0 120.0 130.0 131.7 140.0 200.0
## [1] 17.76281
The histogram of heart rate shows that most participants have values clustered between approximately 130 and 180 beats per minute, indicating a roughly normal distribution with a slight right skew. There are very few participants with heart rates below 100 or above 200, suggesting that extreme values are rare in this dataset. Overall, the distribution appears reasonably symmetric around the central peak near 160 bpm. This pattern supports using linear regression for modeling heart rate as a continuous outcome.
# Histogram of the outcome variable: heart rate
# Using ggplot() from the ggplot2 package (part of tidyverse)
ggplot(data, aes(x = heart_rate)) +
geom_histogram(bins = 10, fill = "steelblue",
alpha = 0.7, color = "white") ## 2.3 Exposure-Outcome Relationship (3 points)
ggplot(data, aes(x = cholesterol_Level, y = heart_rate)) +
geom_point(alpha = 0.5, color = "steelblue") + # Added quotes
geom_smooth(method = "lm", color = "red") + # Added quotes
labs(
title = "Cholesterol Level vs. Heart Rate", # Added quotes
x = "Cholesterol Level (mg/dl)", # Added quotes
y = "Heart Rate (BPM)" # Added quotes
) +
theme_minimal()Interpretation
Describe the crude (unadjusted) relationship you observe between exposure and outcome in 3-4 sentences.
Your answer:The scatterplot of cholesterol levels versus heart rate shows a very weak or negligible linear relationship, as the red regression line is almost flat across the range of cholesterol values. Most heart rate values are clustered between 120 and 180 bpm regardless of cholesterol levels, suggesting little to no change in heart rate as cholesterol increases. The confidence interval around the regression line widens slightly at extreme cholesterol values, reflecting greater uncertainty due to fewer observations in those ranges. Overall, the crude association between cholesterol level and heart rate appears minimal in this dataset.
Show evidence that at least one of your confounders is associated with BOTH the exposure and the outcome.
ggplot(data, aes(x =age, y = cholesterol_Level)) + # set up plot; map X and Y variables
geom_point(alpha = 0.5, color = "darkorange") + # draw individual data points
geom_smooth(method = "lm", se = TRUE, color = "red") + # add a linear trend line with 95% CI shading
labs( # add labels for title and axes
title = "age vs. Cholesterol Level",
x = "Age(years)",
y = "Cholesterol Level(mg/dl)"
) +
theme_minimal() # apply a clean, minimal themeggplot(data, aes(x =age, y =heart_rate)) + # set up plot; map X and Y variables
geom_point(alpha = 0.5, color = "purple") + # draw individual data points
geom_smooth(method = "lm", se = TRUE, color = "red") + # add a linear trend line with 95% CI shading
labs( # add labels for title and axes
title = "age vs. Heart rate",
x = "age(years)",
y = "Heart rate (1–10)"
) +
theme_minimal() # apply a clean, minimal themeInterpretation
Explain how this evidence supports that this variable could be a confounder in 3-5 sentences.
Your answer:The plots demonstrate that age is associated with both cholesterol level and heart rate. The first plot shows a slight positive trend between age and cholesterol level, indicating that cholesterol tends to increase with increasing age. The second plot shows a negative association between age and heart rate, where older individuals generally achieve lower maximum heart rates. Because age is related to both the exposure (cholesterol level) and the outcome (heart rate), it meets the definition of a potential confounder. Therefore, failing to adjust for age could distort the observed relationship between cholesterol level and heart rate.
Based on your outcome variable, select the appropriate regression model:
a) What type of outcome variable do you have? (1 sentence)
Your answer: The outcome variable is continuous, as heart rate is measured on a numerical scale and can take a range of values.
b) Which regression model will you use and why? (2-3 sentences)
Your answer: I will use a linear regression model because the outcome variable, heart rate, is continuous. This model allows me to examine the relationship between the predictor variable, cholesterol level, and heart rate while adjusting for potential confounders such as age, blood pressure, and sex. Including these confounders in the model helps estimate the association between cholesterol level and heart rate more accurately.
Fit a simple regression model with only your exposure variable as the predictor. Use the appropriate model type based on your answer above.
# Using tidy() from the broom package
# tidy() converts model output into a clean data frame with one row per coefficient
tidy_simple_linear <- tidy(model_simple_linear, conf.int = TRUE)
tidy_simple_linear[, sapply(tidy_simple_linear, is.numeric)] <-
round(tidy_simple_linear[, sapply(tidy_simple_linear, is.numeric)], 3)
print(tidy_simple_linear)## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 150. 6.49 23.0 0 137. 162.
## 2 cholesterol_Level 0 0.026 -0.001 0.999 -0.051 0.051
Interpretation of Exposure Effect
In the unadjusted linear regression model, the coefficient for cholesterol level is approximately 0.000, indicating that a one-unit increase in cholesterol level is associated with essentially no change in heart rate on average. The estimated effect size is extremely small, suggesting no meaningful relationship between cholesterol level and heart rate when cholesterol is examined alone.
This association is not statistically significant. The 95% confidence interval (−0.051 to 0.051) includes zero, meaning the true effect could be negative, positive, or null. Additionally, the p-value (0.999) is much greater than 0.05, providing no statistical evidence of an association between cholesterol level and heart rate in the unadjusted model.
Fit a multiple regression model adjusting for your confounders selected in Section 1.2. Use the same model type as above.
# ---- ADJUSTED LINEAR REGRESSION ----
# Add confounders: Age, Sex, blood_pressure
model_adj_linear <- lm(
heart_rate~cholesterol_Level + age + sex + blood_pressure,
data = data
)# Using tidy() from the broom package
# tidy() converts model output into a clean data frame with one row per coefficient
tidy_adj_linear <- tidy(model_adj_linear, conf.int = TRUE)
tidy_adj_linear[, sapply(tidy_adj_linear, is.numeric)] <-
round(tidy_adj_linear[, sapply(tidy_adj_linear, is.numeric)], 3)
print(tidy_adj_linear)## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 195. 11.6 16.8 0 172. 218.
## 2 cholesterol_Level 0.028 0.024 1.13 0.258 -0.02 0.076
## 3 age -1.10 0.143 -7.67 0 -1.38 -0.816
## 4 sexMale -4.11 2.66 -1.54 0.124 -9.35 1.13
## 5 blood_pressure 0.081 0.072 1.13 0.26 -0.061 0.223
Interpretation of Exposure Effect after adjusting for confounders
Adjustment was performed to control for confounders and estimate the independent effect of cholesterol on maximum heart rate. After adjusting for age, sex, and blood pressure, the coefficient for cholesterol level was 0.028, indicating that each one-unit increase in cholesterol level (mg/dL) is associated with an estimated 0.028 bpm increase in heart rate, holding other variables constant. Although the direction of the association becomes slightly positive compared with the unadjusted model, the magnitude of the effect remains extremely small, suggesting little practical relationship between cholesterol level and heart rate. The association was not statistically significant, as the 95% confidence interval (−0.020 to 0.076) includes zero and the p-value (0.258) exceeds 0.05, indicating insufficient evidence of an independent association.
Among the covariates, age showed a strong and statistically significant negative association with heart rate (β = −1.099, p < 0.001), suggesting that heart rate tends to decrease with increasing age. Sex and blood pressure were not statistically significant predictors in the adjusted model. Overall, these results suggest that differences in heart rate are more strongly explained by age rather than cholesterol level in this dataset.
Create a numeric AND visual comparison of your unadjusted and adjusted estimates for the exposure variable.
# Extract the physical_activity coefficient from unadjusted vs adjusted linear models
unadj_lin <- subset(tidy_simple_linear, term == "cholesterol_Level")
unadj_lin$Model <- "Linear – Unadjusted"
adj_lin <- subset(tidy_adj_linear, term == "cholesterol_Level")
adj_lin$Model <- "Linear – Adjusted"
compare_linear <- rbind(unadj_lin, adj_lin)
print(compare_linear)## # A tibble: 2 × 8
## term estimate std.error statistic p.value conf.low conf.high Model
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 cholesterol_Lev… 0 0.026 -0.001 0.999 -0.051 0.051 Line…
## 2 cholesterol_Lev… 0.028 0.024 1.13 0.258 -0.02 0.076 Line…
library(dplyr)
library(broom)
# Fit linear model
model_adj_linear <- lm(
heart_rate ~ cholesterol_Level + age + sex + blood_pressure,
data = data # make sure your dataframe is actually called `data`
)
# Extract coefficients, confidence intervals, and remove intercept
comparison_table <- tidy(model_adj_linear, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(
estimate = as.numeric(estimate),
conf.low = as.numeric(conf.low),
conf.high = as.numeric(conf.high),
Family = "Linear (Coefficient)", # no Model column needed
Type = "Adjusted" # since this is the adjusted model
)comparison_table <- rbind(compare_linear)
comparison_table <- comparison_table[, c("Model", "estimate", "conf.low", "conf.high", "p.value")]
print(comparison_table)## # A tibble: 2 × 5
## Model estimate conf.low conf.high p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Linear – Unadjusted 0 -0.051 0.051 0.999
## 2 Linear – Adjusted 0.028 -0.02 0.076 0.258
comparison_table$Family <- ifelse(
grepl("Linear", comparison_table$Model),
"Linear (Coefficient)",
NA
)
comparison_table$Type <- ifelse(
grepl("Unadjusted", comparison_table$Model),
"Unadjusted",
"Adjusted"
)# Forest plot: Unadjusted vs Adjusted estimates with 95% CI
# Using ggplot() from the ggplot2 package (part of tidyverse)
ggplot(comparison_table, aes(x = estimate, y = Type, colour = Type)) + # set up plot; map estimate to x, model type to y
geom_point(size = 3) + # draw point estimates
geom_errorbar(aes(xmin = conf.low, xmax = conf.high), height = 0.2) + # draw horizontal 95% CI bars
facet_wrap(~ Family, scales = "free_x") + # create separate panels for each regression type
labs( # add labels for title and axes
title = "Forest Plot: Effect of Physical Activity",
subtitle = "Unadjusted vs Adjusted Estimates with 95% CI",
x = "Estimate",
y = NULL,
colour = "Model"
) +
theme_minimal() + # apply a clean, minimal theme
theme(legend.position = "bottom") # move legend to the bottomAnalysis of Confounding
How did the inclusion of confounders change the coefficient for your exposure after adjustment? In your answer, describe the direction the coefficient changed (toward or away from the null effect), and whether your conclusion about statistical significance was affected (4-5 sentences).
Your answer: Comparing the unadjusted and adjusted models shows that the estimated association between cholesterol level and heart rate changes slightly after adjusting for confounders. In the unadjusted model, the effect estimate was approximately 0, indicating little to no observable relationship. After adjusting for age, sex, and blood pressure, the coefficient increased to about 0.028, meaning the estimate moved slightly away from the null value of 0 and became positive. However, the 95 percent confidence intervals for both models include 0, indicating that the association remains not statistically significant. Therefore, although adjusting for confounders slightly changed the estimated effect, it did not change the overall conclusion that cholesterol level is not significantly associated with heart rate in this dataset. ————————————————————————
Create a professional results table for your adjusted regression model.
# Format the adjusted linear model results for a journal-style table
pub_table <- tidy_adj_linear
# Create readable variable names
pub_table$Variable <- pub_table$term
pub_table$Variable[pub_table$term == "(Intercept)"] <- "Intercept"
pub_table$Variable[pub_table$term == "cholesterol_Level"] <- "cholesterol_Level"
pub_table$Variable[pub_table$term == "age"] <- "age (years)"
pub_table$Variable[pub_table$term == "sexMale"] <- "sex (Male vs Female)"
pub_table$Variable[pub_table$term == "blood_pressure"] <- "blood_pressure"
# Paste confidence interval into one column
pub_table$CI <- paste0("(", pub_table$conf.low, ", ", pub_table$conf.high, ")")
# Format p-value
pub_table$P_value <- ifelse(pub_table$p.value < 0.001, "< 0.001", round(pub_table$p.value, 3))
# Keep only the columns we need
pub_table <- pub_table[, c("Variable", "estimate", "CI", "P_value")]
colnames(pub_table) <- c("Variable", "Estimate", "95% CI", "P-value")
kable(pub_table)| Variable | Estimate | 95% CI | P-value |
|---|---|---|---|
| Intercept | 194.734 | (171.858, 217.61) | < 0.001 |
| cholesterol_Level | 0.028 | (-0.02, 0.076) | 0.258 |
| age (years) | -1.099 | (-1.381, -0.816) | < 0.001 |
| sex (Male vs Female) | -4.109 | (-9.352, 1.133) | 0.124 |
| blood_pressure | 0.081 | (-0.061, 0.223) | 0.26 |
Write a 5-7 sentence results paragraph suitable for a research paper. Your statement should include:
Results
Your answer:This study examined whether higher serum cholesterol levels were associated with maximum heart rate achieved among adults in the Cleveland UCI Heart Disease dataset. In the adjusted multiple linear regression model controlling for age, sex, and blood pressure, cholesterol level was associated with a 0.028 beats-per-minute (bpm) increase in heart rate for each 1 mg/dL increase in cholesterol. The 95% confidence interval ranged from −0.020 to 0.076 bpm, indicating substantial uncertainty around the estimate. This association was not statistically significant (p = 0.258), suggesting insufficient evidence of an independent relationship between cholesterol level and heart rate. Compared with the unadjusted model, where the estimated effect was essentially zero, adjustment for confounders slightly shifted the association in a positive direction but did not meaningfully change its magnitude. Among the covariates, age demonstrated a strong negative association with heart rate, indicating that confounding by age influenced the crude relationship. Overall, after accounting for age, sex, and blood pressure, cholesterol level did not appear to be an important predictor of heart rate in this population.*
Discuss THREE limitations of your dataset or analysis. These could include unmeasured confounding variables, measurement bias in data collection, selection bias in study design, lack of external validity, limitations in the statistical analysis, etc.
Limitation 1 Unmeasured confounding
Describe and explain the potential impact on results in 2-3 sentences.
Your answer:Although this analysis adjusted for age, sex, and blood pressure, other important cardiovascular risk factors such as socioeconomic status, smoking status, physical activity, diet, medication use, and body mass index were not included. These variables may influence both cholesterol levels and heart rate, meaning residual confounding could still bias the estimated association between exposure and outcome.
Limitation 2Measurement limitations
Describe and explain the potential impact on results in 2-3 sentences.
Your answer:The Cleveland UCI dataset relies on single clinical measurements rather than repeated assessments over time. Because heart rate and cholesterol levels can fluctuate due to stress, treatment, or temporary health conditions, these measurements may not fully represent participants’ usual physiological status, introducing measurement variability.
Limitation 3Selection bias and limited generalizability
Describe and explain the potential impact on results in 2-3 sentences.
Your answer:Participants in this dataset were patients undergoing cardiac evaluation rather than a random sample of the general population. This may overrepresent individuals already at higher cardiovascular risk, limiting the ability to generalize the findings to healthier or community-based populations. ## 5.2 Interpretation (3 points)
What is your conclusion on whether your exposure causes the outcome? Provide an informative answer based on your response to Sections 4.2 and 5.1; do not simply state that correlation does not equal causation. Incorporate your assessment of the strengths and weaknesses of your dataset and/or analysis. Answer in 4-5 sentences.
Interpretation
Your answer: Based on the adjusted regression analysis, higher serum cholesterol levels were not meaningfully associated with heart rate in this dataset, as the estimated effect was small and not statistically significant after controlling for age, sex, and blood pressure. The lack of association suggests that cholesterol level is unlikely to be a strong determinant of heart rate among participants in this sample. However, causal conclusions cannot be confidently drawn because the dataset is observational and may be affected by unmeasured confounding factors such as lifestyle behaviors, socioeconomic status, and medication use. Additionally, measurement limitations and the clinical nature of the sample reduce the ability to generalize findings to broader populations. Overall, the evidence from this analysis does not support a causal relationship between cholesterol level and heart rate, though further studies using more comprehensive data and longitudinal designs would be needed to better assess causality. ————————————————————————
Before submitting, verify that you have the following, successfully rendered in HTML format:
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Canada.utf8 LC_CTYPE=English_Canada.utf8
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.utf8
##
## time zone: America/Vancouver
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.51 broom_1.0.12 lubridate_1.9.5 forcats_1.0.1
## [5] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.2.0
## [9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4 lattice_0.22-6
## [5] stringi_1.8.7 hms_1.1.4 digest_0.6.39 magrittr_2.0.4
## [9] evaluate_1.0.5 grid_4.4.1 timechange_0.4.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 Matrix_1.7-0 jsonlite_2.0.0 backports_1.5.0
## [17] mgcv_1.9-1 scales_1.4.0 jquerylib_0.1.4 cli_3.6.5
## [21] rlang_1.1.7 crayon_1.5.3 bit64_4.6.0-1 splines_4.4.1
## [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.12 tools_4.4.1
## [29] parallel_4.4.1 tzdb_0.5.0 vctrs_0.7.2 R6_2.6.1
## [33] lifecycle_1.0.5 bit_4.6.0 vroom_1.7.0 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.10.0 gtable_0.3.6 glue_1.8.0
## [41] xfun_0.57 tidyselect_1.2.1 rstudioapi_0.18.0 farver_2.1.2
## [45] nlme_3.1-164 htmltools_0.5.9 rmarkdown_2.30 labeling_0.4.3
## [49] compiler_4.4.1 S7_0.2.1