Instructions

Rename this file

Replace “lastname_firstname” in the file name with your own last and first name before beginning the project (e.g., Doe_Jane_PHAR305_finalproject.Rmd). The same naming convention will automatically apply when you knit this file to HTML.

Overview

This project will assess your understanding of how to formulate and test a scientific research question, conduct and interpret a regression analysis, and contextualize your research findings. You will apply these concepts to a dataset of your choice.

Total Points: 60

Deadline to submit knit HTML and RMD files on Canvas: April 7, 2026 at 11:59 PM

Requirements

Choose ONE dataset from the options provided on Canvas
Complete ALL sections of this analysis template
Show all code and provide written interpretations
Knit your final document to HTML before submitting

Grading Criteria

Section	Topic	Points
Part 0	Dataset Identified	1
Part 1	Research Question & Hypotheses	11
Part 2	Data Exploration	13
Part 3	Regression Analysis	15
Part 4	Communicating Results	6
Part 5	Limitations & Interpretation	6
Overall	RMD Rendered Successfully	3
Overall	Code readability	5
Total		60

Your Dataset Selection (/1 point)

Provide the name of your chosen dataset

Your answer: Heart Disease (Cleveland UCI) Dataset

Part 1: Research Question & Hypotheses (/11 points)

1.1 Load Your Dataset (1 point)

Load your chosen dataset.

data <- read_csv("Heart_Disease_Cleveland_UCI.csv")
head(data)

## # A tibble: 6 × 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    69     1     0      160   234     1       2     131     0     0.1     1
## 2    69     0     0      140   239     0       0     151     0     1.8     0
## 3    66     0     0      150   226     0       0     114     0     2.6     2
## 4    65     1     0      138   282     1       2     174     0     1.4     1
## 5    64     1     0      110   211     0       2     144     1     1.8     1
## 6    64     1     0      170   227     0       2     155     0     0.6     1
## # ℹ 3 more variables: ca <dbl>, thal <dbl>, condition <dbl>

data <- read_csv("Heart_Disease_Cleveland_UCI.csv")

data <- rename(data,
  heart_rate = "thalach",
  cholesterol_Level= "chol",
  blood_pressure="trestbps"
  )
# Convert sex to a factor and set Female as the reference group
data$sex <- factor(data$sex, levels = c(0,1), labels = c("Female","Male"))
data$sex <- relevel(data$sex, ref = "Female")

1.2 Define Your Research Question (8 points)

Based on your chosen dataset, formulate a research question using the PICO framework (5 points):

Population

Your answer:Adults included in the Cleveland Heart Disease dataset

Intervention

Your answer: Higher serum cholesterol level (mg/dL)

Comparator

Your answer: Lower serum cholesterol levels.

Outcome

Your answer: Maximum heart rate achieved

Your Research Question (1 sentence)

*Your answer: Among adults in the Cleveland Heart Disease UCI dataset, is higher serum cholesterol level associated with lower or higher maximum heart rate achieved compared with lower cholesterol levels?

Now, identify the variables in your dataset that will allow you to test this research question (3 points):

Outcome (Y) What is your dependent variable?

Your answer: Maximum heart rate achieved

Exposure (X) What is your main predictor of interest?

Your answer: Serum cholesterol level, measured in mg/dL.

List 2-3 potential confounders for the X-Y relationship

Your answer: Age, Sex, resting blood pressure

1.3 Justify Your Confounders (2 points)

Confounder justification

Your answer: Age: Age is a well‑established risk factor for cardiovascular disease and is commonly included in risk prediction models because older adults have a substantially higher likelihood of developing heart disease than younger adults. This is partly due to the cumulative effect of risk factors like cholesterol and blood pressure over time, as well as age‑related changes in vascular function that increase cardiovascular risk. Age also influences cholesterol levels and lipid metabolism across the life course, meaning that without adjustment, age could bias the estimated relationship between cholesterol and heart disease. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3297980/ Sex: Biological sex is associated with differences in cardiovascular disease risk and lipid metabolism, with men having higher risk at earlier ages and women experiencing shifts in lipid profiles and risk after menopause due to hormonal changes. These differences in cholesterol regulation and heart disease susceptibility by sex mean that sex is related to both the exposure (cholesterol level) and the outcome (heart disease), making it a confounder that should be adjusted for in regression analysis. https://pubmed.ncbi.nlm.nih.gov/36881927/ Blood Pressure: Blood pressure is closely linked to both cholesterol levels and heart disease risk, as elevated blood pressure interacts with dyslipidemia to increase long‑term coronary heart disease risk. Individuals with high blood pressure often also have abnormal lipid profiles, and both risk factors independently contribute to cardiovascular disease outcomes. Because blood pressure is associated with the exposure and the outcome, failing to adjust for it can lead to biased effect estimates in regression analysis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6764095/ ————————————————————————

Part 2: Data Exploration (/13 points)

2.1 Summary Statistics (4 points)

Provide at least one informative summary statistic for each of your key variables (outcome, exposure, and each of your confounders identified in Section 1.2).

# Summary statistics for the outcome variable (Heart Rate) 
summary(data$heart_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.0   153.0   149.6   166.0   202.0

sd(data$heart_rate)

## [1] 22.94156

# Summary statistics for the outcome variable (Cholesterol Level) 
summary(data$cholesterol_Level)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   243.0   247.4   276.0   564.0

sd(data$cholesterol_Level)

## [1] 51.99758

# Confounder 1: Age (continuous)
summary(data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   56.00   54.54   61.00   77.00

sd(data$age)

## [1] 9.049736

# Confounder 2: sex (categorical)
table(data$sex)

## 
## Female   Male 
##     96    201

prop.table(table(data$sex))

## 
##    Female      Male 
## 0.3232323 0.6767677

# Confounder 3: blood_pressure (continuous)
summary(data$blood_pressure)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.7   140.0   200.0

sd(data$blood_pressure)

## [1] 17.76281

2.2 Distribution of Outcome Variable (3 points)

The histogram of heart rate shows that most participants have values clustered between approximately 130 and 180 beats per minute, indicating a roughly normal distribution with a slight right skew. There are very few participants with heart rates below 100 or above 200, suggesting that extreme values are rare in this dataset. Overall, the distribution appears reasonably symmetric around the central peak near 160 bpm. This pattern supports using linear regression for modeling heart rate as a continuous outcome.

# Histogram of the outcome variable: heart rate
# Using ggplot() from the ggplot2 package (part of tidyverse)

ggplot(data, aes(x = heart_rate)) +       
  geom_histogram(bins = 10, fill = "steelblue",  
                 alpha = 0.7, color = "white")

## 2.3 Exposure-Outcome Relationship (3 points)

ggplot(data, aes(x = cholesterol_Level, y = heart_rate)) + 
  geom_point(alpha = 0.5, color = "steelblue") +  # Added quotes
  geom_smooth(method = "lm", color = "red") +     # Added quotes
  labs(
    title = "Cholesterol Level vs. Heart Rate",    # Added quotes
    x = "Cholesterol Level (mg/dl)",               # Added quotes
    y = "Heart Rate (BPM)"                        # Added quotes
  ) + 
  theme_minimal()

Interpretation

Describe the crude (unadjusted) relationship you observe between exposure and outcome in 3-4 sentences.

Your answer:The scatterplot of cholesterol levels versus heart rate shows a very weak or negligible linear relationship, as the red regression line is almost flat across the range of cholesterol values. Most heart rate values are clustered between 120 and 180 bpm regardless of cholesterol levels, suggesting little to no change in heart rate as cholesterol increases. The confidence interval around the regression line widens slightly at extreme cholesterol values, reflecting greater uncertainty due to fewer observations in those ranges. Overall, the crude association between cholesterol level and heart rate appears minimal in this dataset.

2.4 Exploring Potential Confounding (3 points)

Show evidence that at least one of your confounders is associated with BOTH the exposure and the outcome.

ggplot(data, aes(x =age, y = cholesterol_Level)) +  # set up plot; map X and Y variables
  geom_point(alpha = 0.5, color = "darkorange") +              # draw individual data points
  geom_smooth(method = "lm", se = TRUE, color = "red") +      # add a linear trend line with 95% CI shading
  labs(                                                        # add labels for title and axes
    title = "age vs. Cholesterol Level",
    x = "Age(years)",
    y = "Cholesterol Level(mg/dl)"
  ) +
  theme_minimal()                                              # apply a clean, minimal theme

ggplot(data, aes(x =age, y =heart_rate)) +  # set up plot; map X and Y variables
  geom_point(alpha = 0.5, color = "purple") +                 # draw individual data points
  geom_smooth(method = "lm", se = TRUE, color = "red") +     # add a linear trend line with 95% CI shading
  labs(                                                       # add labels for title and axes
    title = "age vs. Heart rate",
    x = "age(years)",
    y = "Heart rate (1–10)"
  ) +
  theme_minimal()                                             # apply a clean, minimal theme

Interpretation

Explain how this evidence supports that this variable could be a confounder in 3-5 sentences.

Your answer:The plots demonstrate that age is associated with both cholesterol level and heart rate. The first plot shows a slight positive trend between age and cholesterol level, indicating that cholesterol tends to increase with increasing age. The second plot shows a negative association between age and heart rate, where older individuals generally achieve lower maximum heart rates. Because age is related to both the exposure (cholesterol level) and the outcome (heart rate), it meets the definition of a potential confounder. Therefore, failing to adjust for age could distort the observed relationship between cholesterol level and heart rate.

Part 3: Regression Analysis (/15 points)

3.1 Choose the Appropriate Regression Model (2 points)

Based on your outcome variable, select the appropriate regression model:

a) What type of outcome variable do you have? (1 sentence)

Your answer: The outcome variable is continuous, as heart rate is measured on a numerical scale and can take a range of values.

b) Which regression model will you use and why? (2-3 sentences)

Your answer: I will use a linear regression model because the outcome variable, heart rate, is continuous. This model allows me to examine the relationship between the predictor variable, cholesterol level, and heart rate while adjusting for potential confounders such as age, blood pressure, and sex. Including these confounders in the model helps estimate the association between cholesterol level and heart rate more accurately.

3.2 Unadjusted (Simple) Regression (4 points)

Fit a simple regression model with only your exposure variable as the predictor. Use the appropriate model type based on your answer above.

model_simple_linear <- lm(heart_rate ~ cholesterol_Level, data = data)

# Using tidy() from the broom package
# tidy() converts model output into a clean data frame with one row per coefficient

tidy_simple_linear <- tidy(model_simple_linear, conf.int = TRUE)
tidy_simple_linear[, sapply(tidy_simple_linear, is.numeric)] <-
  round(tidy_simple_linear[, sapply(tidy_simple_linear, is.numeric)], 3)

print(tidy_simple_linear)

## # A tibble: 2 × 7
##   term              estimate std.error statistic p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)           150.     6.49     23.0     0      137.      162.   
## 2 cholesterol_Level       0      0.026    -0.001   0.999   -0.051     0.051

Interpretation of Exposure Effect

In the unadjusted linear regression model, the coefficient for cholesterol level is approximately 0.000, indicating that a one-unit increase in cholesterol level is associated with essentially no change in heart rate on average. The estimated effect size is extremely small, suggesting no meaningful relationship between cholesterol level and heart rate when cholesterol is examined alone.

This association is not statistically significant. The 95% confidence interval (−0.051 to 0.051) includes zero, meaning the true effect could be negative, positive, or null. Additionally, the p-value (0.999) is much greater than 0.05, providing no statistical evidence of an association between cholesterol level and heart rate in the unadjusted model.

3.3 Adjusted (Multiple) Regression (4 points)

Fit a multiple regression model adjusting for your confounders selected in Section 1.2. Use the same model type as above.

# ---- ADJUSTED LINEAR REGRESSION ----
# Add confounders: Age, Sex, blood_pressure
model_adj_linear <- lm(
  heart_rate~cholesterol_Level + age + sex + blood_pressure,
  data = data
)

# Using tidy() from the broom package
# tidy() converts model output into a clean data frame with one row per coefficient
tidy_adj_linear <- tidy(model_adj_linear, conf.int = TRUE)
tidy_adj_linear[, sapply(tidy_adj_linear, is.numeric)] <-
  round(tidy_adj_linear[, sapply(tidy_adj_linear, is.numeric)], 3)

print(tidy_adj_linear)

## # A tibble: 5 × 7
##   term              estimate std.error statistic p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)        195.       11.6       16.8    0      172.      218.   
## 2 cholesterol_Level    0.028     0.024      1.13   0.258   -0.02      0.076
## 3 age                 -1.10      0.143     -7.67   0       -1.38     -0.816
## 4 sexMale             -4.11      2.66      -1.54   0.124   -9.35      1.13 
## 5 blood_pressure       0.081     0.072      1.13   0.26    -0.061     0.223

Interpretation of Exposure Effect after adjusting for confounders

Adjustment was performed to control for confounders and estimate the independent effect of cholesterol on maximum heart rate. After adjusting for age, sex, and blood pressure, the coefficient for cholesterol level was 0.028, indicating that each one-unit increase in cholesterol level (mg/dL) is associated with an estimated 0.028 bpm increase in heart rate, holding other variables constant. Although the direction of the association becomes slightly positive compared with the unadjusted model, the magnitude of the effect remains extremely small, suggesting little practical relationship between cholesterol level and heart rate. The association was not statistically significant, as the 95% confidence interval (−0.020 to 0.076) includes zero and the p-value (0.258) exceeds 0.05, indicating insufficient evidence of an independent association.

Among the covariates, age showed a strong and statistically significant negative association with heart rate (β = −1.099, p < 0.001), suggesting that heart rate tends to decrease with increasing age. Sex and blood pressure were not statistically significant predictors in the adjusted model. Overall, these results suggest that differences in heart rate are more strongly explained by age rather than cholesterol level in this dataset.

3.4 Compare Unadjusted vs. Adjusted Estimates (5 points)

Create a numeric AND visual comparison of your unadjusted and adjusted estimates for the exposure variable.

# Extract the physical_activity coefficient from unadjusted vs adjusted linear models

unadj_lin <- subset(tidy_simple_linear, term == "cholesterol_Level")
unadj_lin$Model <- "Linear – Unadjusted"

adj_lin <- subset(tidy_adj_linear, term == "cholesterol_Level")
adj_lin$Model <- "Linear – Adjusted"

compare_linear <- rbind(unadj_lin, adj_lin)
print(compare_linear)

## # A tibble: 2 × 8
##   term             estimate std.error statistic p.value conf.low conf.high Model
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl> <chr>
## 1 cholesterol_Lev…    0         0.026    -0.001   0.999   -0.051     0.051 Line…
## 2 cholesterol_Lev…    0.028     0.024     1.13    0.258   -0.02      0.076 Line…

library(dplyr)
library(broom)

# Fit linear model
model_adj_linear <- lm(
  heart_rate ~ cholesterol_Level + age + sex + blood_pressure,
  data = data  # make sure your dataframe is actually called `data`
)

# Extract coefficients, confidence intervals, and remove intercept
comparison_table <- tidy(model_adj_linear, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    estimate = as.numeric(estimate),
    conf.low = as.numeric(conf.low),
    conf.high = as.numeric(conf.high),
    Family = "Linear (Coefficient)",    # no Model column needed
    Type = "Adjusted"                    # since this is the adjusted model
  )

comparison_table <- rbind(compare_linear)
comparison_table <- comparison_table[, c("Model", "estimate", "conf.low", "conf.high", "p.value")]

print(comparison_table)

## # A tibble: 2 × 5
##   Model               estimate conf.low conf.high p.value
##   <chr>                  <dbl>    <dbl>     <dbl>   <dbl>
## 1 Linear – Unadjusted    0       -0.051     0.051   0.999
## 2 Linear – Adjusted      0.028   -0.02      0.076   0.258

comparison_table$Family <- ifelse(
  grepl("Linear", comparison_table$Model),
  "Linear (Coefficient)",
  NA
)

comparison_table$Type <- ifelse(
  grepl("Unadjusted", comparison_table$Model),
  "Unadjusted",
  "Adjusted"
)

# Forest plot: Unadjusted vs Adjusted estimates with 95% CI
# Using ggplot() from the ggplot2 package (part of tidyverse)
ggplot(comparison_table, aes(x = estimate, y = Type, colour = Type)) +  # set up plot; map estimate to x, model type to y
  geom_point(size = 3) +                                                 # draw point estimates
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), height = 0.2) + # draw horizontal 95% CI bars
  facet_wrap(~ Family, scales = "free_x") +                              # create separate panels for each regression type
  labs(                                                                   # add labels for title and axes
    title = "Forest Plot: Effect of Physical Activity",
    subtitle = "Unadjusted vs Adjusted Estimates with 95% CI",
    x = "Estimate",
    y = NULL,
    colour = "Model"
  ) +
  theme_minimal() +                                                      # apply a clean, minimal theme
  theme(legend.position = "bottom")                                      # move legend to the bottom

Analysis of Confounding

How did the inclusion of confounders change the coefficient for your exposure after adjustment? In your answer, describe the direction the coefficient changed (toward or away from the null effect), and whether your conclusion about statistical significance was affected (4-5 sentences).

Your answer: Comparing the unadjusted and adjusted models shows that the estimated association between cholesterol level and heart rate changes slightly after adjusting for confounders. In the unadjusted model, the effect estimate was approximately 0, indicating little to no observable relationship. After adjusting for age, sex, and blood pressure, the coefficient increased to about 0.028, meaning the estimate moved slightly away from the null value of 0 and became positive. However, the 95 percent confidence intervals for both models include 0, indicating that the association remains not statistically significant. Therefore, although adjusting for confounders slightly changed the estimated effect, it did not change the overall conclusion that cholesterol level is not significantly associated with heart rate in this dataset. ————————————————————————

Part 4: Communicating Results (/6 points)

4.1 Publication-Ready Table (2 points)

Create a professional results table for your adjusted regression model.

# Format the adjusted linear model results for a journal-style table

pub_table <- tidy_adj_linear

# Create readable variable names
pub_table$Variable <- pub_table$term
pub_table$Variable[pub_table$term == "(Intercept)"]       <- "Intercept"
pub_table$Variable[pub_table$term == "cholesterol_Level"] <- "cholesterol_Level"
pub_table$Variable[pub_table$term == "age"]               <- "age (years)"
pub_table$Variable[pub_table$term == "sexMale"]           <- "sex (Male vs Female)"
pub_table$Variable[pub_table$term == "blood_pressure"]      <- "blood_pressure"

# Paste confidence interval into one column
pub_table$CI <- paste0("(", pub_table$conf.low, ", ", pub_table$conf.high, ")")

# Format p-value
pub_table$P_value <- ifelse(pub_table$p.value < 0.001, "< 0.001", round(pub_table$p.value, 3))

# Keep only the columns we need
pub_table <- pub_table[, c("Variable", "estimate", "CI", "P_value")]
colnames(pub_table) <- c("Variable", "Estimate", "95% CI", "P-value")

kable(pub_table)

Variable	Estimate	95% CI	P-value
Intercept	194.734	(171.858, 217.61)	< 0.001
cholesterol_Level	0.028	(-0.02, 0.076)	0.258
age (years)	-1.099	(-1.381, -0.816)	< 0.001
sex (Male vs Female)	-4.109	(-9.352, 1.133)	0.124
blood_pressure	0.081	(-0.061, 0.223)	0.26

4.2 Written Results Statement (4 points)

Write a 5-7 sentence results paragraph suitable for a research paper. Your statement should include:

The research question addressed
The adjusted effect estimate with units
The 95% confidence interval
Whether the effect is statistically significant
The confounders that were controlled for and how they impacted the effect estimate

Results

Your answer:This study examined whether higher serum cholesterol levels were associated with maximum heart rate achieved among adults in the Cleveland UCI Heart Disease dataset. In the adjusted multiple linear regression model controlling for age, sex, and blood pressure, cholesterol level was associated with a 0.028 beats-per-minute (bpm) increase in heart rate for each 1 mg/dL increase in cholesterol. The 95% confidence interval ranged from −0.020 to 0.076 bpm, indicating substantial uncertainty around the estimate. This association was not statistically significant (p = 0.258), suggesting insufficient evidence of an independent relationship between cholesterol level and heart rate. Compared with the unadjusted model, where the estimated effect was essentially zero, adjustment for confounders slightly shifted the association in a positive direction but did not meaningfully change its magnitude. Among the covariates, age demonstrated a strong negative association with heart rate, indicating that confounding by age influenced the crude relationship. Overall, after accounting for age, sex, and blood pressure, cholesterol level did not appear to be an important predictor of heart rate in this population.*

Part 5: Limitations & Interpretation (/6 points)

5.1 Limitations (3 points)

Discuss THREE limitations of your dataset or analysis. These could include unmeasured confounding variables, measurement bias in data collection, selection bias in study design, lack of external validity, limitations in the statistical analysis, etc.

Limitation 1 Unmeasured confounding

Describe and explain the potential impact on results in 2-3 sentences.

Your answer:Although this analysis adjusted for age, sex, and blood pressure, other important cardiovascular risk factors such as socioeconomic status, smoking status, physical activity, diet, medication use, and body mass index were not included. These variables may influence both cholesterol levels and heart rate, meaning residual confounding could still bias the estimated association between exposure and outcome.

Limitation 2Measurement limitations

Describe and explain the potential impact on results in 2-3 sentences.

Your answer:The Cleveland UCI dataset relies on single clinical measurements rather than repeated assessments over time. Because heart rate and cholesterol levels can fluctuate due to stress, treatment, or temporary health conditions, these measurements may not fully represent participants’ usual physiological status, introducing measurement variability.

Limitation 3Selection bias and limited generalizability

Describe and explain the potential impact on results in 2-3 sentences.

Your answer:Participants in this dataset were patients undergoing cardiac evaluation rather than a random sample of the general population. This may overrepresent individuals already at higher cardiovascular risk, limiting the ability to generalize the findings to healthier or community-based populations. ## 5.2 Interpretation (3 points)

What is your conclusion on whether your exposure causes the outcome? Provide an informative answer based on your response to Sections 4.2 and 5.1; do not simply state that correlation does not equal causation. Incorporate your assessment of the strengths and weaknesses of your dataset and/or analysis. Answer in 4-5 sentences.

Interpretation

Your answer: Based on the adjusted regression analysis, higher serum cholesterol levels were not meaningfully associated with heart rate in this dataset, as the estimated effect was small and not statistically significant after controlling for age, sex, and blood pressure. The lack of association suggests that cholesterol level is unlikely to be a strong determinant of heart rate among participants in this sample. However, causal conclusions cannot be confidently drawn because the dataset is observational and may be affected by unmeasured confounding factors such as lifestyle behaviors, socioeconomic status, and medication use. Additionally, measurement limitations and the clinical nature of the sample reduce the ability to generalize findings to broader populations. Overall, the evidence from this analysis does not support a causal relationship between cholesterol level and heart rate, though further studies using more comprehensive data and longitudinal designs would be needed to better assess causality. ————————————————————————

Submission Checklist

Before submitting, verify that you have the following, successfully rendered in HTML format:

Session Info

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.utf8    
## 
## time zone: America/Vancouver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.51      broom_1.0.12    lubridate_1.9.5 forcats_1.0.1  
##  [5] stringr_1.6.0   dplyr_1.2.0     purrr_1.2.1     readr_2.2.0    
##  [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.6         sass_0.4.10        generics_0.1.4     lattice_0.22-6    
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.4.1         timechange_0.4.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      Matrix_1.7-0       jsonlite_2.0.0     backports_1.5.0   
## [17] mgcv_1.9-1         scales_1.4.0       jquerylib_0.1.4    cli_3.6.5         
## [21] rlang_1.1.7        crayon_1.5.3       bit64_4.6.0-1      splines_4.4.1     
## [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.12        tools_4.4.1       
## [29] parallel_4.4.1     tzdb_0.5.0         vctrs_0.7.2        R6_2.6.1          
## [33] lifecycle_1.0.5    bit_4.6.0          vroom_1.7.0        pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.10.0       gtable_0.3.6       glue_1.8.0        
## [41] xfun_0.57          tidyselect_1.2.1   rstudioapi_0.18.0  farver_2.1.2      
## [45] nlme_3.1-164       htmltools_0.5.9    rmarkdown_2.30     labeling_0.4.3    
## [49] compiler_4.4.1     S7_0.2.1

PHAR 305 Final Project

Appolinaire Manirakiza

Instructions

Rename this file

Overview

Requirements

Grading Criteria

Your Dataset Selection (/1 point)

Part 1: Research Question & Hypotheses (/11 points)

1.1 Load Your Dataset (1 point)

1.2 Define Your Research Question (8 points)

1.3 Justify Your Confounders (2 points)

Part 2: Data Exploration (/13 points)

2.1 Summary Statistics (4 points)

2.2 Distribution of Outcome Variable (3 points)

2.4 Exploring Potential Confounding (3 points)

Part 3: Regression Analysis (/15 points)

3.1 Choose the Appropriate Regression Model (2 points)

3.2 Unadjusted (Simple) Regression (4 points)

3.3 Adjusted (Multiple) Regression (4 points)

3.4 Compare Unadjusted vs. Adjusted Estimates (5 points)

Part 4: Communicating Results (/6 points)

4.1 Publication-Ready Table (2 points)

4.2 Written Results Statement (4 points)

Part 5: Limitations & Interpretation (/6 points)

5.1 Limitations (3 points)

Submission Checklist

Session Info