Complete these questions on your own using short answers of 1-2 sentences.
Written Answer: Graph A would use a boxplot/violin plot. Graph B would use a scatter plot. Graph C would use a linear plot.
Written Answer: The pipe command %>% takes the result of one step and passes it to the next step. It helps connect the different steps in the code instead of nesting several functions.
For this assignment we will examine the plasma profiles of individuals before and after weight loss program. For the sake of the assignment, I’ve included only a fraction of the plasma proteins that were identified and added from new groupings.
For this assignemnt: Remember to check you knotted document and ensure it looks well formatted and clear. Important is to ensure that code you might use to verify your steps (such as head() or count()) is removed or commented with the pound sign to ensure it wont run. This will prevent large dumps of tables to the knitted output. Only “show” what we ask.
Load the data. The data is provided as a .csv file called “WeightLossProgramData4Assignment.csv”. Verify that the data is loaded. Familiarize your self with the variables and columns. See the warm-up for more information.
List of variables ID - patient ID Weight - weight (kg) BMI - body mass index (kg/m^2) Lipids - Cholesterol, LDL cholesterol, HDL cholesterol and triglycerides (mmol/L) HOMA-IR - Homeostatic Model Assessment for Insulin Resistance (unitless) Diet - Diet during weight loss program intermittent fasting (1) or regular (0) The rest - Proteomic data whereby the values represent mass-spectrometry intensities (a relative measurement so no units)
# load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)
# download the assignment data from quercus and load it here
data_weightloss <- read_csv("WeightLossProgramData4Assignment.csv")
## Rows: 318 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ID, Timepoint
## dbl (17): Week, Weight, BMI, Cholesterol, LDL, HDL, Triglyceride, HOMA-IR, A...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For clarify, I’ve written in some comments to guide you. Feel free to add your own comments.
# Create plot that shows the distribution of weight in the study (0.5 points).
library(tidyverse)
library(ggpubr)
ggplot(data_weightloss, aes(x = Weight)) +
geom_histogram(bins=30) +
labs(
title = "Distribution of Weight",
x = "Weight (kg)",
y = "Frequency"
) +
theme_classic()
For all future questions, make sure to replace these “wrong” numbers with NA. Remember that 0 is not NA.
Written Answer:There are weights that are unrealistically high, such as values around 1000 kg. These are likely due to data entry errors.
# Wrangle your data so that you have 3 main columns that outlines the Variable (everything except ID, Timepoint, Week, and Diet), Week, Number of Missing Data. Recall the functions: is.numeric, is.character, is.na... (1.0 point)
missing_data <- data_weightloss %>%
select(-ID, -Timepoint, -Weight, -Diet) %>%
pivot_longer(cols = -Week, names_to = "Variable", values_to = "Value") %>%
group_by (Variable, Week) %>%
summarize (MissingN = sum(is.na(Value)),
.groups = "drop")
# Pipe into the following code to make a tile plot. You may edit accordingly.
missing_data %>%
ggplot(aes(x=factor(Week), y = Variable, fill = MissingN)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "purple") +
labs(
title = "Missing Weight Loss Data",
x = "Week",
y = "Variable",
fill = "Missing Values"
) +
theme_classic()
# Change the fill gradient to show low values in "white" and high values in a color of your choosing (0.5 points)
Written Answer: Variables like SERPINF1, LRG1,FN1, APOF, APOC3, and AGT did not have missing values, suggesting that they were consistently collected throughout the study. However, F7 and DPP4 show poor data quality as they have missing values for every time point. Triglycerides, LDL, HOMA-IR, HDL, Cholesterol, and BMI also show greater missing values after week 4.
# write a script to stores the plot into an object (0.5 points)
data_obesity <- data_weightloss %>%
filter(Week == -8) %>%
mutate(
obesity_category = ifelse(BMI >= 40, "Class 3",
ifelse(BMI >= 35, "Class 2",
ifelse(BMI >= 30, "Class 1",)))
)
obesity_plot <- data_obesity %>%
ggplot(aes(x = obesity_category, fill = factor(Diet))) +
geom_bar(position = "dodge") +
labs(
title = "Number of People by Obesity Category",
x = "Obesity Category",
y = "Number of people",
fill = "Diet"
) +
theme_classic()
obesity_plot
# run statistical test and print results (0.5 points)
obesity_table <- data_obesity %>%
count(Diet, obesity_category) %>%
pivot_wider(names_from = obesity_category, values_from = n, values_fill = 0) # Reshape to wide format
# Step 5: Perform the chi-square test
obesity_chisquare <- chisq.test(obesity_table[, -1]) # Exclude the Diet column for the test
print(obesity_chisquare)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: obesity_table[, -1]
## X-squared = 5.7479, df = 1, p-value = 0.01651
# write a script to stores the plot into an object (0.5 points)
data_weightloss$Weight <- ifelse(data_weightloss$Weight > 250, NA, data_weightloss$Weight)
weight_difference_data <- data_weightloss %>%
filter(Week %in% c(-8, 0)) %>% # Only keeps rows for Week -8 and Week 0.
select(ID, Weight, Diet, Week) %>%
pivot_wider(names_from = Week, values_from = Weight, names_prefix = "Week ") %>%
filter(!is.na(`Week -8`) & !is.na(`Week 0`)) %>% # Removes rows with missing weights in Week -8 and Week 0
mutate(
Diet = factor(Diet),
weight_difference = `Week -8` - `Week 0`)
weight_diff_plot <- weight_difference_data %>%
ggplot(mapping = aes(x = Diet, y = weight_difference)) +
geom_boxplot() +
labs(
title = "Weight Difference Between Start and End by Diet",
x = "Diet Type",
y = "Weight Difference (kg)",
) +
theme_classic()
weight_diff_plot
# run a statistical test and print results (0.5 points)
diet_0 <- weight_difference_data %>%
filter(Diet == 0) %>%
pull(weight_difference)
diet_1 <- weight_difference_data %>%
filter(Diet == 1) %>%
pull(weight_difference)
shapiro_diet_0 <- shapiro.test(diet_0)
shapiro_diet_1 <- shapiro.test(diet_1)
print(shapiro_diet_0)
##
## Shapiro-Wilk normality test
##
## data: diet_0
## W = 0.92503, p-value = 0.1402
print(shapiro_diet_1)
##
## Shapiro-Wilk normality test
##
## data: diet_1
## W = 0.95682, p-value = 0.3328
t_test <- t.test(diet_0, diet_1)
print(t_test)
##
## Welch Two Sample t-test
##
## data: diet_0 and diet_1
## t = -2.2944, df = 37.24, p-value = 0.0275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.1226159 -0.2563315
## sample estimates:
## mean of x mean of y
## 11.01053 13.20000
# write a script to stores the plot into an object (0.5 points)
protein_data <- data_weightloss %>%
filter(!is.na(APOF)) # Only keep the rows where APOF values were recorded
mean_intensities <- protein_data %>%
group_by(Week, Diet) %>%
summarize(
Mean_Intensity = mean(APOF, na.rm = TRUE), # Computes the mean APOF intensity, ignoring NA values
.groups = "drop"
)
protein_plot <- mean_intensities %>%
ggplot(aes(x = Week, y = Mean_Intensity, color = factor(Diet), group = Diet)) +
geom_line() +
geom_point() +
labs(
title = "Mean Intensities of APOF Over Time by Diet",
x = "Week",
y = "Mean Intensity",
color = "Diet"
) +
theme_classic()
protein_plot
# Put your panels together using ggarrange function from ggpubr package. And present it here. (0.5 points)
library(ggpubr)
top_row <- ggarrange(obesity_plot, weight_diff_plot,
ncol = 2,
labels = c("A","B"))
second_row <- ggarrange(protein_plot, labels = c("C"))
both_rows <- ggarrange(top_row, second_row, ncol = 1, heights = c(1, 1))
print(both_rows)
Written Answer: Individuals following Diet type 1 showed a greater weight difference compared to those on Diet type 0, suggesting it may be more effective for weight loss. The mean intensities of APOF, a protein involved in cholesterol metabolism, were higher at the beginning of the diet but decreased over time, potentially reflecting metabolic improvements. A key limitation of this analysis is the presence of missing values, wile a strength was that the clear trends observed between the variables across time. Future work should address missing data and also explore how lower levels of specific proteins can correlate to other beneficial health factors, such as cardiovascular.
Enjoy! It’s been a blast 🚀