Lab 2 Assignment

Due: FRIDAY JANURARY 24TH

Conceptual questions (1 point each, 2 points total)

Complete these questions on your own using short answers of 1-2 sentences.

What graphs would you use to visualize the following relationships:
1. continuous variable x categorical variable, b) continuous variable x continuous variable, and c) continuous variable over time.

Written Answer: Graph A would use a boxplot/violin plot. Graph B would use a scatter plot. Graph C would use a linear plot.

What is the symbol for the pipe command? What does it do?

Written Answer: The pipe command %>% takes the result of one step and passes it to the next step. It helps connect the different steps in the code instead of nesting several functions.

Coding questions

For this assignment we will examine the plasma profiles of individuals before and after weight loss program. For the sake of the assignment, I’ve included only a fraction of the plasma proteins that were identified and added from new groupings.

For this assignemnt: Remember to check you knotted document and ensure it looks well formatted and clear. Important is to ensure that code you might use to verify your steps (such as head() or count()) is removed or commented with the pound sign to ensure it wont run. This will prevent large dumps of tables to the knitted output. Only “show” what we ask.

Loading the data

Load the data. The data is provided as a .csv file called “WeightLossProgramData4Assignment.csv”. Verify that the data is loaded. Familiarize your self with the variables and columns. See the warm-up for more information.

List of variables ID - patient ID Weight - weight (kg) BMI - body mass index (kg/m^2) Lipids - Cholesterol, LDL cholesterol, HDL cholesterol and triglycerides (mmol/L) HOMA-IR - Homeostatic Model Assessment for Insulin Resistance (unitless) Diet - Diet during weight loss program intermittent fasting (1) or regular (0) The rest - Proteomic data whereby the values represent mass-spectrometry intensities (a relative measurement so no units)

# load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggpubr)

# download the assignment data from quercus and load it here
data_weightloss <- read_csv("WeightLossProgramData4Assignment.csv")

## Rows: 318 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): ID, Timepoint
## dbl (17): Week, Weight, BMI, Cholesterol, LDL, HDL, Triglyceride, HOMA-IR, A...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For clarify, I’ve written in some comments to guide you. Feel free to add your own comments.

Q1 (1 total point). Rarely do we ever get tidy data. Often times, we get data dumps that are not always accurate (e.g., typos, tired undergraduate data collectors, etc). Create an appropriate plot that shows the distrbution of weight in the study.

# Create plot that shows the distribution of weight in the study (0.5 points). 

library(tidyverse)
library(ggpubr)

ggplot(data_weightloss, aes(x = Weight)) + 
  geom_histogram(bins=30) + 
  labs(
    title = "Distribution of Weight",
    x = "Weight (kg)",
    y = "Frequency"
  ) +
  theme_classic()

What do you see? Are there any non-sensible data like unrealistically low or high weights? Remember that these are in kg. (0.5 points)

For all future questions, make sure to replace these “wrong” numbers with NA. Remember that 0 is not NA.

Written Answer:There are weights that are unrealistically high, such as values around 1000 kg. These are likely due to data entry errors.

Q2 (2 total point). Data is never complete. Patients might miss an appointment, values are too low/high/weird, tons of reasons. So, let’s check the “missingness” of our data. Create a tile plot that shows the number of missing data for a given variable and time, with variables as rows and weeks as columns. {PS: Part of this question is exploring a new function. Feel free to use the help reference for geom_tile().}

# Wrangle your data so that you have 3 main columns that outlines the Variable (everything except ID, Timepoint, Week, and Diet), Week, Number of Missing Data. Recall the functions: is.numeric, is.character, is.na... (1.0 point)
 
missing_data <- data_weightloss %>%
  select(-ID, -Timepoint, -Weight, -Diet) %>%
  pivot_longer(cols = -Week, names_to = "Variable", values_to = "Value") %>%
  group_by (Variable, Week) %>%
  summarize (MissingN = sum(is.na(Value)), 
             .groups = "drop")


# Pipe into the following code to make a tile plot. You may edit accordingly.
missing_data %>% 
  ggplot(aes(x=factor(Week), y = Variable, fill = MissingN)) +
  geom_tile(color = "black") +
  scale_fill_gradient(low = "white", high = "purple") +
  labs(
   title = "Missing Weight Loss Data",
   x = "Week",
   y = "Variable",
   fill = "Missing Values"
  ) +
  theme_classic()

# Change the fill gradient to show low values in "white" and high values in a color of your choosing (0.5 points)

What did you observe from the tile plot? Are there certain groups of variables that were collected consistently or not? Are there are variables that stand out because of poor quality (e.g., missing in ~40% in every timepoint)? (0.5 points)

Written Answer: Variables like SERPINF1, LRG1,FN1, APOF, APOC3, and AGT did not have missing values, suggesting that they were consistently collected throughout the study. However, F7 and DPP4 show poor data quality as they have missing values for every time point. Triglycerides, LDL, HOMA-IR, HDL, Cholesterol, and BMI also show greater missing values after week 4.

Q3 (5 total points). We’re going to create a whole figure with multiple panels. For this question, do not print the individual panels. Only write the code and then show the whole figure. Don’t forget labels, etc.

A. Number of people by obesity category (Class 1 >= 30, Class 2 >= 35, Class 3 >= 40) and diet at the start of the program (week = -8).

# write a script to stores the plot into an object (0.5 points)

data_obesity <- data_weightloss %>%
  filter(Week == -8) %>%
  mutate(
    obesity_category = ifelse(BMI >= 40, "Class 3",
                    ifelse(BMI >= 35, "Class 2",
                    ifelse(BMI >= 30, "Class 1",)))
  ) 
obesity_plot <- data_obesity %>%
  ggplot(aes(x = obesity_category, fill = factor(Diet))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Number of People by Obesity Category",
    x = "Obesity Category", 
    y = "Number of people",
    fill = "Diet"
  ) +
  theme_classic()

obesity_plot

# run statistical test and print results (0.5 points)

obesity_table <- data_obesity %>%
  count(Diet, obesity_category) %>%  
  pivot_wider(names_from = obesity_category, values_from = n, values_fill = 0)  # Reshape to wide format

# Step 5: Perform the chi-square test
obesity_chisquare <- chisq.test(obesity_table[, -1])  # Exclude the Diet column for the test
print(obesity_chisquare)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  obesity_table[, -1]
## X-squared = 5.7479, df = 1, p-value = 0.01651

B. Weight difference between start (week = -8) and end (week = 0) by Diet. AND run a statistical test. If using filter, the %in% operator be might helpful.

# write a script to stores the plot into an object (0.5 points)

data_weightloss$Weight <- ifelse(data_weightloss$Weight > 250, NA, data_weightloss$Weight)
weight_difference_data <- data_weightloss %>%
  filter(Week %in% c(-8, 0)) %>% # Only keeps rows for Week -8 and Week 0. 
  select(ID, Weight, Diet, Week) %>%
  pivot_wider(names_from = Week, values_from = Weight, names_prefix = "Week ") %>%
  filter(!is.na(`Week -8`) & !is.na(`Week 0`)) %>% # Removes rows with missing weights in Week -8 and Week 0
  mutate(
    Diet = factor(Diet), 
    weight_difference = `Week -8` - `Week 0`)

weight_diff_plot <- weight_difference_data %>%
  ggplot(mapping = aes(x = Diet, y = weight_difference)) +
  geom_boxplot() +
  labs(
    title = "Weight Difference Between Start and End by Diet",
    x = "Diet Type",
    y = "Weight Difference (kg)",
    
  ) +
  theme_classic()
weight_diff_plot

# run a statistical test and print results (0.5 points)

diet_0 <- weight_difference_data %>%
  filter(Diet == 0) %>%
  pull(weight_difference)

diet_1 <- weight_difference_data %>%
  filter(Diet == 1) %>%
  pull(weight_difference)

shapiro_diet_0 <- shapiro.test(diet_0)
shapiro_diet_1 <- shapiro.test(diet_1)

print(shapiro_diet_0)

## 
##  Shapiro-Wilk normality test
## 
## data:  diet_0
## W = 0.92503, p-value = 0.1402

print(shapiro_diet_1)

## 
##  Shapiro-Wilk normality test
## 
## data:  diet_1
## W = 0.95682, p-value = 0.3328

t_test <- t.test(diet_0, diet_1)
print(t_test)

## 
##  Welch Two Sample t-test
## 
## data:  diet_0 and diet_1
## t = -2.2944, df = 37.24, p-value = 0.0275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.1226159 -0.2563315
## sample estimates:
## mean of x mean of y 
##  11.01053  13.20000

C. Mean intensities of 1 good quality protein of your choice over time by diet.

# write a script to stores the plot into an object (0.5 points)

protein_data <- data_weightloss %>%
  filter(!is.na(APOF)) # Only keep the rows where APOF values were recorded

mean_intensities <- protein_data %>%
  group_by(Week, Diet) %>%
  summarize(
    Mean_Intensity = mean(APOF, na.rm = TRUE), # Computes the mean APOF intensity, ignoring NA values
    .groups = "drop" 
  )

protein_plot <- mean_intensities %>%
  ggplot(aes(x = Week, y = Mean_Intensity, color = factor(Diet), group = Diet)) +
  geom_line() +                      
  geom_point() +                      
  labs(
    title = "Mean Intensities of APOF Over Time by Diet",
    x = "Week",
    y = "Mean Intensity",
    color = "Diet"
  ) +
  theme_classic()
  
protein_plot

Present the figure accordingly: A and B in the top row. C-E in the 2nd row. (0.5 point)

# Put your panels together using ggarrange function from ggpubr package. And present it here. (0.5 points)

library(ggpubr)

top_row <- ggarrange(obesity_plot, weight_diff_plot,
                     ncol = 2,
                     labels = c("A","B"))

second_row <- ggarrange(protein_plot, labels = c("C"))

both_rows <- ggarrange(top_row, second_row, ncol = 1, heights = c(1, 1))

print(both_rows)

Write mini discussion (3-4 sentences) that summarises and makes sense of your findings. Are these findings expected? Were there any strengths/weaknesses? What would you explore next? (2 points)

Written Answer: Individuals following Diet type 1 showed a greater weight difference compared to those on Diet type 0, suggesting it may be more effective for weight loss. The mean intensities of APOF, a protein involved in cholesterol metabolism, were higher at the beginning of the diet but decreased over time, potentially reflecting metabolic improvements. A key limitation of this analysis is the presence of missing values, wile a strength was that the clear trends observed between the variables across time. Future work should address missing data and also explore how lower levels of specific proteins can correlate to other beneficial health factors, such as cardiovascular.

Enjoy! It’s been a blast 🚀