Student name:
Student UID:
This assignment is due on 2025-11-25 (Tuesday) 23:59, and is out of 60 marks (15% of final grade).
Analysis documentation (recommended):
A key part of experimental data analysis (and coding) is documentation and readability. You are encouraged to annotate your code to explain what each line of code is intended to do (refer to our R workshops codes for example).
Data is available on Moodle as “polyps.csv”.
All statistical tests, unless otherwise stated, should be assessed at a significance level of 0.05.
All your code, output, and answers to questions should be inputted into this single R markdown (.Rmd) file. All your code and text input should be made within the appropriate code/text cells.
Submission guidelines:
Please rename your completed R markdown file as “yourUID-yourName.Rmd”, and knit your Rmd file to the HTML format by clicking the “Knit” button at the top of the code editor.
You should upload the HTML file for submission to Moodle.
Guidelines on GenAI usage:
Use of GenAI is acceptable for code generation and troubleshooting for this assignment, but not for data interpretation.
In code chunks where GenAI is used, comment with the
# prefix to 1) specify which GenAI model you have used, and
2) your prompt.
You are a research assistant in a laboratory specializing in gastrointestinal cancer research. Your senior, with an interest in colorectal cancers, has recently collected some data from endoscopy patients. The raw dataset (“polyps.csv”) has been passed to you for analysis. Complete this assignment using this dataset.
‘polyps.csv’ has 70 independent observations and 9 variables:
1 for yes, 0 for no). Unclear why
this variable was recorded. Timeliness will not be studied in this
investigation.male, female).first < second
< third < none).
first degree: The participant’s immediate family
i.e. parents, siblings, and children.second degree: The participant’s grandparents,
grandchildren, uncles, aunts, nephews, nieces, and half-siblings.third degree: The participant’s great-grandparents,
great grandchildren, great uncles/aunts, and first cousins.none: No known relatives with colorectal cancer.first and third degrees, only the
first degree is recorded.placebo, sulindac).This step sets up a proper R environment. Write scripts to:
Install (if needed) and load the tidyverse.
Import the dataset ‘polyps.csv’ and assign it to the
variable polyps.
# Write your codes for the "initialization" section here
if (!require("tidyverse"))
install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('tidyverse')
polyps = read.csv("polyps.csv")
polyps
## participant_id on_time gender age family_history baseline treatment m3 m12
## 1 1 1 female 45 third 7 sulindac 4 2
## 2 2 0 female 36 first 42 placebo 39 22
## 3 3 1 male 24 none 4 sulindac 1 1
## 4 4 0 female 42 none 64 placebo 68 93
## 5 5 1 male 57 third 23 sulindac 16 17
## 6 6 1 female 41 first 35 placebo 42 61
## 7 7 1 female 47 second 11 sulindac 6 1
## 8 8 NA male 74 none 12 placebo 20 23
## 9 9 NA male 44 second 7 placebo 7 15
## 10 10 NA male 48 none 38 placebo 34 47
## 11 11 NA male 36 first 84 sulindac 72 39
## 12 12 NA female 23 none 8 sulindac 2 3
## 13 13 NA male 37 first 20 placebo 18 24
## 14 14 NA male 37 third 11 sulindac 20 10
## 15 15 NA male 34 third 24 placebo 26 40
## 16 16 NA male 43 first 34 sulindac 27 33
## 17 17 NA female 42 second 54 placebo 45 46
## 18 18 NA male 18 first 16 sulindac 10 NA
## 19 19 NA <NA> 37 second 18 placebo 30 50
## 20 20 NA female 49 none 10 sulindac 6 3
## 21 21 NA female 30 second 20 sulindac 5 1
## 22 22 NA male 52 third 91 sulindac 97 97
## 23 23 NA female 39 third 19 sulindac 15 8
## 24 24 NA female 48 third 11 placebo 13 15
## 25 25 NA female 52 none 24 sulindac 19 12
## 26 26 NA <NA> 40 <NA> 17 placebo NA NA
## 27 27 NA male 53 second 12 sulindac 13 14
## 28 28 NA female 59 none 6 placebo 8 11
## 29 29 NA male 28 first 21 sulindac 21 13
## 30 30 NA male 40 third 13 placebo 14 17
## 31 31 NA female 42 second 18 placebo 19 24
## 32 32 NA female 21 none 9 placebo NA NA
## 33 33 NA male 52 first 72 sulindac 65 24
## 34 34 NA female 48 third 13 sulindac 14 6
## 35 35 NA male 38 first 22 placebo 24 37
## 36 36 NA female 42 second 17 placebo 19 28
## 37 37 NA female 31 none 9 placebo 9 11
## 38 38 NA male 51 second 63 sulindac 68 51
## 39 39 NA male 46 first 35 placebo 42 49
## 40 40 NA male 36 none 18 sulindac 14 7
## 41 41 NA male 35 first 19 placebo 19 26
## 42 42 NA female 43 <NA> 9 sulindac 6 2
## 43 43 NA male 28 second 8 sulindac 7 2
## 44 44 NA male 48 second 17 sulindac 14 5
## 45 45 NA male 40 second 11 sulindac 9 7
## 46 46 NA male 47 second 23 placebo 24 27
## 47 47 NA female 43 first 16 sulindac 12 8
## 48 48 NA male 46 third 14 placebo 18 22
## 49 49 NA male 27 first 22 sulindac 9 4
## 50 50 NA female 53 second 27 placebo 31 38
## 51 51 NA female NA <NA> 6 sulindac NA NA
## 52 52 NA male 47 third 19 placebo 20 23
## 53 53 NA female 69 third 45 placebo 46 50
## 54 54 NA female 43 none 12 placebo 14 15
## 55 55 NA female 37 second 16 sulindac 13 10
## 56 56 NA female 45 third 7 sulindac 4 2
## 57 57 NA female 36 first 42 placebo 39 22
## 58 58 NA male 24 none 4 sulindac 1 1
## 59 59 NA female 42 none 64 placebo 68 93
## 60 60 NA male 57 third 23 sulindac 16 17
## 61 61 NA female 41 first 35 placebo 42 61
## 62 62 NA female 47 second 11 sulindac 6 1
## 63 63 NA male 74 none 12 placebo 20 23
## 64 64 NA male 44 second 7 placebo 7 15
## 65 65 NA male 48 none 38 placebo 34 47
## 66 66 NA male 36 first 84 sulindac 72 39
## 67 67 NA female 23 none 8 sulindac 2 3
## 68 68 NA male 37 first 20 placebo 18 24
## 69 69 NA male 37 third 11 sulindac 20 10
## 70 70 NA male 34 third 24 placebo 26 40
This section prepares the data for statistical analyses and data visualization. All codes used have been covered in our R workshops up until 2025-11-11 (Tue). Write scripts to:
(a) Convert appropriate variables into factors with the factor level orders specified in the ‘Dataset information’ section. (3 marks)
# Write your codes for part a here
polyps$gender <- factor(polyps$gender, levels = c("male", "female"))
polyps$family_history <- factor(polyps$family_history, levels = c("first", "second", "third", "none"))
polyps$treatment <- factor(polyps$treatment, levels = c("placebo", "sulindac"))
str(polyps)
## 'data.frame': 70 obs. of 9 variables:
## $ participant_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ on_time : int 1 0 1 0 1 1 1 NA NA NA ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 2 1 2 2 1 1 1 ...
## $ age : int 45 36 24 42 57 41 47 74 44 48 ...
## $ family_history: Factor w/ 4 levels "first","second",..: 3 1 4 4 3 1 2 4 2 4 ...
## $ baseline : int 7 42 4 64 23 35 11 12 7 38 ...
## $ treatment : Factor w/ 2 levels "placebo","sulindac": 2 1 2 1 2 1 2 1 1 1 ...
## $ m3 : int 4 39 1 68 16 42 6 20 7 34 ...
## $ m12 : int 2 22 1 93 17 61 1 23 15 47 ...
(b) Further determine if there are missing values
(NA) in the dataset and process them appropriately. The
processed dataset should be assigned to the variable
polyps_complete, and should contain no missing values. (7
marks)
Hint: For data column(s) that will not be investigated in
this study but have a large proportion of NA values, you can consider
excluding the entire column(s) from
polyps_complete.
# Write your codes for part b here
#checking the NA values from each columns
colSums(is.na(polyps))
## participant_id on_time gender age family_history
## 0 63 2 1 3
## baseline treatment m3 m12
## 0 0 3 4
#Removing on_time column
polyps$on_time <- NULL
polyps_complete = polyps %>% filter(!is.na(gender) & !is.na(family_history))
# Check if there are still NAs
colSums(is.na(polyps_complete))
## participant_id gender age family_history baseline
## 0 0 0 0 0
## treatment m3 m12
## 0 1 2
#NA value occured in "m3" and "m12"
# Quick check with ggplot to see if data in "m3" and "m12" is normally distributed
polyps_complete %>% ggplot(aes(m3)) +
geom_histogram(bins = 150)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
polyps_complete %>% ggplot(aes(m12)) +
geom_histogram(bins = 150)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
#Right-skewed distribution(Graphs showed)
#Input median value in the NA.value in "m3" and "m12" columns
polyps_complete$m3 = replace(polyps_complete$m3, is.na(polyps_complete$m3), median(polyps_complete$m3, na.rm = TRUE))
polyps_complete$m12 = replace(polyps_complete$m12, is.na(polyps_complete$m12), median(polyps_complete$m12, na.rm = TRUE))
colSums(is.na(polyps_complete))
## participant_id gender age family_history baseline
## 0 0 0 0 0
## treatment m3 m12
## 0 0 0
(c) Explain and justify your approach for handling NA values in (b). (4 marks)
Write your answers for part (c) here:
- on_time variable: I removed this column as it has the most NA values and timeliness will not be studied in this investigation.
- Categorical Variables (gender, family_history): For these variables, I applied filtering out rows with missing values as imputation could cause classification bias
- Quantitative Variables (m3, m12): For the polyp count measurements at 3 and 12 months, I used median imputation because they have right-skewed distributions as observed in histograms.
- Age Variable: The single missing age value was efficiently addressed through the initial filtering process, as it occurred in rows that also contained missing categorical data.
While you had a go at preprocessing the data, your senior prepared a
clean dataset (’polyps_clean.csv’) for you to analyze. They
took a quick look at the dataset and made a few statements. In this
section, you will conduct appropriate statistical analyses on
polyps_clean.csv to prove/disprove their statements.
processed and preview the first 6 rows of
data.# importing the clean dataset
processed = read.csv("polyps_clean.csv")
# setting factor levels for the clean dataset
processed$gender = factor(processed$gender, levels = c("male", "female"))
processed$treatment = factor(processed$treatment, levels = c("placebo", "sulindac"))
processed$family_history = factor(processed$family_history,
levels = c("first", "second", "third", "none"))
# previewing the first 6 rows of the clean dataset
head(processed)
## participant_id gender age family_history baseline treatment m3 m12 m12_diff
## 1 1 female 45 third 7 sulindac 4 2 5
## 2 2 female 36 first 42 placebo 39 22 20
## 3 3 male 24 none 4 sulindac 1 1 3
## 4 4 female 42 none 64 placebo 68 93 -29
## 5 5 male 57 third 23 sulindac 16 17 6
## 6 6 female 41 first 35 placebo 42 61 -26
# checking to confirm there are no NAs in the dataset
sum(is.na(processed))
## [1] 0
colSums(is.na(processed))
## participant_id gender age family_history baseline
## 0 0 0 0 0
## treatment m3 m12 m12_diff
## 0 0 0 0
(a) Plot a bar plot to visualize the gender distribution in the study groups. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)
# Write your codes for part a here
processed %>%
ggplot(aes(x = treatment, fill = gender)) +
geom_bar() +
labs(title = "Gender Distribution Across Treatment Groups",
x = "Treatment Group",
y = "Number of Participants",
fill = "Gender") +
theme_minimal()
(b) Choose and conduct the appropriate test to test this statement. Display the summary output for the test. (4 marks)
Hint: Remember to also check the fulfillment of assumptions when running the test.
# Write your codes for part b here
# Contingency table
gender_treatment_table = table(polyps_complete$gender, polyps_complete$treatment)
print(gender_treatment_table)
##
## placebo sulindac
## male 17 20
## female 16 13
# Perform the Chi-Square Test since count >=5
chi_square_test = chisq.test(gender_treatment_table)
print(chi_square_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gender_treatment_table
## X-squared = 0.24604, df = 1, p-value = 0.6199
(c) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 1 part (c) here:
The p-value is 0.6199, > 0.05 (significance level). Also, the critical Chi-square statistic at df=1, α = 0.05 is 3.84, but 0.24604 < 3.84. So, we fail to reject the null hypothesis. Hence, there is no significant association between gender and treatment group. As a result, the senior's statement is correct - males were not preferentially assigned to the Sulindac treatment group.
(a) Your senior created a new column called
m12_diff in the clean dataset. The column contains the
difference in polyp number 12 months after treatment initiation
(calculated as baseline - m12). What could be the rationale
for creating this column? (2 marks)
Write your answers for Statement 2 part (a) here:
Creating the new column can be used to measure polyp count changes over time and provide a direct measure of treatment effect size for each participant while controlling for baseline differences.
(b) Conduct the F-test for equal variances using the
m12_diff variance values for placebo and Sulindac treatment
groups. (4 marks)
# Write your codes for part b here
var(processed$m12_diff[processed$treatment == "placebo"])
## [1] 135.4848
var(processed$m12_diff[processed$treatment == "sulindac"])
## [1] 156.057
var.test(processed$m12_diff[processed$treatment == "sulindac"],
processed$m12_diff[processed$treatment == "placebo"])
##
## F test to compare two variances
##
## data: processed$m12_diff[processed$treatment == "sulindac"] and processed$m12_diff[processed$treatment == "placebo"]
## F = 1.1518, num df = 33, denom df = 32, p-value = 0.6909
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5709629 2.3148736
## sample estimates:
## ratio of variances
## 1.151841
(c) Name a test to be performed with
m12_diff values to assess the efficacy of Sulindac. Are the
assumptions for the test fulfilled? Briefly explain. (3 marks)
Write your answers for Statement 2 part (c) here:
A two-sample t-test should be used with m12_diff values. The assumptions are met: participants were randomly assigned (independence), the sample size is sufficient for normality, and the F-test shows equal variances (p = 0.691 > 0.05).
(d) Assuming the test assumptions are fulfilled, use
the m12_diff values, to perform the above-named test to
assess the efficacy of Sulindac. (4 marks)
# Write your codes for part d here
t.test(m12_diff ~ treatment, data = processed, var.equal = TRUE)
##
## Two Sample t-test
##
## data: m12_diff by treatment
## t = -6.3754, df = 65, p-value = 2.167e-08
## alternative hypothesis: true difference in means between group placebo and group sulindac is not equal to 0
## 95 percent confidence interval:
## -24.71546 -12.92447
## sample estimates:
## mean in group placebo mean in group sulindac
## -8.878788 9.941176
(e) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 2 part (e) here:
The p-value is 2.167e-08, < 0.05 (significance level). Also, the critical t-value at df=65, α = 0.05 is approximately ±1.997, but -6.3754 < -1.997. So, we reject the null hypothesis. Hence, there is a significant difference in polyp reduction between treatment groups. As a result, the senior's statement is correct - polyp number 12 months after treatment initiation is significantly reduced by Sulindac compared to placebo.
(a) Perform an appropriate ANOVA test to assess the effect of family history and gender on baseline polyp number. Display the summary output for the ANOVA test. (4 marks)
# Write your codes for part a here
anova_result <- aov(baseline ~ family_history + gender, data = processed)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## family_history 3 2603 867.6 2.147 0.103
## gender 1 52 51.5 0.128 0.722
## Residuals 62 25048 404.0
(b) Based on the ANOVA summary output in (a), what would you conclude about the effect of family history and gender on baseline polyp number? (2 marks)
Write your answers for Statement 3 part (b) here:
The p-value for family history is 0.103 (p > 0.05), and for gender is 0.722 (p > 0.05). Therefore, we fail to reject the null hypothesis for both factors, indicating that neither family history of colorectal cancer nor gender significantly influences the baseline polyp count in this study population.
(c) Generate the residual plots (that are necessary for assessing residuals assumptions) from your ANOVA test. (4 marks)
# Write your codes for part c here
plot(anova_result, 1, cex = 0.1, main = "Residuals vs Fitted Values")
plot(anova_result, 2, cex = 0.1, main = "Q-Q Plot of Residuals")
(d) Are the residual assumptions for your ANOVA test fulfilled? Explain in relation to the residual plots. (4 marks)
Write your answers for Statement 3 part (d) here:
he Q-Q plot shows the residuals are approximately normally distributed, with only minor deviations in the tails. The Residuals vs Fitted plot demonstrates constant variance (homoscedasticity) with points randomly scattered around zero. As a result, residual assumptions for your ANOVA test is fullfilled.
(e) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 3 part (e) here:
The ANOVA results show that neither family history (p = 0.103) nor gender (p = 0.722) has a statistically significant effect on baseline polyp number, as both p-values are greater than 0.05. Furthermore, the residual plots confirm that the ANOVA assumptions are adequately met, making the test results reliable. Therefore, the senior's statement is correct - ANOVA analysis reliably shows that family history and gender do not significantly influence the baseline polyp number.
(a) Plot a scatter plot to visualize the above-mentioned relationship between baseline polyp number and age. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)
# Write your codes for part a here
processed %>%
ggplot(aes(age,
baseline)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Relationship Between Baseline Polyp Number and Participant Age",
x = "Participant Age",
y = "Baseline Polyp Number")
## `geom_smooth()` using formula = 'y ~ x'
(b) Generate the linear regression model to represent the relationship between baseline polyp number and age in (a). Display the summary output for the linear regression model. (3 marks)
# Write your codes for part b here
lm_model <- lm(baseline ~ age, data = processed)
summary(lm_model)
##
## Call:
## lm(formula = baseline ~ age, data = processed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.583 -13.110 -5.227 5.191 64.047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.8430 10.0607 1.475 0.145
## age 0.2329 0.2305 1.010 0.316
##
## Residual standard error: 20.48 on 65 degrees of freedom
## Multiple R-squared: 0.01546, Adjusted R-squared: 0.0003098
## F-statistic: 1.02 on 1 and 65 DF, p-value: 0.3162
(c) Assuming the assumptions of linear regression are all met, is your senior’s statement correct based on your outputs in (a) and (b)? Briefly explain. (3 marks)
Write your answers for Statement 4 part (c) here:
Based on the linear regression output, there is no significant linear association between baseline polyp number and age. The p-value for the age coefficient is 0.316, which is greater than the 0.05 significance level. Additionally, the R-squared value of 0.015 indicates that age explains only 1.5% of the variance in baseline polyp numbers. The F-statistic p-value of 0.316 also confirms that the overall model is not statistically significant. Therefore, the data does not support the statement that baseline polyp number is linearly associated with age.
(d) Your senior asked you to further test their statement with data transformation. Identify the appropriate data transformation approaches (if applicable) for baseline polyp number and age. (3 marks)
Hint: Transformation of one variable is independent of the other variable(s) i.e. Different variables can have different transformation approaches.
# Write your codes for part d here
(e) Perform the visualization and analyses steps needed to generate and assess the new linear regression model. (4 marks)
Hint: Remember to also check the fulfillment of assumptions when assessing the model.
# Write your codes for part e here
(f) Justify your choice of transformation approach(es). Based on your outputs relating to the new linear regression model in (e), is your senior’s statement correct? Briefly explain. (5 marks)
Write your answers for Statement 4 part (f) here: