Student name:
Student UID:

Instructions:

Scenario:

You are a research assistant in a laboratory specializing in gastrointestinal cancer research. Your senior, with an interest in colorectal cancers, has recently collected some data from endoscopy patients. The raw dataset (“polyps.csv”) has been passed to you for analysis. Complete this assignment using this dataset.

Background

  • Polyp formation is the development of abnormal tissue growths in the lining of the colon or rectum. While most polyps are benign, some can progress to colorectal cancer if left untreated. Endoscopy is currently widely used to visualize and, if necessary, excise the polyps.
  • Colorectal cancer can be attributed to genetic and lifestyle factors. Patients with a family history of colorectal cancer are advised to undergo regular endoscopy screenings for polyps, starting at an age earlier than the general population.
  • Your senior would like to further investigate the effectiveness of a preventive treatment with Sulindac, a nonsteroidal pain reliever, in reducing polyp numbers. They also wish to determine the potential factors contributing to polyposis and the risk of developing colorectal cancer.

Experimental approach

  • 70 patients were randomly assigned into 2 treatment groups (Randomized controlled trial: placebo control, and the Sulindac treatment).
  • Polyp numbers were measured by endoscopy at the start of the clinical trial, 3 months, and 12 months after initiation of the treatment.
  • It is known that the sampling is reliable and independent.

Dataset information

‘polyps.csv’ has 70 independent observations and 9 variables:

  • participant_ID: One ID for each participant.
  • on_time: Whether the participant arrived on time for the endoscopies (1 for yes, 0 for no). Unclear why this variable was recorded. Timeliness will not be studied in this investigation.
  • gender: Gender of participant (in no particular order: male, female).
  • age: Age of participant.
  • family_history: Family history of colorectal cancer by degrees (in the order of first < second < third < none).
    • first degree: The participant’s immediate family i.e. parents, siblings, and children.
    • second degree: The participant’s grandparents, grandchildren, uncles, aunts, nephews, nieces, and half-siblings.
    • third degree: The participant’s great-grandparents, great grandchildren, great uncles/aunts, and first cousins.
    • none: No known relatives with colorectal cancer.
    • In the event that a participant has multiple degrees of colorectal cancer family history, only the more immediate degree is recorded e.g. If a participant has colorectal cancer family history in both the first and third degrees, only the first degree is recorded.
  • baseline: Polyp number measured by endoscopy prior to initiating treatment.
  • treatment: Treatment group the participant was assigned to (in the order of placebo, sulindac).
  • m3: Polyp number measured by endoscopy 3 months after initiating treatment.
  • m12: Polyp number measured by endoscopy 12 months after initiating treatment.

Initialization (2 marks):

This step sets up a proper R environment. Write scripts to:

# Write your codes for the "initialization" section here
if (!require("tidyverse")) 
  install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('tidyverse')
polyps = read.csv("polyps.csv")
polyps
##    participant_id on_time gender age family_history baseline treatment m3 m12
## 1               1       1 female  45          third        7  sulindac  4   2
## 2               2       0 female  36          first       42   placebo 39  22
## 3               3       1   male  24           none        4  sulindac  1   1
## 4               4       0 female  42           none       64   placebo 68  93
## 5               5       1   male  57          third       23  sulindac 16  17
## 6               6       1 female  41          first       35   placebo 42  61
## 7               7       1 female  47         second       11  sulindac  6   1
## 8               8      NA   male  74           none       12   placebo 20  23
## 9               9      NA   male  44         second        7   placebo  7  15
## 10             10      NA   male  48           none       38   placebo 34  47
## 11             11      NA   male  36          first       84  sulindac 72  39
## 12             12      NA female  23           none        8  sulindac  2   3
## 13             13      NA   male  37          first       20   placebo 18  24
## 14             14      NA   male  37          third       11  sulindac 20  10
## 15             15      NA   male  34          third       24   placebo 26  40
## 16             16      NA   male  43          first       34  sulindac 27  33
## 17             17      NA female  42         second       54   placebo 45  46
## 18             18      NA   male  18          first       16  sulindac 10  NA
## 19             19      NA   <NA>  37         second       18   placebo 30  50
## 20             20      NA female  49           none       10  sulindac  6   3
## 21             21      NA female  30         second       20  sulindac  5   1
## 22             22      NA   male  52          third       91  sulindac 97  97
## 23             23      NA female  39          third       19  sulindac 15   8
## 24             24      NA female  48          third       11   placebo 13  15
## 25             25      NA female  52           none       24  sulindac 19  12
## 26             26      NA   <NA>  40           <NA>       17   placebo NA  NA
## 27             27      NA   male  53         second       12  sulindac 13  14
## 28             28      NA female  59           none        6   placebo  8  11
## 29             29      NA   male  28          first       21  sulindac 21  13
## 30             30      NA   male  40          third       13   placebo 14  17
## 31             31      NA female  42         second       18   placebo 19  24
## 32             32      NA female  21           none        9   placebo NA  NA
## 33             33      NA   male  52          first       72  sulindac 65  24
## 34             34      NA female  48          third       13  sulindac 14   6
## 35             35      NA   male  38          first       22   placebo 24  37
## 36             36      NA female  42         second       17   placebo 19  28
## 37             37      NA female  31           none        9   placebo  9  11
## 38             38      NA   male  51         second       63  sulindac 68  51
## 39             39      NA   male  46          first       35   placebo 42  49
## 40             40      NA   male  36           none       18  sulindac 14   7
## 41             41      NA   male  35          first       19   placebo 19  26
## 42             42      NA female  43           <NA>        9  sulindac  6   2
## 43             43      NA   male  28         second        8  sulindac  7   2
## 44             44      NA   male  48         second       17  sulindac 14   5
## 45             45      NA   male  40         second       11  sulindac  9   7
## 46             46      NA   male  47         second       23   placebo 24  27
## 47             47      NA female  43          first       16  sulindac 12   8
## 48             48      NA   male  46          third       14   placebo 18  22
## 49             49      NA   male  27          first       22  sulindac  9   4
## 50             50      NA female  53         second       27   placebo 31  38
## 51             51      NA female  NA           <NA>        6  sulindac NA  NA
## 52             52      NA   male  47          third       19   placebo 20  23
## 53             53      NA female  69          third       45   placebo 46  50
## 54             54      NA female  43           none       12   placebo 14  15
## 55             55      NA female  37         second       16  sulindac 13  10
## 56             56      NA female  45          third        7  sulindac  4   2
## 57             57      NA female  36          first       42   placebo 39  22
## 58             58      NA   male  24           none        4  sulindac  1   1
## 59             59      NA female  42           none       64   placebo 68  93
## 60             60      NA   male  57          third       23  sulindac 16  17
## 61             61      NA female  41          first       35   placebo 42  61
## 62             62      NA female  47         second       11  sulindac  6   1
## 63             63      NA   male  74           none       12   placebo 20  23
## 64             64      NA   male  44         second        7   placebo  7  15
## 65             65      NA   male  48           none       38   placebo 34  47
## 66             66      NA   male  36          first       84  sulindac 72  39
## 67             67      NA female  23           none        8  sulindac  2   3
## 68             68      NA   male  37          first       20   placebo 18  24
## 69             69      NA   male  37          third       11  sulindac 20  10
## 70             70      NA   male  34          third       24   placebo 26  40

Data pre-processing (14 marks):

This section prepares the data for statistical analyses and data visualization. All codes used have been covered in our R workshops up until 2025-11-11 (Tue). Write scripts to:

(a) Convert appropriate variables into factors with the factor level orders specified in the ‘Dataset information’ section. (3 marks)

# Write your codes for part a here

polyps$gender <- factor(polyps$gender, levels = c("male", "female"))
polyps$family_history <- factor(polyps$family_history, levels = c("first", "second", "third", "none"))
polyps$treatment <- factor(polyps$treatment, levels = c("placebo", "sulindac"))

str(polyps)
## 'data.frame':    70 obs. of  9 variables:
##  $ participant_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ on_time       : int  1 0 1 0 1 1 1 NA NA NA ...
##  $ gender        : Factor w/ 2 levels "male","female": 2 2 1 2 1 2 2 1 1 1 ...
##  $ age           : int  45 36 24 42 57 41 47 74 44 48 ...
##  $ family_history: Factor w/ 4 levels "first","second",..: 3 1 4 4 3 1 2 4 2 4 ...
##  $ baseline      : int  7 42 4 64 23 35 11 12 7 38 ...
##  $ treatment     : Factor w/ 2 levels "placebo","sulindac": 2 1 2 1 2 1 2 1 1 1 ...
##  $ m3            : int  4 39 1 68 16 42 6 20 7 34 ...
##  $ m12           : int  2 22 1 93 17 61 1 23 15 47 ...

(b) Further determine if there are missing values (NA) in the dataset and process them appropriately. The processed dataset should be assigned to the variable polyps_complete, and should contain no missing values. (7 marks)

Hint: For data column(s) that will not be investigated in this study but have a large proportion of NA values, you can consider excluding the entire column(s) from polyps_complete.

# Write your codes for part b here 

#checking the NA values from each columns
colSums(is.na(polyps))
## participant_id        on_time         gender            age family_history 
##              0             63              2              1              3 
##       baseline      treatment             m3            m12 
##              0              0              3              4
#Removing on_time column
polyps$on_time <- NULL

polyps_complete = polyps %>% filter(!is.na(gender) & !is.na(family_history))

# Check if there are still NAs
colSums(is.na(polyps_complete))
## participant_id         gender            age family_history       baseline 
##              0              0              0              0              0 
##      treatment             m3            m12 
##              0              1              2
#NA value occured in "m3" and "m12"
# Quick check with ggplot to see if data in "m3" and "m12" is normally distributed
polyps_complete %>% ggplot(aes(m3)) +
                      geom_histogram(bins = 150)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

polyps_complete %>% ggplot(aes(m12)) +
                      geom_histogram(bins = 150)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

#Right-skewed distribution(Graphs showed)
#Input median value in the NA.value in "m3" and "m12" columns
polyps_complete$m3 = replace(polyps_complete$m3, is.na(polyps_complete$m3), median(polyps_complete$m3, na.rm = TRUE))
polyps_complete$m12 = replace(polyps_complete$m12, is.na(polyps_complete$m12), median(polyps_complete$m12, na.rm = TRUE))
colSums(is.na(polyps_complete))
## participant_id         gender            age family_history       baseline 
##              0              0              0              0              0 
##      treatment             m3            m12 
##              0              0              0

(c) Explain and justify your approach for handling NA values in (b). (4 marks)

Write your answers for part (c) here:
- on_time variable: I removed this column as it has the most NA values and timeliness will not be studied in this investigation.
- Categorical Variables (gender, family_history): For these variables, I applied filtering out rows with missing values as imputation could cause classification bias
- Quantitative Variables (m3, m12): For the polyp count measurements at 3 and 12 months, I used median imputation because they have right-skewed distributions as observed in histograms.
- Age Variable: The single missing age value was efficiently addressed through the initial filtering process, as it occurred in rows that also contained missing categorical data.

Statistical analysis:

While you had a go at preprocessing the data, your senior prepared a clean dataset (’polyps_clean.csv’) for you to analyze. They took a quick look at the dataset and made a few statements. In this section, you will conduct appropriate statistical analyses on polyps_clean.csv to prove/disprove their statements.

Compulsory step: Run the following code chunk to import your senior’s clean dataset as processed and preview the first 6 rows of data.

# importing the clean dataset
processed = read.csv("polyps_clean.csv")

# setting factor levels for the clean dataset
processed$gender = factor(processed$gender, levels = c("male", "female"))
processed$treatment = factor(processed$treatment, levels = c("placebo", "sulindac"))
processed$family_history = factor(processed$family_history, 
                               levels = c("first", "second", "third", "none"))

# previewing the first 6 rows of the clean dataset
head(processed)
##   participant_id gender age family_history baseline treatment m3 m12 m12_diff
## 1              1 female  45          third        7  sulindac  4   2        5
## 2              2 female  36          first       42   placebo 39  22       20
## 3              3   male  24           none        4  sulindac  1   1        3
## 4              4 female  42           none       64   placebo 68  93      -29
## 5              5   male  57          third       23  sulindac 16  17        6
## 6              6 female  41          first       35   placebo 42  61      -26
# checking to confirm there are no NAs in the dataset
sum(is.na(processed))
## [1] 0
colSums(is.na(processed))
## participant_id         gender            age family_history       baseline 
##              0              0              0              0              0 
##      treatment             m3            m12       m12_diff 
##              0              0              0              0

Statement 1: Males were not preferentially assigned to the Sulindac treatment group. (8 marks)

(a) Plot a bar plot to visualize the gender distribution in the study groups. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)

# Write your codes for part a here  

processed %>% 
  ggplot(aes(x = treatment, fill = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution Across Treatment Groups",
       x = "Treatment Group", 
       y = "Number of Participants",
       fill = "Gender") +
  theme_minimal()

(b) Choose and conduct the appropriate test to test this statement. Display the summary output for the test. (4 marks)

Hint: Remember to also check the fulfillment of assumptions when running the test.

# Write your codes for part b here  

# Contingency table
gender_treatment_table = table(polyps_complete$gender, polyps_complete$treatment)
print(gender_treatment_table)
##         
##          placebo sulindac
##   male        17       20
##   female      16       13
# Perform the Chi-Square Test since count >=5
chi_square_test = chisq.test(gender_treatment_table)
print(chi_square_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender_treatment_table
## X-squared = 0.24604, df = 1, p-value = 0.6199

(c) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 1 part (c) here:
The p-value is 0.6199, > 0.05 (significance level). Also, the critical Chi-square statistic at df=1, α = 0.05 is 3.84, but 0.24604 < 3.84. So, we fail to reject the null hypothesis. Hence, there is no significant association between gender and treatment group. As a result, the senior's statement is correct - males were not preferentially assigned to the Sulindac treatment group.

Statement 2: Compared to the baseline, polyp number 12 months after treatment initiation is significantly reduced by Sulindac. (15 marks)

(a) Your senior created a new column called m12_diff in the clean dataset. The column contains the difference in polyp number 12 months after treatment initiation (calculated as baseline - m12). What could be the rationale for creating this column? (2 marks)

Write your answers for Statement 2 part (a) here: 
Creating the new column can be used to measure polyp count changes over time and provide a direct measure of treatment effect size for each participant while controlling for baseline differences.

(b) Conduct the F-test for equal variances using the m12_diff variance values for placebo and Sulindac treatment groups. (4 marks)

# Write your codes for part b here  
var(processed$m12_diff[processed$treatment == "placebo"]) 
## [1] 135.4848
var(processed$m12_diff[processed$treatment == "sulindac"])
## [1] 156.057
var.test(processed$m12_diff[processed$treatment == "sulindac"],
         processed$m12_diff[processed$treatment == "placebo"])
## 
##  F test to compare two variances
## 
## data:  processed$m12_diff[processed$treatment == "sulindac"] and processed$m12_diff[processed$treatment == "placebo"]
## F = 1.1518, num df = 33, denom df = 32, p-value = 0.6909
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5709629 2.3148736
## sample estimates:
## ratio of variances 
##           1.151841

(c) Name a test to be performed with m12_diff values to assess the efficacy of Sulindac. Are the assumptions for the test fulfilled? Briefly explain. (3 marks)

Write your answers for Statement 2 part (c) here:
A two-sample t-test should be used with m12_diff values. The assumptions are met: participants were randomly assigned (independence), the sample size is sufficient for normality, and the F-test shows equal variances (p = 0.691 > 0.05).

(d) Assuming the test assumptions are fulfilled, use the m12_diff values, to perform the above-named test to assess the efficacy of Sulindac. (4 marks)

# Write your codes for part d here  

t.test(m12_diff ~ treatment, data = processed, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  m12_diff by treatment
## t = -6.3754, df = 65, p-value = 2.167e-08
## alternative hypothesis: true difference in means between group placebo and group sulindac is not equal to 0
## 95 percent confidence interval:
##  -24.71546 -12.92447
## sample estimates:
##  mean in group placebo mean in group sulindac 
##              -8.878788               9.941176

(e) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 2 part (e) here:
The p-value is 2.167e-08, < 0.05 (significance level). Also, the critical t-value at df=65, α = 0.05 is approximately ±1.997, but -6.3754 < -1.997. So, we reject the null hypothesis. Hence, there is a significant difference in polyp reduction between treatment groups. As a result, the senior's statement is correct - polyp number 12 months after treatment initiation is significantly reduced by Sulindac compared to placebo.

Statement 3: ANOVA analysis reliably shows that family history and gender do not significantly influence the baseline polyp number. (16 marks)

(a) Perform an appropriate ANOVA test to assess the effect of family history and gender on baseline polyp number. Display the summary output for the ANOVA test. (4 marks)

# Write your codes for part a here  

anova_result <- aov(baseline ~ family_history + gender, data = processed)
summary(anova_result)
##                Df Sum Sq Mean Sq F value Pr(>F)
## family_history  3   2603   867.6   2.147  0.103
## gender          1     52    51.5   0.128  0.722
## Residuals      62  25048   404.0

(b) Based on the ANOVA summary output in (a), what would you conclude about the effect of family history and gender on baseline polyp number? (2 marks)

Write your answers for Statement 3 part (b) here:
The p-value for family history is 0.103 (p > 0.05), and for gender is 0.722 (p > 0.05). Therefore, we fail to reject the null hypothesis for both factors, indicating that neither family history of colorectal cancer nor gender significantly influences the baseline polyp count in this study population.

(c) Generate the residual plots (that are necessary for assessing residuals assumptions) from your ANOVA test. (4 marks)

# Write your codes for part c here 
plot(anova_result, 1, cex = 0.1, main = "Residuals vs Fitted Values")

plot(anova_result, 2, cex = 0.1, main = "Q-Q Plot of Residuals")

(d) Are the residual assumptions for your ANOVA test fulfilled? Explain in relation to the residual plots. (4 marks)

Write your answers for Statement 3 part (d) here:
he Q-Q plot shows the residuals are approximately normally distributed, with only minor deviations in the tails. The Residuals vs Fitted plot demonstrates constant variance (homoscedasticity) with points randomly scattered around zero. As a result, residual assumptions for your ANOVA test is fullfilled.

(e) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 3 part (e) here:
The ANOVA results show that neither family history (p = 0.103) nor gender (p = 0.722) has a statistically significant effect on baseline polyp number, as both p-values are greater than 0.05. Furthermore, the residual plots confirm that the ANOVA assumptions are adequately met, making the test results reliable. Therefore, the senior's statement is correct - ANOVA analysis reliably shows that family history and gender do not significantly influence the baseline polyp number.

Statement 4: Baseline polyp number is linearly associated with age as the independent variable (20 marks).

(a) Plot a scatter plot to visualize the above-mentioned relationship between baseline polyp number and age. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)

# Write your codes for part a here  

processed %>% 
  ggplot(aes(age,
            baseline)) + 
  geom_point() +
  geom_smooth(method = "lm", color = "blue") + 
  labs(title = "Relationship Between Baseline Polyp Number and Participant Age",
       x = "Participant Age",
       y = "Baseline Polyp Number")
## `geom_smooth()` using formula = 'y ~ x'

(b) Generate the linear regression model to represent the relationship between baseline polyp number and age in (a). Display the summary output for the linear regression model. (3 marks)

# Write your codes for part b here 

lm_model <- lm(baseline ~ age, data = processed)
summary(lm_model)
## 
## Call:
## lm(formula = baseline ~ age, data = processed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.583 -13.110  -5.227   5.191  64.047 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  14.8430    10.0607   1.475    0.145
## age           0.2329     0.2305   1.010    0.316
## 
## Residual standard error: 20.48 on 65 degrees of freedom
## Multiple R-squared:  0.01546,    Adjusted R-squared:  0.0003098 
## F-statistic:  1.02 on 1 and 65 DF,  p-value: 0.3162

(c) Assuming the assumptions of linear regression are all met, is your senior’s statement correct based on your outputs in (a) and (b)? Briefly explain. (3 marks)

Write your answers for Statement 4 part (c) here:
Based on the linear regression output, there is no significant linear association between baseline polyp number and age. The p-value for the age coefficient is 0.316, which is greater than the 0.05 significance level. Additionally, the R-squared value of 0.015 indicates that age explains only 1.5% of the variance in baseline polyp numbers. The F-statistic p-value of 0.316 also confirms that the overall model is not statistically significant. Therefore, the data does not support the statement that baseline polyp number is linearly associated with age.

(d) Your senior asked you to further test their statement with data transformation. Identify the appropriate data transformation approaches (if applicable) for baseline polyp number and age. (3 marks)

Hint: Transformation of one variable is independent of the other variable(s) i.e. Different variables can have different transformation approaches.

# Write your codes for part d here  

(e) Perform the visualization and analyses steps needed to generate and assess the new linear regression model. (4 marks)

Hint: Remember to also check the fulfillment of assumptions when assessing the model.

# Write your codes for part e here  

(f) Justify your choice of transformation approach(es). Based on your outputs relating to the new linear regression model in (e), is your senior’s statement correct? Briefly explain. (5 marks)

Write your answers for Statement 4 part (f) here: