Student name: Xu King Chun,Jethro
Student UID: 3036401107

Instructions:

Scenario:

You are a research assistant in a laboratory specializing in gastrointestinal cancer research. Your senior, with an interest in colorectal cancers, has recently collected some data from endoscopy patients. The raw dataset (“polyps.csv”) has been passed to you for analysis. Complete this assignment using this dataset.

Background

  • Polyp formation is the development of abnormal tissue growths in the lining of the colon or rectum. While most polyps are benign, some can progress to colorectal cancer if left untreated. Endoscopy is currently widely used to visualize and, if necessary, excise the polyps.
  • Colorectal cancer can be attributed to genetic and lifestyle factors. Patients with a family history of colorectal cancer are advised to undergo regular endoscopy screenings for polyps, starting at an age earlier than the general population.
  • Your senior would like to further investigate the effectiveness of a preventive treatment with Sulindac, a nonsteroidal pain reliever, in reducing polyp numbers. They also wish to determine the potential factors contributing to polyposis and the risk of developing colorectal cancer.

Experimental approach

  • 70 patients were randomly assigned into 2 treatment groups (Randomized controlled trial: placebo control, and the Sulindac treatment).
  • Polyp numbers were measured by endoscopy at the start of the clinical trial, 3 months, and 12 months after initiation of the treatment.
  • It is known that the sampling is reliable and independent.

Dataset information

‘polyps.csv’ has 70 independent observations and 9 variables:

  • participant_ID: One ID for each participant.
  • on_time: Whether the participant arrived on time for the endoscopies (1 for yes, 0 for no). Unclear why this variable was recorded. Timeliness will not be studied in this investigation.
  • gender: Gender of participant (in no particular order: male, female).
  • age: Age of participant.
  • family_history: Family history of colorectal cancer by degrees (in the order of first < second < third < none).
    • first degree: The participant’s immediate family i.e. parents, siblings, and children.
    • second degree: The participant’s grandparents, grandchildren, uncles, aunts, nephews, nieces, and half-siblings.
    • third degree: The participant’s great-grandparents, great grandchildren, great uncles/aunts, and first cousins.
    • none: No known relatives with colorectal cancer.
    • In the event that a participant has multiple degrees of colorectal cancer family history, only the more immediate degree is recorded e.g. If a participant has colorectal cancer family history in both the first and third degrees, only the first degree is recorded.
  • baseline: Polyp number measured by endoscopy prior to initiating treatment.
  • treatment: Treatment group the participant was assigned to (in the order of placebo, sulindac).
  • m3: Polyp number measured by endoscopy 3 months after initiating treatment.
  • m12: Polyp number measured by endoscopy 12 months after initiating treatment.

Initialization (2 marks):

This step sets up a proper R environment. Write scripts to:

# Write your codes for the "initialization" section here
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
polyps = read.csv("polyps.csv", header = TRUE, sep = ",")

Data pre-processing (14 marks):

This section prepares the data for statistical analyses and data visualization. All codes used have been covered in our R workshops up until 2025-11-11 (Tue). Write scripts to:

(a) Convert appropriate variables into factors with the factor level orders specified in the ‘Dataset information’ section. (3 marks)

# Write your codes for part a here


polyps$gender = factor(polyps$gender,
                       levels = c("male", "female"))

polyps$family_history = factor(polyps$family_history,
                               levels = c("first", "second", "third", "none"),
                               ordered = TRUE)

polyps$treatment = factor(polyps$treatment,
                          levels = c("placebo", "sulindac"),
                          ordered = TRUE)
summary(polyps)
##  participant_id     on_time          gender        age        family_history
##  Min.   : 1.00   Min.   :0.0000   male  :37   Min.   :18.00   first :17     
##  1st Qu.:18.25   1st Qu.:0.5000   female:31   1st Qu.:36.00   second:17     
##  Median :35.50   Median :1.0000   NA's  : 2   Median :42.00   third :16     
##  Mean   :35.50   Mean   :0.7143               Mean   :41.93   none  :17     
##  3rd Qu.:52.75   3rd Qu.:1.0000               3rd Qu.:48.00   NA's  : 3     
##  Max.   :70.00   Max.   :1.0000               Max.   :74.00                 
##                  NA's   :63                   NA's   :1                     
##     baseline        treatment        m3             m12       
##  Min.   : 4.00   placebo :35   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:11.00   sulindac:35   1st Qu.: 9.00   1st Qu.: 7.25  
##  Median :18.00                 Median :18.00   Median :17.00  
##  Mean   :24.09                 Mean   :23.72   Mean   :24.12  
##  3rd Qu.:26.25                 3rd Qu.:30.50   3rd Qu.:37.75  
##  Max.   :91.00                 Max.   :97.00   Max.   :97.00  
##                                NA's   :3       NA's   :4

(b) Further determine if there are missing values (NA) in the dataset and process them appropriately. The processed dataset should be assigned to the variable polyps_complete, and should contain no missing values. (7 marks)

Hint: For data column(s) that will not be investigated in this study but have a large proportion of NA values, you can consider excluding the entire column(s) from polyps_complete.

# Write your codes for part b here 

sum(is.na(polyps))
## [1] 76
colSums(is.na(polyps))
## participant_id        on_time         gender            age family_history 
##              0             63              2              1              3 
##       baseline      treatment             m3            m12 
##              0              0              3              4
polyps_complete = polyps %>%
  filter(!is.na(gender))

polyps_complete = polyps_complete %>%
  filter(!is.na(family_history))

polyps_complete = polyps_complete %>%
  select(!on_time)

polyps_complete %>% 
  ggplot(aes(m3)) +
  geom_histogram(bins = 30)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

polyps_complete$m3 = replace(polyps_complete$m3,
                             is.na(polyps_complete$m3),
                             median(polyps_complete$m3, na.rm = TRUE))

polyps_complete %>% 
  ggplot(aes(m12)) +
  geom_histogram(bins = 30)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

polyps_complete$m12 = replace(polyps_complete$m12,
                              is.na(polyps_complete$m12),
                              median(polyps_complete$m12, na.rm = TRUE))

polyps_complete %>% 
  ggplot(aes(age)) +
  geom_histogram(bins = 30)

polyps_complete$age = replace(polyps_complete$age,
                              is.na(polyps_complete$age),
                              mean(polyps_complete$age, na.rm = TRUE))

colSums(is.na(polyps_complete))
## participant_id         gender            age family_history       baseline 
##              0              0              0              0              0 
##      treatment             m3            m12 
##              0              0              0
summary(polyps_complete)
##  participant_id     gender        age        family_history    baseline    
##  Min.   : 1.00   male  :37   Min.   :18.00   first :17      Min.   : 4.00  
##  1st Qu.:17.25   female:29   1st Qu.:36.00   second:16      1st Qu.:11.00  
##  Median :35.50               Median :42.00   third :16      Median :18.50  
##  Mean   :35.56               Mean   :42.02   none  :17      Mean   :24.79  
##  3rd Qu.:53.75               3rd Qu.:48.00                  3rd Qu.:32.25  
##  Max.   :70.00               Max.   :74.00                  Max.   :91.00  
##     treatment        m3             m12       
##  placebo :33   Min.   : 1.00   Min.   : 1.00  
##  sulindac:33   1st Qu.: 9.25   1st Qu.: 8.00  
##                Median :18.00   Median :17.00  
##                Mean   :23.80   Mean   :23.85  
##                3rd Qu.:30.00   3rd Qu.:36.00  
##                Max.   :97.00   Max.   :97.00

(c) Explain and justify your approach for handling NA values in (b). (4 marks)

Write your answers for part (c) here: The entire 'on time' column is deleted as there is so many NAs and timeliness will not be studied in this investigation. The number of NAs in gender and family_history are is small, so I omit those NAs completely. For NAs in m3 and m12, i replaec them with median as their histogram show a righ-skewed pattern. For the age i replace the NA with mean is the data distribution is roughly symmetric

Statistical analysis:

While you had a go at preprocessing the data, your senior prepared a clean dataset (’polyps_clean.csv’) for you to analyze. They took a quick look at the dataset and made a few statements. In this section, you will conduct appropriate statistical analyses on polyps_clean.csv to prove/disprove their statements.

Compulsory step: Run the following code chunk to import your senior’s clean dataset as processed and preview the first 6 rows of data.

# importing the clean dataset
processed = read.csv("polyps_clean.csv")

# setting factor levels for the clean dataset
processed$gender = factor(processed$gender, levels = c("male", "female"))
processed$treatment = factor(processed$treatment, levels = c("placebo", "sulindac"))
processed$family_history = factor(processed$family_history, 
                               levels = c("first", "second", "third", "none"))

# previewing the first 6 rows of the clean dataset
head(processed)
##   participant_id gender age family_history baseline treatment m3 m12 m12_diff
## 1              1 female  45          third        7  sulindac  4   2        5
## 2              2 female  36          first       42   placebo 39  22       20
## 3              3   male  24           none        4  sulindac  1   1        3
## 4              4 female  42           none       64   placebo 68  93      -29
## 5              5   male  57          third       23  sulindac 16  17        6
## 6              6 female  41          first       35   placebo 42  61      -26
# checking to confirm there are no NAs in the dataset
sum(is.na(processed))
## [1] 0
colSums(is.na(processed))
## participant_id         gender            age family_history       baseline 
##              0              0              0              0              0 
##      treatment             m3            m12       m12_diff 
##              0              0              0              0

Statement 1: Males were not preferentially assigned to the Sulindac treatment group. (8 marks)

(a) Plot a bar plot to visualize the gender distribution in the study groups. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)

# Write your codes for part a here  
processed %>%
  ggplot(aes(x = treatment, fill = gender)) +
  geom_bar(position = "fill") +
  labs (title = "Gender distribution by treatment group",
        x = "Treatment group",
        y = "Proportion of participants")

(b) Choose and conduct the appropriate test to test this statement. Display the summary output for the test. (4 marks)

Hint: Remember to also check the fulfillment of assumptions when running the test.

# Write your codes for part b here  

gender_treatment = table(processed$gender, processed$treatment)
gender_treatment
##         
##          placebo sulindac
##   male        18       20
##   female      15       14
gender_treatment_chisq = chisq.test(gender_treatment)
gender_treatment_chisq
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender_treatment
## X-squared = 0.011393, df = 1, p-value = 0.915
gender_treatment_chisq$observed 
##         
##          placebo sulindac
##   male        18       20
##   female      15       14
gender_treatment_chisq$expected
##         
##           placebo sulindac
##   male   18.71642 19.28358
##   female 14.28358 14.71642

(c) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 1 part (c) here: Correct, the p-value = 0.915 >0.05, we fail to reject the null hypothesis, meaning there is no significant association between gender and treatment assignment at the 5% significance level.

Statement 2: Compared to the baseline, polyp number 12 months after treatment initiation is significantly reduced by Sulindac. (15 marks)

(a) Your senior created a new column called m12_diff in the clean dataset. The column contains the difference in polyp number 12 months after treatment initiation (calculated as baseline - m12). What could be the rationale for creating this column? (2 marks)

Write your answers for Statement 2 part (a) here: The column summarises each participant’s change in polyp number over 12 months into a single quantitative outcome. A positive value indicates a reduction in polyps (improvement), while a negative value indicates an increase. This makes it easy to interpret the treatment's effect. This calculated difference can be further used to perform statistical tests such as paired t-test to determine if the observed change is significant

(b) Conduct the F-test for equal variances using the m12_diff variance values for placebo and Sulindac treatment groups. (4 marks)

# Write your codes for part b here  
var(processed$m12_diff[processed$treatment == "placebo"])
## [1] 135.4848
var(processed$m12_diff[processed$treatment == "sulindac"])
## [1] 156.057
var.test(processed$m12_diff[processed$treatment == "sulindac"],
         processed$m12_diff[processed$treatment == "placebo"],
         alternative = c("greater"))
## 
##  F test to compare two variances
## 
## data:  processed$m12_diff[processed$treatment == "sulindac"] and processed$m12_diff[processed$treatment == "placebo"]
## F = 1.1518, num df = 33, denom df = 32, p-value = 0.3454
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
##  0.6403017       Inf
## sample estimates:
## ratio of variances 
##           1.151841

(c) Name a test to be performed with m12_diff values to assess the efficacy of Sulindac. Are the assumptions for the test fulfilled? Briefly explain. (3 marks)

Write your answers for Statement 2 part (c) here: two-sample t-test, the assumptions are fulfilled: as with n>30 for each group,each group are approximated to be normal disrtibution by central limit theorem, and they have equal variance is the p-value of F-test for equal variance =0.3454>0.05, there is no strong statistical evidence that the variance differ

(d) Assuming the test assumptions are fulfilled, use the m12_diff values, to perform the above-named test to assess the efficacy of Sulindac. (4 marks)

# Write your codes for part d here  

t_m12diff = t.test(m12_diff ~ treatment,
                   data = processed,
                   alternative = "less",
                   var.equal   = TRUE)

t_m12diff
## 
##  Two Sample t-test
## 
## data:  m12_diff by treatment
## t = -6.3754, df = 65, p-value = 1.084e-08
## alternative hypothesis: true difference in means between group placebo and group sulindac is less than 0
## 95 percent confidence interval:
##      -Inf -13.8942
## sample estimates:
##  mean in group placebo mean in group sulindac 
##              -8.878788               9.941176

(e) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 2 part (e) here:the p value of two sample t test=10.084e-08<0.05,so we reject the null hypothesis of no difference in mean m12_diff between the groups and accept the alternative that the mean m12_diff is greater in the Sulindac group. The 95% confidence interval for the difference (placebo − Sulindac) is (−∞, −13.9), entirely below 0, which also indicates that Sulindac produces a larger reduction in polyp number; therefore, the senior’s statement is supported to be correct.

Statement 3: ANOVA analysis reliably shows that family history and gender do not significantly influence the baseline polyp number. (16 marks)

(a) Perform an appropriate ANOVA test to assess the effect of family history and gender on baseline polyp number. Display the summary output for the ANOVA test. (4 marks)

# Write your codes for part a here  
baseline_famhist_gender.aov = aov(
  baseline ~ family_history * gender,
  data = processed)

summary(baseline_famhist_gender.aov)
##                       Df Sum Sq Mean Sq F value Pr(>F)
## family_history         3   2603   867.6   2.091  0.111
## gender                 1     52    51.5   0.124  0.726
## family_history:gender  3    573   191.0   0.460  0.711
## Residuals             59  24475   414.8

(b) Based on the ANOVA summary output in (a), what would you conclude about the effect of family history and gender on baseline polyp number? (2 marks)

Write your answers for Statement 3 part (b) here:
All three p-values for the effects are > 0.05. There is no statistical evidence that baseline polyp number differs by family history.There is no statistical evidence that baseline polyp number differs by gender.There is no statistical evidence of an interaction between family history and gender.

(c) Generate the residual plots (that are necessary for assessing residuals assumptions) from your ANOVA test. (4 marks)

# Write your codes for part c here 

plot(baseline_famhist_gender.aov, 1, cex = 0.8) 

plot(baseline_famhist_gender.aov, 2, cex = 0.1)

(d) Are the residual assumptions for your ANOVA test fulfilled? Explain in relation to the residual plots. (4 marks)

Write your answers for Statement 3 part (d) here:No,the residual do not have equal variance as there are some obvious influential points such as points labeled as 22 and 63 in the 'Residual v.s. Fitted'. Also the QQ plot also shows deviation from the diagonal straight line at both left end and right end so the residuals are not normally distributed as well

(e) Is your senior’s statement correct? Briefly explain. (2 marks)

Write your answers for Statement 3 part (e) here:Incorrect, as the assumption of anova is not fulfilled, so even the anova result show hows that family history and gender do not significantly influence the baseline polyp number, the test result is not reliable.

Statement 4: Baseline polyp number is linearly associated with age as the independent variable (20 marks).

(a) Plot a scatter plot to visualize the above-mentioned relationship between baseline polyp number and age. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)

# Write your codes for part a here  
processed %>%
  ggplot(aes(x = age, y = baseline)) +
  geom_point(size=0.5) +
  labs( title = "Baseline polyp number vs age", x = "Age (years)", y = "Baseline polyp count")

(b) Generate the linear regression model to represent the relationship between baseline polyp number and age in (a). Display the summary output for the linear regression model. (3 marks)

# Write your codes for part b here  
baseline_age.lm = lm(baseline ~ age, data = processed)
summary(baseline_age.lm)
## 
## Call:
## lm(formula = baseline ~ age, data = processed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.583 -13.110  -5.227   5.191  64.047 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  14.8430    10.0607   1.475    0.145
## age           0.2329     0.2305   1.010    0.316
## 
## Residual standard error: 20.48 on 65 degrees of freedom
## Multiple R-squared:  0.01546,    Adjusted R-squared:  0.0003098 
## F-statistic:  1.02 on 1 and 65 DF,  p-value: 0.3162

(c) Assuming the assumptions of linear regression are all met, is your senior’s statement correct based on your outputs in (a) and (b)? Briefly explain. (3 marks)

Write your answers for Statement 4 part (c) here:Not correct.In (a),the scatter plot shows no clear linear trend between age and baseline polyp number as the counts are very spread out at all ages.From (b), the p-value for age is 0.3162 which is much larger than 0.05,showing no statisticaly signifcant between age and baseline polyp level number, and with adjusted R-squared = 0.0003098, meaning the model explains almost none of the variation.

(d) Your senior asked you to further test their statement with data transformation. Identify the appropriate data transformation approaches (if applicable) for baseline polyp number and age. (3 marks)

Hint: Transformation of one variable is independent of the other variable(s) i.e. Different variables can have different transformation approaches.

# Write your codes for part d here

# Untransformed
processed %>% ggplot(aes(baseline)) +
  geom_histogram(bins = 200) +
  labs(title = "Distribution of baseline polyp count",
       x = "Baseline polyp count",
       y = "Frequency")

# Log transformation
processed %>% ggplot(aes(log(baseline))) +
  geom_histogram(bins = 200) +
  labs(title = "Distribution of log(baseline)",
       x = "log(baseline)",
       y = "Frequency")

# Untransformed
processed %>% ggplot(aes(age)) +
  geom_histogram(bins = 200) +
  labs(title = "Distribution of age",
       x = "Age (years)",
       y = "Frequency")

(e) Perform the visualization and analyses steps needed to generate and assess the new linear regression model. (4 marks)

Hint: Remember to also check the fulfillment of assumptions when assessing the model.

# Write your codes for part e here  


processed %>%
  ggplot(aes(age,
             log(baseline))) +
  geom_point(size = 0.5)

log_baseline_age_lm = lm(log(baseline) ~ age,
                         data = processed)
summary(log_baseline_age_lm)
## 
## Call:
## lm(formula = log(baseline) ~ age, data = processed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33490 -0.58349  0.03342  0.46006  1.57387 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.449692   0.360713   6.791 4.05e-09 ***
## age         0.011313   0.008266   1.369    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7344 on 65 degrees of freedom
## Multiple R-squared:  0.02801,    Adjusted R-squared:  0.01306 
## F-statistic: 1.873 on 1 and 65 DF,  p-value: 0.1758
qqnorm(resid(log_baseline_age_lm), main = "QQ-Plot of Residuals")
qqline(resid(log_baseline_age_lm), col = "red")

plot(predict(log_baseline_age_lm), resid(log_baseline_age_lm))

(f) Justify your choice of transformation approach(es). Based on your outputs relating to the new linear regression model in (e), is your senior’s statement correct? Briefly explain. (5 marks)

Write your answers for Statement 4 part (f) here: For baseline polyp number, the untransformed histogram in (d) showed a strong right-skew, so i choose to log-transform the graph after which the distribution looked much closer to normal and the QQ-plot of residuals in (e) lay roughly on the straight line, and the residuals–vs–fitted plot showed equal variance.This suggests that the log transformation is appropriate and improves the linear regression assumptions.For age, the original histogram was already reasonably symmetric and can be regraded as normally distribution, therefore no further transformation approach is needed. In the new model log(baseline) ~ age, the estimated slope for age is small (≈ 0.011) and not statistically significant (p ≈ 0.18<0.05), and the adjuste) R-squared remains very low and lose to 0 (0.01306) , meaning age explains only a tiny proportion of the variation in log baseline polyp number. Even after using an appropriate transformation and checking that the model assumptions are reasonably met, there is no strong evidence of a linear association between baseline polyp number and age. Therefore, the statement is not correct.