Student name: Xu King Chun,Jethro
Student UID: 3036401107
This assignment is due on 2025-11-25 (Tuesday) 23:59, and is out of 60 marks (15% of final grade).
Analysis documentation (recommended):
A key part of experimental data analysis (and coding) is documentation and readability. You are encouraged to annotate your code to explain what each line of code is intended to do (refer to our R workshops codes for example).
Data is available on Moodle as “polyps.csv”.
All statistical tests, unless otherwise stated, should be assessed at a significance level of 0.05.
All your code, output, and answers to questions should be inputted into this single R markdown (.Rmd) file. All your code and text input should be made within the appropriate code/text cells.
Submission guidelines:
Please rename your completed R markdown file as “yourUID-yourName.Rmd”, and knit your Rmd file to the HTML format by clicking the “Knit” button at the top of the code editor.
You should upload the HTML file for submission to Moodle.
Guidelines on GenAI usage:
Use of GenAI is acceptable for code generation and troubleshooting for this assignment, but not for data interpretation.
In code chunks where GenAI is used, comment with the
# prefix to 1) specify which GenAI model you have used, and
2) your prompt.
You are a research assistant in a laboratory specializing in gastrointestinal cancer research. Your senior, with an interest in colorectal cancers, has recently collected some data from endoscopy patients. The raw dataset (“polyps.csv”) has been passed to you for analysis. Complete this assignment using this dataset.
‘polyps.csv’ has 70 independent observations and 9 variables:
1 for yes, 0 for no). Unclear why
this variable was recorded. Timeliness will not be studied in this
investigation.male, female).first < second
< third < none).
first degree: The participant’s immediate family
i.e. parents, siblings, and children.second degree: The participant’s grandparents,
grandchildren, uncles, aunts, nephews, nieces, and half-siblings.third degree: The participant’s great-grandparents,
great grandchildren, great uncles/aunts, and first cousins.none: No known relatives with colorectal cancer.first and third degrees, only the
first degree is recorded.placebo, sulindac).This step sets up a proper R environment. Write scripts to:
Install (if needed) and load the tidyverse.
Import the dataset ‘polyps.csv’ and assign it to the
variable polyps.
# Write your codes for the "initialization" section here
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
polyps = read.csv("polyps.csv", header = TRUE, sep = ",")
This section prepares the data for statistical analyses and data visualization. All codes used have been covered in our R workshops up until 2025-11-11 (Tue). Write scripts to:
(a) Convert appropriate variables into factors with the factor level orders specified in the ‘Dataset information’ section. (3 marks)
# Write your codes for part a here
polyps$gender = factor(polyps$gender,
levels = c("male", "female"))
polyps$family_history = factor(polyps$family_history,
levels = c("first", "second", "third", "none"),
ordered = TRUE)
polyps$treatment = factor(polyps$treatment,
levels = c("placebo", "sulindac"),
ordered = TRUE)
summary(polyps)
## participant_id on_time gender age family_history
## Min. : 1.00 Min. :0.0000 male :37 Min. :18.00 first :17
## 1st Qu.:18.25 1st Qu.:0.5000 female:31 1st Qu.:36.00 second:17
## Median :35.50 Median :1.0000 NA's : 2 Median :42.00 third :16
## Mean :35.50 Mean :0.7143 Mean :41.93 none :17
## 3rd Qu.:52.75 3rd Qu.:1.0000 3rd Qu.:48.00 NA's : 3
## Max. :70.00 Max. :1.0000 Max. :74.00
## NA's :63 NA's :1
## baseline treatment m3 m12
## Min. : 4.00 placebo :35 Min. : 1.00 Min. : 1.00
## 1st Qu.:11.00 sulindac:35 1st Qu.: 9.00 1st Qu.: 7.25
## Median :18.00 Median :18.00 Median :17.00
## Mean :24.09 Mean :23.72 Mean :24.12
## 3rd Qu.:26.25 3rd Qu.:30.50 3rd Qu.:37.75
## Max. :91.00 Max. :97.00 Max. :97.00
## NA's :3 NA's :4
(b) Further determine if there are missing values
(NA) in the dataset and process them appropriately. The
processed dataset should be assigned to the variable
polyps_complete, and should contain no missing values. (7
marks)
Hint: For data column(s) that will not be investigated in
this study but have a large proportion of NA values, you can consider
excluding the entire column(s) from
polyps_complete.
# Write your codes for part b here
sum(is.na(polyps))
## [1] 76
colSums(is.na(polyps))
## participant_id on_time gender age family_history
## 0 63 2 1 3
## baseline treatment m3 m12
## 0 0 3 4
polyps_complete = polyps %>%
filter(!is.na(gender))
polyps_complete = polyps_complete %>%
filter(!is.na(family_history))
polyps_complete = polyps_complete %>%
select(!on_time)
polyps_complete %>%
ggplot(aes(m3)) +
geom_histogram(bins = 30)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
polyps_complete$m3 = replace(polyps_complete$m3,
is.na(polyps_complete$m3),
median(polyps_complete$m3, na.rm = TRUE))
polyps_complete %>%
ggplot(aes(m12)) +
geom_histogram(bins = 30)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
polyps_complete$m12 = replace(polyps_complete$m12,
is.na(polyps_complete$m12),
median(polyps_complete$m12, na.rm = TRUE))
polyps_complete %>%
ggplot(aes(age)) +
geom_histogram(bins = 30)
polyps_complete$age = replace(polyps_complete$age,
is.na(polyps_complete$age),
mean(polyps_complete$age, na.rm = TRUE))
colSums(is.na(polyps_complete))
## participant_id gender age family_history baseline
## 0 0 0 0 0
## treatment m3 m12
## 0 0 0
summary(polyps_complete)
## participant_id gender age family_history baseline
## Min. : 1.00 male :37 Min. :18.00 first :17 Min. : 4.00
## 1st Qu.:17.25 female:29 1st Qu.:36.00 second:16 1st Qu.:11.00
## Median :35.50 Median :42.00 third :16 Median :18.50
## Mean :35.56 Mean :42.02 none :17 Mean :24.79
## 3rd Qu.:53.75 3rd Qu.:48.00 3rd Qu.:32.25
## Max. :70.00 Max. :74.00 Max. :91.00
## treatment m3 m12
## placebo :33 Min. : 1.00 Min. : 1.00
## sulindac:33 1st Qu.: 9.25 1st Qu.: 8.00
## Median :18.00 Median :17.00
## Mean :23.80 Mean :23.85
## 3rd Qu.:30.00 3rd Qu.:36.00
## Max. :97.00 Max. :97.00
(c) Explain and justify your approach for handling NA values in (b). (4 marks)
Write your answers for part (c) here: The entire 'on time' column is deleted as there is so many NAs and timeliness will not be studied in this investigation. The number of NAs in gender and family_history are is small, so I omit those NAs completely. For NAs in m3 and m12, i replaec them with median as their histogram show a righ-skewed pattern. For the age i replace the NA with mean is the data distribution is roughly symmetric
While you had a go at preprocessing the data, your senior prepared a
clean dataset (’polyps_clean.csv’) for you to analyze. They
took a quick look at the dataset and made a few statements. In this
section, you will conduct appropriate statistical analyses on
polyps_clean.csv to prove/disprove their statements.
processed and preview the first 6 rows of
data.# importing the clean dataset
processed = read.csv("polyps_clean.csv")
# setting factor levels for the clean dataset
processed$gender = factor(processed$gender, levels = c("male", "female"))
processed$treatment = factor(processed$treatment, levels = c("placebo", "sulindac"))
processed$family_history = factor(processed$family_history,
levels = c("first", "second", "third", "none"))
# previewing the first 6 rows of the clean dataset
head(processed)
## participant_id gender age family_history baseline treatment m3 m12 m12_diff
## 1 1 female 45 third 7 sulindac 4 2 5
## 2 2 female 36 first 42 placebo 39 22 20
## 3 3 male 24 none 4 sulindac 1 1 3
## 4 4 female 42 none 64 placebo 68 93 -29
## 5 5 male 57 third 23 sulindac 16 17 6
## 6 6 female 41 first 35 placebo 42 61 -26
# checking to confirm there are no NAs in the dataset
sum(is.na(processed))
## [1] 0
colSums(is.na(processed))
## participant_id gender age family_history baseline
## 0 0 0 0 0
## treatment m3 m12 m12_diff
## 0 0 0 0
(a) Plot a bar plot to visualize the gender distribution in the study groups. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)
# Write your codes for part a here
processed %>%
ggplot(aes(x = treatment, fill = gender)) +
geom_bar(position = "fill") +
labs (title = "Gender distribution by treatment group",
x = "Treatment group",
y = "Proportion of participants")
(b) Choose and conduct the appropriate test to test this statement. Display the summary output for the test. (4 marks)
Hint: Remember to also check the fulfillment of assumptions when running the test.
# Write your codes for part b here
gender_treatment = table(processed$gender, processed$treatment)
gender_treatment
##
## placebo sulindac
## male 18 20
## female 15 14
gender_treatment_chisq = chisq.test(gender_treatment)
gender_treatment_chisq
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gender_treatment
## X-squared = 0.011393, df = 1, p-value = 0.915
gender_treatment_chisq$observed
##
## placebo sulindac
## male 18 20
## female 15 14
gender_treatment_chisq$expected
##
## placebo sulindac
## male 18.71642 19.28358
## female 14.28358 14.71642
(c) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 1 part (c) here: Correct, the p-value = 0.915 >0.05, we fail to reject the null hypothesis, meaning there is no significant association between gender and treatment assignment at the 5% significance level.
(a) Your senior created a new column called
m12_diff in the clean dataset. The column contains the
difference in polyp number 12 months after treatment initiation
(calculated as baseline - m12). What could be the rationale
for creating this column? (2 marks)
Write your answers for Statement 2 part (a) here: The column summarises each participant’s change in polyp number over 12 months into a single quantitative outcome. A positive value indicates a reduction in polyps (improvement), while a negative value indicates an increase. This makes it easy to interpret the treatment's effect. This calculated difference can be further used to perform statistical tests such as paired t-test to determine if the observed change is significant
(b) Conduct the F-test for equal variances using the
m12_diff variance values for placebo and Sulindac treatment
groups. (4 marks)
# Write your codes for part b here
var(processed$m12_diff[processed$treatment == "placebo"])
## [1] 135.4848
var(processed$m12_diff[processed$treatment == "sulindac"])
## [1] 156.057
var.test(processed$m12_diff[processed$treatment == "sulindac"],
processed$m12_diff[processed$treatment == "placebo"],
alternative = c("greater"))
##
## F test to compare two variances
##
## data: processed$m12_diff[processed$treatment == "sulindac"] and processed$m12_diff[processed$treatment == "placebo"]
## F = 1.1518, num df = 33, denom df = 32, p-value = 0.3454
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
## 0.6403017 Inf
## sample estimates:
## ratio of variances
## 1.151841
(c) Name a test to be performed with
m12_diff values to assess the efficacy of Sulindac. Are the
assumptions for the test fulfilled? Briefly explain. (3 marks)
Write your answers for Statement 2 part (c) here: two-sample t-test, the assumptions are fulfilled: as with n>30 for each group,each group are approximated to be normal disrtibution by central limit theorem, and they have equal variance is the p-value of F-test for equal variance =0.3454>0.05, there is no strong statistical evidence that the variance differ
(d) Assuming the test assumptions are fulfilled, use
the m12_diff values, to perform the above-named test to
assess the efficacy of Sulindac. (4 marks)
# Write your codes for part d here
t_m12diff = t.test(m12_diff ~ treatment,
data = processed,
alternative = "less",
var.equal = TRUE)
t_m12diff
##
## Two Sample t-test
##
## data: m12_diff by treatment
## t = -6.3754, df = 65, p-value = 1.084e-08
## alternative hypothesis: true difference in means between group placebo and group sulindac is less than 0
## 95 percent confidence interval:
## -Inf -13.8942
## sample estimates:
## mean in group placebo mean in group sulindac
## -8.878788 9.941176
(e) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 2 part (e) here:the p value of two sample t test=10.084e-08<0.05,so we reject the null hypothesis of no difference in mean m12_diff between the groups and accept the alternative that the mean m12_diff is greater in the Sulindac group. The 95% confidence interval for the difference (placebo − Sulindac) is (−∞, −13.9), entirely below 0, which also indicates that Sulindac produces a larger reduction in polyp number; therefore, the senior’s statement is supported to be correct.
(a) Perform an appropriate ANOVA test to assess the effect of family history and gender on baseline polyp number. Display the summary output for the ANOVA test. (4 marks)
# Write your codes for part a here
baseline_famhist_gender.aov = aov(
baseline ~ family_history * gender,
data = processed)
summary(baseline_famhist_gender.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## family_history 3 2603 867.6 2.091 0.111
## gender 1 52 51.5 0.124 0.726
## family_history:gender 3 573 191.0 0.460 0.711
## Residuals 59 24475 414.8
(b) Based on the ANOVA summary output in (a), what would you conclude about the effect of family history and gender on baseline polyp number? (2 marks)
Write your answers for Statement 3 part (b) here:
All three p-values for the effects are > 0.05. There is no statistical evidence that baseline polyp number differs by family history.There is no statistical evidence that baseline polyp number differs by gender.There is no statistical evidence of an interaction between family history and gender.
(c) Generate the residual plots (that are necessary for assessing residuals assumptions) from your ANOVA test. (4 marks)
# Write your codes for part c here
plot(baseline_famhist_gender.aov, 1, cex = 0.8)
plot(baseline_famhist_gender.aov, 2, cex = 0.1)
(d) Are the residual assumptions for your ANOVA test fulfilled? Explain in relation to the residual plots. (4 marks)
Write your answers for Statement 3 part (d) here:No,the residual do not have equal variance as there are some obvious influential points such as points labeled as 22 and 63 in the 'Residual v.s. Fitted'. Also the QQ plot also shows deviation from the diagonal straight line at both left end and right end so the residuals are not normally distributed as well
(e) Is your senior’s statement correct? Briefly explain. (2 marks)
Write your answers for Statement 3 part (e) here:Incorrect, as the assumption of anova is not fulfilled, so even the anova result show hows that family history and gender do not significantly influence the baseline polyp number, the test result is not reliable.
(a) Plot a scatter plot to visualize the above-mentioned relationship between baseline polyp number and age. The plot should have a suitable title, and the x-/y-axes should also be labelled. (2 marks)
# Write your codes for part a here
processed %>%
ggplot(aes(x = age, y = baseline)) +
geom_point(size=0.5) +
labs( title = "Baseline polyp number vs age", x = "Age (years)", y = "Baseline polyp count")
(b) Generate the linear regression model to represent the relationship between baseline polyp number and age in (a). Display the summary output for the linear regression model. (3 marks)
# Write your codes for part b here
baseline_age.lm = lm(baseline ~ age, data = processed)
summary(baseline_age.lm)
##
## Call:
## lm(formula = baseline ~ age, data = processed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.583 -13.110 -5.227 5.191 64.047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.8430 10.0607 1.475 0.145
## age 0.2329 0.2305 1.010 0.316
##
## Residual standard error: 20.48 on 65 degrees of freedom
## Multiple R-squared: 0.01546, Adjusted R-squared: 0.0003098
## F-statistic: 1.02 on 1 and 65 DF, p-value: 0.3162
(c) Assuming the assumptions of linear regression are all met, is your senior’s statement correct based on your outputs in (a) and (b)? Briefly explain. (3 marks)
Write your answers for Statement 4 part (c) here:Not correct.In (a),the scatter plot shows no clear linear trend between age and baseline polyp number as the counts are very spread out at all ages.From (b), the p-value for age is 0.3162 which is much larger than 0.05,showing no statisticaly signifcant between age and baseline polyp level number, and with adjusted R-squared = 0.0003098, meaning the model explains almost none of the variation.
(d) Your senior asked you to further test their statement with data transformation. Identify the appropriate data transformation approaches (if applicable) for baseline polyp number and age. (3 marks)
Hint: Transformation of one variable is independent of the other variable(s) i.e. Different variables can have different transformation approaches.
# Write your codes for part d here
# Untransformed
processed %>% ggplot(aes(baseline)) +
geom_histogram(bins = 200) +
labs(title = "Distribution of baseline polyp count",
x = "Baseline polyp count",
y = "Frequency")
# Log transformation
processed %>% ggplot(aes(log(baseline))) +
geom_histogram(bins = 200) +
labs(title = "Distribution of log(baseline)",
x = "log(baseline)",
y = "Frequency")
# Untransformed
processed %>% ggplot(aes(age)) +
geom_histogram(bins = 200) +
labs(title = "Distribution of age",
x = "Age (years)",
y = "Frequency")
(e) Perform the visualization and analyses steps needed to generate and assess the new linear regression model. (4 marks)
Hint: Remember to also check the fulfillment of assumptions when assessing the model.
# Write your codes for part e here
processed %>%
ggplot(aes(age,
log(baseline))) +
geom_point(size = 0.5)
log_baseline_age_lm = lm(log(baseline) ~ age,
data = processed)
summary(log_baseline_age_lm)
##
## Call:
## lm(formula = log(baseline) ~ age, data = processed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33490 -0.58349 0.03342 0.46006 1.57387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.449692 0.360713 6.791 4.05e-09 ***
## age 0.011313 0.008266 1.369 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7344 on 65 degrees of freedom
## Multiple R-squared: 0.02801, Adjusted R-squared: 0.01306
## F-statistic: 1.873 on 1 and 65 DF, p-value: 0.1758
qqnorm(resid(log_baseline_age_lm), main = "QQ-Plot of Residuals")
qqline(resid(log_baseline_age_lm), col = "red")
plot(predict(log_baseline_age_lm), resid(log_baseline_age_lm))
(f) Justify your choice of transformation approach(es). Based on your outputs relating to the new linear regression model in (e), is your senior’s statement correct? Briefly explain. (5 marks)
Write your answers for Statement 4 part (f) here: For baseline polyp number, the untransformed histogram in (d) showed a strong right-skew, so i choose to log-transform the graph after which the distribution looked much closer to normal and the QQ-plot of residuals in (e) lay roughly on the straight line, and the residuals–vs–fitted plot showed equal variance.This suggests that the log transformation is appropriate and improves the linear regression assumptions.For age, the original histogram was already reasonably symmetric and can be regraded as normally distribution, therefore no further transformation approach is needed. In the new model log(baseline) ~ age, the estimated slope for age is small (≈ 0.011) and not statistically significant (p ≈ 0.18<0.05), and the adjuste) R-squared remains very low and lose to 0 (0.01306) , meaning age explains only a tiny proportion of the variation in log baseline polyp number. Even after using an appropriate transformation and checking that the model assumptions are reasonably met, there is no strong evidence of a linear association between baseline polyp number and age. Therefore, the statement is not correct.