SID: 520455597

Instructions

  • The exam will run for 2 hours.
  • Use the pre-filled rmarkdown file from Canvas to help arrange your answers. You are welcome to use or ignore the code chunks I have inserted or add additional ones.
  • Submit a final html file to turnitin in Canvas to be marked.
  • There are 3 questions. The first two are worth \(40\%\) each, the third is worth \(20\%\).
  • The exam question sheet is 6 pages long.

Question 1 (\(40\%\))

Question 1 has 3 parts.

Part 1

melanoma25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/melanoma2025.csv")
library(ggplot2)
library(tidyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” lubridate 1.9.4     âś” tibble    3.2.1
## âś” purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(naniar)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.19.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:stats':
## 
##     cutree
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(datasets)
library(survival)
  1. How many variables and observations are in the dataset?

Answer:

190 observations and 8 variables

head(melanoma25)
##   X status sex age year thickness ulcer location
## 1 1      2   1  41 1977  1.157584     0    torso
## 2 2      1   1  52 1965  3.475629     1     limb
## 3 3      1   1  28 1971  2.200000     1     limb
## 4 4      1   1  77 1972  2.271563     1     limb
## 5 5      1   1  49 1968  3.588872     1    torso
## 6 6      1   0  68 1971  2.722132     1    torso
dim(melanoma25)
## [1] 190   8
  1. Comment on the class of these variables and how they are stored in R.

Answer: 6 of the variables are stored as integers (X, status, sex, age, year and ulcer), One is stored as numerical (thickness) and one is stored as character (location)

Ulcer, sex and status are categorical despite being stored as integers. Age, year and thickness are numeric. Location is a character

str(melanoma25)
## 'data.frame':    190 obs. of  8 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ status   : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ sex      : int  1 1 1 1 1 0 0 0 1 0 ...
##  $ age      : int  41 52 28 77 49 68 53 68 63 14 ...
##  $ year     : int  1977 1965 1971 1972 1968 1971 1969 1965 1970 1969 ...
##  $ thickness: num  1.16 3.48 2.2 2.27 3.59 ...
##  $ ulcer    : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ location : chr  "torso" "limb" "limb" "limb" ...
sapply(melanoma25, class)
##           X      status         sex         age        year   thickness 
##   "integer"   "integer"   "integer"   "integer"   "integer"   "numeric" 
##       ulcer    location 
##   "integer" "character"
  1. Is there any missing data in this dataset?

Answer: no

sum(is.na(melanoma25))
## [1] 0
vis_miss(melanoma25)

Part 2

The researchers would like to understand the relationships between tumour thickness and the other variables.

  1. Test to see if tumour thickness is different between people with and without ulceration. Check your assumptions and make a conclusion using a signficance threshold of 0.05.

Answer: Tumour thickness differs significantly between people with and without ulceration (p<0.05), being thicker in those with an ulcer. The assumptions are independence, normality and equal variance. From residuals vs plotted, distribution implies homoscedasticity, from qq plot, residuals are relatively normally distributed however there is a seemingly right skew. Overall the assumptions seem to be valid

boxplot(thickness ~ factor(ulcer), melanoma25)

fit = lm(thickness ~ factor(ulcer), melanoma25)
anova(fit)
## Analysis of Variance Table
## 
## Response: thickness
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(ulcer)   1 24.394 24.3940   63.64 1.425e-13 ***
## Residuals     188 72.063  0.3833                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit)

  1. Test to see if tumour thickness is higher in females than males. Check your assumptions and make a conclusion using a signficance threshold of 0.05.

Answer: Tumour thickness is greater in males (p<0.05). The assumptions are independence, normality, equal variance and random sampling. Population variances appear equal from the plots. QQ plot is slightly skewed right suggesting the data may not be normal however the skew is minor.

boxplot(thickness ~ factor(sex), melanoma25)

fit = lm(thickness ~ factor(sex), melanoma25)
anova(fit)
## Analysis of Variance Table
## 
## Response: thickness
##              Df Sum Sq Mean Sq F value   Pr(>F)   
## factor(sex)   1  4.087  4.0872  8.3188 0.004382 **
## Residuals   188 92.369  0.4913                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit)

  1. Fit a two-way ANOVA model without an interaction effect to assess if ulceration and sex both have a relationship with tumour thickness when included in the same model. Be sure to comment on the assumptions needed for this test.

Answer: This model requires the usual assumptions of independence, normality, homoscedasticity and random sampling but also includes the assumption that the two factors have an additive effect rather than interactive.

Melanoma.additive = aov(thickness ~ sex + ulcer, melanoma25)
summary(Melanoma.additive)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## sex           1   4.09   4.087   10.78  0.00122 ** 
## ulcer         1  21.47  21.466   56.62 2.16e-12 ***
## Residuals   187  70.90   0.379                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Create an interaction plot with sex and ulceration as factors and thickness as the response. From this plot, is there evidence that there is an interaction effect? Why?

Answer: Yes there seems to be interaction as the lines of best fit are not parallel and are interacting.

melanoma25$sex = as.factor(melanoma25$sex)
melanoma25$ulcer = as.factor(melanoma25$ulcer)

ggplot(melanoma25, aes(x = factor(ulcer), y = thickness, colour = sex)) + geom_boxplot() +
    stat_summary(fun.y = mean, geom = "line", aes(group = sex)) + theme_classic() +
    xlab("ulcer") + ylab("thickness") + labs(colour = "sex")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  1. Include an interaction term in the two-way ANOVA model to assess if ulceration and sex both have a relationship with tumour thickness. What is your conclusion?

Answer: No they dont have a significant interactive effect as the p value for sex:ulcer is just above the significance threshold (p = 0.058).

Melanoma.interaction = aov(thickness ~ sex * ulcer, melanoma25)
summary(Melanoma.interaction)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## sex           1   4.09   4.087  10.931  0.00113 ** 
## ulcer         1  21.47  21.466  57.412 1.61e-12 ***
## sex:ulcer     1   1.36   1.358   3.632  0.05822 .  
## Residuals   186  69.55   0.374                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 3

The researchers would like to test if age has an effect on tumour thickness.

  1. If there was no relationship between tumour thinkess and age, what value should the slope of an appropriate regression model be?

Answer: R^2 = 0

  1. Fit a regression model and then assess and comment on the model assumptions.

Answer: The model assumptions are random sampling, homoscedastic values, normally distributed values and independence of values. From the QQ, the values are normally distributed. From the residuals vs fitted and scale vs location the values appear homoscedastic. From the residuals vs leveraged there seems to be very few outliers.

fit = lm(age ~ thickness, melanoma25)
summary(fit)
## 
## Call:
## lm(formula = age ~ thickness, data = melanoma25)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.059 -10.929   2.145  10.713  40.097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.195      2.794  16.174   <2e-16 ***
## thickness      4.147      1.674   2.477   0.0141 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.45 on 188 degrees of freedom
## Multiple R-squared:  0.03159,    Adjusted R-squared:  0.02644 
## F-statistic: 6.134 on 1 and 188 DF,  p-value: 0.01415
plot(fit)

  1. What would you conclude from the test?

Answer: p<0.05 and adjusted R-square value is 0.02. These indicate there is a significant effect of age on tumour thickness, however gradual, with tumour thickness increasing with age

  1. What proportion of variation in tumour thickness can be explained by age?

Answer: ~2%

Question 2 (\(40\%\))

Part 1

heart25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/heartDiseaseV25.csv")
str(heart25)
## 'data.frame':    297 obs. of  13 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ disease : int  0 1 1 0 0 0 1 0 1 1 ...

From clinical experience the researchers believe that there may be a relationship between sex and whether an individual had heart disease.

  1. What types of variables should sex and whether they had heart disease be?

Answer: Sex and heart disease status should be categorical variables

  1. What is an appropriate statistical test that could be used by the researchers to test this question?

Answer: a chi-squared test

  1. What is the corresponding null and alternate hypothesis?

Answer: Null: There is no relationship between sex and whether an individual has heart disease. Alternate: There is a relationship between sex and whether and individual had heart disease

  1. Construct a contingency table using the variables sex and disease.

Answer:

tab <- table(heart25$sex, heart25$disease)
tab
##    
##       0   1
##   0  71  25
##   1  89 112
  1. Perform the appropriate test.

Answer:

chisq.test(tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 21.852, df = 1, p-value = 2.946e-06
test = chisq.test(tab)
test$expected >= 5
##    
##        0    1
##   0 TRUE TRUE
##   1 TRUE TRUE
  1. Using a siginficance threshold of 0.05 what would you conclude from this test?

Answer: There is a highly significant p value, suggesting that sex (female) has a significant effect on whether an individual has heart disease, confirming the alternate hypothesis

  1. What were the assumptions for this test? Comment on them in the context of the observed data.

Answer: That the expected frequency in each cell is above 5, that random sampling has occurred, that the observations are independent of each other and that the results are ordinal

  1. Report and interpret the corresponding Odds Ratio.

Answer: OR = 3.57. Since it is greater than 1, it means there is a greater likelihood of someone female having heart disease.

OR <- (tab[1, 1] * tab[2, 2])/(tab[2, 1] * tab[1, 2])
OR
## [1] 3.573933

Part 2

Use logistic regression to model the chance of having heart disease. For the following do not check model assumptions:

  1. Fit a logistic regression model that uses all of the variables to model heart disease. Comment on which variables appear to be informative.

Answer: Sex, chest pain(cp), resting blood pressure(trestbps), maximum heart rate achieved(thalach), exersice induced angina(exang), number of major vessel (ca) all appear informative as they are below the significance threshold (0.05).

fit = glm(disease ~ ., heart25, family = binomial())
plot(fit)

summary(fit)
## 
## Call:
## glm(formula = disease ~ ., family = binomial(), data = heart25)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.237063   2.785366  -2.598 0.009370 ** 
## age         -0.010452   0.023133  -0.452 0.651412    
## sex          1.919064   0.441491   4.347 1.38e-05 ***
## cp           0.678500   0.188934   3.591 0.000329 ***
## trestbps     0.027149   0.010463   2.595 0.009466 ** 
## chol         0.006237   0.003747   1.665 0.095980 .  
## fbs         -1.047147   0.537209  -1.949 0.051267 .  
## restecg      0.180479   0.177738   1.015 0.309903    
## thalach     -0.023469   0.010122  -2.319 0.020417 *  
## exang        1.096381   0.397986   2.755 0.005872 ** 
## oldpeak      0.264152   0.205372   1.286 0.198369    
## slope        0.712802   0.358094   1.991 0.046531 *  
## ca           1.298607   0.258345   5.027 4.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 216.76  on 284  degrees of freedom
## AIC: 242.76
## 
## Number of Fisher Scoring iterations: 6
  1. Use backwards step-wise variable selection to fit a simpler model. How many variables are included in the model?

Answer: 9 variables sex + cp + trestbps + chol + fbs + thalach + exang + slope + ca

fit = step(glm(disease ~ ., heart25, family = binomial()), direction = "backward")
## Start:  AIC=242.76
## disease ~ age + sex + cp + trestbps + chol + fbs + restecg + 
##     thalach + exang + oldpeak + slope + ca
## 
##            Df Deviance    AIC
## - age       1   216.96 240.96
## - restecg   1   217.79 241.79
## - oldpeak   1   218.45 242.45
## <none>          216.76 242.76
## - chol      1   219.45 243.45
## - slope     1   220.71 244.71
## - fbs       1   220.75 244.75
## - thalach   1   222.46 246.46
## - trestbps  1   223.78 247.78
## - exang     1   224.35 248.35
## - cp        1   230.75 254.75
## - sex       1   239.10 263.10
## - ca        1   249.78 273.79
## 
## Step:  AIC=240.96
## disease ~ sex + cp + trestbps + chol + fbs + restecg + thalach + 
##     exang + oldpeak + slope + ca
## 
##            Df Deviance    AIC
## - restecg   1   218.00 240.00
## - oldpeak   1   218.78 240.78
## <none>          216.96 240.96
## - chol      1   219.52 241.52
## - slope     1   220.83 242.83
## - fbs       1   221.06 243.06
## - thalach   1   222.72 244.72
## - trestbps  1   223.87 245.87
## - exang     1   224.76 246.76
## - cp        1   231.39 253.39
## - sex       1   240.23 262.23
## - ca        1   251.13 273.13
## 
## Step:  AIC=240
## disease ~ sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca
## 
##            Df Deviance    AIC
## - oldpeak   1   219.78 239.78
## <none>          218.00 240.00
## - chol      1   221.39 241.39
## - fbs       1   221.98 241.98
## - slope     1   222.43 242.43
## - thalach   1   223.97 243.97
## - trestbps  1   225.44 245.44
## - exang     1   225.77 245.77
## - cp        1   232.15 252.15
## - sex       1   242.23 262.23
## - ca        1   252.67 272.67
## 
## Step:  AIC=239.78
## disease ~ sex + cp + trestbps + chol + fbs + thalach + exang + 
##     slope + ca
## 
##            Df Deviance    AIC
## <none>          219.78 239.78
## - chol      1   223.46 241.46
## - fbs       1   224.15 242.15
## - thalach   1   226.56 244.56
## - exang     1   228.30 246.30
## - trestbps  1   228.43 246.43
## - slope     1   230.65 248.65
## - cp        1   233.98 251.98
## - sex       1   247.90 265.90
## - ca        1   260.82 278.82
  1. When the step-wise variable selection was performed, which model fit criteria was used to decide whether a variable should be included or not? Feel free to use an acronym.

Answer: AIC

  1. From either one of the logistic regression models, calculate odds ratios and comment on the relationship between heart disease and the maximum heart rate achieved.

Answer:

Question 3 (\(20\%\))

Question 3 has 2 parts

Part 1

urine25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/UrineDataV25.csv")
  1. How many variables and observations are in the dataset?

Answer: 79 observations and 7 variables

head(urine25)
##   crystals gravity   ph osmo cond urea calc
## 1        0   1.021 4.91  725   NA  443 2.45
## 2        0   1.017 5.74  577 20.0  296 4.49
## 3        0   1.008 7.20  321 14.9  101 2.36
## 4        0   1.011 5.51  408 12.6  224 2.15
## 5        0   1.005 6.52  187  7.5   91 1.16
## 6        0   1.020 5.27  668 25.3  252 3.34
dim(urine25)
## [1] 79  7
  1. Comment on the class of these variables and how they are stored in R.

Answer: Crystals: stored as integer, class is categorical Gravity: stored as numerical, class continuous numerical ph: stored as numerical, class continuous numerical osmo: stored as integer, class continuous numerical cond: stored as numerical, class continuous numerical urea: stored as integer, class continuous numerical calc: stored as numerical, class continuous numerical

str(urine25)
## 'data.frame':    79 obs. of  7 variables:
##  $ crystals: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ gravity : num  1.02 1.02 1.01 1.01 1 ...
##  $ ph      : num  4.91 5.74 7.2 5.51 6.52 5.27 5.62 5.67 5.41 6.13 ...
##  $ osmo    : int  725 577 321 408 187 668 461 1107 543 779 ...
##  $ cond    : num  NA 20 14.9 12.6 7.5 25.3 17.4 35.9 21.9 25.7 ...
##  $ urea    : int  443 296 101 224 91 252 195 550 170 382 ...
##  $ calc    : num  2.45 4.49 2.36 2.15 1.16 3.34 1.4 8.48 1.16 2.21 ...
  1. Use a visualization to check if the dataset has any missing data?

Answer:

vis_miss(urine25)

  1. In a sentence, explain why you would conclude that the data is either MCAR, MAR or MNAR?

Answer: missing completely at random as there is no structural or known reason why data is missing, instead it is random.

  1. Perform case deletion.

Answer:

Data <- na.omit(urine25)

Part 2

  1. Use hierarchical clustering to cluster the variables. Hint: you may need to use t().

Answer:

DataScaled <- Data %>%
    scale()

hC <- hclust(dist(DataScaled))
hCC <- cutree(hC, k = 2)
hC %>%
    as.dendrogram() %>%
    set("branches_k_color", k = 2) %>%
    plot()

  1. Comment on why you did or did not decide to scale the data when performing the analysis.

Answer: I chose to scale the data as the variables use different forms of measurements and may appear far apart due to this.

  1. How does this clustering inform the researchers’ primary question?

Answer: It provides x number of distinct groups that variables fall under, indicating which variables are closely related/have an effect on each other

  1. Use k-means clustering to cluster the observations in the dataset. Use all of the variables except for crystals to cluster, and, set a seed of 51773 before you cluster.

Answer:

DataScaled <- Data %>%
    dplyr::select(-crystals) %>%
    scale()
set.seed(51773)
kM <- kmeans(DataScaled, 2)

Data$crystals = as.factor(Data$crystals)
pca <- prcomp(DataScaled)
df <- data.frame(pca$x, cluster = paste("cluster", kM$cluster, sep = "_"), Data)
ggplot(df, aes(x = PC1, y = PC2, colour = cluster, shape = crystals)) + geom_point()

  1. Why is it advisable to set a seed?

Answer: so that the random assignment of points for k-means clustering will be the same each time you close/reopen the file

  1. Is there any evidence of a relationship between this k-means clustering and the formation of crystals?

Answer: No, from the chi-square test, p > 0.05 therefore there is no relationship

DATA <- data.frame(Data, kmeans = kM$cluster)
(tabk <- table(DATA$kmeans, DATA$crystals))
##    
##      0  1
##   1 26 12
##   2 18 21
chisq.test(tabk)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tabk
## X-squared = 3.0406, df = 1, p-value = 0.08121
  1. How does this clustering inform the researchers’ primary question?

Answer: It allows researchers to compile the effects of all variables using principle components analysis, so that groups/clusters can be assigned to samples. This then allows the researchers compare the clustering to crystallization to determine whether the combined effects of the other variables is related to the formation of crystals.