melanoma25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/melanoma2025.csv")
library(ggplot2)
library(tidyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” lubridate 1.9.4 âś” tibble 3.2.1
## âś” purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(naniar)
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.19.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
##
## Attaching package: 'dendextend'
##
## The following object is masked from 'package:stats':
##
## cutree
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(datasets)
library(survival)
Answer:
190 observations and 8 variables
head(melanoma25)
## X status sex age year thickness ulcer location
## 1 1 2 1 41 1977 1.157584 0 torso
## 2 2 1 1 52 1965 3.475629 1 limb
## 3 3 1 1 28 1971 2.200000 1 limb
## 4 4 1 1 77 1972 2.271563 1 limb
## 5 5 1 1 49 1968 3.588872 1 torso
## 6 6 1 0 68 1971 2.722132 1 torso
dim(melanoma25)
## [1] 190 8
Answer: 6 of the variables are stored as integers (X, status, sex, age, year and ulcer), One is stored as numerical (thickness) and one is stored as character (location)
Ulcer, sex and status are categorical despite being stored as integers. Age, year and thickness are numeric. Location is a character
str(melanoma25)
## 'data.frame': 190 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ status : int 2 1 1 1 1 1 1 1 1 1 ...
## $ sex : int 1 1 1 1 1 0 0 0 1 0 ...
## $ age : int 41 52 28 77 49 68 53 68 63 14 ...
## $ year : int 1977 1965 1971 1972 1968 1971 1969 1965 1970 1969 ...
## $ thickness: num 1.16 3.48 2.2 2.27 3.59 ...
## $ ulcer : int 0 1 1 1 1 1 1 1 1 1 ...
## $ location : chr "torso" "limb" "limb" "limb" ...
sapply(melanoma25, class)
## X status sex age year thickness
## "integer" "integer" "integer" "integer" "integer" "numeric"
## ulcer location
## "integer" "character"
Answer: no
sum(is.na(melanoma25))
## [1] 0
vis_miss(melanoma25)
The researchers would like to understand the relationships between tumour thickness and the other variables.
Answer: Tumour thickness differs significantly between people with and without ulceration (p<0.05), being thicker in those with an ulcer. The assumptions are independence, normality and equal variance. From residuals vs plotted, distribution implies homoscedasticity, from qq plot, residuals are relatively normally distributed however there is a seemingly right skew. Overall the assumptions seem to be valid
boxplot(thickness ~ factor(ulcer), melanoma25)
fit = lm(thickness ~ factor(ulcer), melanoma25)
anova(fit)
## Analysis of Variance Table
##
## Response: thickness
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(ulcer) 1 24.394 24.3940 63.64 1.425e-13 ***
## Residuals 188 72.063 0.3833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit)
Answer: Tumour thickness is greater in males (p<0.05). The assumptions are independence, normality, equal variance and random sampling. Population variances appear equal from the plots. QQ plot is slightly skewed right suggesting the data may not be normal however the skew is minor.
boxplot(thickness ~ factor(sex), melanoma25)
fit = lm(thickness ~ factor(sex), melanoma25)
anova(fit)
## Analysis of Variance Table
##
## Response: thickness
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(sex) 1 4.087 4.0872 8.3188 0.004382 **
## Residuals 188 92.369 0.4913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit)
Answer: This model requires the usual assumptions of independence, normality, homoscedasticity and random sampling but also includes the assumption that the two factors have an additive effect rather than interactive.
Melanoma.additive = aov(thickness ~ sex + ulcer, melanoma25)
summary(Melanoma.additive)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 4.09 4.087 10.78 0.00122 **
## ulcer 1 21.47 21.466 56.62 2.16e-12 ***
## Residuals 187 70.90 0.379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: Yes there seems to be interaction as the lines of best fit are not parallel and are interacting.
melanoma25$sex = as.factor(melanoma25$sex)
melanoma25$ulcer = as.factor(melanoma25$ulcer)
ggplot(melanoma25, aes(x = factor(ulcer), y = thickness, colour = sex)) + geom_boxplot() +
stat_summary(fun.y = mean, geom = "line", aes(group = sex)) + theme_classic() +
xlab("ulcer") + ylab("thickness") + labs(colour = "sex")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Answer: No they dont have a significant interactive effect as the p value for sex:ulcer is just above the significance threshold (p = 0.058).
Melanoma.interaction = aov(thickness ~ sex * ulcer, melanoma25)
summary(Melanoma.interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 4.09 4.087 10.931 0.00113 **
## ulcer 1 21.47 21.466 57.412 1.61e-12 ***
## sex:ulcer 1 1.36 1.358 3.632 0.05822 .
## Residuals 186 69.55 0.374
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The researchers would like to test if age has an effect on tumour thickness.
Answer: R^2 = 0
Answer: The model assumptions are random sampling, homoscedastic values, normally distributed values and independence of values. From the QQ, the values are normally distributed. From the residuals vs fitted and scale vs location the values appear homoscedastic. From the residuals vs leveraged there seems to be very few outliers.
fit = lm(age ~ thickness, melanoma25)
summary(fit)
##
## Call:
## lm(formula = age ~ thickness, data = melanoma25)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.059 -10.929 2.145 10.713 40.097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.195 2.794 16.174 <2e-16 ***
## thickness 4.147 1.674 2.477 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.45 on 188 degrees of freedom
## Multiple R-squared: 0.03159, Adjusted R-squared: 0.02644
## F-statistic: 6.134 on 1 and 188 DF, p-value: 0.01415
plot(fit)
Answer: p<0.05 and adjusted R-square value is 0.02. These indicate there is a significant effect of age on tumour thickness, however gradual, with tumour thickness increasing with age
Answer: ~2%
heart25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/heartDiseaseV25.csv")
str(heart25)
## 'data.frame': 297 obs. of 13 variables:
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ cp : int 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : int 2 2 2 0 2 0 2 0 2 2 ...
## $ thalach : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ disease : int 0 1 1 0 0 0 1 0 1 1 ...
From clinical experience the researchers believe that there may be a relationship between sex and whether an individual had heart disease.
Answer: Sex and heart disease status should be categorical variables
Answer: a chi-squared test
Answer: Null: There is no relationship between sex and whether an individual has heart disease. Alternate: There is a relationship between sex and whether and individual had heart disease
Answer:
tab <- table(heart25$sex, heart25$disease)
tab
##
## 0 1
## 0 71 25
## 1 89 112
Answer:
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 21.852, df = 1, p-value = 2.946e-06
test = chisq.test(tab)
test$expected >= 5
##
## 0 1
## 0 TRUE TRUE
## 1 TRUE TRUE
Answer: There is a highly significant p value, suggesting that sex (female) has a significant effect on whether an individual has heart disease, confirming the alternate hypothesis
Answer: That the expected frequency in each cell is above 5, that random sampling has occurred, that the observations are independent of each other and that the results are ordinal
Answer: OR = 3.57. Since it is greater than 1, it means there is a greater likelihood of someone female having heart disease.
OR <- (tab[1, 1] * tab[2, 2])/(tab[2, 1] * tab[1, 2])
OR
## [1] 3.573933
Use logistic regression to model the chance of having heart disease. For the following do not check model assumptions:
Answer: Sex, chest pain(cp), resting blood pressure(trestbps), maximum heart rate achieved(thalach), exersice induced angina(exang), number of major vessel (ca) all appear informative as they are below the significance threshold (0.05).
fit = glm(disease ~ ., heart25, family = binomial())
plot(fit)
summary(fit)
##
## Call:
## glm(formula = disease ~ ., family = binomial(), data = heart25)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.237063 2.785366 -2.598 0.009370 **
## age -0.010452 0.023133 -0.452 0.651412
## sex 1.919064 0.441491 4.347 1.38e-05 ***
## cp 0.678500 0.188934 3.591 0.000329 ***
## trestbps 0.027149 0.010463 2.595 0.009466 **
## chol 0.006237 0.003747 1.665 0.095980 .
## fbs -1.047147 0.537209 -1.949 0.051267 .
## restecg 0.180479 0.177738 1.015 0.309903
## thalach -0.023469 0.010122 -2.319 0.020417 *
## exang 1.096381 0.397986 2.755 0.005872 **
## oldpeak 0.264152 0.205372 1.286 0.198369
## slope 0.712802 0.358094 1.991 0.046531 *
## ca 1.298607 0.258345 5.027 4.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 216.76 on 284 degrees of freedom
## AIC: 242.76
##
## Number of Fisher Scoring iterations: 6
Answer: 9 variables sex + cp + trestbps + chol + fbs + thalach + exang + slope + ca
fit = step(glm(disease ~ ., heart25, family = binomial()), direction = "backward")
## Start: AIC=242.76
## disease ~ age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca
##
## Df Deviance AIC
## - age 1 216.96 240.96
## - restecg 1 217.79 241.79
## - oldpeak 1 218.45 242.45
## <none> 216.76 242.76
## - chol 1 219.45 243.45
## - slope 1 220.71 244.71
## - fbs 1 220.75 244.75
## - thalach 1 222.46 246.46
## - trestbps 1 223.78 247.78
## - exang 1 224.35 248.35
## - cp 1 230.75 254.75
## - sex 1 239.10 263.10
## - ca 1 249.78 273.79
##
## Step: AIC=240.96
## disease ~ sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope + ca
##
## Df Deviance AIC
## - restecg 1 218.00 240.00
## - oldpeak 1 218.78 240.78
## <none> 216.96 240.96
## - chol 1 219.52 241.52
## - slope 1 220.83 242.83
## - fbs 1 221.06 243.06
## - thalach 1 222.72 244.72
## - trestbps 1 223.87 245.87
## - exang 1 224.76 246.76
## - cp 1 231.39 253.39
## - sex 1 240.23 262.23
## - ca 1 251.13 273.13
##
## Step: AIC=240
## disease ~ sex + cp + trestbps + chol + fbs + thalach + exang +
## oldpeak + slope + ca
##
## Df Deviance AIC
## - oldpeak 1 219.78 239.78
## <none> 218.00 240.00
## - chol 1 221.39 241.39
## - fbs 1 221.98 241.98
## - slope 1 222.43 242.43
## - thalach 1 223.97 243.97
## - trestbps 1 225.44 245.44
## - exang 1 225.77 245.77
## - cp 1 232.15 252.15
## - sex 1 242.23 262.23
## - ca 1 252.67 272.67
##
## Step: AIC=239.78
## disease ~ sex + cp + trestbps + chol + fbs + thalach + exang +
## slope + ca
##
## Df Deviance AIC
## <none> 219.78 239.78
## - chol 1 223.46 241.46
## - fbs 1 224.15 242.15
## - thalach 1 226.56 244.56
## - exang 1 228.30 246.30
## - trestbps 1 228.43 246.43
## - slope 1 230.65 248.65
## - cp 1 233.98 251.98
## - sex 1 247.90 265.90
## - ca 1 260.82 278.82
Answer: AIC
Answer:
urine25 = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/e5/UrineDataV25.csv")
Answer: 79 observations and 7 variables
head(urine25)
## crystals gravity ph osmo cond urea calc
## 1 0 1.021 4.91 725 NA 443 2.45
## 2 0 1.017 5.74 577 20.0 296 4.49
## 3 0 1.008 7.20 321 14.9 101 2.36
## 4 0 1.011 5.51 408 12.6 224 2.15
## 5 0 1.005 6.52 187 7.5 91 1.16
## 6 0 1.020 5.27 668 25.3 252 3.34
dim(urine25)
## [1] 79 7
Answer: Crystals: stored as integer, class is categorical Gravity: stored as numerical, class continuous numerical ph: stored as numerical, class continuous numerical osmo: stored as integer, class continuous numerical cond: stored as numerical, class continuous numerical urea: stored as integer, class continuous numerical calc: stored as numerical, class continuous numerical
str(urine25)
## 'data.frame': 79 obs. of 7 variables:
## $ crystals: int 0 0 0 0 0 0 0 0 0 0 ...
## $ gravity : num 1.02 1.02 1.01 1.01 1 ...
## $ ph : num 4.91 5.74 7.2 5.51 6.52 5.27 5.62 5.67 5.41 6.13 ...
## $ osmo : int 725 577 321 408 187 668 461 1107 543 779 ...
## $ cond : num NA 20 14.9 12.6 7.5 25.3 17.4 35.9 21.9 25.7 ...
## $ urea : int 443 296 101 224 91 252 195 550 170 382 ...
## $ calc : num 2.45 4.49 2.36 2.15 1.16 3.34 1.4 8.48 1.16 2.21 ...
Answer:
vis_miss(urine25)
Answer: missing completely at random as there is no structural or known reason why data is missing, instead it is random.
Answer:
Data <- na.omit(urine25)
t().Answer:
DataScaled <- Data %>%
scale()
hC <- hclust(dist(DataScaled))
hCC <- cutree(hC, k = 2)
hC %>%
as.dendrogram() %>%
set("branches_k_color", k = 2) %>%
plot()
Answer: I chose to scale the data as the variables use different forms of measurements and may appear far apart due to this.
Answer: It provides x number of distinct groups that variables fall under, indicating which variables are closely related/have an effect on each other
crystals to cluster,
and, set a seed of 51773 before you cluster.Answer:
DataScaled <- Data %>%
dplyr::select(-crystals) %>%
scale()
set.seed(51773)
kM <- kmeans(DataScaled, 2)
Data$crystals = as.factor(Data$crystals)
pca <- prcomp(DataScaled)
df <- data.frame(pca$x, cluster = paste("cluster", kM$cluster, sep = "_"), Data)
ggplot(df, aes(x = PC1, y = PC2, colour = cluster, shape = crystals)) + geom_point()
Answer: so that the random assignment of points for k-means clustering will be the same each time you close/reopen the file
Answer: No, from the chi-square test, p > 0.05 therefore there is no relationship
DATA <- data.frame(Data, kmeans = kM$cluster)
(tabk <- table(DATA$kmeans, DATA$crystals))
##
## 0 1
## 1 26 12
## 2 18 21
chisq.test(tabk)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tabk
## X-squared = 3.0406, df = 1, p-value = 0.08121
Answer: It allows researchers to compile the effects of all variables using principle components analysis, so that groups/clusters can be assigned to samples. This then allows the researchers compare the clustering to crystallization to determine whether the combined effects of the other variables is related to the formation of crystals.