HIV = read.csv('http://www.maths.usyd.edu.au/u/ellisp/AMED3002/data/HIV.csv')
How many variables and observations are in the dataset? a. 2843 observations of 7 variables
Answer:
dim(HIV)
## [1] 2843 7
Comment on the class of these variables and how they are stored in R.
Answer:
str(HIV)
## 'data.frame': 2843 obs. of 7 variables:
## $ state : Factor w/ 4 levels "NSW","Other",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ diag : int 10905 11029 9551 9577 10015 9971 10746 10042 10464 10439 ...
## $ death : int 11081 11096 9983 9654 10290 10344 11135 11069 10956 10873 ...
## $ status : Factor w/ 2 levels "A","D": 2 2 2 2 2 2 2 2 2 2 ...
## $ T.categ: Factor w/ 8 levels "blood","haem",..: 4 4 4 2 4 4 8 4 4 5 ...
## $ age : int 35 53 42 44 39 36 36 31 26 27 ...
Answer: No
sum(is.na(HIV))
## [1] 0
Answer: Sex and state should both be categorical
Answer: There are clearly more men than women in every state. There are more people in NSW than any other state. It is hard to see in this plit uf the patterns are the same in each state in women.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(HIV, aes(x = state, fill = sex)) + geom_bar(position = "dodge")
f.What is an alternative plot you could use to asses this question. What is a benefit and disadvantage of using this alternate visualisation?
Answer: Stacked barplot would be an alternative. A stacked barplot buts more emphasis on the primary variable labelled on the x-axis. It is often hard to see changes in the relative proporitions of the seconary variable (labelled by colour)
Answer: A chi-square test. A test of independence because all of our observations came from one sample (prospective study)
Answer: The null is that there is no relationship between sex and state and the alternative is that there is a relationship
tab <- table(HIV$sex, HIV$state)
tab
##
## NSW Other QLD VIC
## F 54 13 9 13
## M 1726 236 217 575
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 5.8235, df = 3, p-value = 0.1205
Answer: As p-value is larger than the significance threshold we would conclude that there is not enough evidence to reject the null. This is not enough evidence to say that there is any relationship between sex and state.
Answer: We assume that at least 80% of expected cell counts are greater than 5. This is the case in the data
chisq.test(tab)$exp
##
## NSW Other QLD VIC
## F 55.72283 7.794935 7.074921 18.40732
## M 1724.27717 241.205065 218.925079 569.59268
The researchers would like to know if there is some difference between the states in the outcomes for HIV patients. They decide that they would like to see if the time between diagnosis and death are different between states.
HIV2 <- filter(HIV, status == "D")
HIV2 <- mutate(HIV2, timeSurvived = death-diag)
#or HIV$timeSurvived <- HIV2$death - HIV2$diag
Answer: The data is right skewed. The variances are pretty similar. The means of each group seem reasonably similar, maybe VIctoria is slightly higher.
ggplot(HIV2, aes(state, timeSurvived)) + geom_boxplot()
fit <- aov(timeSurvived ~ state, HIV2)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 3 1477348 492449 5.124 0.00157 **
## Residuals 1757 168870047 96113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: At a signficiance threshold of 0.05, given that our p-value is 0.0016, we would conclude that there is enough evidence to reject the null and accept the alternative hyothesis. So there is enough evidence to suggest there is a relationship between state and time survived.
Answer: The assumptions are independence of the observations. Equal variances. Normality. From our boxplots it would appear that variances are roughly equal. As there is a little bit of evidence that the data is right skewed, maybe the assumption of normality isn’t appropriate.
Answer: Tukey HSD test to perform pair-wise comparisons. THis may allow them to see which group means were most different.