SID:

Instructions

  • The exam is expected to run for 1 hour and 50 minutes.
  • You will additionally be given 5 minutes to download your Rmd file off Canvas and 5 minutes to submit your html file to Canvas.
  • The exam will become active on Canvas at 12pm at which point you can begin downloading the Rmd file.
  • You should submit your html file on Canvas at 1:55pm. Submissions after 2pm will incur a 5% penalty increasing every 5 minutes.
  • Use the pre-filled rmarkdown file from Canvas to help arrange your answers. You are welcome to use or ignore the code chunks I have inserted or add additional ones.
  • I recommend that you knit your file every 20-30 minutes to ensure it knits. If you get to 1:55 and your file won’t knit, do not stress, just submit your Rmd file.
  • There are 3 questions. The first is worth \(40\%\), the second and third are worth \(30\%\) each.
  • The exam is ‘open computer’. You can use all your notes, google and appropriate help websites. You are not allowed to communicate with others.
  • Academic integrity is a core value of the University of Sydney. We expect you to be familiar with the policies and codes covering academic honesty and conduct at the University. You should be aware that The University of Sydney does not tolerate any form of breach or academic dishonesty https://www.sydney.edu.au/students/academic-dishonesty.html.

Question 1 - HIV (40% of total mark)

HIV = read.csv('http://www.maths.usyd.edu.au/u/ellisp/AMED3002/data/HIV.csv')

Part 1

How many variables and observations are in the dataset? a. 2843 observations of 7 variables

Answer:

dim(HIV)
## [1] 2843    7

Comment on the class of these variables and how they are stored in R.

  1. State. sex. status, T categ are all categorical and all stored as factors in R. Diagn, death and age are all numerical and stored as integers in R.

Answer:

str(HIV)
## 'data.frame':    2843 obs. of  7 variables:
##  $ state  : Factor w/ 4 levels "NSW","Other",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex    : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diag   : int  10905 11029 9551 9577 10015 9971 10746 10042 10464 10439 ...
##  $ death  : int  11081 11096 9983 9654 10290 10344 11135 11069 10956 10873 ...
##  $ status : Factor w/ 2 levels "A","D": 2 2 2 2 2 2 2 2 2 2 ...
##  $ T.categ: Factor w/ 8 levels "blood","haem",..: 4 4 4 2 4 4 8 4 4 5 ...
##  $ age    : int  35 53 42 44 39 36 36 31 26 27 ...
  1. Is there any missing data in this dataset?

Answer: No

sum(is.na(HIV))
## [1] 0

Part 2

  1. What types of variables should sex and state be?

Answer: Sex and state should both be categorical

  1. Generate a grouped bar-plot, comment on any striking features and what they tell you about the data.

Answer: There are clearly more men than women in every state. There are more people in NSW than any other state. It is hard to see in this plit uf the patterns are the same in each state in women.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(HIV, aes(x = state, fill = sex)) + geom_bar(position = "dodge")

f.What is an alternative plot you could use to asses this question. What is a benefit and disadvantage of using this alternate visualisation?

Answer: Stacked barplot would be an alternative. A stacked barplot buts more emphasis on the primary variable labelled on the x-axis. It is often hard to see changes in the relative proporitions of the seconary variable (labelled by colour)

  1. What is an appropriate statistical test that could be used by the researchers to test this question and why?

Answer: A chi-square test. A test of independence because all of our observations came from one sample (prospective study)

  1. What is the corresponding null and alternate hypothesis?

Answer: The null is that there is no relationship between sex and state and the alternative is that there is a relationship

  1. Construct a contingency table using the variables sex and state.
tab <- table(HIV$sex, HIV$state)
tab
##    
##      NSW Other  QLD  VIC
##   F   54    13    9   13
##   M 1726   236  217  575
  1. Perform the appropriate test.
chisq.test(tab)
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 5.8235, df = 3, p-value = 0.1205
  1. Using a significance threshold of 0.05 what would you conclude from this test and how does this inform the researchers’ question?

Answer: As p-value is larger than the significance threshold we would conclude that there is not enough evidence to reject the null. This is not enough evidence to say that there is any relationship between sex and state.

  1. What were the assumptions for this test? Comment on them in the context of the observed data.

Answer: We assume that at least 80% of expected cell counts are greater than 5. This is the case in the data

chisq.test(tab)$exp
##    
##            NSW      Other        QLD       VIC
##   F   55.72283   7.794935   7.074921  18.40732
##   M 1724.27717 241.205065 218.925079 569.59268

Part 3

The researchers would like to know if there is some difference between the states in the outcomes for HIV patients. They decide that they would like to see if the time between diagnosis and death are different between states.

  1. Create a new dataset containing only patients that died using the status variable.
HIV2 <- filter(HIV, status == "D")
  1. Create a new variable for the time that patients survived by subtracting diag from death.
HIV2 <- mutate(HIV2, timeSurvived = death-diag)

#or HIV$timeSurvived <- HIV2$death - HIV2$diag
  1. Visualise the time to death for the patients in each state using a boxplot. Comment on any striking features.

Answer: The data is right skewed. The variances are pretty similar. The means of each group seem reasonably similar, maybe VIctoria is slightly higher.

ggplot(HIV2, aes(state, timeSurvived)) + geom_boxplot()

  1. Use one-way ANOVA to test whether the time between diagnosis and death are different between states.
fit <- aov(timeSurvived ~ state, HIV2)
summary(fit)
##               Df    Sum Sq Mean Sq F value  Pr(>F)   
## state          3   1477348  492449   5.124 0.00157 **
## Residuals   1757 168870047   96113                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Using a significance threshold of 0.05 what would you conclude from this test and how does this inform the researchers’ question?

Answer: At a signficiance threshold of 0.05, given that our p-value is 0.0016, we would conclude that there is enough evidence to reject the null and accept the alternative hyothesis. So there is enough evidence to suggest there is a relationship between state and time survived.

  1. What were the assumptions for this test? Comment on them in the context of the observed data and fitted model.

Answer: The assumptions are independence of the observations. Equal variances. Normality. From our boxplots it would appear that variances are roughly equal. As there is a little bit of evidence that the data is right skewed, maybe the assumption of normality isn’t appropriate.

  1. Are there any other tests that you could perform to help the researchers interpret these results. If yes, what would you tell the researchers?

Answer: Tukey HSD test to perform pair-wise comparisons. THis may allow them to see which group means were most different.