DATA2002 REPORT

Author

510608820

Published

March 9, 2022

Code

knitr::opts_chunk$set(echo = TRUE)
library(gridExtra)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::combine() masks gridExtra::combine()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()

Code

library(ggplot2)
survey = read_tsv("DATA2x02 survey (2022) - Form responses 1.tsv")

Rows: 207 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (23): Timestamp, Have you ever tested positive to COVID-19?, What are y...
dbl  (13): How do you feel about the idea of travelling overseas?, How often...
time  (2): What time did you go to sleep last night?, What time did you wake...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1 Introduction

1.1 Is this a random sample of DATA2X02 students?

To be regarded as a random sample of DATA2X02 students, this survey should ensure that every subject in the target population have an equal chance of being selected in the sample. This survey was conducted using Google Forms, and presumably, the link to the form was sent to every student enrolled in DATA2X02 through their Microsoft Outlook accounts. It can be seen that, all DATA2X02 students had the opportunity to engage in the survey. However, this does not mean the population as there might be students who did not fill out the form. Moreover, this survey may only gather responds from one year, so it could ignore the variation of data over the course of time. Nevertheless, in short, this is considered as a random sample of DATA2X02 students.

1.2 What are the potential biases? Which variables are most likely to be subjected to this bias?

Since, presumably, this survey was sent to every DATA2X02 students, it should avoid sampling bias. However, it may encounter non-response bias simply because there were students who decided to not participate. Moreover, there were students who did not take the survey seriously and inaccurately respond to questions which eventually resulted in response bias. Finally, a few questions are poorly constructed, which could confuse the respondents and could result in measurement bias.

1.3 Which questions needed improvement to generate useful data?

There are some questions in the survey that required improvement. For example, “Which sports do you play most often?” contained options that were irrelevant. That question asked about types of sports but also included NRL and AFL, which are sports leagues. Moreover, questions like “What is your shoe size?” or “How tall are you” could add general measurement scales like US shoe size or height in centimeters. Finally, there were questions about gender or WAM that could be too personal for some students. Overall, in my opinion, the survey did come up with many exciting questions and gather insightful data from DATA2X02 students.

2 Result

2.1 Is choosing whether to study DATA2002 or DATA2902 related to previous experience in R?

Code

new_survey <- survey[c("Which unit are you enrolled in?", "Have you ever used R before starting DATA2x02?")]
colnames(new_survey) <- c("Unit_enrolled", "Used_R")

easy_used_r <- filter(new_survey, Unit_enrolled == "DATA2002" & Used_R == "Yes")
a = nrow(easy_used_r)
easy_not_used_r <- filter(new_survey, Unit_enrolled == "DATA2002" & Used_R == "No")
b = nrow(easy_not_used_r)
hard_used_r <- filter(new_survey, Unit_enrolled == "DATA2902" & Used_R == "Yes")
c = nrow(hard_used_r)
hard_not_used_r <- filter(new_survey, Unit_enrolled == "DATA2902" & Used_R == "No")
d = nrow(hard_not_used_r)

In Figure 1, it can be seen that the ones who have used R before comprise majority of DATA2X02 students. Moreover, the number of DATA2902 students is relative smaller than that of DATA2002. To test if having previous experience in R influence the choice of units, we apply the Fisher’s exact test to the following contingency table.

Code

table <- matrix(c(c,d,a,b), ncol = 2)
colnames(table) <- c("DATA2902", "DATA2002")
rownames(table) <- c("Have used R", "Have not used R")
df <- as.data.frame(table) %>% rownames_to_column(var = "R_Experience")                                                   %>% pivot_longer(cols = c("DATA2902","DATA2002"), names_to = "Unit", values_to = "No_Students")
knitr::kable(table)

	DATA2902	DATA2002
Have used R	40	147
Have not used R	0	18

Code

p1 <- ggplot(df, aes(y=No_Students, x=Unit, fill = R_Experience)) + geom_col(width = 0.75) + 
      xlab("Unit") + ylab("Number of Students") + labs(fill = "Prior experience in R") +             ggtitle("Figure 1: Numbers of students in DATA2X02 and their experience in R") +
      scale_fill_brewer(palette = "Set2")

print(p1)

Code

t1 <- fisher.test(table)

Hypothesis: H0: Choice of unit and previous experience in R are independent vs H1: Choice of unit and previous experience in R are related.
Assumptions: The contingency table should be a 2x2. The row and column totals are fixed.
Test statistic: NA
p-value: 0.0269
Decision: Since p-value is less than 0.05, we reject the null hypothesis and conclude that there is a relationship between the choice of unit and previous experience in R.

Code

new_sec_survey <- survey[c("Have you ever tested positive to COVID-19?", "How do you feel about the idea of travelling overseas?")]
colnames(new_sec_survey) <- c("cov19","travel_rate")

cov_19_pos <- filter(new_sec_survey, cov19 == "Yes") %>% na.omit

cov_19_neg <- filter(new_sec_survey, cov19 == "No") %>% na.omit

Code

Cov19_postive <- c("Have", "Haven't")
Mean <- c(round(mean(cov_19_pos$travel_rate),2), round(mean(cov_19_neg$travel_rate),2))
SD <- c(round(sd(cov_19_pos$travel_rate),2), round(sd(cov_19_neg$travel_rate),2))
n <- c(nrow(cov_19_pos),nrow(cov_19_neg))

df <- data.frame(Cov19_postive, Mean, SD, n)

knitr::kable(df)

Cov19_postive	Mean	SD	n
Have	8.34	1.90	87
Haven’t	7.46	2.37	114

Code

dat <- rbind(cov_19_pos, cov_19_neg)

p1 <- ggplot(dat, aes(x=cov19, y=travel_rate, color = cov19)) + geom_boxplot() + geom_jitter(width = 0.15, size = 1) + ylab("Ratings for travelling") + xlab("") + scale_color_brewer(palette = "Set2") + labs(color = "Cov19 +") + ggtitle("Figure 2:")
print(p1)

Code

p2 <- ggplot(dat, aes(sample = travel_rate, colour = cov19)) + stat_qq() + stat_qq_line() +
      scale_color_brewer(palette = "Set2") + facet_wrap(~cov19) + coord_fixed() + labs(color = "Cov19 +") + ggtitle("Figure 3:")
print(p2)

Code

grid.arrange(p1, p2, ncol = 2)

Code

t2 <- t.test(cov_19_pos$travel_rate, cov_19_neg$travel_rate, alternative = "two.sided", var.equal = TRUE)