DATA2002 REPORT

Author

510608820

Published

March 9, 2022

Code

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
survey = read_tsv("DATA2x02 survey (2022) - Form responses 1.tsv")

Rows: 207 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (23): Timestamp, Have you ever tested positive to COVID-19?, What are y...
dbl  (13): How do you feel about the idea of travelling overseas?, How often...
time  (2): What time did you go to sleep last night?, What time did you wake...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

survey

# A tibble: 207 × 38
   Timestamp     Have …¹ What …² How t…³ How d…⁴ What …⁵ If yo…⁶ How d…⁷ How o…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>
 1 15/08/2022 1… No      With p… "1,58"  I neve… I neve… 50-100…      10       3
 2 15/08/2022 1… Yes     Colleg… "165"   Bus, W… Music,… 100           4       8
 3 15/08/2022 1… Yes     Share … "180"   Walk    Music   $700          7       8
 4 15/08/2022 1… Yes     With p… "182"   Train,… Music,… 200          10       2
 5 15/08/2022 1… Yes     With p… "173"   Train   Music,… 5             7       2
 6 15/08/2022 1… No      With p… "163"   Bus     Music   dont k…       2      10
 7 15/08/2022 1… No      Share … "183cm" Bus     I don'… 2000          5       6
 8 15/08/2022 1… Yes     With p… "6'3\"" Walk    I don'… 150           9       3
 9 15/08/2022 1… No      With p… "166"   Train   I don'… on ave…      10       6
10 15/08/2022 1… Yes     With p… "165"   Train,… Music,… 3000          8       4
# … with 197 more rows, 29 more variables:
#   `How many hours a week do you spend studying?` <dbl>,
#   `Do you watch and/or read the news regularly?` <chr>,
#   `What is your study load?` <chr>, `Do you work?` <chr>,
#   `When you're in a Zoom lab, how often do you turn your camera on?` <chr>,
#   `What is your favourite social media platform?` <chr>,
#   `What is your gender?` <chr>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

1 Introduction

1.1 Is this a random sample of DATA2X02 students?

To be regarded as a random sample of DATA2X02 students, this survey should ensure that every subject in the target population have an equal chance of being selected in the sample. This survey was conducted using Google Forms, and presumably, the link to the form was sent to every student enrolled in DATA2X02 through their Microsoft Outlook accounts. It can be seen that, all DATA2X02 students had the opportunity to engage in the survey. However, this does not mean the population as there might be students who did not fill out the form. Moreover, this survey may only gather responds from one year, so it could ignore the variation of data over the course of time. Nevertheless, in short, this is considered as a random sample of DATA2X02 students.

1.2 What are the potential biases? Which variables are most likely to be subjected to this bias?

Since, presumably, this survey was sent to every DATA2X02 students, it should avoid sampling bias. However, it may encounter non-response bias simply because there were students who decided to not participate. Moreover, there were students who did not take the survey seriously and inaccurately respond to questions which eventually resulted in response bias. Finally, a few questions are poorly constructed, which could confuse the respondents and could result in measurement bias.

1.3 Which questions needed improvement to generate useful data?

There are some questions in the survey that required improvement. For example, “Which sports do you play most often?” contained options that were irrelevant. That question asked about types of sports but also included NRL and AFL, which are sports leagues. Moreover, questions like “What is your shoe size?” or “How tall are you” could add general measurement scales like US shoe size or height in centimeters. Finally, there were questions about gender or WAM that could be too personal for some students. Overall, in my opinion, the survey did come up with many exciting questions and gather insightful data from DATA2X02 students.

2 Result

Code

new_survey <- survey[c("Which unit are you enrolled in?", "Have you ever used R before starting DATA2x02?")]
colnames(new_survey) <- c("Unit_enrolled", "Used_R")

easy_used_r <- filter(new_survey, Unit_enrolled == "DATA2002" & Used_R == "Yes")
a = nrow(easy_used_r)
easy_not_used_r <- filter(new_survey, Unit_enrolled == "DATA2002" & Used_R == "No")
b = nrow(easy_not_used_r)
hard_used_r <- filter(new_survey, Unit_enrolled == "DATA2902" & Used_R == "Yes")
c = nrow(hard_used_r)
hard_not_used_r <- filter(new_survey, Unit_enrolled == "DATA2902" & Used_R == "No")
d = nrow(hard_not_used_r)

table <- matrix(c(c,d,a,b), ncol = 2)
colnames(table) <- c("DATA2902", "DATA2002")
rownames(table) <- c("Yes", "No")

df <- as.data.frame(table) %>% rownames_to_column(var = "R_Experience")                            %>% pivot_longer(cols = c("DATA2902","DATA2002"),
                           names_to = "Unit", values_to = "No_Students")

p1 <- ggplot(df, aes(y=No_Students, x=Unit, fill = R_Experience)) + geom_col() + xlab("Unit") + ylab("Number of Students") + labs(fill = "Prior experience in R") + scale_fill_brewer(palette = "Set2")

knitr::kable(table)

	DATA2902	DATA2002
Yes	40	147
No	0	18

Code

print(p1)

Code

t1 <- fisher.test(table)

Code

new_sec_survey <- survey[c("Have you ever tested positive to COVID-19?", "How do you feel about the idea of travelling overseas?")]
colnames(new_sec_survey) <- c("cov19","travel_rate")

cov_19_pos <- filter(new_sec_survey, cov19 == "Yes") %>% na.omit

cov_19_neg <- filter(new_sec_survey, cov19 == "No") %>% na.omit

status <- c("Yes", "No")
Mean <- c(mean(cov_19_pos$travel_rate), mean(cov_19_neg$travel_rate))
SD <- c(sd(cov_19_pos$travel_rate), sd(cov_19_neg$travel_rate))
n <- c(nrow(cov_19_pos),nrow(cov_19_neg))

df <- data.frame(status, Mean, SD, n)
df

  status     Mean       SD   n
1    Yes 8.344828 1.897536  87
2     No 7.456140 2.365461 114

Code

dat <- rbind(cov_19_pos, cov_19_neg)

p1 <- ggplot(dat, aes(x=cov19, y=travel_rate)) + geom_boxplot() + 
     geom_jitter(width = 0.15, size = 1) + 
     ylab("Ratings for travelling") + xlab("Covid-19 Test")
p1

Code

p2 <- ggplot(dat, aes(sample = travel_rate, colour = cov19)) + stat_qq() + stat_qq_line()
p2

Code

t.test(cov_19_pos$travel_rate, cov_19_neg$travel_rate, alternative = "two.sided", var.equal = TRUE)


    Two Sample t-test

data:  cov_19_pos$travel_rate and cov_19_neg$travel_rate
t = 2.8693, df = 199, p-value = 0.004558
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2779316 1.4994428
sample estimates:
mean of x mean of y 
 8.344828  7.456140