library(ggplot2)survey =read_tsv("DATA2x02 survey (2022) - Form responses 1.tsv")
Rows: 207 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (23): Timestamp, Have you ever tested positive to COVID-19?, What are y...
dbl (13): How do you feel about the idea of travelling overseas?, How often...
time (2): What time did you go to sleep last night?, What time did you wake...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1 Introduction
1.1 Is this a random sample of DATA2X02 students?
To be regarded as a random sample of DATA2X02 students, this survey should ensure that every subject in the target population have an equal chance of being selected in the sample. This survey was conducted using Google Forms, and presumably, the link to the form was sent to every student enrolled in DATA2X02 through their Microsoft Outlook accounts. It can be seen that, all DATA2X02 students had the opportunity to engage in the survey. However, this does not mean the population as there might be students who did not fill out the form. Moreover, this survey may only gather responds from one year, so it could ignore the variation of data over the course of time. Nevertheless, in short, this is considered as a random sample of DATA2X02 students.
1.2 What are the potential biases? Which variables are most likely to be subjected to this bias?
Since, presumably, this survey was sent to every DATA2X02 students, it should avoid sampling bias. However, it may encounter non-response bias simply because there were students who decided to not participate. Moreover, there were students who did not take the survey seriously and inaccurately respond to questions which eventually resulted in response bias. Finally, a few questions are poorly constructed, which could confuse the respondents and could result in measurement bias.
1.3 Which questions needed improvement to generate useful data?
There are some questions in the survey that required improvement. For example, “Which sports do you play most often?” contained options that were irrelevant. That question asked about types of sports but also included NRL and AFL, which are sports leagues. Moreover, questions like “What is your shoe size?” or “How tall are you” could add general measurement scales like US shoe size or height in centimeters. Finally, there were questions about gender or WAM that could be too personal for some students. Overall, in my opinion, the survey did come up with many exciting questions and gather insightful data from DATA2X02 students.
2 Result
2.1 Is choosing whether to study DATA2002 or DATA2902 related to previous experience in R?
Code
new_survey <- survey[c("Which unit are you enrolled in?", "Have you ever used R before starting DATA2x02?")]colnames(new_survey) <-c("Unit_enrolled", "Used_R")easy_used_r <-filter(new_survey, Unit_enrolled =="DATA2002"& Used_R =="Yes")a =nrow(easy_used_r)easy_not_used_r <-filter(new_survey, Unit_enrolled =="DATA2002"& Used_R =="No")b =nrow(easy_not_used_r)hard_used_r <-filter(new_survey, Unit_enrolled =="DATA2902"& Used_R =="Yes")c =nrow(hard_used_r)hard_not_used_r <-filter(new_survey, Unit_enrolled =="DATA2902"& Used_R =="No")d =nrow(hard_not_used_r)
In Figure 1, it can be seen that the ones who have used R before comprise majority of DATA2X02 students. Moreover, the number of DATA2902 students is relative smaller than that of DATA2002. To test if having previous experience in R influence the choice of units, we apply the Fisher’s exact test to the following contingency table.
Code
table <-matrix(c(c,d,a,b), ncol =2)colnames(table) <-c("DATA2902", "DATA2002")rownames(table) <-c("Have used R", "Have not used R")df <-as.data.frame(table) %>%rownames_to_column(var ="R_Experience") %>%pivot_longer(cols =c("DATA2902","DATA2002"), names_to ="Unit", values_to ="No_Students")knitr::kable(table)
DATA2902
DATA2002
Have used R
40
147
Have not used R
0
18
Code
p1 <-ggplot(df, aes(y=No_Students, x=Unit, fill = R_Experience)) +geom_col(width =0.75) +xlab("Unit") +ylab("Number of Students") +labs(fill ="Prior experience in R") +ggtitle("Figure 1: Numbers of students in DATA2X02 and their experience in R") +scale_fill_brewer(palette ="Set2")print(p1)
Code
t1 <-fisher.test(table)
Hypothesis: H0: Choice of unit and previous experience in R are independent vs H1: Choice of unit and previous experience in R are related.
Assumptions: The contingency table should be a 2x2. The row and column totals are fixed.
Test statistic: NA
p-value: 0.0269
Decision: Since p-value is less than 0.05, we reject the null hypothesis and conclude that there is a relationship between the choice of unit and previous experience in R.
Code
new_sec_survey <- survey[c("Have you ever tested positive to COVID-19?", "How do you feel about the idea of travelling overseas?")]colnames(new_sec_survey) <-c("cov19","travel_rate")cov_19_pos <-filter(new_sec_survey, cov19 =="Yes") %>% na.omitcov_19_neg <-filter(new_sec_survey, cov19 =="No") %>% na.omit
t2 <-t.test(cov_19_pos$travel_rate, cov_19_neg$travel_rate, alternative ="two.sided", var.equal =TRUE)
Source Code
---title: "DATA2002 REPORT"date: "03-09-2022"author: "510608820"format: html: self-contained: true code-fold: true code-tools: true table-of-contents: true number-sections: true ---```{r}knitr::opts_chunk$set(echo =TRUE)library(gridExtra)library(tidyverse)library(ggplot2)survey =read_tsv("DATA2x02 survey (2022) - Form responses 1.tsv")```## Introduction### Is this a random sample of DATA2X02 students?To be regarded as a random sample of DATA2X02 students, this survey should ensure that every subject in the target population have an equal chance of being selected in the sample. This survey was conducted using Google Forms, and presumably, the link to the form was sent to every student enrolled in DATA2X02 through their Microsoft Outlook accounts. It can be seen that, all DATA2X02 students had the opportunity to engage in the survey. However, this does not mean the population as there might be students who did not fill out the form. Moreover, this survey may only gather responds from one year, so it could ignore the variation of data over the course of time. Nevertheless, in short, this is considered as a random sample of DATA2X02 students.### What are the potential biases? Which variables are most likely to be subjected to this bias?Since, presumably, this survey was sent to every DATA2X02 students, it should avoid sampling bias. However, it may encounter non-response bias simply because there were students who decided to not participate. Moreover, there were students who did not take the survey seriously and inaccurately respond to questions which eventually resulted in response bias. Finally, a few questions are poorly constructed, which could confuse the respondents and could result in measurement bias.### Which questions needed improvement to generate useful data?There are some questions in the survey that required improvement. For example, "Which sports do you play most often?" contained options that were irrelevant. That question asked about types of sports but also included NRL and AFL, which are sports leagues. Moreover, questions like "What is your shoe size?" or "How tall are you" could add general measurement scales like US shoe size or height in centimeters. Finally, there were questions about gender or WAM that could be too personal for some students. Overall, in my opinion, the survey did come up with many exciting questions and gather insightful data from DATA2X02 students.## Result### Is choosing whether to study DATA2002 or DATA2902 related to previous experience in R?```{r}new_survey <- survey[c("Which unit are you enrolled in?", "Have you ever used R before starting DATA2x02?")]colnames(new_survey) <-c("Unit_enrolled", "Used_R")easy_used_r <-filter(new_survey, Unit_enrolled =="DATA2002"& Used_R =="Yes")a =nrow(easy_used_r)easy_not_used_r <-filter(new_survey, Unit_enrolled =="DATA2002"& Used_R =="No")b =nrow(easy_not_used_r)hard_used_r <-filter(new_survey, Unit_enrolled =="DATA2902"& Used_R =="Yes")c =nrow(hard_used_r)hard_not_used_r <-filter(new_survey, Unit_enrolled =="DATA2902"& Used_R =="No")d =nrow(hard_not_used_r)```In Figure 1, it can be seen that the ones who have used R before comprise majority of DATA2X02 students. Moreover, the number of DATA2902 students is relative smaller than that of DATA2002. To test if having previous experience in R influence the choice of units, we apply the Fisher's exact test to the following contingency table.```{r}table <-matrix(c(c,d,a,b), ncol =2)colnames(table) <-c("DATA2902", "DATA2002")rownames(table) <-c("Have used R", "Have not used R")df <-as.data.frame(table) %>%rownames_to_column(var ="R_Experience") %>%pivot_longer(cols =c("DATA2902","DATA2002"), names_to ="Unit", values_to ="No_Students")knitr::kable(table)``````{r}p1 <-ggplot(df, aes(y=No_Students, x=Unit, fill = R_Experience)) +geom_col(width =0.75) +xlab("Unit") +ylab("Number of Students") +labs(fill ="Prior experience in R") +ggtitle("Figure 1: Numbers of students in DATA2X02 and their experience in R") +scale_fill_brewer(palette ="Set2")print(p1)``````{r}t1 <-fisher.test(table)```1. Hypothesis: H0: Choice of unit and previous experience in R are independent vs H1: Choice of unit and previous experience in R are related.2. Assumptions: The contingency table should be a 2x2. The row and column totals are fixed.3. Test statistic: NA4. p-value: 0.02695. Decision: Since p-value is less than 0.05, we reject the null hypothesis and conclude that there is a relationship between the choice of unit and previous experience in R.```{r}new_sec_survey <- survey[c("Have you ever tested positive to COVID-19?", "How do you feel about the idea of travelling overseas?")]colnames(new_sec_survey) <-c("cov19","travel_rate")cov_19_pos <-filter(new_sec_survey, cov19 =="Yes") %>% na.omitcov_19_neg <-filter(new_sec_survey, cov19 =="No") %>% na.omit``````{r}Cov19_postive <-c("Have", "Haven't")Mean <-c(round(mean(cov_19_pos$travel_rate),2), round(mean(cov_19_neg$travel_rate),2))SD <-c(round(sd(cov_19_pos$travel_rate),2), round(sd(cov_19_neg$travel_rate),2))n <-c(nrow(cov_19_pos),nrow(cov_19_neg))df <-data.frame(Cov19_postive, Mean, SD, n)knitr::kable(df)``````{r}dat <-rbind(cov_19_pos, cov_19_neg)p1 <-ggplot(dat, aes(x=cov19, y=travel_rate, color = cov19)) +geom_boxplot() +geom_jitter(width =0.15, size =1) +ylab("Ratings for travelling") +xlab("") +scale_color_brewer(palette ="Set2") +labs(color ="Cov19 +") +ggtitle("Figure 2:")print(p1)p2 <-ggplot(dat, aes(sample = travel_rate, colour = cov19)) +stat_qq() +stat_qq_line() +scale_color_brewer(palette ="Set2") +facet_wrap(~cov19) +coord_fixed() +labs(color ="Cov19 +") +ggtitle("Figure 3:")print(p2)grid.arrange(p1, p2, ncol =2)``````{r}t2 <-t.test(cov_19_pos$travel_rate, cov_19_neg$travel_rate, alternative ="two.sided", var.equal =TRUE)```