DATA2902 Assignment

Introduction

General discussion of the data as a whole

Question 1,2,3

While the professor spread out the survey to all the students in the unit of DATA2X02, not all the students respond to it. The sample in the data set here is a voluntary response sample, which consists of people who choose themselves by responding to a general invitation. And this depends on people to decide whether to join the sample, thus, suffer the bias due to personal choice.

And the students who are more enthusiastic about the study may be more likely to respond to the survey and become part of the sample. Also due to the naming of the course, which differences DATA2902 as advanced level, DATA2002 students tend to have a lower self-evaluation to their statistic ability compared to the advanced level students.

From the data set, there appearances differences in the unit of height, the way students predict the in-come salary, and the majors for them to take this unit. The height of the person should be indicated as in meters or inches, and the income should be indicated as an hour, month, or year. And for the major, it would be better to give a list of selections, instead of letting students respond in their way.

Preparations

Package import

library(tidyverse)
library(ggplot2)
library(dplyr)

Dataset import

Original_data <- readr::read_csv("D:/R_studio_data2912/DATA2902_R_data/DATA2x02_survey_data.csv")

Data rename

DATA_2X02_data<-rename(Original_data,time = 1, num_covid_test = 2, living_arrangement = 3, height = 4, guess_day_of_event = 5, in_Australia = 6, self_math_ability = 7, self_R_coding_ability = 8, diff_easy_of_data2002 = 9, year_of_uni = 10, fre_zoom_cam_on = 11, vac_status = 12, fav_social_media = 13, gender = 14, steak_cooked = 15, domi_hand = 16, stress = 17, loneliness = 18, non_spam_on_Fri = 19, end_email = 20, guess_entry_salary = 21, unit = 22, major = 23, hr_exercise_per_week = 24)

number <- DATA_2X02_data %>%
  filter(unit == "DATA2902 (Advanced)")

omit_na_covid<-data.frame(num_covid_test = DATA_2X02_data$num_covid_test) %>%
  drop_na() 

omit_na_covid %>%
  summarize(sum_of_test = sum(num_covid_test),sum_of_people_cotest = sum(!is.na(DATA_2X02_data$num_covid_test)), lambda_covid_test = sum_of_test/sum_of_people_cotest)

##   sum_of_test sum_of_people_cotest lambda_covid_test
## 1         214                  208          1.028846

#naming the variables
sum_of_test = sum(omit_na_covid$num_covid_test)
sum_of_people_cotest = sum(!is.na(DATA_2X02_data$num_covid_test))
lambda_covid_test = sum_of_test/sum_of_people_cotest

214/ 208 = 1.0288462

#drawing the graph comparing the real number of covid test distribution vs. possion distribution 
y=omit_na_covid
colnames(y) = c('num_covid_test')
n = nrow(y) # sample size

covid_counts = y %>% 
  group_by(num_covid_test) %>% 
  summarise(count = n()) 

pois <- as.data.frame(dpois(0:9, lambda = lambda_covid_test) * sum_of_people_cotest)

df <- data.frame(covid_counts, pois)

colnames(df) <- (c("Tests_done","Observed_counts", "Poisson_counts"))

ggplot(df, aes(x = Tests_done, y = Observed_counts)) + geom_col(alpha = 0.8) +
  geom_point(data = df, aes(x = Tests_done, y = Poisson_counts), alpha = 0.6, color = "blue") +
  geom_line(data = df, aes(x = Tests_done, y = Poisson_counts),  alpha = 0.6, color = "blue") +
  ylab("Counts") + xlab("Covid tests done") +
  scale_x_continuous(breaks = 0:10)

Drop_na_unit_stress<- drop_na(DATA_2X02_data,unit,stress)
ggplot(Drop_na_unit_stress, aes(x = unit, y = stress)) + 
  geom_boxplot(coef = 10) +
  geom_jitter(width = 0.15, size = 1) +
  theme_linedraw(base_size = 10) + 
  labs(x = "Unit",  y = "Stress Levels Felt in Past Week")