Loading the relevant package.

library(readr)
library(ggplot2)
library(tidyverse)
library(janitor)
library(kableExtra)
library(gendercoder)
library(gridExtra)

options(knitr.table.format = "html")

1.introduction

This data set is provided by this course, and data is from the students who answered the question which posted on the ED. There are 696 students in DATA2002 and 58 students in DATA2902.
There are 24 variables and sample size is 574；There are quantitative data，and 17 qualitative data
general discussion of the data：

This is not a random sample of DATA 2X20 students, because this survey only mentioned in the lecture and posted on ED. Therefore, the person who like to visit Ed and attend to every lecture did this survey.
Non-response bias. it will influence the feeling of stress, and core/selective, because someone who use Ed occasionally will not see the survey. Those people maybe not spend much time on this course, so this course for them maybe is a selective and they won’t feel stressful

Besides posted the survey on Ed and mentioned in the lecture, it can also send email to everyone to make sure more students can see that survey. That can reduce non-response bias.

2.Initial Data Analysis (IDA)

Firstly, we read the data and do a simple clean. We change the missing value "“,”n/a“, and” " into na

Quick overview of the question and structure

glimpse(data)

colnames(data)

Each row represents the data from a particular student.
Each column represents all data of a single attribute.

To make the analysis more clear, we handle all space and ‘n/a’ as NA. Moreover, we also rename of each column to make these easier to read.

questions = colnames(data)
short_names = c("time","covid_tests","living_arrangements","height",
                "event_day","in_aus","math_ability","r_code_ability",
                "data2002","enrolled_year","webcam","vaccination_status","social_media",
                "gender","steak_preference","dominant_hand","stress",
                "loneliness","emails","sign_off","salary","units","major/selective","exercise")
colnames(data) = short_names

Data cleaning

It is better to not drop all the NA data, because maybe in one row there is only one NA data. If we drop all row, we could miss other useful data. ## Gender

For gender data, there are many different way to express, like Upper case,abbreviation. Therefore, we need to clean data and unified writing. That is useful for the research question.

We used gendercoder function to unify the gender data, like change f, Female into female. Therefore, we can have fewer classifications of gender. Then we add a new column called gender_clean in our data set

data = data %>% 
  mutate(gender_clean = gendercoder::recode_gender(gender))
data %>% tabyl(gender_clean)

After that, we find gender has some missing data. Then, we choose not NA data from gender_clean column and update those data to gender.

gender = data %>% 
  select(gender_clean) %>% 
  filter(!is.na(gender_clean))
gender = as.data.frame(table(gender))

height

Then, we clean the height data. We add a new column called height_clean in data set. Meanwhile, we also clean some impossible data. We firstly use parse_number function to transfer all data into number, and think about case which to solve the problem of students using different unit and exclude some impossible values.

data = data %>% 
  dplyr::mutate(
    height_clean = readr::parse_number(height),
    height_clean = case_when(
      height_clean <= 2.2 ~ height_clean * 100, # someone who uses meter as unit traverse it into centimeter
      height_clean <= 140 ~ NA_real_, #students between 2.5 and 140 whatever they are in meter or centimeter it is not real
      TRUE ~ height_clean
    )
  )

h = data %>% select(height_clean)
h

event_day

Firstly, change all words to lowercase, it can solve tha problem that write the same word in different cases.

data$event_day = tolower(data$event_day)

Combine the same dates at the beginning.

data = data %>% 
  mutate(
    event_day = case_when( 
startsWith(event_day, "fri") ~ "friday",
startsWith(event_day, "sun") ~ "sunday",
startsWith(event_day, "wed") ~ "wednesday",
startsWith(event_day, "mon") ~ "monday",
startsWith(event_day, "chri") ~ "christmasday",
startsWith(event_day, "lol") ~ "monday",
TRUE ~ event_day
)) 

day = data %>% select(event_day)
day

Salary (tricky)

List salary data and then only remain number.

data %>% tabyl(salary)

data = data %>% mutate(
  salary_clean = readr::parse_number(salary)
)

Social media

Firstly, change all words to lowercase, it can solve tha problem that write the same word in different cases.

data %>% tabyl(social_media)

data$social_media = tolower(data$social_media)

data = data %>% 
  mutate(
    social_media = case_when( 
startsWith(social_media, "tik") ~ "tiktok",
startsWith(social_media, "wec") ~ "wechat",
startsWith(social_media, "wei") ~ "weibo",
startsWith(social_media, "ins") ~ "instagram",
startsWith(social_media, "wet") ~ "wechat",
social_media=='ig' ~ "instagram",
TRUE ~ social_media
))

Exercise

Change data in exercise in numeric value. Suppose we are exercising all the time, up to 168 hours a week. If the time is less than 0 hours or greater than 168 hours, then exclude it.

data = data %>%
  mutate(
    exercise = as.numeric(exercise),
    exercise = case_when(
      exercise > 168 ~ NA_real_,
      exercise < 0 ~ NA_real_,
      TRUE ~ exercise
    )
  )

Research Questions

Research quesrtion 1

Does the number of COVID tests follow a Poisson distribution?

\[P\left( x \right) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}\]

par(mfrow=c(1,2)) 
# lambda = 2
plot(table(rpois(n=10000, lambda=2)), ylab = "Count")
# lambda = 6
plot(table(rpois(n=10000, lambda=6)), ylab = "Count")

Choose covid_test data and then exclude na.

covid_tests = data %>% 
  select(covid_tests) %>% 
  filter(!is.na(covid_tests))

covid = as.data.frame(table(covid_tests))
covid

data visualisation:

ggplot(covid,aes(x = covid_tests, y = Freq)) + 
 geom_bar(stat = "identity", fill="steelblue",alpha=0.8)+
  labs(x = "Number of tests",y = "Number of people")

This picture may seem to be in line with the Poisson distribution, but there are still a little ups and downs when the number of tests is 5.

Hypothesis \(H_{0}:\) covid test come from a Poisson distribution vs. \(H_{1}:\) covid test do not come from a Poisson distribution.

Assumptions - The expected frequencies, \(e_i=np\geq5\). - Observations are independent.

y = covid$Freq
x = 0:9
n = sum(y) # total number of samples (sample size)
k = length(y) # number of groups
lam = sum(y * x)/n # estimate the lambda parameter
p = dpois(x, lambda = lam) # obtain the p_i from the Poisson pmf
p

##  [1] 3.608724e-01 3.678123e-01 1.874428e-01 6.368249e-02 1.622679e-02
##  [6] 3.307768e-03 5.618965e-04 8.181461e-05 1.042350e-05 1.180439e-06

p[9] = 1 - sum(p[1:8]) 
round(p, 5)

##  [1] 0.36087 0.36781 0.18744 0.06368 0.01623 0.00331 0.00056 0.00008 0.00001
## [10] 0.00000

##  [1] 3.608724e-01 3.678123e-01 1.874428e-01 6.368249e-02 1.622679e-02
##  [6] 3.307768e-03 5.618965e-04 8.181461e-05 1.173642e-05 1.180439e-06

(ey = n * p) # calculate the expected frequencies

##  [1] 7.506146e+01 7.650495e+01 3.898810e+01 1.324596e+01 3.375172e+00
##  [6] 6.880158e-01 1.168745e-01 1.701744e-02 2.441176e-03 2.455312e-04

ey >= 5 #check assumption e_i >= 5 not all satisfied

##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Only the first three expected values satisfy our assumption, we have to combine adjacent classes.

yr = c(y[1:3], sum(y[4:9]))
yr

## [1] 126  40  16  24

check assumption is greater than 5

yr >= 5

## [1] TRUE TRUE TRUE TRUE

(eyr = c(ey[1:3], sum(ey[4:9])))

## [1] 75.06146 76.50495 38.98810 17.44548

eyr >= 5

## [1] TRUE TRUE TRUE TRUE

(pr = c(p[1:3], sum(p[4:9])))

## [1] 0.36087243 0.36781228 0.18744280 0.08387249

Test statistic

\[T = \sum_{i=1}^k \frac{(Y_i - np_i)^2}{np_i}\]

kr = length(yr) # number of combined classes
(t0 = sum((yr - eyr)^2/eyr)) # test statistic

## [1] 68.0036

P-value and Conclusion

(pval = 1 - pchisq(t0, df = kr - 1 - 1)) # p-value

## [1] 1.665335e-15

round(pval, 5)

## [1] 0

The P-value is smaller than 0.05, there is strong evidence in the data against H_{0}. Hence, the data is not followed poission distribution.

xr = c("0","1","2",'>=3') # group labels
par(mfrow = c(1, 2), cex = 1.5)  # plot options
barplot(yr, names.arg = xr, main = "Observed frequency")
barplot(eyr, names.arg = xr, main = "Expected frequency")

From the graph shows above, we can also see that there are some different between our data and expected Poisson distribution.

Research Question 2

Relationship between gender and social media. The reason for studying the relationship between gender and social media is that based on the observation of friends, it is found that Chinese boys may use social media less because they prefer to go out. The social media in the survey are bilibili, QQ, wechat, and weibo.

Hypothesis \(H_{0}:\) gender and social media are independent. vs. \(H_{1}:\) gender and social media are dependent.

Assumptions - The expected frequencies, \(e_{ij} = y_{i\bullet} y_{\bullet j}/n \ge 5\). - Observations are independent.

r1 = as.data.frame(table(data$gender_clean, data$social_media))
r1

y = c(0,4,1,1,3,1,5,3,17,23)
n = sum(y)
c = 5
r = 2
y.mat = matrix(y, nrow = r, ncol = c)
colnames(y.mat) = c("bilibili", "qq", "weibo", "tiktok", "wechat")
rownames(y.mat) = c("female", "male")
y.mat

##        bilibili qq weibo tiktok wechat
## female        0  1     3      5     17
## male          4  1     1      3     23

chisq.test(y.mat, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 5.8418, df = 4, p-value = 0.2113

(yr = apply(y.mat, 1, sum))  # rowSums(y.mat)

## female   male 
##     26     32

(yc = apply(y.mat, 2, sum))  # colSums(y.mat)

## bilibili       qq    weibo   tiktok   wechat 
##        4        2        4        8       40

(yr.mat = matrix(yr, r, c, byrow = FALSE))

##      [,1] [,2] [,3] [,4] [,5]
## [1,]   26   26   26   26   26
## [2,]   32   32   32   32   32

(yc.mat = matrix(yc, r, c, byrow = TRUE))

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    2    4    8   40
## [2,]    4    2    4    8   40

# matrix mult: ey.mat = yr %*% t(yc) / n
(ey.mat = yr.mat * yc.mat / sum(y.mat))

##          [,1]      [,2]     [,3]     [,4]     [,5]
## [1,] 1.793103 0.8965517 1.793103 3.586207 17.93103
## [2,] 2.206897 1.1034483 2.206897 4.413793 22.06897

all(ey.mat>=5) # check all e_{ij} >= 5

## [1] FALSE

(t0 = sum((y.mat - ey.mat)^2 / ey.mat))

## [1] 5.841827

(pval = pchisq(t0, (r - 1) * (c - 1), 
               lower.tail=FALSE))

## [1] 0.2112762

Decision

Because our p-value is greater than 0.05 we retain our null hypothesis and reject the alternative hypothesis at the significance level of 0.05. Therefore, we conclude that gender and social media are independent.

Research Question 3

Relationship between gender and stress. The psychological endurance of girls is more vulnerable than boys. Therefore, girls usually bear more pressure than boys.

Hypothesis \(H_{0}:\) gender and stress are independent. vs. \(H_{1}:\) gender and social media are dependent.

Assumptions - The expected frequencies, \(e_{ij} = y_{i\bullet} y_{\bullet j}/n \ge 5\). - Observations are independent.

r2 = as.data.frame(table(data$stress, data$gender_clean))
r2

y.mat = xtabs(Freq ~ Var1 + Var2, r2)
y.mat

##     Var2
## Var1 female male non-binary
##   0       0    1          0
##   1       0    2          0
##   2       1    4          0
##   3       9   16          0
##   4       6   12          0
##   5       7   20          0
##   6      14   30          0
##   7      12   22          0
##   8      12    7          0
##   9       8    8          0
##   10      5    7          1

chisq.test(y.mat, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 26.17, df = 20, p-value = 0.1603

r = 11
c = 3
(yr = apply(y.mat, 1, sum))  # rowSums(y.mat)

##  0  1  2  3  4  5  6  7  8  9 10 
##  1  2  5 25 18 27 44 34 19 16 13

(yc = apply(y.mat, 2, sum))  # colSums(y.mat)

##     female       male non-binary 
##         74        129          1

(yr.mat = matrix(yr, r, c, byrow = FALSE))

##       [,1] [,2] [,3]
##  [1,]    1    1    1
##  [2,]    2    2    2
##  [3,]    5    5    5
##  [4,]   25   25   25
##  [5,]   18   18   18
##  [6,]   27   27   27
##  [7,]   44   44   44
##  [8,]   34   34   34
##  [9,]   19   19   19
## [10,]   16   16   16
## [11,]   13   13   13

(yc.mat = matrix(yc, r, c, byrow = TRUE))

##       [,1] [,2] [,3]
##  [1,]   74  129    1
##  [2,]   74  129    1
##  [3,]   74  129    1
##  [4,]   74  129    1
##  [5,]   74  129    1
##  [6,]   74  129    1
##  [7,]   74  129    1
##  [8,]   74  129    1
##  [9,]   74  129    1
## [10,]   74  129    1
## [11,]   74  129    1

# matrix mult: ey.mat = yr %*% t(yc) / n
(ey.mat = yr.mat * yc.mat / sum(y.mat))

##             [,1]       [,2]        [,3]
##  [1,]  0.3627451  0.6323529 0.004901961
##  [2,]  0.7254902  1.2647059 0.009803922
##  [3,]  1.8137255  3.1617647 0.024509804
##  [4,]  9.0686275 15.8088235 0.122549020
##  [5,]  6.5294118 11.3823529 0.088235294
##  [6,]  9.7941176 17.0735294 0.132352941
##  [7,] 15.9607843 27.8235294 0.215686275
##  [8,] 12.3333333 21.5000000 0.166666667
##  [9,]  6.8921569 12.0147059 0.093137255
## [10,]  5.8039216 10.1176471 0.078431373
## [11,]  4.7156863  8.2205882 0.063725490

all(ey.mat>=5) # check all e_{ij} >= 5

## [1] FALSE

(t0 = sum((y.mat - ey.mat)^2 / ey.mat))

## [1] 26.16993

(pval = pchisq(t0, (r - 1) * (c - 1), 
               lower.tail=FALSE))

## [1] 0.1602726

Decision

Research Question 4

Whether the weekly exercise volume meets the standard. From a website article, it mentioned that it is good for people to do exercise 150min per week(“Move More; Sit Less”, 2021). Therefore, in order to know whether the student’s exercise volume meets the standard, we need further research.

Hypothesis \(H_{0}:\) μ=2.5 vs. \(H_{1}:\) μ>2.5

Assumptions - \(X_{i}\) are identically distributed random variable and follow \(N(\mu, \sigma^2)\).

exercise = data %>% 
  select(exercise) %>% 
  filter(!is.na(exercise))

r3 = as.data.frame(table(exercise))
r3

x = exercise$exercise
mean(x)

## [1] 4.749261

sd(x)

## [1] 6.382117

t.test(x, mu = 2.5, alternative = "less")

## 
##  One Sample t-test
## 
## data:  x
## t = 5.0214, df = 202, p-value = 1
## alternative hypothesis: true mean is less than 2.5
## 95 percent confidence interval:
##      -Inf 5.489446
## sample estimates:
## mean of x 
##  4.749261

n = length(x)
t0 = (mean(x) - 2.5)/(sd(x)/sqrt(n))
t0

## [1] 5.02138

pval = pt(t0, n - 1)
pval

## [1] 0.9999994

Decision Because our p-value is greater than 0.05 we retain our null hypothesis and reject the alternative hypothesis. Therefore, we conclude that the students exercise more than 2.5 hours a week..

Conclusions

The researches I studied were “Dose the number of COVID tests a student has taken in the past two months follows a Poisson distribution”. Then, we find that it dose not follow the poisson distribution.

Second research and third research are about “gender and social media are independent” and “gender and stress are independent”. From the results, they support gender and social media are independent, and gender and stress are independent.

Final research studied “students do exercise 2.5hr per week”. However, the result is that students do exercise over 2.5 hours per week.

Limitation

Some data still not clean very well. For instance, someone said they do exercise 80 hours per week. That is not very real.

References

Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec04.html#14

Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec08.html#29

Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec10.html#17

Move More; Sit Less. (2021). Retrieved 18 September 2021, from https://www.cdc.gov/physicalactivity/basics/adults/index.htm

Assignment1_Data2002

Project 1

Eva Zhuo/500031917

University of Sydney | DATA2002 | 09/21