Loading the relevant package.
library(readr)
library(ggplot2)
library(tidyverse)
library(janitor)
library(kableExtra)
library(gendercoder)
library(gridExtra)
options(knitr.table.format = "html")
This data set is provided by this course, and data is from the students who answered the question which posted on the ED. There are 696 students in DATA2002 and 58 students in DATA2902.
There are 24 variables and sample size is 574;There are quantitative data,and 17 qualitative data
general discussion of the data:
This is not a random sample of DATA 2X20 students, because this survey only mentioned in the lecture and posted on ED. Therefore, the person who like to visit Ed and attend to every lecture did this survey.
Non-response bias. it will influence the feeling of stress, and core/selective, because someone who use Ed occasionally will not see the survey. Those people maybe not spend much time on this course, so this course for them maybe is a selective and they won’t feel stressful
Firstly, we read the data and do a simple clean. We change the missing value "“,”n/a“, and” " into na
Quick overview of the question and structure
glimpse(data)
colnames(data)
Each row represents the data from a particular student.
Each column represents all data of a single attribute.
To make the analysis more clear, we handle all space and ‘n/a’ as NA. Moreover, we also rename of each column to make these easier to read.
questions = colnames(data)
short_names = c("time","covid_tests","living_arrangements","height",
"event_day","in_aus","math_ability","r_code_ability",
"data2002","enrolled_year","webcam","vaccination_status","social_media",
"gender","steak_preference","dominant_hand","stress",
"loneliness","emails","sign_off","salary","units","major/selective","exercise")
colnames(data) = short_names
It is better to not drop all the NA data, because maybe in one row there is only one NA data. If we drop all row, we could miss other useful data. ## Gender
For gender data, there are many different way to express, like Upper case,abbreviation. Therefore, we need to clean data and unified writing. That is useful for the research question.
We used gendercoder function to unify the gender data, like change f, Female into female. Therefore, we can have fewer classifications of gender. Then we add a new column called gender_clean in our data set
data = data %>%
mutate(gender_clean = gendercoder::recode_gender(gender))
data %>% tabyl(gender_clean)
After that, we find gender has some missing data. Then, we choose not NA data from gender_clean column and update those data to gender.
gender = data %>%
select(gender_clean) %>%
filter(!is.na(gender_clean))
gender = as.data.frame(table(gender))
Then, we clean the height data. We add a new column called height_clean in data set. Meanwhile, we also clean some impossible data. We firstly use parse_number function to transfer all data into number, and think about case which to solve the problem of students using different unit and exclude some impossible values.
data = data %>%
dplyr::mutate(
height_clean = readr::parse_number(height),
height_clean = case_when(
height_clean <= 2.2 ~ height_clean * 100, # someone who uses meter as unit traverse it into centimeter
height_clean <= 140 ~ NA_real_, #students between 2.5 and 140 whatever they are in meter or centimeter it is not real
TRUE ~ height_clean
)
)
h = data %>% select(height_clean)
h
Firstly, change all words to lowercase, it can solve tha problem that write the same word in different cases.
data$event_day = tolower(data$event_day)
Combine the same dates at the beginning.
data = data %>%
mutate(
event_day = case_when(
startsWith(event_day, "fri") ~ "friday",
startsWith(event_day, "sun") ~ "sunday",
startsWith(event_day, "wed") ~ "wednesday",
startsWith(event_day, "mon") ~ "monday",
startsWith(event_day, "chri") ~ "christmasday",
startsWith(event_day, "lol") ~ "monday",
TRUE ~ event_day
))
day = data %>% select(event_day)
day
List salary data and then only remain number.
data %>% tabyl(salary)
data = data %>% mutate(
salary_clean = readr::parse_number(salary)
)
Change data in exercise in numeric value. Suppose we are exercising all the time, up to 168 hours a week. If the time is less than 0 hours or greater than 168 hours, then exclude it.
data = data %>%
mutate(
exercise = as.numeric(exercise),
exercise = case_when(
exercise > 168 ~ NA_real_,
exercise < 0 ~ NA_real_,
TRUE ~ exercise
)
)
Does the number of COVID tests follow a Poisson distribution?
\[P\left( x \right) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}\]
par(mfrow=c(1,2))
# lambda = 2
plot(table(rpois(n=10000, lambda=2)), ylab = "Count")
# lambda = 6
plot(table(rpois(n=10000, lambda=6)), ylab = "Count")
Choose covid_test data and then exclude na.
covid_tests = data %>%
select(covid_tests) %>%
filter(!is.na(covid_tests))
covid = as.data.frame(table(covid_tests))
covid
data visualisation:
ggplot(covid,aes(x = covid_tests, y = Freq)) +
geom_bar(stat = "identity", fill="steelblue",alpha=0.8)+
labs(x = "Number of tests",y = "Number of people")
This picture may seem to be in line with the Poisson distribution, but there are still a little ups and downs when the number of tests is 5.
Hypothesis \(H_{0}:\) covid test come from a Poisson distribution vs. \(H_{1}:\) covid test do not come from a Poisson distribution.
Assumptions - The expected frequencies, \(e_i=np\geq5\). - Observations are independent.
y = covid$Freq
x = 0:9
n = sum(y) # total number of samples (sample size)
k = length(y) # number of groups
lam = sum(y * x)/n # estimate the lambda parameter
p = dpois(x, lambda = lam) # obtain the p_i from the Poisson pmf
p
## [1] 3.608724e-01 3.678123e-01 1.874428e-01 6.368249e-02 1.622679e-02
## [6] 3.307768e-03 5.618965e-04 8.181461e-05 1.042350e-05 1.180439e-06
p[9] = 1 - sum(p[1:8])
round(p, 5)
## [1] 0.36087 0.36781 0.18744 0.06368 0.01623 0.00331 0.00056 0.00008 0.00001
## [10] 0.00000
p
## [1] 3.608724e-01 3.678123e-01 1.874428e-01 6.368249e-02 1.622679e-02
## [6] 3.307768e-03 5.618965e-04 8.181461e-05 1.173642e-05 1.180439e-06
(ey = n * p) # calculate the expected frequencies
## [1] 7.506146e+01 7.650495e+01 3.898810e+01 1.324596e+01 3.375172e+00
## [6] 6.880158e-01 1.168745e-01 1.701744e-02 2.441176e-03 2.455312e-04
ey >= 5 #check assumption e_i >= 5 not all satisfied
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Only the first three expected values satisfy our assumption, we have to combine adjacent classes.
yr = c(y[1:3], sum(y[4:9]))
yr
## [1] 126 40 16 24
check assumption is greater than 5
yr >= 5
## [1] TRUE TRUE TRUE TRUE
(eyr = c(ey[1:3], sum(ey[4:9])))
## [1] 75.06146 76.50495 38.98810 17.44548
eyr >= 5
## [1] TRUE TRUE TRUE TRUE
(pr = c(p[1:3], sum(p[4:9])))
## [1] 0.36087243 0.36781228 0.18744280 0.08387249
Test statistic
\[T = \sum_{i=1}^k \frac{(Y_i - np_i)^2}{np_i}\]
kr = length(yr) # number of combined classes
(t0 = sum((yr - eyr)^2/eyr)) # test statistic
## [1] 68.0036
P-value and Conclusion
(pval = 1 - pchisq(t0, df = kr - 1 - 1)) # p-value
## [1] 1.665335e-15
round(pval, 5)
## [1] 0
The P-value is smaller than 0.05, there is strong evidence in the data against H_{0}. Hence, the data is not followed poission distribution.
xr = c("0","1","2",'>=3') # group labels
par(mfrow = c(1, 2), cex = 1.5) # plot options
barplot(yr, names.arg = xr, main = "Observed frequency")
barplot(eyr, names.arg = xr, main = "Expected frequency")
From the graph shows above, we can also see that there are some different between our data and expected Poisson distribution.
Relationship between gender and social media. The reason for studying the relationship between gender and social media is that based on the observation of friends, it is found that Chinese boys may use social media less because they prefer to go out. The social media in the survey are bilibili, QQ, wechat, and weibo.
Hypothesis \(H_{0}:\) gender and social media are independent. vs. \(H_{1}:\) gender and social media are dependent.
Assumptions - The expected frequencies, \(e_{ij} = y_{i\bullet} y_{\bullet j}/n \ge 5\). - Observations are independent.
r1 = as.data.frame(table(data$gender_clean, data$social_media))
r1
y = c(0,4,1,1,3,1,5,3,17,23)
n = sum(y)
c = 5
r = 2
y.mat = matrix(y, nrow = r, ncol = c)
colnames(y.mat) = c("bilibili", "qq", "weibo", "tiktok", "wechat")
rownames(y.mat) = c("female", "male")
y.mat
## bilibili qq weibo tiktok wechat
## female 0 1 3 5 17
## male 4 1 1 3 23
chisq.test(y.mat, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: y.mat
## X-squared = 5.8418, df = 4, p-value = 0.2113
(yr = apply(y.mat, 1, sum)) # rowSums(y.mat)
## female male
## 26 32
(yc = apply(y.mat, 2, sum)) # colSums(y.mat)
## bilibili qq weibo tiktok wechat
## 4 2 4 8 40
(yr.mat = matrix(yr, r, c, byrow = FALSE))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 26 26 26 26 26
## [2,] 32 32 32 32 32
(yc.mat = matrix(yc, r, c, byrow = TRUE))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 2 4 8 40
## [2,] 4 2 4 8 40
# matrix mult: ey.mat = yr %*% t(yc) / n
(ey.mat = yr.mat * yc.mat / sum(y.mat))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.793103 0.8965517 1.793103 3.586207 17.93103
## [2,] 2.206897 1.1034483 2.206897 4.413793 22.06897
all(ey.mat>=5) # check all e_{ij} >= 5
## [1] FALSE
(t0 = sum((y.mat - ey.mat)^2 / ey.mat))
## [1] 5.841827
(pval = pchisq(t0, (r - 1) * (c - 1),
lower.tail=FALSE))
## [1] 0.2112762
Decision
Because our p-value is greater than 0.05 we retain our null hypothesis and reject the alternative hypothesis at the significance level of 0.05. Therefore, we conclude that gender and social media are independent.
Relationship between gender and stress. The psychological endurance of girls is more vulnerable than boys. Therefore, girls usually bear more pressure than boys.
Hypothesis \(H_{0}:\) gender and stress are independent. vs. \(H_{1}:\) gender and social media are dependent.
Assumptions - The expected frequencies, \(e_{ij} = y_{i\bullet} y_{\bullet j}/n \ge 5\). - Observations are independent.
r2 = as.data.frame(table(data$stress, data$gender_clean))
r2
y.mat = xtabs(Freq ~ Var1 + Var2, r2)
y.mat
## Var2
## Var1 female male non-binary
## 0 0 1 0
## 1 0 2 0
## 2 1 4 0
## 3 9 16 0
## 4 6 12 0
## 5 7 20 0
## 6 14 30 0
## 7 12 22 0
## 8 12 7 0
## 9 8 8 0
## 10 5 7 1
chisq.test(y.mat, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: y.mat
## X-squared = 26.17, df = 20, p-value = 0.1603
r = 11
c = 3
(yr = apply(y.mat, 1, sum)) # rowSums(y.mat)
## 0 1 2 3 4 5 6 7 8 9 10
## 1 2 5 25 18 27 44 34 19 16 13
(yc = apply(y.mat, 2, sum)) # colSums(y.mat)
## female male non-binary
## 74 129 1
(yr.mat = matrix(yr, r, c, byrow = FALSE))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 5 5 5
## [4,] 25 25 25
## [5,] 18 18 18
## [6,] 27 27 27
## [7,] 44 44 44
## [8,] 34 34 34
## [9,] 19 19 19
## [10,] 16 16 16
## [11,] 13 13 13
(yc.mat = matrix(yc, r, c, byrow = TRUE))
## [,1] [,2] [,3]
## [1,] 74 129 1
## [2,] 74 129 1
## [3,] 74 129 1
## [4,] 74 129 1
## [5,] 74 129 1
## [6,] 74 129 1
## [7,] 74 129 1
## [8,] 74 129 1
## [9,] 74 129 1
## [10,] 74 129 1
## [11,] 74 129 1
# matrix mult: ey.mat = yr %*% t(yc) / n
(ey.mat = yr.mat * yc.mat / sum(y.mat))
## [,1] [,2] [,3]
## [1,] 0.3627451 0.6323529 0.004901961
## [2,] 0.7254902 1.2647059 0.009803922
## [3,] 1.8137255 3.1617647 0.024509804
## [4,] 9.0686275 15.8088235 0.122549020
## [5,] 6.5294118 11.3823529 0.088235294
## [6,] 9.7941176 17.0735294 0.132352941
## [7,] 15.9607843 27.8235294 0.215686275
## [8,] 12.3333333 21.5000000 0.166666667
## [9,] 6.8921569 12.0147059 0.093137255
## [10,] 5.8039216 10.1176471 0.078431373
## [11,] 4.7156863 8.2205882 0.063725490
all(ey.mat>=5) # check all e_{ij} >= 5
## [1] FALSE
(t0 = sum((y.mat - ey.mat)^2 / ey.mat))
## [1] 26.16993
(pval = pchisq(t0, (r - 1) * (c - 1),
lower.tail=FALSE))
## [1] 0.1602726
Decision
Because our p-value is greater than 0.05 we retain our null hypothesis and reject the alternative hypothesis at the significance level of 0.05. Therefore, we conclude that gender and stress are independent.
Whether the weekly exercise volume meets the standard. From a website article, it mentioned that it is good for people to do exercise 150min per week(“Move More; Sit Less”, 2021). Therefore, in order to know whether the student’s exercise volume meets the standard, we need further research.
Hypothesis \(H_{0}:\) μ=2.5 vs. \(H_{1}:\) μ>2.5
Assumptions - \(X_{i}\) are identically distributed random variable and follow \(N(\mu, \sigma^2)\).
exercise = data %>%
select(exercise) %>%
filter(!is.na(exercise))
r3 = as.data.frame(table(exercise))
r3
x = exercise$exercise
mean(x)
## [1] 4.749261
sd(x)
## [1] 6.382117
t.test(x, mu = 2.5, alternative = "less")
##
## One Sample t-test
##
## data: x
## t = 5.0214, df = 202, p-value = 1
## alternative hypothesis: true mean is less than 2.5
## 95 percent confidence interval:
## -Inf 5.489446
## sample estimates:
## mean of x
## 4.749261
n = length(x)
t0 = (mean(x) - 2.5)/(sd(x)/sqrt(n))
t0
## [1] 5.02138
pval = pt(t0, n - 1)
pval
## [1] 0.9999994
Decision Because our p-value is greater than 0.05 we retain our null hypothesis and reject the alternative hypothesis. Therefore, we conclude that the students exercise more than 2.5 hours a week..
The researches I studied were “Dose the number of COVID tests a student has taken in the past two months follows a Poisson distribution”. Then, we find that it dose not follow the poisson distribution.
Second research and third research are about “gender and social media are independent” and “gender and stress are independent”. From the results, they support gender and social media are independent, and gender and stress are independent.
Final research studied “students do exercise 2.5hr per week”. However, the result is that students do exercise over 2.5 hours per week.
Some data still not clean very well. For instance, someone said they do exercise 80 hours per week. That is not very real.
Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec04.html#14
Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec08.html#29
Build software better, together. (2021). Retrieved 18 September 2021, from https://pages.github.sydney.edu.au/DATA2002/2021/lectures/lec10.html#17
Move More; Sit Less. (2021). Retrieved 18 September 2021, from https://www.cdc.gov/physicalactivity/basics/adults/index.htm
Social media
Firstly, change all words to lowercase, it can solve tha problem that write the same word in different cases.