Research Question:
Are there differences in math, reading and writing exams?
mydata <- read.table("./original_data4.csv", header = TRUE, sep = ",")
head(mydata)
## ID Gender MathScore ReadingScore WritingScore
## 1 0 female 72 72 74
## 2 1 female 69 90 88
## 3 2 female 90 95 93
## 4 3 male 47 57 44
## 5 4 male 76 78 75
## 6 5 female 71 83 78
Unit of observation: one student
Description of data:
The data set was taken from Kaggle.com (Students exam scores: Extended dataset) and the sample size is 999 students. I will choose a random sample of 200 students.
#Random sample of 200 units. Needed for accurate hypothesis testing later on.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 200), ]
library(psych)
mydata$GenderF <- factor(mydata$Gender,
levels = c("male", "female"),
labels = c(0, 1))
head(mydata)
## ID Gender MathScore ReadingScore WritingScore GenderF
## 836 835 female 60 64 74 1
## 679 678 male 81 75 78 0
## 129 128 male 82 82 74 0
## 930 929 female 48 56 51 1
## 509 508 male 79 78 77 0
## 471 470 female 83 85 90 1
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Wide format of data is changed to long format.
mydata_long <- mydata %>%
pivot_longer(
cols = c("MathScore", "ReadingScore", "WritingScore"),
names_to = "Exams",
values_to = "Points") %>%
convert_as_factor(Exams)
mydata_long <- as.data.frame(mydata_long)
tail(mydata_long)
## ID Gender GenderF Exams Points
## 595 586 female 1 MathScore 55
## 596 586 female 1 ReadingScore 73
## 597 586 female 1 WritingScore 73
## 598 15 female 1 MathScore 69
## 599 15 female 1 ReadingScore 75
## 600 15 female 1 WritingScore 78
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
#Boxplot for each variable.
ggboxplot(mydata_long,
x = "Exams",
y = "Points",
add = "jitter")
library(tidyverse)
library(ggpubr)
library(rstatix)
#Finding outliers.
mydata_long %>%
group_by(Exams) %>%
identify_outliers(`Points`)
## # A tibble: 5 × 7
## Exams ID Gender GenderF Points is.outlier is.extreme
## <fct> <int> <chr> <fct> <int> <lgl> <lgl>
## 1 MathScore 842 female 1 23 TRUE FALSE
## 2 ReadingScore 596 male 0 24 TRUE FALSE
## 3 ReadingScore 76 male 0 26 TRUE FALSE
## 4 WritingScore 596 male 0 15 TRUE FALSE
## 5 WritingScore 76 male 0 22 TRUE FALSE
#Removing outliers.
mydata_long <- mydata_long %>%
filter(!ID == (842))
mydata_long <- mydata_long %>%
filter(!ID == (76))
mydata_long <- mydata_long %>%
filter(!ID == (596))
#Checking if there are anymore outliers
mydata_long %>%
group_by(Exams) %>%
identify_outliers(`Points`)
## # A tibble: 1 × 7
## Exams ID Gender GenderF Points is.outlier is.extreme
## <fct> <int> <chr> <fct> <int> <lgl> <lgl>
## 1 MathScore 338 female 1 24 TRUE FALSE
mydata_long <- mydata_long %>%
filter(!ID == (338))
#Checking if there are anymore outliers #2
mydata_long %>%
group_by(Exams) %>%
identify_outliers(`Points`)
## # A tibble: 1 × 7
## Exams ID Gender GenderF Points is.outlier is.extreme
## <fct> <int> <chr> <fct> <int> <lgl> <lgl>
## 1 MathScore 466 female 1 26 TRUE FALSE
mydata_long <- mydata_long %>%
filter(!ID == (466))
#Checking if there are anymore outliers #3
mydata_long %>%
group_by(Exams) %>%
identify_outliers(`Points`)
## [1] Exams ID Gender GenderF Points is.outlier is.extreme
## <0 rows> (or 0-length row.names)
Assumptions to perform a parametric test
(rANOVA):
- All variables are numeric.
- Normality: The variables in the population are normally distributed
(Shapiro-Wilk test).
- Sphericity: the variances of the differences between the examines
variables are equal (Mauchly test for sphericity).
- No significant outliers.
library(rstatix)
#Checking normality with Shapiro-Wilk test
mydata_long %>%
group_by(Exams) %>%
shapiro_test(Points)
## # A tibble: 3 × 4
## Exams variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 MathScore Points 0.994 0.587
## 2 ReadingScore Points 0.992 0.378
## 3 WritingScore Points 0.990 0.183
Variable Points needs to be normally distributed within each of the
stated Exams.
Example of MathScore:
H0: Variable Points in normally distributed within
MathScore.
H1: Variable Points in NOT normally distributed within
MathScore.
H0 of WritingScore is rejected at p<0.05, therefore normality is violated and I will use Friedman ANOVA.
I will still first show the parametric test - rANOVA
H0: Mean of all the variables (MathScore, ReadingScore and WritingScore) are the same. H1: At least one mean out of those 3 is different.
library(rstatix)
#rANOVA
ANOVA_results <- anova_test(dv = Points, #Dependent variable
wid = ID, #Subject identifier
within = Exams, #Within-subject factor variable
data = mydata_long)
ANOVA_results #Summary of results.
## ANOVA Table (type III tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Exams 2 388 18.285 2.58e-08 * 0.009
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 Exams 0.54 1.4e-26 *
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 Exams 0.685 1.37, 265.69 2.18e-06 * 0.688 1.38, 266.94 2.08e-06
## p[HF]<.05
## 1 *
get_anova_table(ANOVA_results, correction = "auto")
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Exams 1.37 265.69 18.285 2.18e-06 * 0.009
I reject H0 at p<0.05.
After the show of the descriptive statistics, I will show the more accurate test my case - Friedman ANOVA.
library(rstatix)
#Descriptive statistics for each variable.
mydata_long %>%
group_by(Exams) %>%
get_summary_stats(Points, type = "common")
## # A tibble: 3 × 11
## Exams variable n min max median iqr mean sd se ci
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MathScore Points 195 29 99 66 19.5 65.8 14.6 1.05 2.07
## 2 ReadingScore Points 195 31 100 69 21 69.0 14.1 1.01 2.00
## 3 WritingScore Points 195 33 100 68 22 68.4 14.6 1.05 2.06
It is more accurate in my case if I compare medians and not means, because normallity was violated, therefore Friedman ANOVA is the more appropriate test (non-parametric test).
library(rstatix)
#Friedman ANOVA.
FriedmanANOVA <- friedman_test(Points ~ Exams | ID,
data = mydata_long)
#Results
FriedmanANOVA
## # A tibble: 1 × 6
## .y. n statistic df p method
## * <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 Points 195 22.8 2 0.0000114 Friedman test
H0: Location distribution of all 3 variables are the
same.
H1: At least 1 location distribution is different.
I reject H0 at p<0.001.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::kendalls_w(Points ~ Exams | ID,
data = mydata_long)
## Warning: 32 block(s) contain ties.
## Kendall's W | 95% CI
## --------------------------
## 0.06 | [0.02, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_kendalls_w(0.05)
## [1] "slight agreement"
## (Rules: landis1977)
There is slight agrement in effect size, which just confirms my hyposthesis testing where I rejected H0.
library(rstatix)
#Wilcoxon signed rank tests - comparing all possible paires.
paires_nonpar <- wilcox_test(Points ~ Exams,
paired = TRUE,
p.adjust.method = "bonferroni",
data = mydata_long)
paires_nonpar
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Points MathScore Readin… 195 195 5126. 1.87e-6 5.61e-6 ****
## 2 Points MathScore Writin… 195 195 6073 2.47e-4 7.41e-4 ***
## 3 Points ReadingScore Writin… 195 195 9519 6.8 e-2 2.05e-1 ns
All the p values are statistically significant. No location distributions in all 3 combinations are the same.
library(rstatix)
comparisons <- paires_nonpar %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(mydata_long, x = "Exams", y = "Points", add = "point", ylim=c(0, 18)) +
stat_pvalue_manual(comparisons, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = Exams), color = "darkred",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "darkred",
position = position_dodge(width = 0.8),
geom = "text", vjust = 0.5, hjust = -8,
aes(label = round(after_stat(y), digits = 2), group = Exams)) +
labs(subtitle = get_test_label(FriedmanANOVA, detailed = TRUE),
caption = get_pwc_label(comparisons))
Conclusion:
Based on the sample data, I found that there are differences between
math, reading and writing exams (p<0.001). The effect size is 0,05,
which means there is slight agreement in the statement that there is no
difference between those 3 exams.