columns {display: flex;}
h1 {color: salmon;font-family: "Monaco", monospace; font-size: 36px;}
columns {display: flex;}
h3 {color: coral;font-family: "Monaco", monospace; font-size: 24px;}
mydata <- read.table("~/Documents/Šola/IMB/2. semester/Multivariate analysis/HW1/StudentsPerformance.csv", header=TRUE, sep=",")
head(mydata)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
colnames(mydata) [1] <- "Gender"
colnames(mydata) [2] <- "Race/ethnicity"
colnames(mydata) [3] <- "Parental level of education"
colnames(mydata) [4] <- "Lunch"
colnames(mydata) [5] <- "Test preparation course"
colnames(mydata) [6] <- "Math score"
colnames(mydata) [7] <- "Reading score"
colnames(mydata) [8] <- "Writing score"
head(mydata)
## Gender Race/ethnicity Parental level of education Lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## Test preparation course Math score Reading score Writing score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
Unit of observation: individual student
Sample size: 1000 students
Definition of variables:
Gender: female, male
Race/ethnicity: student’s racial or ethnic group (Group A, B, C, D)
Parental level of education: highest level of education completed by the student’s parents (high school, bachelor’s degree, some college, master’s degree, associate’s college)
Lunch: type of lunch the student receives (standard or free/reduced)
Test preparation course: whether the student completed a test preparation course (completed or none)
Math score: number representing the student’s score in mathematics (0-100)
Reading score: number representing the student’s score in reading (0-100)
Writing score: number representing the student’s score in writing (0-100)
anyNA(mydata)
## [1] FALSE
There is no missing data in my data frame.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
mutate(
Gender = as.factor(Gender),
`Race/ethnicity` = as.factor(`Race/ethnicity`),
`Parental level of education` = as.factor(`Parental level of education`),
Lunch = as.factor(Lunch),
`Test preparation course` = as.factor(`Test preparation course`)
)
summary(mydata)
## Gender Race/ethnicity Parental level of education Lunch
## female:518 group A: 89 associate's degree:222 free/reduced:355
## male :482 group B:190 bachelor's degree :118 standard :645
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## Test preparation course Math score Reading score Writing score
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
The average math score of students in the data frame is 66.09.
The median number of reading score of students is 70.00, meaning that half of students in the data frame had a score lower than 70.00, while the other half had more.
The minimum writing score in the data frame is 10.00 points, meaning that the student had the lowest writing score in this data frame had a score of 10.00.
My research question is if we can claim that the average score in math and the average score in reading is different?
mydata$Diference <- mydata$`Math score` - mydata$`Reading score`
library(ggplot2)
ggplot(mydata, aes(x = Diference)) +
geom_histogram(binwidth = 3, colour = "coral3", fill = "salmon") +
ylab("Frequency") +
xlab("Diference")
From the histogram it looks like the differences are normally distributed, I will check for sure with Shapiro-Wilk test.
shapiro.test(mydata$Diference)
##
## Shapiro-Wilk normality test
##
## data: mydata$Diference
## W = 0.99577, p-value = 0.007536
We can reject H0 at p = 0.008.
t.test(mydata$`Math score`, mydata$`Reading score`,
paired = TRUE,
alternative = "two.sided")
##
## Paired t-test
##
## data: mydata$`Math score` and mydata$`Reading score`
## t = -10.816, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -3.638791 -2.521209
## sample estimates:
## mean difference
## -3.08
We can reject H0 at p < 0.001.
Mean difference is -3.08, therefore students are better in reading than math.
library(effectsize)
effectsize::cohens_d(mydata$Diference)
## Cohen's d | 95% CI
## --------------------------
## -0.34 | [-0.41, -0.28]
interpret_cohens_d(0.34, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
We can reject the null hypothesis at p < 0.001 - there is a difference in the average score in math and the average score in reading among the sample students. Mean difference is -3.08, therefore students are better in reading than math. The difference in distribution is small (r = 0.34).
wilcox.test(mydata$`Math score`, mydata$`Reading score`,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: mydata$`Math score` and mydata$`Reading score`
## V = 140794, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
We can reject H0 at p < 0.001.
library(effectsize)
effectsize(wilcox.test(mydata$`Math score`, mydata$`Reading score`,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.38 | [-0.44, -0.32]
interpret_rank_biserial(0.38)
## [1] "large"
## (Rules: funder2019)
Using the sample data, we find that there is a difference in the average score in math and the average score in reading among the students (p < 0.001). The score is larger for reading as the median is 69.17, on the other hand the median for math score is 66.09. The effect size is large (r = 0.38).
For paired samples the assumptions are:
variable is numeric
differences of the population are normally distributed
For my data frame the variable is numeric, however I showed with the Shapiro-Wilk test that the normality is violated. Therefore the non-parametric test is more suitable. All in all, I concluded that there is a difference in the average score in math and the average score in reading among the students.
My research question is if there is a correlation between the scores achieved for reading and writing?
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(mydata, columns = c(7,8))
There seems to be a linear relationship between reading and writing score, which is positive and very strong (r=0.955). We will test this with correlation matrix.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1,-2,-3,-4,-5,-6)]),
type = "pearson")
## Reading score Writing score Diference
## Reading score 1.00 0.95 -0.24
## Writing score 0.95 1.00 -0.20
## Diference -0.24 -0.20 1.00
##
## n= 1000
##
##
## P
## Reading score Writing score Diference
## Reading score 0 0
## Writing score 0 0
## Diference 0 0
There is a linear relationship between reading and writing score, which is positive and very strong (r=0.95). We reject H0 at p < 0.001.
library(Hmisc)
rcorr(as.matrix(mydata[ , c(-1,-2,-3,-4,-5,-6)]),
type = "spearman")
## Reading score Writing score Diference
## Reading score 1.00 0.95 -0.25
## Writing score 0.95 1.00 -0.22
## Diference -0.25 -0.22 1.00
##
## n= 1000
##
##
## P
## Reading score Writing score Diference
## Reading score 0 0
## Writing score 0 0
## Diference 0 0
There is a linear relationship between reading and writing score, which is positive and very strong (r = 0.95). We reject H0 at p < 0.001.
I used both Pearson and Spearman correlation matrixes, since the data is writing and reading score and I am not completely sure if this is numerical or ordinal data. There is no information if it takes the same amount of time to study to achieve between 50-60 points, as it is to achieve 90-100 points. If it does take the same amount than this is a numerical variable and Pearson correlation matrix is better to use. On the other hand, if it doesn’t take the same amount of time, than it is an ordinal variable and Spearman correlation matrix is better to use. In our case the results are the same in both cases, so it doesn’t really matter that much.
Is there a significant association between completion of a test preparation course and the type of lunch a student receives?
results <- chisq.test(mydata$`Test preparation course`, mydata$Lunch,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$`Test preparation course` and mydata$Lunch
## X-squared = 0.22095, df = 1, p-value = 0.6383
addmargins(results$observed)
## mydata$Lunch
## free/reduced standard Sum
## completed 131 227 358
## none 224 418 642
## Sum 355 645 1000
round(results$expected, 2)
## mydata$Lunch
## free/reduced standard
## completed 127.09 230.91
## none 227.91 414.09
round(results$res, 2)
## mydata$Lunch
## free/reduced standard
## completed 0.35 -0.26
## none -0.26 0.19
There are 2 assumptions for the association between two categorical variables.
The observations are independent of each other, which is true in my case.
All expected frequencies are greater than 5, which is also true in my case. We see this from the expected frequencies table, where all numbers are < 5 (127.09, 230.91, 227.91, 414.09).
Since both assumptions are met, we can continue with the χ2 hypothesis.
We cannot reject H0, therefore we cannot say that there is an association between the two categorical variables.
Explanations: 1. The observed (empirical) value for completed preparation course and free/reduced lunch is 131. 2. The expected (theoretical) value is 127.09 (this number indicates that if there would be no association, we would expect 127.09 students in the category completed preparation course & free/reduced lunch). We can see that there is a small difference between the two values, which indicates again that there is no association between the two categorical variables. 3. In the combination completed preparation course and free/reduced lunch we cannot make any conclusions, as we didn’t find any difference in the frequencies that we are comparing (std. residual = 0.35 < 1.96, this result is not significant).
library(effectsize)
effectsize::cramers_v(mydata$`Test preparation course`, mydata$Lunch)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
We cannot say there is an association between completion of a test preparation course and the type of lunch a student receives, with the effect size being tiny (r = 0.00).