columns {display: flex;}
h1 {color: salmon;font-family: "Monaco", monospace; font-size: 36px;}

columns {display: flex;}
h3 {color: coral;font-family: "Monaco", monospace; font-size: 24px;}

Descriptive statistics

mydata <- read.table("~/Documents/Šola/IMB/2. semester/Multivariate analysis/HW1/StudentsPerformance.csv", header=TRUE, sep=",")

head(mydata)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

colnames(mydata) [1] <- "Gender"
colnames(mydata) [2] <- "Race/ethnicity"
colnames(mydata) [3] <- "Parental level of education"
colnames(mydata) [4] <- "Lunch"
colnames(mydata) [5] <- "Test preparation course"
colnames(mydata) [6] <- "Math score"
colnames(mydata) [7] <- "Reading score"
colnames(mydata) [8] <- "Writing score"

head(mydata)

##   Gender Race/ethnicity Parental level of education        Lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   Test preparation course Math score Reading score Writing score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Unit of observation: individual student
Sample size: 1000 students
Definition of variables:

Gender: female, male
Race/ethnicity: student’s racial or ethnic group (Group A, B, C, D)
Parental level of education: highest level of education completed by the student’s parents (high school, bachelor’s degree, some college, master’s degree, associate’s college)
Lunch: type of lunch the student receives (standard or free/reduced)
Test preparation course: whether the student completed a test preparation course (completed or none)
Math score: number representing the student’s score in mathematics (0-100)
Reading score: number representing the student’s score in reading (0-100)
Writing score: number representing the student’s score in writing (0-100)

Source: Seshapanpu, J. (2018, November 9). Students performance in exams. Kaggle. https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/data.

anyNA(mydata)

## [1] FALSE

There is no missing data in my data frame.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  mutate(
    Gender = as.factor(Gender),
    `Race/ethnicity` = as.factor(`Race/ethnicity`),
    `Parental level of education` = as.factor(`Parental level of education`),
    Lunch = as.factor(Lunch),
    `Test preparation course` = as.factor(`Test preparation course`)
  )

summary(mydata)

##     Gender    Race/ethnicity     Parental level of education          Lunch    
##  female:518   group A: 89    associate's degree:222          free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118          standard    :645  
##               group C:319    high school       :196                            
##               group D:262    master's degree   : 59                            
##               group E:140    some college      :226                            
##                              some high school  :179                            
##  Test preparation course   Math score     Reading score    Writing score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00

The average math score of students in the data frame is 66.09.
The median number of reading score of students is 70.00, meaning that half of students in the data frame had a score lower than 70.00, while the other half had more.
The minimum writing score in the data frame is 10.00 points, meaning that the student had the lowest writing score in this data frame had a score of 10.00.

RQ1

My research question is if we can claim that the average score in math and the average score in reading is different?

mydata$Diference <- mydata$`Math score` - mydata$`Reading score`

library(ggplot2)
ggplot(mydata, aes(x = Diference)) +
  geom_histogram(binwidth = 3, colour = "coral3", fill = "salmon") +
  ylab("Frequency") +
  xlab("Diference")

From the histogram it looks like the differences are normally distributed, I will check for sure with Shapiro-Wilk test.

shapiro.test(mydata$Diference)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Diference
## W = 0.99577, p-value = 0.007536

H0: Differences are normally distributed.
H1: Differences are not normally distributed.

We can reject H0 at p = 0.008.

Parametric test

t.test(mydata$`Math score`, mydata$`Reading score`, 
       paired = TRUE, 
       alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  mydata$`Math score` and mydata$`Reading score`
## t = -10.816, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.638791 -2.521209
## sample estimates:
## mean difference 
##           -3.08

H0: μd = 0 or μmath = μ(reading)
H1: μd ≠ 0 or μmath ≠ μ(reading)

We can reject H0 at p < 0.001.

Mean difference is -3.08, therefore students are better in reading than math.

library(effectsize)
effectsize::cohens_d(mydata$Diference)

## Cohen's d |         95% CI
## --------------------------
## -0.34     | [-0.41, -0.28]

interpret_cohens_d(0.34, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

We can reject the null hypothesis at p < 0.001 - there is a difference in the average score in math and the average score in reading among the sample students. Mean difference is -3.08, therefore students are better in reading than math. The difference in distribution is small (r = 0.34).

Non-parametric test

wilcox.test(mydata$`Math score`,  mydata$`Reading score`,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$`Math score` and mydata$`Reading score`
## V = 140794, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location of reading score is equal to the distribution location of math score.
H1: Distribution location of reading score is not equal to the distribution location of math score.

We can reject H0 at p < 0.001.

library(effectsize)
effectsize(wilcox.test(mydata$`Math score`, mydata$`Reading score`,
                       paired = TRUE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.38             | [-0.44, -0.32]

interpret_rank_biserial(0.38)

## [1] "large"
## (Rules: funder2019)

Using the sample data, we find that there is a difference in the average score in math and the average score in reading among the students (p < 0.001). The score is larger for reading as the median is 69.17, on the other hand the median for math score is 66.09. The effect size is large (r = 0.38).

For paired samples the assumptions are:

variable is numeric
differences of the population are normally distributed

For my data frame the variable is numeric, however I showed with the Shapiro-Wilk test that the normality is violated. Therefore the non-parametric test is more suitable. All in all, I concluded that there is a difference in the average score in math and the average score in reading among the students.

RQ2

My research question is if there is a correlation between the scores achieved for reading and writing?

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(mydata, columns = c(7,8))

There seems to be a linear relationship between reading and writing score, which is positive and very strong (r=0.955). We will test this with correlation matrix.

Pearson correlation matrix

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c(-1,-2,-3,-4,-5,-6)]), 
      type = "pearson")

##               Reading score Writing score Diference
## Reading score          1.00          0.95     -0.24
## Writing score          0.95          1.00     -0.20
## Diference             -0.24         -0.20      1.00
## 
## n= 1000 
## 
## 
## P
##               Reading score Writing score Diference
## Reading score                0             0       
## Writing score  0                           0       
## Diference      0             0

H0: ρ(reading, writing) = 0
H1: ρ(reading, writing) ≠ 0

There is a linear relationship between reading and writing score, which is positive and very strong (r=0.95). We reject H0 at p < 0.001.

Spearman correlation matrix

library(Hmisc)
rcorr(as.matrix(mydata[ , c(-1,-2,-3,-4,-5,-6)]), 
      type = "spearman")

##               Reading score Writing score Diference
## Reading score          1.00          0.95     -0.25
## Writing score          0.95          1.00     -0.22
## Diference             -0.25         -0.22      1.00
## 
## n= 1000 
## 
## 
## P
##               Reading score Writing score Diference
## Reading score                0             0       
## Writing score  0                           0       
## Diference      0             0

H0: ρs(reading, writing) = 0
H1: ρs(reading, writing) ≠ 0

There is a linear relationship between reading and writing score, which is positive and very strong (r = 0.95). We reject H0 at p < 0.001.

I used both Pearson and Spearman correlation matrixes, since the data is writing and reading score and I am not completely sure if this is numerical or ordinal data. There is no information if it takes the same amount of time to study to achieve between 50-60 points, as it is to achieve 90-100 points. If it does take the same amount than this is a numerical variable and Pearson correlation matrix is better to use. On the other hand, if it doesn’t take the same amount of time, than it is an ordinal variable and Spearman correlation matrix is better to use. In our case the results are the same in both cases, so it doesn’t really matter that much.

RQ3

Is there a significant association between completion of a test preparation course and the type of lunch a student receives?

results <- chisq.test(mydata$`Test preparation course`, mydata$Lunch, 
                      correct = TRUE)

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$`Test preparation course` and mydata$Lunch
## X-squared = 0.22095, df = 1, p-value = 0.6383

addmargins(results$observed)

##            mydata$Lunch
##             free/reduced standard  Sum
##   completed          131      227  358
##   none               224      418  642
##   Sum                355      645 1000

round(results$expected, 2)

##            mydata$Lunch
##             free/reduced standard
##   completed       127.09   230.91
##   none            227.91   414.09

round(results$res, 2)

##            mydata$Lunch
##             free/reduced standard
##   completed         0.35    -0.26
##   none             -0.26     0.19

There are 2 assumptions for the association between two categorical variables.

The observations are independent of each other, which is true in my case.
All expected frequencies are greater than 5, which is also true in my case. We see this from the expected frequencies table, where all numbers are < 5 (127.09, 230.91, 227.91, 414.09).

Since both assumptions are met, we can continue with the χ2 hypothesis.

H0: There is no association between completion of a test preparation course and the type of lunch a student receives.
H1: There is an association between completion of a test preparation course and the type of lunch a student receives.

We cannot reject H0, therefore we cannot say that there is an association between the two categorical variables.

Explanations: 1. The observed (empirical) value for completed preparation course and free/reduced lunch is 131. 2. The expected (theoretical) value is 127.09 (this number indicates that if there would be no association, we would expect 127.09 students in the category completed preparation course & free/reduced lunch). We can see that there is a small difference between the two values, which indicates again that there is no association between the two categorical variables. 3. In the combination completed preparation course and free/reduced lunch we cannot make any conclusions, as we didn’t find any difference in the frequencies that we are comparing (std. residual = 0.35 < 1.96, this result is not significant).

library(effectsize)
effectsize::cramers_v(mydata$`Test preparation course`, mydata$Lunch)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

We cannot say there is an association between completion of a test preparation course and the type of lunch a student receives, with the effect size being tiny (r = 0.00).

HW1

Manja Modic

2025-01-11

Descriptive statistics

RQ1

Parametric test

Non-parametric test

RQ2

Pearson correlation matrix

Spearman correlation matrix

RQ3