Luka Černila

Two categorical variables

Research question 1
Is there any relationship between gender and test preparation course?

mydata <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")

head(mydata)
##   gender race.ethnicity parental.level.of.education        lunch test.preparation.course math.score reading.score
## 1 female        group B           bachelor's degree     standard                    none         72            72
## 2 female        group C                some college     standard               completed         69            90
## 3 female        group B             master's degree     standard                    none         90            95
## 4   male        group A          associate's degree free/reduced                    none         47            57
## 5   male        group C                some college     standard                    none         76            78
## 6 female        group B          associate's degree     standard                    none         71            83
##   writing.score
## 1            74
## 2            88
## 3            93
## 4            44
## 5            75
## 6            78
mydata$race.ethnicity <- NULL

mydata$parental.level.of.education <- NULL

mydata$lunch <- NULL

mydata$math.score <- NULL

mydata$writing.score <- NULL

mydata$reading.score <- NULL
head(mydata, 10)
##    gender test.preparation.course
## 1  female                    none
## 2  female               completed
## 3  female                    none
## 4    male                    none
## 5    male                    none
## 6  female                    none
## 7  female               completed
## 8    male                    none
## 9    male               completed
## 10 female                    none

Unit of observation: one student

  • gender: male or female
  • test preparation course: either students completed it or didnt take it (none)

I have found this dataset on Kaggle.com with the title Students Performance on Exams.

This sample sample size cotains 1000 units and I decided to randomly select 200 units, since we took similar size also at university.

set.seed(1)
mydata <- mydata[sample(nrow(mydata), 200), ]
head(mydata)
##     gender test.preparation.course
## 836 female               completed
## 679   male                    none
## 129   male                    none
## 930 female                    none
## 509   male                    none
## 471 female               completed
mydata$ID <- seq(1, nrow(mydata))

head(mydata, 3)
##     gender test.preparation.course ID
## 836 female               completed  1
## 679   male                    none  2
## 129   male                    none  3
mydata$GenderF <- factor(mydata$gender, 
                                levels = c("male", "female"), 
                                labels = c("male", "female"))

mydata$TestF <- factor(mydata$test.preparation.course, 
                                levels = c("none", "completed"), 
                                labels = c("none", "completed"))
   
head(mydata, 3)
##     gender test.preparation.course ID GenderF     TestF
## 836 female               completed  1  female completed
## 679   male                    none  2    male      none
## 129   male                    none  3    male      none

Now im going to check if all the assumptions are true.

First one holds, since both observations are independent. You cant be male and female at the same time and also you cant complete and not complete a preperation course at the same time.

Now i will conduct Pearson chi2 test and afterwards check second assumption with expected values.

Pearson Chi2 test

H0: There is no association between the two categorical variables.
H1: There is association between the two categorical variables.

results <- chisq.test(mydata$GenderF, mydata$TestF, 
                      correct = TRUE)

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$GenderF and mydata$TestF
## X-squared = 6.9975e-31, df = 1, p-value = 1

We cant reject the HO(p=1).That result indicates that there is no association at all between the two categorical variables. Since this result is possible, but not very likely I also need to further check if the second assumption holds.

addmargins(results$observed)
##               mydata$TestF
## mydata$GenderF none completed Sum
##         male     58        31  89
##         female   72        39 111
##         Sum     130        70 200
round(results$expected, 2)
##               mydata$TestF
## mydata$GenderF  none completed
##         male   57.85     31.15
##         female 72.15     38.85

We can see here that all of the expected frequencies are larger than 5, which means that second assumptions is met and Pearson test is sufficient.

round(results$res, 2)
##               mydata$TestF
## mydata$GenderF  none completed
##         male    0.02     -0.03
##         female -0.02      0.02

(Male,Completed)- 0,03:
The actual number of males in our sample that completed the course preparation test is lower than we expected (alfa=0,05).

addmargins(round(prop.table(results$observed), 3))
##               mydata$TestF
## mydata$GenderF  none completed   Sum
##         male   0.290     0.155 0.445
##         female 0.360     0.195 0.555
##         Sum    0.650     0.350 1.000

(Male, Completed) 0,155:
Out of 200 students, there is 15,5% of students, that were male and completed the test preparation course.

addmargins(round(prop.table(results$observed, 1), 3), 2) 
##               mydata$TestF
## mydata$GenderF  none completed   Sum
##         male   0.652     0.348 1.000
##         female 0.649     0.351 1.000

(Male, Completed) 0,348:
Out of all males, 34,8% completed the test preparation course.

addmargins(round(prop.table(results$observed, 2), 3), 1)
##               mydata$TestF
## mydata$GenderF  none completed
##         male   0.446     0.443
##         female 0.554     0.557
##         Sum    1.000     1.000

(Male, Completed) 0,443:
Out of al students that completed the test preparation course, 44,3% were male.

Now i will show some descriptive statistics of my data set sample.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:DescTools':
## 
##     AUC, ICC, SD
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(mydata)
##                          vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## gender*                     1 200   1.45  0.50    1.0    1.43  0.00   1   2     1  0.22    -1.96 0.04
## test.preparation.course*    2 200   1.65  0.48    2.0    1.69  0.00   1   2     1 -0.62    -1.62 0.03
## ID                          3 200 100.50 57.88  100.5  100.50 74.13   1 200   199  0.00    -1.22 4.09
## GenderF*                    4 200   1.55  0.50    2.0    1.57  0.00   1   2     1 -0.22    -1.96 0.04
## TestF*                      5 200   1.35  0.48    1.0    1.31  0.00   1   2     1  0.62    -1.62 0.03
#(effectsize)
effectsize::cramers_v(mydata$GenderF, mydata$TestF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Effect size is 0, which means that it is Tiny. That consist with our findings from before.

#install.packages("epitools")
library(epitools)
oddsratio(mydata$GenderF, mydata$TestF)
## $data
##          Outcome
## Predictor none completed Total
##    male     58        31    89
##    female   72        39   111
##    Total   130        70   200
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##    male   1.000000        NA       NA
##    female 1.012871 0.5634129 1.828644
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##    male           NA           NA         NA
##    female  0.9659982            1  0.9643094
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

The odds ration between gender and preparation test completion is 1.01. The odds of completing the test preparation course are 1.01 times higher for females compared to males.

Conclusion
Based on the sample data, I found that there is no correlation between the gender and whether they completed the preparation course or not (p=1).Based on the sample data, the effect size is tiny and females are slightly more likely to complete the course.

Since all the assumptions were met Pierson Chi2 test is appropriate.

Two numerical variables

Research question 2
Is there any linear correlation between reading and writting test score?

mydata1 <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")

head(mydata)
##     gender test.preparation.course ID GenderF     TestF
## 836 female               completed  1  female completed
## 679   male                    none  2    male      none
## 129   male                    none  3    male      none
## 930 female                    none  4  female      none
## 509   male                    none  5    male      none
## 471 female               completed  6  female completed
mydata1$race.ethnicity <- NULL

mydata1$parental.level.of.education <- NULL

mydata1$lunch <- NULL

mydata1$math.score <- NULL

mydata1$test.preparation.course <- NULL

mydata1$gender <- NULL
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 200), ]
head(mydata1)
##   reading.score writing.score
## 1            72            74
## 2            90            88
## 3            95            93
## 4            57            44
## 5            78            75
## 6            83            78

Unit of observation: one student

  • reading.score: student’s result on a reading comprehension exam in points up to 100
  • writing.score: student’s result on a writing comprehension exam in points up to 100

I have found this dataset on Kaggle.com with the title Students Performance on Exams.

This sample sample size cotains 1000 units and I decided to randomly select 200 units, since we took similar size also at university.

describe(mydata1)
##               vars    n  mean   sd median trimmed   mad min max range  skew kurtosis   se
## reading.score    1 1000 69.17 14.6     70   69.50 14.83  17 100    83 -0.26    -0.08 0.46
## writing.score    2 1000 68.05 15.2     69   68.41 16.31  10 100    90 -0.29    -0.05 0.48

From this descriptive statistics we can see, that the main parameters such as mean, mediana, max value, are very close to each other.

#install.packages("car")
library(psych)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:DescTools':
## 
##     Recode
scatterplotMatrix(mydata1, smooth=FALSE)

Based on the scatterplot I can assume that there is a positive and very strong linear correlation between reading score and on writting score.

#install.packages("Hmisc")
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:DescTools':
## 
##     %nin%, Label, Mean, Quantile
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata1), 
      type = "pearson")
##               reading.score writing.score
## reading.score          1.00          0.95
## writing.score          0.95          1.00
## 
## n= 1000 
## 
## 
## P
##               reading.score writing.score
## reading.score                0           
## writing.score  0

If we interpret the number 0,95 it means that the linear relationship between reading.score and writing.score in positive and strong.

Just in case we can also check the relationship with ggplot.

#install.packages("ggplot2")
library(ggplot2)
ggplot(mydata1, aes(x = reading.score, y = writing.score)) +
  geom_point()

Ggplot shows same positive relationship between the two variables.

cor(mydata1$reading.score, mydata1$writing.score,
    method = "pearson",
    use = "complete.obs")
## [1] 0.9545981

This test gave Us the same result as we expected. The linear correlation between writing.score and reading,score is strong and positive.

Finally lets also test the hypothesis.

HO: The correlation is equal to zero
H1: The correlation is not equal to zero

cor.test(mydata1$reading.score, mydata1$writing.score,
         method = "pearson",
         use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydata1$reading.score and mydata1$writing.score
## t = 101.23, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9487506 0.9597921
## sample estimates:
##       cor 
## 0.9545981

Based we can reject the HO (p<o,oo1).

Conclusion
Based on the conducted tests on the sample data, i can conclude that there is linear correlation between reading score and writing score (p<0,001). This correlation is strong and positive (r=0,95).