Research question 1
Is there any relationship between gender and test preparation
course?
mydata <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")
head(mydata)
## gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score
## 1 female group B bachelor's degree standard none 72 72
## 2 female group C some college standard completed 69 90
## 3 female group B master's degree standard none 90 95
## 4 male group A associate's degree free/reduced none 47 57
## 5 male group C some college standard none 76 78
## 6 female group B associate's degree standard none 71 83
## writing.score
## 1 74
## 2 88
## 3 93
## 4 44
## 5 75
## 6 78
mydata$race.ethnicity <- NULL
mydata$parental.level.of.education <- NULL
mydata$lunch <- NULL
mydata$math.score <- NULL
mydata$writing.score <- NULL
mydata$reading.score <- NULL
head(mydata, 10)
## gender test.preparation.course
## 1 female none
## 2 female completed
## 3 female none
## 4 male none
## 5 male none
## 6 female none
## 7 female completed
## 8 male none
## 9 male completed
## 10 female none
Unit of observation: one student
I have found this dataset on Kaggle.com with the title Students Performance on Exams.
This sample sample size cotains 1000 units and I decided to randomly select 200 units, since we took similar size also at university.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 200), ]
head(mydata)
## gender test.preparation.course
## 836 female completed
## 679 male none
## 129 male none
## 930 female none
## 509 male none
## 471 female completed
mydata$ID <- seq(1, nrow(mydata))
head(mydata, 3)
## gender test.preparation.course ID
## 836 female completed 1
## 679 male none 2
## 129 male none 3
mydata$GenderF <- factor(mydata$gender,
levels = c("male", "female"),
labels = c("male", "female"))
mydata$TestF <- factor(mydata$test.preparation.course,
levels = c("none", "completed"),
labels = c("none", "completed"))
head(mydata, 3)
## gender test.preparation.course ID GenderF TestF
## 836 female completed 1 female completed
## 679 male none 2 male none
## 129 male none 3 male none
Now im going to check if all the assumptions are true.
First one holds, since both observations are independent. You cant be male and female at the same time and also you cant complete and not complete a preperation course at the same time.
Now i will conduct Pearson chi2 test and afterwards check second assumption with expected values.
H0: There is no association between the two
categorical variables.
H1: There is association between the two categorical
variables.
results <- chisq.test(mydata$GenderF, mydata$TestF,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$GenderF and mydata$TestF
## X-squared = 6.9975e-31, df = 1, p-value = 1
We cant reject the HO(p=1).That result indicates that there is no association at all between the two categorical variables. Since this result is possible, but not very likely I also need to further check if the second assumption holds.
addmargins(results$observed)
## mydata$TestF
## mydata$GenderF none completed Sum
## male 58 31 89
## female 72 39 111
## Sum 130 70 200
round(results$expected, 2)
## mydata$TestF
## mydata$GenderF none completed
## male 57.85 31.15
## female 72.15 38.85
We can see here that all of the expected frequencies are larger than 5, which means that second assumptions is met and Pearson test is sufficient.
round(results$res, 2)
## mydata$TestF
## mydata$GenderF none completed
## male 0.02 -0.03
## female -0.02 0.02
(Male,Completed)- 0,03:
The actual number of males in our sample that completed the course
preparation test is lower than we expected (alfa=0,05).
addmargins(round(prop.table(results$observed), 3))
## mydata$TestF
## mydata$GenderF none completed Sum
## male 0.290 0.155 0.445
## female 0.360 0.195 0.555
## Sum 0.650 0.350 1.000
(Male, Completed) 0,155:
Out of 200 students, there is 15,5% of students, that were male and
completed the test preparation course.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$TestF
## mydata$GenderF none completed Sum
## male 0.652 0.348 1.000
## female 0.649 0.351 1.000
(Male, Completed) 0,348:
Out of all males, 34,8% completed the test preparation course.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$TestF
## mydata$GenderF none completed
## male 0.446 0.443
## female 0.554 0.557
## Sum 1.000 1.000
(Male, Completed) 0,443:
Out of al students that completed the test preparation course, 44,3%
were male.
Now i will show some descriptive statistics of my data set sample.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:DescTools':
##
## AUC, ICC, SD
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(mydata)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender* 1 200 1.45 0.50 1.0 1.43 0.00 1 2 1 0.22 -1.96 0.04
## test.preparation.course* 2 200 1.65 0.48 2.0 1.69 0.00 1 2 1 -0.62 -1.62 0.03
## ID 3 200 100.50 57.88 100.5 100.50 74.13 1 200 199 0.00 -1.22 4.09
## GenderF* 4 200 1.55 0.50 2.0 1.57 0.00 1 2 1 -0.22 -1.96 0.04
## TestF* 5 200 1.35 0.48 1.0 1.31 0.00 1 2 1 0.62 -1.62 0.03
#(effectsize)
effectsize::cramers_v(mydata$GenderF, mydata$TestF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
Effect size is 0, which means that it is Tiny. That consist with our findings from before.
#install.packages("epitools")
library(epitools)
oddsratio(mydata$GenderF, mydata$TestF)
## $data
## Outcome
## Predictor none completed Total
## male 58 31 89
## female 72 39 111
## Total 130 70 200
##
## $measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## male 1.000000 NA NA
## female 1.012871 0.5634129 1.828644
##
## $p.value
## two-sided
## Predictor midp.exact fisher.exact chi.square
## male NA NA NA
## female 0.9659982 1 0.9643094
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"
The odds ration between gender and preparation test completion is 1.01. The odds of completing the test preparation course are 1.01 times higher for females compared to males.
Conclusion
Based on the sample data, I found that there is no correlation between
the gender and whether they completed the preparation course or not
(p=1).Based on the sample data, the effect size is tiny and females are
slightly more likely to complete the course.
Since all the assumptions were met Pierson Chi2 test is appropriate.
Research question 2
Is there any linear correlation between reading and writting test
score?
mydata1 <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")
head(mydata)
## gender test.preparation.course ID GenderF TestF
## 836 female completed 1 female completed
## 679 male none 2 male none
## 129 male none 3 male none
## 930 female none 4 female none
## 509 male none 5 male none
## 471 female completed 6 female completed
mydata1$race.ethnicity <- NULL
mydata1$parental.level.of.education <- NULL
mydata1$lunch <- NULL
mydata1$math.score <- NULL
mydata1$test.preparation.course <- NULL
mydata1$gender <- NULL
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 200), ]
head(mydata1)
## reading.score writing.score
## 1 72 74
## 2 90 88
## 3 95 93
## 4 57 44
## 5 78 75
## 6 83 78
Unit of observation: one student
I have found this dataset on Kaggle.com with the title Students Performance on Exams.
This sample sample size cotains 1000 units and I decided to randomly select 200 units, since we took similar size also at university.
describe(mydata1)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## reading.score 1 1000 69.17 14.6 70 69.50 14.83 17 100 83 -0.26 -0.08 0.46
## writing.score 2 1000 68.05 15.2 69 68.41 16.31 10 100 90 -0.29 -0.05 0.48
From this descriptive statistics we can see, that the main parameters such as mean, mediana, max value, are very close to each other.
#install.packages("car")
library(psych)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:DescTools':
##
## Recode
scatterplotMatrix(mydata1, smooth=FALSE)
Based on the scatterplot I can assume that there is a positive and very strong linear correlation between reading score and on writting score.
#install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:DescTools':
##
## %nin%, Label, Mean, Quantile
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata1),
type = "pearson")
## reading.score writing.score
## reading.score 1.00 0.95
## writing.score 0.95 1.00
##
## n= 1000
##
##
## P
## reading.score writing.score
## reading.score 0
## writing.score 0
If we interpret the number 0,95 it means that the linear relationship between reading.score and writing.score in positive and strong.
Just in case we can also check the relationship with ggplot.
#install.packages("ggplot2")
library(ggplot2)
ggplot(mydata1, aes(x = reading.score, y = writing.score)) +
geom_point()
Ggplot shows same positive relationship between the two variables.
cor(mydata1$reading.score, mydata1$writing.score,
method = "pearson",
use = "complete.obs")
## [1] 0.9545981
This test gave Us the same result as we expected. The linear correlation between writing.score and reading,score is strong and positive.
Finally lets also test the hypothesis.
HO: The correlation is equal to zero
H1: The correlation is not equal to zero
cor.test(mydata1$reading.score, mydata1$writing.score,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata1$reading.score and mydata1$writing.score
## t = 101.23, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9487506 0.9597921
## sample estimates:
## cor
## 0.9545981
Based we can reject the HO (p<o,oo1).
Conclusion
Based on the conducted tests on the sample data, i can conclude that
there is linear correlation between reading score and writing score
(p<0,001). This correlation is strong and positive (r=0,95).