Research Question:
Are there any differences between gender and having a scholarship?
mydata <- read.table("./dataset 2.hw.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
## Gender Scholarship.holder ID
## 1 1 0 1
## 2 1 0 2
## 3 1 0 3
## 4 0 0 4
## 5 0 0 5
## 6 1 0 6
Unit of observation: one student
Description of data:
The data set was taken from Kaggle.com (Predict students’ dropout and academic success) and the sample size is 4424 students. I will choose a random sample of 500 students, since 4424 units would be a bit too much.
#Random sample of 500 units.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 500), ]
head(mydata)
## Gender Scholarship.holder ID
## 1017 0 1 1017
## 2177 0 0 2177
## 1533 0 0 1533
## 2347 1 0 2347
## 270 0 1 270
## 4050 0 1 4050
#Creating factors
mydata$GenderF <- factor(mydata$Gender,
levels = c(0, 1),
labels = c("Male", "Female"))
mydata$Scholarship.holderF <- factor(mydata$Scholarship.holder,
levels = c(0, 1),
labels = c("NO", "YES"))
head(mydata, 4)
## Gender Scholarship.holder ID GenderF Scholarship.holderF
## 1017 0 1 1017 Male YES
## 2177 0 0 2177 Male NO
## 1533 0 0 1533 Male NO
## 2347 1 0 2347 Female NO
Assumptions:
- Observations must be independent.
- Check that all expected frequencies are greater than 5 (that’s what we
said in class with Denis).
- In larger contingency tables (at least one categorical variable has
more than two categories), up to 20% of the expected frequencies can be
between 1 and 5, but this will reduce the power of the test.
If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher’s Exact Probability Test of Independence should be used - nonparametric test.
First assumption is met, because students are either male or female (same in the class where we had cats as an example. Either “Love” of “Food” was taken as an approach).
The second assumption I will check later, when I will have the results of Pierson Chi2 test and I will be able to check if all expected values are greater than 5.
Third assumption is met, because none of my two categorical variables have more than two categories.
H0: There is no association between the two
categorical variables.
H1: There is association between the two categorical
variables.
results <- chisq.test(mydata$GenderF, mydata$Scholarship.holderF,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$GenderF and mydata$Scholarship.holderF
## X-squared = 17.238, df = 1, p-value = 3.297e-05
I reject H0 (p=0,023). I assume that there is association between the two categorical variables.
addmargins(results$observed)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 226 107 333
## Female 143 24 167
## Sum 369 131 500
round(results$expected, 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male 245.75 87.25
## Female 123.25 43.75
All expected frequencies are larger than 5, second assumption is met.
round(results$res, 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male -1.26 2.11
## Female 1.78 -2.99
addmargins(round(prop.table(results$observed), 3))
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 0.452 0.214 0.666
## Female 0.286 0.048 0.334
## Sum 0.738 0.262 1.000
Explanation of the number 0,214 (Male, YES): Out of 500 students, there is 21,4% of students, which were males and were awarded the Scholarship.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 0.679 0.321 1.000
## Female 0.856 0.144 1.000
Explanation of the number 0,321 (Male, YES):
Out of all the males, 32,1% got awarded the Scholarship.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male 0.612 0.817
## Female 0.388 0.183
## Sum 1.000 1.000
Explanation of the number 0,817 (Male, YES):
Out of all the students that were awarded the Scholarship, 81,7% of them
were males.
library(effectsize)
effectsize::cramers_v(mydata$GenderF, mydata$Scholarship.holderF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.19 | [0.11, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.19)
## [1] "small"
## (Rules: funder2019)
Effect size is 0,19, which means that it’s small.
Conclusion:
Based on the sample data, I found that there is a association between
the gender and scholarships being awarded (p<0,001). Based on the
sample data, even though the effect size is small (r=0,19), males are
more likely to get a scholarship compared to females.
Because all assumptions were met, Pierson Chi2 Test was the most appropriate to perform, but still I will also show the nonparametric test (Fisher’s exact probability test).
HO: Odds ratio is equal to 1.
H1: Odds ratio in not equal to 1.
fisher.test(mydata$GenderF, mydata$Scholarship.holderF)
##
## Fisher's Exact Test for Count Data
##
## data: mydata$GenderF and mydata$Scholarship.holderF
## p-value = 1.427e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2077130 0.5886365
## sample estimates:
## odds ratio
## 0.3551537
I reject H0 at (p=0,010).
Conclusion (Fisher’s exact probability test):
Based on the sample data I can conclude that there are differences is
gender and being a scholarship holder among students (p<0,001).