Matej Suhalj

Research Question:
Are there any differences between gender and having a scholarship?

mydata <- read.table("./dataset 2.hw.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
##   Gender Scholarship.holder ID
## 1      1                  0  1
## 2      1                  0  2
## 3      1                  0  3
## 4      0                  0  4
## 5      0                  0  5
## 6      1                  0  6

Unit of observation: one student

Description of data:

The data set was taken from Kaggle.com (Predict students’ dropout and academic success) and the sample size is 4424 students. I will choose a random sample of 500 students, since 4424 units would be a bit too much.

#Random sample of 500 units. 
set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 500), ]
head(mydata)
##      Gender Scholarship.holder   ID
## 1017      0                  1 1017
## 2177      0                  0 2177
## 1533      0                  0 1533
## 2347      1                  0 2347
## 270       0                  1  270
## 4050      0                  1 4050
#Creating factors
mydata$GenderF <- factor(mydata$Gender, 
                                levels = c(0, 1), 
                                labels = c("Male", "Female"))

mydata$Scholarship.holderF <- factor(mydata$Scholarship.holder, 
                                levels = c(0, 1), 
                                labels = c("NO", "YES"))
   
head(mydata, 4)
##      Gender Scholarship.holder   ID GenderF Scholarship.holderF
## 1017      0                  1 1017    Male                 YES
## 2177      0                  0 2177    Male                  NO
## 1533      0                  0 1533    Male                  NO
## 2347      1                  0 2347  Female                  NO

Assumptions:
- Observations must be independent.
- Check that all expected frequencies are greater than 5 (that’s what we said in class with Denis).
- In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.

If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher’s Exact Probability Test of Independence should be used - nonparametric test.

First assumption is met, because students are either male or female (same in the class where we had cats as an example. Either “Love” of “Food” was taken as an approach).

The second assumption I will check later, when I will have the results of Pierson Chi2 test and I will be able to check if all expected values are greater than 5.

Third assumption is met, because none of my two categorical variables have more than two categories.

Pearson Chi2 test

H0: There is no association between the two categorical variables.
H1: There is association between the two categorical variables.

results <- chisq.test(mydata$GenderF, mydata$Scholarship.holderF, 
                      correct = TRUE)

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$GenderF and mydata$Scholarship.holderF
## X-squared = 17.238, df = 1, p-value = 3.297e-05

I reject H0 (p=0,023). I assume that there is association between the two categorical variables.

addmargins(results$observed)
##               mydata$Scholarship.holderF
## mydata$GenderF  NO YES Sum
##         Male   226 107 333
##         Female 143  24 167
##         Sum    369 131 500
round(results$expected, 2)
##               mydata$Scholarship.holderF
## mydata$GenderF     NO   YES
##         Male   245.75 87.25
##         Female 123.25 43.75

All expected frequencies are larger than 5, second assumption is met.

round(results$res, 2)
##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES
##         Male   -1.26  2.11
##         Female  1.78 -2.99
addmargins(round(prop.table(results$observed), 3))
##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES   Sum
##         Male   0.452 0.214 0.666
##         Female 0.286 0.048 0.334
##         Sum    0.738 0.262 1.000

Explanation of the number 0,214 (Male, YES): Out of 500 students, there is 21,4% of students, which were males and were awarded the Scholarship.

addmargins(round(prop.table(results$observed, 1), 3), 2) 
##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES   Sum
##         Male   0.679 0.321 1.000
##         Female 0.856 0.144 1.000

Explanation of the number 0,321 (Male, YES):
Out of all the males, 32,1% got awarded the Scholarship.

addmargins(round(prop.table(results$observed, 2), 3), 1) 
##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES
##         Male   0.612 0.817
##         Female 0.388 0.183
##         Sum    1.000 1.000

Explanation of the number 0,817 (Male, YES):
Out of all the students that were awarded the Scholarship, 81,7% of them were males.

library(effectsize)
effectsize::cramers_v(mydata$GenderF, mydata$Scholarship.holderF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.19              | [0.11, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.19)
## [1] "small"
## (Rules: funder2019)

Effect size is 0,19, which means that it’s small.

Conclusion:
Based on the sample data, I found that there is a association between the gender and scholarships being awarded (p<0,001). Based on the sample data, even though the effect size is small (r=0,19), males are more likely to get a scholarship compared to females.

Because all assumptions were met, Pierson Chi2 Test was the most appropriate to perform, but still I will also show the nonparametric test (Fisher’s exact probability test).

Fisher’s exact probability test

HO: Odds ratio is equal to 1.
H1: Odds ratio in not equal to 1.

fisher.test(mydata$GenderF, mydata$Scholarship.holderF)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$GenderF and mydata$Scholarship.holderF
## p-value = 1.427e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2077130 0.5886365
## sample estimates:
## odds ratio 
##  0.3551537

I reject H0 at (p=0,010).

Conclusion (Fisher’s exact probability test):
Based on the sample data I can conclude that there are differences is gender and being a scholarship holder among students (p<0,001).