Research Question 1:
Is there any relationship between gender and having a scholarship?
mydata <- read.table("./dataset 2.hw.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
## Gender Scholarship.holder ID
## 1 1 0 1
## 2 1 0 2
## 3 1 0 3
## 4 0 0 4
## 5 0 0 5
## 6 1 0 6
Unit of observation: one student
Description of data:
The data set was taken from Kaggle.com (Predict students’ dropout and academic success) and the sample size is 4424 students. I will choose a random sample of 500 students, since 4424 units would be a bit too much.
#Random sample of 500 units.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 500), ]
head(mydata)
## Gender Scholarship.holder ID
## 1017 0 1 1017
## 2177 0 0 2177
## 1533 0 0 1533
## 2347 1 0 2347
## 270 0 1 270
## 4050 0 1 4050
#Creating factors
mydata$GenderF <- factor(mydata$Gender,
levels = c(0, 1),
labels = c("Male", "Female"))
mydata$Scholarship.holderF <- factor(mydata$Scholarship.holder,
levels = c(0, 1),
labels = c("NO", "YES"))
head(mydata, 4)
## Gender Scholarship.holder ID GenderF Scholarship.holderF
## 1017 0 1 1017 Male YES
## 2177 0 0 2177 Male NO
## 1533 0 0 1533 Male NO
## 2347 1 0 2347 Female NO
summary(mydata[c("GenderF", "Scholarship.holderF")]) #Some descriptive statistics
## GenderF Scholarship.holderF
## Male :333 NO :369
## Female:167 YES:131
Assumptions:
- Observations must be independent.
- Check that all expected frequencies are greater than 5 (that’s what we
said in class with Denis).
- In larger contingency tables (at least one categorical variable has
more than two categories), up to 20% of the expected frequencies can be
between 1 and 5, but this will reduce the power of the test.
If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher’s Exact Probability Test of Independence should be used - nonparametric test.
First assumption is met, because students are either male or female (same in the class where we had cats as an example. Either “Love” of “Food” was taken as an approach).
The second assumption I will check later, when I will have the results of Pierson Chi2 test and I will be able to check if all expected values are greater than 5.
Third assumption is met, because none of my two categorical variables have more than two categories.
H0: There is no association between the two
categorical variables.
H1: There is association between the two categorical
variables.
results <- chisq.test(mydata$GenderF, mydata$Scholarship.holderF,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$GenderF and mydata$Scholarship.holderF
## X-squared = 17.238, df = 1, p-value = 3.297e-05
I reject H0 (p<0,001). I assume that there is association between the two categorical variables.
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max
## Gender 1 500 0.33 0.47 0.0 0.29 0.00 0 1
## Scholarship.holder 2 500 0.26 0.44 0.0 0.20 0.00 0 1
## ID 3 500 2232.17 1300.02 2179.5 2234.95 1653.84 15 4411
## GenderF* 4 500 1.33 0.47 1.0 1.29 0.00 1 2
## Scholarship.holderF* 5 500 1.26 0.44 1.0 1.20 0.00 1 2
## range skew kurtosis se
## Gender 1 0.70 -1.51 0.02
## Scholarship.holder 1 1.08 -0.84 0.02
## ID 4396 0.00 -1.23 58.14
## GenderF* 1 0.70 -1.51 0.02
## Scholarship.holderF* 1 1.08 -0.84 0.02
round(results$res, 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male -1.26 2.11
## Female 1.78 -2.99
Explanation of the number (Male, YES) 2.11
The actual number of males in our sample that got awarded a scholarship
is higher than expected (alfa=5%).
addmargins(results$observed)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 226 107 333
## Female 143 24 167
## Sum 369 131 500
round(results$expected, 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male 245.75 87.25
## Female 123.25 43.75
All expected frequencies are larger than 5, second assumption is met.
round(results$res, 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male -1.26 2.11
## Female 1.78 -2.99
addmargins(round(prop.table(results$observed), 3))
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 0.452 0.214 0.666
## Female 0.286 0.048 0.334
## Sum 0.738 0.262 1.000
Explanation of the number 0,214 (Male, YES): Out of 500 students, there is 21,4% of students, which were males and were awarded the Scholarship.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES Sum
## Male 0.679 0.321 1.000
## Female 0.856 0.144 1.000
Explanation of the number 0,321 (Male, YES):
Out of all the males, 32,1% got awarded the Scholarship.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$Scholarship.holderF
## mydata$GenderF NO YES
## Male 0.612 0.817
## Female 0.388 0.183
## Sum 1.000 1.000
Explanation of the number 0,817 (Male, YES):
Out of all the students that were awarded the Scholarship, 81,7% of them
were males.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$GenderF, mydata$Scholarship.holderF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.19 | [0.11, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.19)
## [1] "small"
## (Rules: funder2019)
The size of discrepancies is 0,19, which means that it’s small.
oddsratio(mydata$GenderF, mydata$Scholarship.holderF)
## Odds ratio | 95% CI
## -------------------------
## 0.35 | [0.22, 0.58]
interpret_oddsratio(0.35)
## [1] "small"
## (Rules: chen2010)
The odds ratio between gender and awarded scholarship is 0.35. The odds of getting the scholarship are 0,35-times lover for females compared to males.
Conclusion:
Based on the sample data, I found that there is a association between
the gender and scholarships being awarded (p<0,001). Based on the
sample data, even though the effect size is small (r=0,19), males are
more likely to get a scholarship compared to females.
Because all assumptions were met, Pierson Chi2 Test was the most appropriate to perform, but still I will also show the nonparametric test (Fisher’s exact probability test).
HO: Odds ratio is equal to 1.
H1: Odds ratio in not equal to 1.
fisher.test(mydata$GenderF, mydata$Scholarship.holderF)
##
## Fisher's Exact Test for Count Data
##
## data: mydata$GenderF and mydata$Scholarship.holderF
## p-value = 1.427e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2077130 0.5886365
## sample estimates:
## odds ratio
## 0.3551537
I reject H0 at (p<0,001). OR is not equal to 1 (OR=0,36).
Conclusion (Fisher’s exact probability test):
Based on the sample data I can conclude that there are differences is
gender and being a scholarship holder among students (p<0,001).
Research Question 2:
Is there any linear correlation between points on the math exam and
reading exam?
mydata1 <- read.table("./original_data4.csv", header=TRUE, sep=",")
mydata1$Gender <- NULL #Removing a variable
mydata1$WritingScore <- NULL #Removing a variable
head(mydata1)
## ID MathScore ReadingScore
## 1 0 72 72
## 2 1 69 90
## 3 2 90 95
## 4 3 47 57
## 5 4 76 78
## 6 5 71 83
#Random sample of 200 units.
set.seed(1)
mydata1 <- mydata1[sample(nrow(mydata), 200), ]
tail(mydata1)
## ID MathScore ReadingScore
## 468 467 72 67
## 338 337 49 51
## 437 436 75 68
## 212 211 35 28
## 127 126 72 68
## 133 132 87 74
Unit of observation: one student
Description of data:
The data set was taken from Kaggle.com (Students exam scores: Extended dataset) and the sample size is 999 students. I will choose a random sample of 200 students. I also chose this data for HW1.
library(psych)
psych::describe(mydata1[ , c("MathScore", "ReadingScore")]) #Descriptive statistics
## vars n mean sd median trimmed mad min max range skew
## MathScore 1 200 66.92 15.10 69.0 67.62 14.83 24 100 76 -0.39
## ReadingScore 2 200 69.04 14.32 70.5 69.39 13.34 26 100 74 -0.28
## kurtosis se
## MathScore -0.11 1.07
## ReadingScore 0.02 1.01
summary(mydata1[c("MathScore", "ReadingScore")]) #Additional descriptive statistics
## MathScore ReadingScore
## Min. : 24.00 Min. : 26.00
## 1st Qu.: 57.00 1st Qu.: 59.75
## Median : 69.00 Median : 70.50
## Mean : 66.92 Mean : 69.04
## 3rd Qu.: 77.25 3rd Qu.: 77.25
## Max. :100.00 Max. :100.00
Here I can see that the arithmetic means, both minimums, medians etc. are very similar.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata1[ , -1], smooth=FALSE)
Based on the scater plot I can assume that there is strong or very strong linear correlation between the points received at the both exams.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata1[ , -1]),
type = "pearson")
## MathScore ReadingScore
## MathScore 1.00 0.81
## ReadingScore 0.81 1.00
##
## n= 200
##
##
## P
## MathScore ReadingScore
## MathScore 0
## ReadingScore 0
Interpretation of number: 0,81
Linear relationship between MathScore and ReadingScore is positive and
strong.
Now I will just check the same thing with ggplot and function cor (just as a robustness check).
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata1, aes(x = MathScore, y = ReadingScore)) +
geom_point()
Also with the ggplot graph I can assume the positive relationship between those two variables.
cor(mydata1$MathScore, mydata1$ReadingScore,
method = "pearson",
use = "complete.obs")
## [1] 0.8147229
The same result as expected. The correlation between MathScore and ReadingScore is positive and strong. With scatterplot and with ggplot I can see that linearity is met.
Normality of variables:
For both normality tests, hypothesis are the same.
H0:The variable is normally distributed.
H!:The variable is not normally distributed.
shapiro.test(mydata1$MathScore)
##
## Shapiro-Wilk normality test
##
## data: mydata1$MathScore
## W = 0.98594, p-value = 0.04422
shapiro.test(mydata1$ReadingScore)
##
## Shapiro-Wilk normality test
##
## data: mydata1$ReadingScore
## W = 0.99041, p-value = 0.2051
For the variable MathScore I can reject the null hypothesis and
conclude that this variable is not normally distributed (p=0,045).
For the variable ReadingScore I cannot reject the null hypothesis.
Because not both variables are are normally distributed, I will use
Spearman correlation.
H0:The correlation is equal to 0.
H1:The correlation is not equal to 0.
cor.test(mydata1$MathScore, mydata1$ReadingScore,
method = "spearman",
use = "complete.obs")
## Warning in cor.test.default(mydata1$MathScore, mydata1$ReadingScore, method =
## "spearman", : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata1$MathScore and mydata1$ReadingScore
## S = 286731, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.784946
I reject H0 at (p<0,001). There is correlation between the variables
Conclusion:
Based on the sample data I can conclude that there is linear correlation
between points on the math and reading exams (p<0,001). This
correlation is positive and strong (r=0,78).