Matej Suhalj

1. Part (two categorical variables)

Research Question 1:
Is there any relationship between gender and having a scholarship?

mydata <- read.table("./dataset 2.hw.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)

##   Gender Scholarship.holder ID
## 1      1                  0  1
## 2      1                  0  2
## 3      1                  0  3
## 4      0                  0  4
## 5      0                  0  5
## 6      1                  0  6

Unit of observation: one student

Description of data:

ID: student identificator
Gender: male, female (0=Male, 1=Female)
Scholarship.holder: (0=NO, 1=YES)

The data set was taken from Kaggle.com (Predict students’ dropout and academic success) and the sample size is 4424 students. I will choose a random sample of 500 students, since 4424 units would be a bit too much.

#Random sample of 500 units. 
set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 500), ]
head(mydata)

##      Gender Scholarship.holder   ID
## 1017      0                  1 1017
## 2177      0                  0 2177
## 1533      0                  0 1533
## 2347      1                  0 2347
## 270       0                  1  270
## 4050      0                  1 4050

#Creating factors
mydata$GenderF <- factor(mydata$Gender, 
                                levels = c(0, 1), 
                                labels = c("Male", "Female"))

mydata$Scholarship.holderF <- factor(mydata$Scholarship.holder, 
                                levels = c(0, 1), 
                                labels = c("NO", "YES"))
   
head(mydata, 4)

##      Gender Scholarship.holder   ID GenderF Scholarship.holderF
## 1017      0                  1 1017    Male                 YES
## 2177      0                  0 2177    Male                  NO
## 1533      0                  0 1533    Male                  NO
## 2347      1                  0 2347  Female                  NO

summary(mydata[c("GenderF", "Scholarship.holderF")]) #Some descriptive statistics

##    GenderF    Scholarship.holderF
##  Male  :333   NO :369            
##  Female:167   YES:131

Assumptions:
- Observations must be independent.
- Check that all expected frequencies are greater than 5 (that’s what we said in class with Denis).
- In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.

If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher’s Exact Probability Test of Independence should be used - nonparametric test.

First assumption is met, because students are either male or female (same in the class where we had cats as an example. Either “Love” of “Food” was taken as an approach).

The second assumption I will check later, when I will have the results of Pierson Chi2 test and I will be able to check if all expected values are greater than 5.

Third assumption is met, because none of my two categorical variables have more than two categories.

Pearson Chi2 test

H0: There is no association between the two categorical variables.
H1: There is association between the two categorical variables.

results <- chisq.test(mydata$GenderF, mydata$Scholarship.holderF, 
                      correct = TRUE)

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$GenderF and mydata$Scholarship.holderF
## X-squared = 17.238, df = 1, p-value = 3.297e-05

I reject H0 (p<0,001). I assume that there is association between the two categorical variables.

library(psych)
describe(mydata)

##                      vars   n    mean      sd median trimmed     mad min  max
## Gender                  1 500    0.33    0.47    0.0    0.29    0.00   0    1
## Scholarship.holder      2 500    0.26    0.44    0.0    0.20    0.00   0    1
## ID                      3 500 2232.17 1300.02 2179.5 2234.95 1653.84  15 4411
## GenderF*                4 500    1.33    0.47    1.0    1.29    0.00   1    2
## Scholarship.holderF*    5 500    1.26    0.44    1.0    1.20    0.00   1    2
##                      range skew kurtosis    se
## Gender                   1 0.70    -1.51  0.02
## Scholarship.holder       1 1.08    -0.84  0.02
## ID                    4396 0.00    -1.23 58.14
## GenderF*                 1 0.70    -1.51  0.02
## Scholarship.holderF*     1 1.08    -0.84  0.02

round(results$res, 2)

##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES
##         Male   -1.26  2.11
##         Female  1.78 -2.99

Explanation of the number (Male, YES) 2.11
The actual number of males in our sample that got awarded a scholarship is higher than expected (alfa=5%).

addmargins(results$observed)

##               mydata$Scholarship.holderF
## mydata$GenderF  NO YES Sum
##         Male   226 107 333
##         Female 143  24 167
##         Sum    369 131 500

round(results$expected, 2)

##               mydata$Scholarship.holderF
## mydata$GenderF     NO   YES
##         Male   245.75 87.25
##         Female 123.25 43.75

All expected frequencies are larger than 5, second assumption is met.

round(results$res, 2)

##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES
##         Male   -1.26  2.11
##         Female  1.78 -2.99

addmargins(round(prop.table(results$observed), 3))

##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES   Sum
##         Male   0.452 0.214 0.666
##         Female 0.286 0.048 0.334
##         Sum    0.738 0.262 1.000

Explanation of the number 0,214 (Male, YES): Out of 500 students, there is 21,4% of students, which were males and were awarded the Scholarship.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES   Sum
##         Male   0.679 0.321 1.000
##         Female 0.856 0.144 1.000

Explanation of the number 0,321 (Male, YES):
Out of all the males, 32,1% got awarded the Scholarship.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##               mydata$Scholarship.holderF
## mydata$GenderF    NO   YES
##         Male   0.612 0.817
##         Female 0.388 0.183
##         Sum    1.000 1.000

Explanation of the number 0,817 (Male, YES):
Out of all the students that were awarded the Scholarship, 81,7% of them were males.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata$GenderF, mydata$Scholarship.holderF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.19              | [0.11, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.19)

## [1] "small"
## (Rules: funder2019)

The size of discrepancies is 0,19, which means that it’s small.

oddsratio(mydata$GenderF, mydata$Scholarship.holderF)

## Odds ratio |       95% CI
## -------------------------
## 0.35       | [0.22, 0.58]

interpret_oddsratio(0.35)

## [1] "small"
## (Rules: chen2010)

The odds ratio between gender and awarded scholarship is 0.35. The odds of getting the scholarship are 0,35-times lover for females compared to males.

Conclusion:
Based on the sample data, I found that there is a association between the gender and scholarships being awarded (p<0,001). Based on the sample data, even though the effect size is small (r=0,19), males are more likely to get a scholarship compared to females.

Because all assumptions were met, Pierson Chi2 Test was the most appropriate to perform, but still I will also show the nonparametric test (Fisher’s exact probability test).

Fisher’s exact probability test

HO: Odds ratio is equal to 1.
H1: Odds ratio in not equal to 1.

fisher.test(mydata$GenderF, mydata$Scholarship.holderF)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$GenderF and mydata$Scholarship.holderF
## p-value = 1.427e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2077130 0.5886365
## sample estimates:
## odds ratio 
##  0.3551537

I reject H0 at (p<0,001). OR is not equal to 1 (OR=0,36).

Conclusion (Fisher’s exact probability test):
Based on the sample data I can conclude that there are differences is gender and being a scholarship holder among students (p<0,001).

2. Part (two numerical variables)

Research Question 2:
Is there any linear correlation between points on the math exam and reading exam?

mydata1 <- read.table("./original_data4.csv", header=TRUE, sep=",")
mydata1$Gender <- NULL #Removing a variable
mydata1$WritingScore <- NULL #Removing a variable
head(mydata1)

##   ID MathScore ReadingScore
## 1  0        72           72
## 2  1        69           90
## 3  2        90           95
## 4  3        47           57
## 5  4        76           78
## 6  5        71           83

#Random sample of 200 units.
set.seed(1) 
mydata1 <- mydata1[sample(nrow(mydata), 200), ]
tail(mydata1)

##      ID MathScore ReadingScore
## 468 467        72           67
## 338 337        49           51
## 437 436        75           68
## 212 211        35           28
## 127 126        72           68
## 133 132        87           74

Unit of observation: one student

Description of data:

ID: student identificator
MathScore: results on a math exam
ReadingScore: results on a reading exam

The data set was taken from Kaggle.com (Students exam scores: Extended dataset) and the sample size is 999 students. I will choose a random sample of 200 students. I also chose this data for HW1.

library(psych)
psych::describe(mydata1[ , c("MathScore", "ReadingScore")]) #Descriptive statistics

##              vars   n  mean    sd median trimmed   mad min max range  skew
## MathScore       1 200 66.92 15.10   69.0   67.62 14.83  24 100    76 -0.39
## ReadingScore    2 200 69.04 14.32   70.5   69.39 13.34  26 100    74 -0.28
##              kurtosis   se
## MathScore       -0.11 1.07
## ReadingScore     0.02 1.01

summary(mydata1[c("MathScore", "ReadingScore")]) #Additional descriptive statistics

##    MathScore       ReadingScore   
##  Min.   : 24.00   Min.   : 26.00  
##  1st Qu.: 57.00   1st Qu.: 59.75  
##  Median : 69.00   Median : 70.50  
##  Mean   : 66.92   Mean   : 69.04  
##  3rd Qu.: 77.25   3rd Qu.: 77.25  
##  Max.   :100.00   Max.   :100.00

Here I can see that the arithmetic means, both minimums, medians etc. are very similar.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata1[ , -1], smooth=FALSE)

Based on the scater plot I can assume that there is strong or very strong linear correlation between the points received at the both exams.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata1[ , -1]), 
      type = "pearson")

##              MathScore ReadingScore
## MathScore         1.00         0.81
## ReadingScore      0.81         1.00
## 
## n= 200 
## 
## 
## P
##              MathScore ReadingScore
## MathScore               0          
## ReadingScore  0

Interpretation of number: 0,81
Linear relationship between MathScore and ReadingScore is positive and strong.

Now I will just check the same thing with ggplot and function cor (just as a robustness check).

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata1, aes(x = MathScore, y = ReadingScore)) +
  geom_point()

Also with the ggplot graph I can assume the positive relationship between those two variables.

cor(mydata1$MathScore, mydata1$ReadingScore,
    method = "pearson",
    use = "complete.obs")

## [1] 0.8147229

The same result as expected. The correlation between MathScore and ReadingScore is positive and strong. With scatterplot and with ggplot I can see that linearity is met.

Normality of variables:
For both normality tests, hypothesis are the same.
H0:The variable is normally distributed.
H!:The variable is not normally distributed.

shapiro.test(mydata1$MathScore)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$MathScore
## W = 0.98594, p-value = 0.04422

shapiro.test(mydata1$ReadingScore)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$ReadingScore
## W = 0.99041, p-value = 0.2051

For the variable MathScore I can reject the null hypothesis and conclude that this variable is not normally distributed (p=0,045).
For the variable ReadingScore I cannot reject the null hypothesis.

Because not both variables are are normally distributed, I will use Spearman correlation.
H0:The correlation is equal to 0.
H1:The correlation is not equal to 0.

cor.test(mydata1$MathScore, mydata1$ReadingScore,
         method = "spearman",
         use = "complete.obs")

## Warning in cor.test.default(mydata1$MathScore, mydata1$ReadingScore, method =
## "spearman", : Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  mydata1$MathScore and mydata1$ReadingScore
## S = 286731, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.784946

I reject H0 at (p<0,001). There is correlation between the variables

Conclusion:
Based on the sample data I can conclude that there is linear correlation between points on the math and reading exams (p<0,001). This correlation is positive and strong (r=0,78).

Homework 2

2024-01-16

Matej Suhalj

1. Part (two categorical variables)

Pearson Chi2 test

Fisher’s exact probability test

2. Part (two numerical variables)