Stella Maria Lourdes Kusumawardhani

2) Import data

mydata <- read.table("./student-mat.csv", header = TRUE, sep = ";", dec = ",")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  select(sex, age, studytime, absences, G1)
  • due to the excessive amount of variables, 5 variables of interest are chosen from the original dataset

3) Data display

head(mydata)
##   sex age studytime absences G1
## 1   F  18         2        6  5
## 2   F  17         2        4  5
## 3   F  15         2       10  7
## 4   F  15         3        2 15
## 5   F  16         2        4  6
## 6   M  16         2       10 15

4) Data explanation

  • The processed dataset contains 395 observations and 5 variables
  • The unit of observation is an individual student

Variables:

  • sex
    • sex of student
    • type: categorical, nominal
      • F : Female
      • M : Male
  • age
    • age of student
    • type: numeric, ratio
    • unit: years
  • studytime
    • weekly study time of student
    • type: categorical, ordinal
      • 1 : less than 2 hours
      • 2 : 2 to 5 hours
      • 3 : 5 to 10 hours
      • 4 : more than 10 hours
  • absences
    • number of student’s school absences
    • type: numeric, ratio
    • unit: days
  • grade
    • student’s mathamatics grade during first evaluation
    • type: numeric, ratio
    • unit: points

5) Data source

Cortez, P. (2008). Student Performance [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5TG7T.

6) Data manipulation

mydata <- mydata %>%
  rename(gender = sex, grade = G1)
mydata$gender <- factor(mydata$gender,
                        levels = c("F", "M"),
                        labels = c("Female", "Male"))
mydata$studytime <- factor(mydata$studytime, 
                            levels = c(1, 2, 3, 4), 
                            labels = c("< 2 hours", "2 to 5 hours", "5 to 10 hours", "> 10 hours"), 
                            ordered = TRUE)

Descriptive Statistics

summary(mydata)
##     gender         age               studytime      absences     
##  Female:208   Min.   :15.0   < 2 hours    :105   Min.   : 0.000  
##  Male  :187   1st Qu.:16.0   2 to 5 hours :198   1st Qu.: 0.000  
##               Median :17.0   5 to 10 hours: 65   Median : 4.000  
##               Mean   :16.7   > 10 hours   : 27   Mean   : 5.709  
##               3rd Qu.:18.0                       3rd Qu.: 8.000  
##               Max.   :22.0                       Max.   :75.000  
##      grade      
##  Min.   : 3.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.91  
##  3rd Qu.:13.00  
##  Max.   :19.00

Interpretation

  • age
    • mean: the average age of students is 16.7 years old
    • median: half of the students are aged less then or equal to 17 years old, while the other half are aged over 17 years old
    • min, max: the youngest student in the sample is aged 15 years old, the oldest student is aged 22 years old
  • absences
    • mean: the average days of absences of the students is 5.7 days
    • median: half of the students were absent for less than or equal to 4 days, while the other half were absent for over 4 days
    • min, max: the fewest days for student’s absence is 0 days, the most recorded days for student’s absence is 75 days
  • grade
    • mean: the average grade of students is 10.91 points
    • median: half of the students received a grade of less than or equal to 11 points, while the other half were absent for over 11 points
    • min, max: student who received the lowest grade scored 3 points, while student who received the highest grade scored 19 points
library(psych)
describeBy(mydata$grade, group = mydata$gender)
## 
##  Descriptive statistics by group 
## group: Female
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 208 10.62 3.23     10   10.46 2.97   4  19    15 0.36    -0.68 0.22
## ------------------------------------------------------------ 
## group: Male
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 187 11.23 3.39     11   11.19 4.45   3  19    16  0.1    -0.72 0.25

**Interpretation:*

  • Female
    • mean: the average grade for female students is 10.62 points
    • median: half of the female students received a grade of less than or equal to 10 points, while the other half received a grade of over 10 points
    • min, max: the female student who received the lowest grade scored 4 points, while the female student who received the highest grade scored 19 points
  • Male
    • mean: the average grade for male students is 11.23 points
    • median: half of the male students received a grade of less than or equal to 11 points, while the other half received a grade of over 11 points
    • min, max: the male student who received the lowest grade scored 3 points, while the male student who received the highest grade scored 19 points

7) Research Question and Hypothesis Testing

Paremetric Test

Research Question Is there a difference in average mathematics grade between male and female students?

First assumption Grade is a numeric variable

  • Assumption fulfilled

Second assumption Grade is normally distributed within the population of male and female students

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
Grade_Male <- ggplot(mydata[mydata$gender == "Male", ], aes(x = grade)) +
  geom_histogram(binwidth = 1, col = "black", fill = "purple") +
  theme_linedraw() +
  ylab("Points") +
  ggtitle("Male Students' Grades")

Grade_Female <- ggplot(mydata[mydata$gender == "Female", ], aes(x = grade)) +
  geom_histogram(binwidth = 1, col = "black", fill = "purple") +
  theme_linedraw() +
  ylab("Points") +
  ggtitle("Female Students' Grades")
library(ggpubr)
ggarrange(Grade_Male, Grade_Female,
          ncol = 2, nrow = 1)

library(ggpubr)
ggqqplot(mydata,
         "grade",
         facet.by = "gender")

  • Normality assumption is **not fulfilled*, grade is not normally distributed within the Female students population

Third Assumption Grade has the same variance in both male and female students population

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(mydata$grade, group = mydata$gender)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.4903 0.4842
##       393
  • the variance for grade is the same in both population
  • third assumption is fulfilled
t.test(mydata$grade ~ mydata$gender,
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  mydata$grade by mydata$gender
## t = -1.8284, df = 393, p-value = 0.06825
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -1.26541466  0.04590623
## sample estimates:
## mean in group Female   mean in group Male 
##             10.62019             11.22995
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata$grade ~ mydata$gender,
                     pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.18     | [-0.38, 0.01]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.18, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)

Results - Based on the sample data, we fail to reject the null hypothesis that the average mathematics grade of male and female students does not differ (p = 0.07).The effect size is very small (d = 0.18)

Non-Parametric Test

Research Question Is there a difference in mathematics grade between male and female students?

wilcox.test(mydata$grade ~ mydata$gender,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$grade by mydata$gender
## W = 17322, p-value = 0.05951
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(mydata$grade ~ mydata$gender,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.11             | [-0.22, 0.00]
interpret_rank_biserial(0.11)
## [1] "small"
## (Rules: funder2019)

Results - Based on the sample data, we find that the mathematics grade for male and female students do not differ (p = 0.06). The effect size is small (d = 0.11)

Conclusion

  • Non-parametric test is more suitable in this particular study, as not all assumptions are met, specifically the normality assumption