Analysis of relationship between school, study time, absences, and student grade

Yoga Pratama (s3756007) and Qi Shen (s3734247)

Last updated: 01 June, 2019

Introduction

Problem Statement

Research question:

Statistical methods:

Data

student_mat <- read_delim("C:/Users/Yoga Pratama S/Desktop/Intro to Statistic/Assignment 3/student-mat.csv", 
    ";", escape_double = FALSE, trim_ws = TRUE)

Important notes from data

Descriptive Statistics and Visualisation

Descriptive Statistic and Visualisation for study time

student_mat %>% group_by(school) %>% summarise (Min = min(studytime,na.rm = TRUE),
Q1 = quantile(studytime,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(studytime, na.rm = TRUE)%>% round (3),
Q3 = quantile(studytime,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(studytime)%>% round (3),
Max = max(studytime,na.rm = TRUE),
Mean = mean(studytime, na.rm = TRUE)%>% round (3),
SD = sd(studytime, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(studytime)))
student_mat %>% histogram(~ studytime|school, col="dodgerblue3",
                       data=., xlab="studytime")

Descriptive Statistic and Visualisation for Grade

student_mat %>% group_by(school) %>% summarise (Min = min(G3,na.rm = TRUE),
Q1 = quantile(G3,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(G3, na.rm = TRUE)%>% round (3),
Q3 = quantile(G3,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(G3)%>% round (3),
Max = max(G3,na.rm = TRUE),
Mean = mean(G3, na.rm = TRUE)%>% round (3),
SD = sd(G3, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(G3)))
GPdata <-filter(student_mat,school=='GP')
hist1 <- hist(GPdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for GPschool(G3)")
meannum=mean(GPdata$G3,rm.na=true)
sdnum <-sd(GPdata$G3)
d <- seq(from=min(GPdata$G3),to=max(GPdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

MSdata <-filter(student_mat,school=='MS')
hist1 <- hist(MSdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for MSdata(G3)")
meannum=mean(MSdata$G3,rm.na=true)
sdnum <-sd(MSdata$G3)
d <- seq(from=min(MSdata$G3),to=max(MSdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

GPdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(G3)", col="blue", col.lines="red")

## [1] 129 131
MSdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(G3)", col="blue", col.lines="red")

## [1] 19 35

Descriptive Statistic and Visualisation for Absences

student_mat %>% group_by(school) %>% summarise (Min = min(absences,na.rm = TRUE),
Q1 = quantile(absences,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(absences, na.rm = TRUE)%>% round (3),
Q3 = quantile(absences,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(absences)%>% round (3),
Max = max(absences,na.rm = TRUE),
Mean = mean(absences, na.rm = TRUE)%>% round (3),
SD = sd(absences, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(absences)))
hist1 <- hist(GPdata$absences,xlab="absences",freq = FALSE,xlim = c(0,90),ylim = c(0,0.10),
main="histogram with normal distribution for GPschool(absences)")
meannum=mean(GPdata$absences,rm.na=true)
sdnum <-sd(GPdata$absences)
d <- seq(from=min(GPdata$absences),to=max(GPdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

hist1 <- hist(MSdata$absences,xlab="absences",freq = FALSE,xlim = c(0,20),ylim = c(0,0.30),
main="histogram with normal distribution for MSschool(absences)")
meannum=mean(MSdata$absences,rm.na=true)
sdnum <-sd(MSdata$absences)
d <- seq(from=min(MSdata$absences),to=max(MSdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

GPdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(absences)", col="blue", col.lines="red")

## [1] 277 184
MSdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(absences)", col="blue", col.lines="red")

## [1] 31 25

Hypothesis Testing

Independent sample t-test:

  1. Homogenity of variance using Levene’s test \[H_0: σ^2_1 = σ^2_2 \] \[H_1: σ^2_1 ≠ σ^2_2 \] Homogenity of variance between school and grade, p value is higher than 0.05 so it is fail to reject H0.
leveneTest(G3 ~ school, data = student_mat)

Homogenity of variance between school and absences, p value is higher than 0.05 so it is fail to reject H0.

leveneTest(absences ~ school, data = student_mat)

Hypothesis Testing Cont.

  1. Independent sample t-test assuming equal variance

\[H_0: \mu_1 - \mu_2 = 0\] \[H_A: \mu_1 - \mu_2 ≠ 0\]

t.test(
  G3 ~ school,
  data = student_mat,
  var.equal = TRUE,
  alternative = "two.sided"
  )
## 
##  Two Sample t-test
## 
## data:  G3 by school
## t = 0.89333, df = 393, p-value = 0.3722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7710694  2.0553599
## sample estimates:
## mean in group GP mean in group MS 
##        10.489971         9.847826

The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.

t.test(
  absences ~ school,
  data = student_mat,
  var.equal = TRUE,
  alternative = "two.sided"
  )
## 
##  Two Sample t-test
## 
## data:  absences by school
## t = 1.761, df = 393, p-value = 0.07902
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2567415  4.6662344
## sample estimates:
## mean in group GP mean in group MS 
##         5.965616         3.760870

The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.

Hypothesis Testing Cont.

  1. Chi Square Test of Association

    H0 = There is no association in the population between the categorical variables school and study time

    HA = There is an association in the population between the categorical variables school and study time

chi2 <- chisq.test(table(student_mat$school,student_mat$studytime))
chi2
## 
##  Pearson's Chi-squared test
## 
## data:  table(student_mat$school, student_mat$studytime)
## X-squared = 4.9584, df = 3, p-value = 0.1749
# Observed
chi2$observed
##     
##        1   2   3   4
##   GP  89 176  57  27
##   MS  16  22   8   0
# Expected
chi2$expected %>%round(3)
##     
##           1       2     3      4
##   GP 92.772 174.942 57.43 23.856
##   MS 12.228  23.058  7.57  3.144

The results shows us that p value is higher than 0.05, so we failed to reject null hypothesis, there is no association between school and study time

Hypothesis Testing cont.

  1. Linear correlation between absences and grade

Hypothesis:

  H0: r = 0
  
  HA: r ≠ 0
#correlation
studytimeG3model <- lm(G3 ~ absences, data = student_mat)
plot(G3 ~ absences, data = student_mat, xlab = "Absences", ylab = "Grade")
abline(studytimeG3model, col = "red")

- No linear correlation between absences and grade found from this plot

bivariate<-as.matrix(dplyr::select(student_mat, G3,absences)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")
##            G3 absences
## G3       1.00     0.03
## absences 0.03     1.00
## 
## n= 395 
## 
## 
## P
##          G3     absences
## G3              0.4973  
## absences 0.4973

Discussion

According to this result it shows that there are no significant differences between school and student grade, school and study time, school and absences. Moreover, there is no linear correlation between absences and grade. So if parents are confused where they want to send their children to school, they can choose either Gabriel Pereira or Mousinho da Silveira because there is no significant differences between their grade, study time, and absences. This study also answer parents concern about their kids absences and their grade, there is no correlation between them.

The advantage of this study is the method selection to handle various type of data either numerical or categorical. Independent t-test is used to handle numerical variable, chi square used to handle categorical variable, and linear regression is used to find mathematical correlation between two numerical variables.

The limitation from this study is that there is no data about individual capability of the sample students in this test. The result can be biased if we compare grade between a student with higher IQ who have many absences and a student with low IQ but always present in school. Degree of difficulty of the test that the students take to get the grade also not explained in the data set, so the grade result maybe biased.

Further improvement can be done by doing pre-test to know the abilities of each student and consider to add more attributes such as health, school activities, and reason to choose school.

References

[1]. UCI Machine Learning Repository, Student Performance Data Set, viewed 1 June 2019, http://archive.ics.uci.edu/ml/datasets/Student+Performance