Analysis of relationship between school, study time, absences, and student grade

Yoga Pratama (s3756007) and Qi Shen (s3734247)

Last updated: 01 June, 2019

RPubs link information

Rpubs link comes here: www………

Introduction

Good education is one of the keys to success in the future.
However, choosing the right school is always a question for parents who want to send their children to study.
There are some attributes that determines the school quality, in this study we choose student grades, how long are their study time, and number of absences.
This study will explore the relationship between student grades from mathematics subject, study time, and number of absences from two different school in Portuguese.
Our null hypothesis is student grades, study time, and number of absences will not be different between two schools.

Problem Statement

Research question:

Is there a relationship between different school and student grade (numerical value)?
Is there a relationship between different school and study time (categorical value)?
Is there a relationship between different school and number of absences (numerical value)?
Is there any correlation between student grade and number of absence?

Statistical methods:

Independent sample t-test, including test of variance is used to determine the relationship between school~student grade and school~number of absences
Chi square test of association is used to determine the relationship between school and study time
Linear regression plot and Pearson correlation is used to determine relationship between grade and number of absence

Data

The dataset was imported from http://archive.ics.uci.edu/ml/datasets/Student+Performance

student_mat <- read_delim("C:/Users/Yoga Pratama S/Desktop/Intro to Statistic/Assignment 3/student-mat.csv", 
    ";", escape_double = FALSE, trim_ws = TRUE)

This data shows student achievement in secondary education of two Portuguese schools from Mathematic subject.
The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires.

Important notes from data

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
studytime - weekly study time (categorical: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
absences - number of student absences (numeric: from 0 to 93)
G3 - final grade (numeric: from 0 to 20, output target)

Descriptive Statistics and Visualisation

To find the descriptive statistics we grouped the data based on schools and shows each attributes.
In each group initial check for missing data, outliers, data distribution was checked using R code and visualized using QQ plot.

Descriptive Statistic and Visualisation for study time

student_mat %>% group_by(school) %>% summarise (Min = min(studytime,na.rm = TRUE),
Q1 = quantile(studytime,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(studytime, na.rm = TRUE)%>% round (3),
Q3 = quantile(studytime,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(studytime)%>% round (3),
Max = max(studytime,na.rm = TRUE),
Mean = mean(studytime, na.rm = TRUE)%>% round (3),
SD = sd(studytime, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(studytime)))

This data was cleaned from missing values

student_mat %>% histogram(~ studytime|school, col="dodgerblue3",
                       data=., xlab="studytime")

From this bar chart, GP school have value “4”, in their study time attributes.

Descriptive Statistic and Visualisation for Grade

student_mat %>% group_by(school) %>% summarise (Min = min(G3,na.rm = TRUE),
Q1 = quantile(G3,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(G3, na.rm = TRUE)%>% round (3),
Q3 = quantile(G3,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(G3)%>% round (3),
Max = max(G3,na.rm = TRUE),
Mean = mean(G3, na.rm = TRUE)%>% round (3),
SD = sd(G3, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(G3)))

This data was cleaned from missing values

GPdata <-filter(student_mat,school=='GP')
hist1 <- hist(GPdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for GPschool(G3)")
meannum=mean(GPdata$G3,rm.na=true)
sdnum <-sd(GPdata$G3)
d <- seq(from=min(GPdata$G3),to=max(GPdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

MSdata <-filter(student_mat,school=='MS')
hist1 <- hist(MSdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for MSdata(G3)")
meannum=mean(MSdata$G3,rm.na=true)
sdnum <-sd(MSdata$G3)
d <- seq(from=min(MSdata$G3),to=max(MSdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

GPdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(G3)", col="blue", col.lines="red")

## [1] 129 131

MSdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(G3)", col="blue", col.lines="red")

## [1] 19 35

From the histogram and qq plot, it shows that the data follow normal distribution

Descriptive Statistic and Visualisation for Absences

student_mat %>% group_by(school) %>% summarise (Min = min(absences,na.rm = TRUE),
Q1 = quantile(absences,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(absences, na.rm = TRUE)%>% round (3),
Q3 = quantile(absences,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(absences)%>% round (3),
Max = max(absences,na.rm = TRUE),
Mean = mean(absences, na.rm = TRUE)%>% round (3),
SD = sd(absences, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(absences)))

This data was cleaned from missing values

hist1 <- hist(GPdata$absences,xlab="absences",freq = FALSE,xlim = c(0,90),ylim = c(0,0.10),
main="histogram with normal distribution for GPschool(absences)")
meannum=mean(GPdata$absences,rm.na=true)
sdnum <-sd(GPdata$absences)
d <- seq(from=min(GPdata$absences),to=max(GPdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

hist1 <- hist(MSdata$absences,xlab="absences",freq = FALSE,xlim = c(0,20),ylim = c(0,0.30),
main="histogram with normal distribution for MSschool(absences)")
meannum=mean(MSdata$absences,rm.na=true)
sdnum <-sd(MSdata$absences)
d <- seq(from=min(MSdata$absences),to=max(MSdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)

GPdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(absences)", col="blue", col.lines="red")

## [1] 277 184

MSdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(absences)", col="blue", col.lines="red")

## [1] 31 25

From the histogram and qq plot, it shows that the data skewed to the right, however the sample size is more than 30 so it is assumed normally distributed

Hypothesis Testing

Independent sample t-test:

Homogenity of variance using Levene’s test \[H_0: σ^2_1 = σ^2_2 \] \[H_1: σ^2_1 ≠ σ^2_2 \] Homogenity of variance between school and grade, p value is higher than 0.05 so it is fail to reject H0.

leveneTest(G3 ~ school, data = student_mat)

Homogenity of variance between school and absences, p value is higher than 0.05 so it is fail to reject H0.

leveneTest(absences ~ school, data = student_mat)

Hypothesis Testing Cont.

Independent sample t-test assuming equal variance

\[H_0: \mu_1 - \mu_2 = 0\] \[H_A: \mu_1 - \mu_2 ≠ 0\]

Independent sample t-test between school and grade

t.test(
  G3 ~ school,
  data = student_mat,
  var.equal = TRUE,
  alternative = "two.sided"
  )

## 
##  Two Sample t-test
## 
## data:  G3 by school
## t = 0.89333, df = 393, p-value = 0.3722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7710694  2.0553599
## sample estimates:
## mean in group GP mean in group MS 
##        10.489971         9.847826

The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.

Independent sample t-test between school and absences

t.test(
  absences ~ school,
  data = student_mat,
  var.equal = TRUE,
  alternative = "two.sided"
  )

## 
##  Two Sample t-test
## 
## data:  absences by school
## t = 1.761, df = 393, p-value = 0.07902
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2567415  4.6662344
## sample estimates:
## mean in group GP mean in group MS 
##         5.965616         3.760870

The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.

Hypothesis Testing Cont.

Chi Square Test of Association

H0 = There is no association in the population between the categorical variables school and study time

HA = There is an association in the population between the categorical variables school and study time

chi2 <- chisq.test(table(student_mat$school,student_mat$studytime))
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  table(student_mat$school, student_mat$studytime)
## X-squared = 4.9584, df = 3, p-value = 0.1749

# Observed
chi2$observed

##     
##        1   2   3   4
##   GP  89 176  57  27
##   MS  16  22   8   0

# Expected
chi2$expected %>%round(3)

##     
##           1       2     3      4
##   GP 92.772 174.942 57.43 23.856
##   MS 12.228  23.058  7.57  3.144

The results shows us that p value is higher than 0.05, so we failed to reject null hypothesis, there is no association between school and study time

Hypothesis Testing cont.

Linear correlation between absences and grade

Hypothesis:

  H0: r = 0
  
  HA: r ≠ 0

#correlation
studytimeG3model <- lm(G3 ~ absences, data = student_mat)
plot(G3 ~ absences, data = student_mat, xlab = "Absences", ylab = "Grade")
abline(studytimeG3model, col = "red")

- No linear correlation between absences and grade found from this plot

bivariate<-as.matrix(dplyr::select(student_mat, G3,absences)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")

##            G3 absences
## G3       1.00     0.03
## absences 0.03     1.00
## 
## n= 395 
## 
## 
## P
##          G3     absences
## G3              0.4973  
## absences 0.4973

R reports the correlation between student grade and absences to be r= 0.03 and the p-value is 0.497, which is more than 0.05.
From this p-value, it is concluded that we failed to reject null Hypothesis, that stated there is no correlation between this attributes

Discussion

According to this result it shows that there are no significant differences between school and student grade, school and study time, school and absences. Moreover, there is no linear correlation between absences and grade. So if parents are confused where they want to send their children to school, they can choose either Gabriel Pereira or Mousinho da Silveira because there is no significant differences between their grade, study time, and absences. This study also answer parents concern about their kids absences and their grade, there is no correlation between them.

The advantage of this study is the method selection to handle various type of data either numerical or categorical. Independent t-test is used to handle numerical variable, chi square used to handle categorical variable, and linear regression is used to find mathematical correlation between two numerical variables.

The limitation from this study is that there is no data about individual capability of the sample students in this test. The result can be biased if we compare grade between a student with higher IQ who have many absences and a student with low IQ but always present in school. Degree of difficulty of the test that the students take to get the grade also not explained in the data set, so the grade result maybe biased.

Further improvement can be done by doing pre-test to know the abilities of each student and consider to add more attributes such as health, school activities, and reason to choose school.

References

[1]. UCI Machine Learning Repository, Student Performance Data Set, viewed 1 June 2019, http://archive.ics.uci.edu/ml/datasets/Student+Performance