Yoga Pratama (s3756007) and Qi Shen (s3734247)
Last updated: 01 June, 2019
Research question:
Statistical methods:
student_mat <- read_delim("C:/Users/Yoga Pratama S/Desktop/Intro to Statistic/Assignment 3/student-mat.csv",
";", escape_double = FALSE, trim_ws = TRUE)
Descriptive Statistic and Visualisation for study time
student_mat %>% group_by(school) %>% summarise (Min = min(studytime,na.rm = TRUE),
Q1 = quantile(studytime,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(studytime, na.rm = TRUE)%>% round (3),
Q3 = quantile(studytime,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(studytime)%>% round (3),
Max = max(studytime,na.rm = TRUE),
Mean = mean(studytime, na.rm = TRUE)%>% round (3),
SD = sd(studytime, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(studytime)))
student_mat %>% histogram(~ studytime|school, col="dodgerblue3",
data=., xlab="studytime")
student_mat %>% group_by(school) %>% summarise (Min = min(G3,na.rm = TRUE),
Q1 = quantile(G3,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(G3, na.rm = TRUE)%>% round (3),
Q3 = quantile(G3,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(G3)%>% round (3),
Max = max(G3,na.rm = TRUE),
Mean = mean(G3, na.rm = TRUE)%>% round (3),
SD = sd(G3, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(G3)))
GPdata <-filter(student_mat,school=='GP')
hist1 <- hist(GPdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for GPschool(G3)")
meannum=mean(GPdata$G3,rm.na=true)
sdnum <-sd(GPdata$G3)
d <- seq(from=min(GPdata$G3),to=max(GPdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)
MSdata <-filter(student_mat,school=='MS')
hist1 <- hist(MSdata$G3,xlab="G3",freq = FALSE,
main="histogram with normal distribution for MSdata(G3)")
meannum=mean(MSdata$G3,rm.na=true)
sdnum <-sd(MSdata$G3)
d <- seq(from=min(MSdata$G3),to=max(MSdata$G3),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)
GPdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(G3)", col="blue", col.lines="red")
## [1] 129 131
MSdata$G3 %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(G3)", col="blue", col.lines="red")
## [1] 19 35
student_mat %>% group_by(school) %>% summarise (Min = min(absences,na.rm = TRUE),
Q1 = quantile(absences,probs = .25,na.rm = TRUE) %>% round (3),
Median = median(absences, na.rm = TRUE)%>% round (3),
Q3 = quantile(absences,probs = .75,na.rm = TRUE)%>% round (3),
IQR=IQR(absences)%>% round (3),
Max = max(absences,na.rm = TRUE),
Mean = mean(absences, na.rm = TRUE)%>% round (3),
SD = sd(absences, na.rm = TRUE)%>% round (3),
n = n(),
Missing = sum(is.na(absences)))
hist1 <- hist(GPdata$absences,xlab="absences",freq = FALSE,xlim = c(0,90),ylim = c(0,0.10),
main="histogram with normal distribution for GPschool(absences)")
meannum=mean(GPdata$absences,rm.na=true)
sdnum <-sd(GPdata$absences)
d <- seq(from=min(GPdata$absences),to=max(GPdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)
hist1 <- hist(MSdata$absences,xlab="absences",freq = FALSE,xlim = c(0,20),ylim = c(0,0.30),
main="histogram with normal distribution for MSschool(absences)")
meannum=mean(MSdata$absences,rm.na=true)
sdnum <-sd(MSdata$absences)
d <- seq(from=min(MSdata$absences),to=max(MSdata$absences),by=0.1)
lines(x=d,y=dnorm(d,meannum,sdnum),lty=2,col=2)
GPdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for GPschool
(absences)", col="blue", col.lines="red")
## [1] 277 184
MSdata$absences %>% qqPlot(pch = 1,ylim=c(-5,30),dist="norm",main="qq plot for MSdata
(absences)", col="blue", col.lines="red")
## [1] 31 25
Independent sample t-test:
leveneTest(G3 ~ school, data = student_mat)
Homogenity of variance between school and absences, p value is higher than 0.05 so it is fail to reject H0.
leveneTest(absences ~ school, data = student_mat)
\[H_0: \mu_1 - \mu_2 = 0\] \[H_A: \mu_1 - \mu_2 ≠ 0\]
t.test(
G3 ~ school,
data = student_mat,
var.equal = TRUE,
alternative = "two.sided"
)
##
## Two Sample t-test
##
## data: G3 by school
## t = 0.89333, df = 393, p-value = 0.3722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7710694 2.0553599
## sample estimates:
## mean in group GP mean in group MS
## 10.489971 9.847826
The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.
t.test(
absences ~ school,
data = student_mat,
var.equal = TRUE,
alternative = "two.sided"
)
##
## Two Sample t-test
##
## data: absences by school
## t = 1.761, df = 393, p-value = 0.07902
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2567415 4.6662344
## sample estimates:
## mean in group GP mean in group MS
## 5.965616 3.760870
The results show us that p value is higher than 0.05 and 95% CI capture H0 so this study failed to reject H0.
Chi Square Test of Association
H0 = There is no association in the population between the categorical variables school and study time
HA = There is an association in the population between the categorical variables school and study time
chi2 <- chisq.test(table(student_mat$school,student_mat$studytime))
chi2
##
## Pearson's Chi-squared test
##
## data: table(student_mat$school, student_mat$studytime)
## X-squared = 4.9584, df = 3, p-value = 0.1749
# Observed
chi2$observed
##
## 1 2 3 4
## GP 89 176 57 27
## MS 16 22 8 0
# Expected
chi2$expected %>%round(3)
##
## 1 2 3 4
## GP 92.772 174.942 57.43 23.856
## MS 12.228 23.058 7.57 3.144
The results shows us that p value is higher than 0.05, so we failed to reject null hypothesis, there is no association between school and study time
Hypothesis:
H0: r = 0
HA: r ≠ 0
#correlation
studytimeG3model <- lm(G3 ~ absences, data = student_mat)
plot(G3 ~ absences, data = student_mat, xlab = "Absences", ylab = "Grade")
abline(studytimeG3model, col = "red")
- No linear correlation between absences and grade found from this plot
bivariate<-as.matrix(dplyr::select(student_mat, G3,absences)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")
## G3 absences
## G3 1.00 0.03
## absences 0.03 1.00
##
## n= 395
##
##
## P
## G3 absences
## G3 0.4973
## absences 0.4973
According to this result it shows that there are no significant differences between school and student grade, school and study time, school and absences. Moreover, there is no linear correlation between absences and grade. So if parents are confused where they want to send their children to school, they can choose either Gabriel Pereira or Mousinho da Silveira because there is no significant differences between their grade, study time, and absences. This study also answer parents concern about their kids absences and their grade, there is no correlation between them.
The advantage of this study is the method selection to handle various type of data either numerical or categorical. Independent t-test is used to handle numerical variable, chi square used to handle categorical variable, and linear regression is used to find mathematical correlation between two numerical variables.
The limitation from this study is that there is no data about individual capability of the sample students in this test. The result can be biased if we compare grade between a student with higher IQ who have many absences and a student with low IQ but always present in school. Degree of difficulty of the test that the students take to get the grade also not explained in the data set, so the grade result maybe biased.
Further improvement can be done by doing pre-test to know the abilities of each student and consider to add more attributes such as health, school activities, and reason to choose school.
[1]. UCI Machine Learning Repository, Student Performance Data Set, viewed 1 June 2019, http://archive.ics.uci.edu/ml/datasets/Student+Performance