Introduction

The data collected is about students in the age 18 to 24 year from two portuguese schools for two subjects maths and portuguese and their achievement based on marks report and some questionnaires.
The data is giving information about the student performance along with their attributes like demographics and family status/background, school related attributes.
However, for this study, it was decided to work on Math subject.
Analyse various demographics and its effects on the overall student performance.
By answering hypothesis research question we tried to support the investigation being performed.

Problem Statement

Sometimes, it is perceived that student’s gender is an issue which tell that scoring depends on gender. ’
Specific subject might have more association with gender. However, this study focused only student’s Math score to see how far this is associated with gender.
Trying to investigate if there is any association of scoring in Maths exam and gender of the person.
if yes, then who are more like to score better than other gender.
We use Pearson chi square test of association to investigate the test of association.
We try to find if there is any correlation between the G1 (Semester-1) and G3 (Final) scores using Correlation Coeeficient ‘r’

Data

The data has been collected from https://archive.ics.uci.edu/ml/datasets/Student+Performance, which is Open-data for use, reuse and share freely.
This data was found as csv format
The collected data has 395 rows and 33 variables under analysis.
Students records from two schools and their maths results along with other variables which describe their demographics ,social status, family background and school related attributes associated with it.
Most variables are found as categorical nature with few exception like age, scores, absence data which are quantative.
Data was found without any missing information or not applicable cases(NAN).
For this study, Gender (sex), G1 and G3 score will be used for necessary analysis. Note that G1 and G3 are numerical variables and sex is a categorical (nominal) variable.

Student_d1 <- read_delim("D:/Intro to Statistics/Assignment 4/student-mat.csv", 
    ";", escape_double = FALSE, trim_ws = TRUE)
print(nrow(Student_d1))

## [1] 395

Summary statistics

Summary statistics for sex vs G1 and G3 Score

## # A tibble: 2 × 10
##     sex   Min    Q1 Median    Q3   Max      Mean       SD     n Missing
##   <chr> <int> <dbl>  <dbl> <dbl> <int>     <dbl>    <dbl> <int>   <int>
## 1     F     0     8     10    13    19  9.966346 4.622338   208       0
## 2     M     0     9     11    14    20 10.914439 4.495297   187       0

## # A tibble: 2 × 10
##     sex   Min    Q1 Median    Q3   Max     Mean       SD     n Missing
##   <chr> <int> <dbl>  <dbl> <dbl> <int>    <dbl>    <dbl> <int>   <int>
## 1     F     4     8     10    13    19 10.62019 3.232530   208       0
## 2     M     3     9     11    14    19 11.22995 3.392839   187       0

#BoxPlot and summary statistics for sex vs G3 Score

Student_d1 %>% boxplot(G3 ~ sex, data = ., ylab = "Grade3 Score", xlab="Sex")

Data Structuring and processing

Dataset was critically observed for necessary quality checks like missing data, N/A cases, etc.
For doing the analysis, students’ final grade score (G3) has been used to create a categorical variable ‘Result’
Score divided as 0-5 “Very Poor”,Score 5-10 “Poor”, Score 10-15 “Good”, Score 15-20 “Very Good”.
Gender (Sex) was alreadt found factored into two levels “M” and “F”.

Data factorizing and Ordering factors

Student_d1$Result[Student_d1$G3<5] <- "Very Poor"
Student_d1$Result[Student_d1$G3>=5 & Student_d1$G3<=10] <- "Poor"
Student_d1$Result[Student_d1$G3>10 & Student_d1$G3<=15] <- "Good"
Student_d1$Result[Student_d1$G3>15 & Student_d1$G3<=20] <- "Very Good"
table(Student_d1$Result, Student_d1$sex)

##            
##              F  M
##   Good      87 82
##   Poor      81 66
##   Very Good 16 24
##   Very Poor 24 15

Student_d1$Result <- Student_d1$Result %>% fct_relevel('Very Good','Good','Poor','Very Poor')
table(Student_d1$Result, Student_d1$sex)

##            
##              F  M
##   Very Good 16 24
##   Good      87 82
##   Poor      81 66
##   Very Poor 24 15

Descriptive Statistics and Proportion table

Created the proportion table to view the distribution of data columns.
In the sample, male-female distribution was almost same (47% vs 53%).

tb <- table(Student_d1$Result,Student_d1$sex) %>% addmargins()
names(attributes(tb)$dimnames) <- c("Result","sex")
kable(tb,padding=0,format="html")

	F	M	Sum
Very Good	16	24	40
Good	87	82	169
Poor	81	66	147
Very Poor	24	15	39
Sum	208	187	395

tb2 <- table(Student_d1$Result,Student_d1$sex) %>% prop.table(margin=1) %>% round(2)
names(attributes(tb2)$dimnames) <- c("Result","sex")
kable(tb2,padding=0,format = "html")

	F	M
Very Good	0.40	0.60
Good	0.51	0.49
Poor	0.55	0.45
Very Poor	0.62	0.38

Visualisation of the distribution

Chart clearly shows that males did well. In highest band (Very Good) of scores, Male’s contribution is 60% whereas female is 40%.

barplot(tb2,ylab="Proportion Within Result",
          ylim=c(0,1),legend=rownames(tb2),beside=TRUE,
          args.legend=c(x = "top",horiz=TRUE,title="Result category"),
          xlab="Sex", col = c("Red","Yellow","orange","Green"))

Hypothesis Testing (Chi-square test of Association)

\(H0\) There is no association in the population between the Grade3 results of students and Gender.
\(HA\) There is an association in the population between Grade3 results of students and Gender.

Testing assumptions and decision rules

Assumptions:

No more than 25% of the cells have expected counts below 5.
Gender doesn’t play significant role in scoring,, thus assumed to be independent.

Decision rules:

Reject the null hypothesis if p-value < 0.05 significance level.
Otherwise, fail to reject the null hypothesis.

Hypothesis Testing Results

chi2 <- chisq.test(table(Student_d1$Result, Student_d1$sex)) #Chi-square test between Result and Sex
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  table(Student_d1$Result, Student_d1$sex)
## X-squared = 4.251, df = 3, p-value = 0.2356

pchisq(q = 3.8623,df = 3,lower.tail = FALSE) #Pvalue finding with q value found from X2 value

## [1] 0.2767223

qchisq(p = .95,df = 3) # Critical value identification

## [1] 7.814728

Observed values

chi2$observed %>% addmargins()

##            
##               F   M Sum
##   Very Good  16  24  40
##   Good       87  82 169
##   Poor       81  66 147
##   Very Poor  24  15  39
##   Sum       208 187 395

Expected values

chi2$expected %>% addmargins() %>% round(2)

##            
##                  F      M Sum
##   Very Good  21.06  18.94  40
##   Good       88.99  80.01 169
##   Poor       77.41  69.59 147
##   Very Poor  20.54  18.46  39
##   Sum       208.00 187.00 395

Raw residuals

chi2$observed - chi2$expected %>% round(2)

##            
##                 F     M
##   Very Good -5.06  5.06
##   Good      -1.99  1.99
##   Poor       3.59 -3.59
##   Very Poor  3.46 -3.46

Standardized residuals

chi2$stdres

##            
##                      F          M
##   Very Good -1.6913438  1.6913438
##   Good      -0.4058105  0.4058105
##   Poor       0.7489347 -0.7489347
##   Very Poor  1.1699704 -1.1699704

Hypothesis Test Summary

0% of expected cell counts were less than 5.
\(X2\) : 3.8623, df=3
\(p\)-value : 0.2767 which is >0.05
Decision : Failed to reject \(H0\)

Correlation between G1 and G3 score

Measure the strenght and relationship between G1 and G3 scores
0 means no correlation(Ranges from -I to +I)

#Scatter plot for G1 and G3 scores
plot(G3~ G1, data =Student_d1, xlab = "G1 Score", ylab = "G3 Score")

#Correlation between G1 and G3 scores
cor(Student_d1$G1,Student_d1$G3,use = "complete.obs")

## [1] 0.8014679

#Full corelation analysis

bivariate<-as.matrix(dplyr::select(Student_d1, G1,G3)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")

##     G1  G3
## G1 1.0 0.8
## G3 0.8 1.0
## 
## n= 395 
## 
## 
## P
##    G1 G3
## G1     0
## G3  0

#The confidence interval does not capture 0 values
r=cor(Student_d1$G1,Student_d1$G3)
CIr(r=r,n=395,level=.95)

## [1] 0.7631479 0.8341713

Correlation results

R Reports a strong positive correlation between G1 and G3 scores which is r=0.80 and the p- value= 0 which we write as p<.001.
A hypothesis test for r has following statistical hypothesis \[H0:r=0\] \[HA:r \neq 0 \]
Using the 0.05 level of significnace we can reject the \(H0\)
The confidence interval(0.7631479 0.8341713) does not capture \(H0\), therefore, \(H0\) was rejected.
There was a statistically significant positive correlation between G1 and G3 socres of the student.

Discussion

The result of the test failed to find statistically significant association between students’ grade3 result and students’ gender , \(X2\) (df=3)=3.86, p>0.05.
As this is a test of association, it gives status whether gender and Math final score ar associated or not and thus found no association through the data analysis.
There was statistically significant positive correlation between G1 and G3 scores, where we reject the null hypothesis as p<0.001.
Similar like gender, other independent categorical variables could also be tested to see the association.
Attempted with some other variables but there was lack of sufficient sample for each factors of the respective categorical variables, violated the assumption of ‘25% of the cells have expected counts below 5’.
Further analysis like multiple regression and correlation could be used to predict a model where more than oneindependent variables coulde be used.
As this data contained mainly categorical variables as predictor (independent variables), thus there wa no option to apply that level of advanced analysis.
More research is also needed (e.g. sociological studies) in order to understand why and how some variables (e.g. reason to choose school, parent’s job or alcohol consumption) affect student performance.
There is a potential for an automatic on-line learning environment, by using a student prediction engine as part of a school management support system.
This will allow the collection of additional features (e.g.grades from previous school years) and also to obtain a valuable feedback from the school professionals
Since only a small portion of the input variables considered seem to be relevant we can perform more investigations by accumulating more data and reinvistigate other variables as well affecting.

Analysis for student performance

Factors affecting student performance

Introduction

Problem Statement

Data

Summary statistics

Data Structuring and processing

Data factorizing and Ordering factors

Descriptive Statistics and Proportion table

Visualisation of the distribution

Hypothesis Testing (Chi-square test of Association)

Testing assumptions and decision rules

Assumptions:

Decision rules:

Hypothesis Testing Results

Observed values

Expected values

Raw residuals

Standardized residuals

Hypothesis Test Summary

Correlation between G1 and G3 score

Correlation results

Discussion

References