Applying Statistical Methods to Answer Research Questions about the Student Performance Data Set from the University of California Irvine’s Machine Learning Repository

(work in progress)

Introduction

Purposes of this report include:

Exploring relationships between a large number of student performance, socioeconomic, individual, school, and other variables.
Integrating original research questions with the statistical methods that can answer them.
Providing R code for instructive and copying purposes so that others may apply these analyses to research questions of similar nature.
Providing examples of intepretations and conclusions that can be made from results of these methods.
Connecting statistical methods with meaningful language and concepts to connect quantitative processes with qualitative and coneptual understandings.

Data Setup

dpor=as.data.frame(read.table("student-por.csv",sep=";",header=TRUE))
dmath=as.data.frame(read.table("student-mat.csv",sep=";",header=TRUE))
dmerge=merge(dpor,dmath,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
dpor1 <- subset(subset(dpor, school=="GP"),select=-school)
dpor2 <- subset(subset(dpor, school=="MS"),select=-school)
dmath1 <- subset(dmath, school=="GP")
dmath2 <- subset(dmath, school == "MS")

Descriptive Statistics

What is the mean and standard deviation of end of year Portuguese grades?
What grade corresponds to the 25th, 50th, and 75th percentile?
What is the lowest and highest grade? What is the range of the grades?
What percentage of students score at each grade? What percentage of students score at or below each grade?
How are the grades distributed? Are they normally distributed?
What is the skewness and kurtosis of the grades?

T Tests, Effect Sizes, Confidence Intervals

Does the average grade differ between schools Gabriel Pereira ‘GP’ and Mousinho da Silveira ‘MS’? By how much?
At GP (for rest of questions), does the average grade differ between females and males? By how much?
Does the average grade differ between students who want to take higher education and those who dont? By how much?
Do students living in urban homes have different average grades than students living in rural homes? By how much?
Do students with parents living together differ in average grade than students with parents living apart? By how much?

One-Way ANOVA, Two-Way ANOVA, ANCOVA, MANOVA Post Hoc Comparisons/Tukey Honest Significant Difference Tests

Do average grades differ between at least two groups amongst students who attend their school because it is close to home, because of its reputation, because of a course preference, or other reason? How much do average grades differ for each combination of pairs of these groups?
Do differences in average grades between the groups of different reasons differ between students who want to take higher education and those who do not? Is there an interaction between school reason and higher education interest on final grades? How much does each pair of levels differ by?
Is there a difference in either average grades or average number of absences between different levels of students’ mothers’ education (none/up to 4th grade/ 5th-9th / secondary education / higher education)?

Linear Regression, Scatterplots

What is the equation of a linear line of best fit predicting students’ final grades from absences? What is the slope of grades over absences? What is the intercept? Are they significant?
Does the linear model meet its assumptions?
What is the correlation and correlation coefficient between grades and absences?
Does a transformation of the distribution of students’ absences allow for a better linear fit?
Does a transformation of the distribution of students’ grades allow for a better linear fit?

Multiple Regression, Generalized Linear Models, Model Comparisons

What proportion of the varience in students’ final grades can be explained by the following variables:

Gender M/F
Age 15-22
Address type: Urban/Rural
Family size: Less than or equal to 3, more than 3
Parent’s cohabitation status: Together/Apart
Mother’s education level: None/Primary-4th Grade/5th-9th grade/ Secondary Education/ Higher Education
Father’s education level:None/Primary-4th Grade/5th-9th grade/ Secondary Education/ Higher Education
School Reason: close to home/ school reputation/ course preference/ other
Student’s guardian: mother/father/other
Commute time: <15min/15-30min/ 30min-1hour/ >1hour
Weekly study time: <2hours, 2-5 hours, 5-10 hours, >10hours
Number of past course failures: 0, 1, 2, 3, >3
Receiving extra in school support: Y/N
Receiving family educational support: Y/N
Receving extra paid classes/tutoring: Y/N
Involved inextra-curricular activities: Y/N
Attended nursery school: Y/N
Interested in higher education: Y/N
Internet access at home: Y/N
In a romantic relationship: Y/N
Quality of family relationships: 1-very bad to 5 excellent
Free time after school: 1-very low to 5- very high
Going out with friends: 1-very low to 5- very high
Workday alcohol consumption: 1-very low to 5-very high
Weekend alcohol consumption: 1- very low to 5- very high
Current health status: 1- very bad to 5-very good
Number of absences: 0-93
What is the predicted change in students’ grade with a one unit increase in a numerical variable, after controlling for all other variables?
What is the difference in mean grades of one factor level compared to a default factor level after controlling for all other variables?
What variables have significant effects on grades?
What variables have the most and least predictive power on grades?
Is there multicollinearity in the model? Which variables contribute to it?
What variables could be removed to make a simpler model that accounts for a comparable amount of the variance?
Are there any significant interactions between variables in which the effect of one variable on grades depend on the value of another variable?
How do various models made compare with each other?
How well does a generalized linear model with a binomial/logistic link model whether a student is interested in higher education?
How do the questions above about students’ grades relate to this new variable of interest in higher education?
How can logit, log-odds, and probability coefficients of a binomial/logistic generalized linear model be interpreted and converted from one another?
How well does a poisson generalized linear model account for the variation within the number of absences for a student?

#lm <- lm(G3 ~ . -G1 -G2, data=dpor1)
summary(lm)
layout(matrix(c(1,2,3,4),2,2))
plot(lm)

library(MASS)
steplm <- stepAIC(lm, direction="both")
summary(steplm)

steplm <- lm(G3 ~ sex + age + Mjob + Fjob + studytime + failures + 
               schoolsup + activities + higher + romantic + goout + health + 
               absences, data = dpor1)
summary(steplm)
steplmInt <- lm(G3 ~ (sex + age + Mjob + Fjob + studytime + failures + 
               schoolsup + activities + higher + romantic + goout + health + 
               absences)^2, data = dpor1)

steplmIntStep <- stepAIC(steplmInt, direction="both")
summary(steplmIntStep)
AIC(lm, steplm, steplmInt, steplmIntStep)

library(relaimpo)
calc.relimp(steplm, type=c("lmg"),rela=TRUE, rank=TRUE)

dpor1$failures <- ifelse(dpor1$failures >0, 1, 0)
table(dpor1$failures)

glm <- glm(failures ~ . -G1 -G2 -G3, data=dpor1, family=binomial())
summary(glm)
library(MASS)
glmstep <- stepAIC(glm, direction="both")
summary(glmstep)
glmstepInt <- glm(failures ~ (age + famsize + Medu + reason + studytime + 
                    schoolsup + paid + higher + absences)^2, family = binomial(), 
                  data = dpor1)
summary(glmstepInt)
glmstepIntStep <- stepAIC(glmstepInt, direciton="both")
AIC(glm, glmstep, glmstepInt, glmstepIntStep)

Mixed Effect/Hierarchical/Multilevel Modeling

Using ‘G1’, ‘G2’, and ‘G3’ as repeated measures of ‘grade’, how well does a mixed effects model adding random effects for student and school and fixed effects for all other variables predict students’ grades?
How does this mixed effect model compare with a generalized linear model using only fixed effects?
What variables could be removed from the mixed effect model to make a simpler model that accounts for a comaprable proportion of the variation in grades?
How does this reduced model comapre with a reduced model using only fixed effects? ## Classification/Prediction, ROC Curves
To what percentage can logistic regression/classification models created on a proportion of the data, accurately predict binary or multinomial variables of the remainder of the data?
How do changes in the criteria cutoff of predicting binary variables affect the model’s sensitivity and specificity?
How do various machine learning models compare in prediction accuracy compared to other intuitive or baseline strategies such as random chance, use of measures of central tendencies, or expert/clinician judgement?

Principal Components Analysis, Exploratory Factor Analysis

What groupings of variables can the full set of variables be reduced to?
How many independent groupings of variables are statistically significant?
Are there certain variables that load very highly or low onto a grouping that they may provide information about the substantive meanings behind them?
How might variables within a grouping be connected substantively?
How could extracted groupings/components/factors be hypothethically labeled? What other variables could be measured to provide more evidence towards supporting or disconfirming these hypotheses?
Given a certain label,

Confirmatory Factor Analysis, Structural Equation Modeling

How well does the data on measured variables support a theoretical latent variable/construct comprised from those variables?
How well does data collected support a theoretical model specifying pathways of latent and measured variables’ direct and indirect effects on each other as well as on a criterion variable(s)?
What indexes of fit can be used to indicate the magnitude the measured data fits the theoretical model?
What proportion of variance in the criterion variable(s) does the structural equation model account for?