Data

This dataset is from the UCI Machine Learning Repository and is comprised of student performance inforation. The data contains the following features:

The task associated with this dataset is regression to determine the the grades related with the course subjects:

df_mat = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-mat.csv",sep=";",header=TRUE)
df_por = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-por.csv",sep=";",header=TRUE)

Abstract

Key Words

Introduction

The goal of this project is to use features from a student’s personal life and activities to predict grades in Math and Portuguese classes. Feature importance and explainability will be particularly useful, as it will indicate which factors are most valuable in grades. Two datasets will be used containing the same features - one for Math grades (df_mat) and one for Portuguese grades (df_por). There is some overlap in students between the Math and Portuguese classes, so separate models will be created for each dataset. This will also be a useful indicator of whether feature importance varies between math and language skills.

Methodology

The first step to predicting test scores for math and portuguese classes is data exploration and preprocessing. There are no missing values in the dataset, so no imputation is needed. However, there are many categorical and ordinal variables in the dataset. Categorical variables will be encoded prior to use in linear models, but ordinal variables will be kept as-is. To encode categorical variables, one hot encoding will be used to create dummy variables.

Experimentation and Results

Data Exploration

Correlation

The correlation plots below show correlation on numeric columns only, and indicate very limited collinearity in the dataset. The last three columns/rows are the test scores - scores are highly correlated with each other.

corrplot(cor(select_if(df_mat, is.numeric), use = "complete.obs"), method = "circle", tl.pos='n')

corrplot(cor(select_if(df_por, is.numeric), use = "complete.obs"), method = "circle", tl.pos='n')