Preface

This report is part of the requirement of Harvardx’s Data Science Professional Certificate Program with R Capstone project1.

The R Markdown code used to generate this report and the PDF version are available on GitHub.

CHAPTER ONE

#INTRODUCTION

Over the years, different methods have been used to measure academic performance and these include report card grades, grade point averages, standardized test scores, teacher ratings, other cognitive test scores, grade retention and dropout rates, etc. Academic performance by students has always been a subject of interest to every educational institution and educationist. The author being an educator, hence his interest on the student performance data.There is a consensus that schools should play a major role in this process of student performance, others however believe that efforts of schools should be integrated with other social institutions such as family and community towards educating children. Therefore, heads of educational institution, teachers and parents are primarily responsible for students’ academic performance and that schools should efficiently and effectively organize themselves towards this task. The present study made use of both academic performance (Maths, reading and writing scores) and some other domestic attributes such as parents’ educational levels and types of lunch taken.

Some use of student performance data analysis include:

  1. Identify potential factors responsible for students’ success or failure
  2. Benchmarking against similar school or class
  3. Setting annual goals
  4. Identify low performing students and creating assistance plans for them
  5. Generally to improve students’ performance in all fields

This study-flow follows the following pattern; Chapter 1, tells the story of the dataset and summarizes the study-goal and highlighting the key steps that were performed. The second chapter (Chapter 2) illustrates the process, methods and techniques used, including the data cleaning, data exploration and visualization, insights derived, and the modeling approach. In the third chapter (Chapter 3) the model is executed and results are presented and evaluated the model performance. The final chapter which is Chapter 4 involves a brief summary of the report, the limitations and future work.

Sudent Performance Dataset

The student performance dataset as downloaded from Kaggle contains the math score, reading score and writing score of students. Other domestic attributes of the students were observed such as the type of lunch they brought to school, the educational level of their parents, their ethnicity, and age were included in the data.

The performance measurement was measured as the standardization of the sum of all the measured subjects. The standardization was done to account for the differences and variances (or biases) in the scores. Some of the students may not perform well or may not like particular subject or find them as favorites. The standardized process will eliminate all the biases attached to the likeness or dislike or low or high performance on the subjects.

Model Evaluation

In machine learning, after a model is fitted, it is further evaluated to examine its level of accuracy by comparing the predicted observations with the actual observations.The loss function which is a measure of misses or inaccuracies between the predicted and the training is used among others in this study to reference the goodness of the model. The loss function used in this study which is also the most common loss functions in machine learning are the mean absolute error (MAE), mean squared error (MSE) and root mean squared error (RMSE). A perfect model is arrived when the RMSE is zero, which implied that every value the model predicted is true.

Root Mean Squared Error (RMSE)

To derive the RMSE, the MAE AND MSE are derived. The root mean squared error is derived from both loss functions (MAE AND MSE).

The Adjusted R-Squared

The adjusted R square is an estimates that measures the degree of variableness of the dependent attributes as explained by all the independent variables combined.

Process and Workflow

The dataset was first downloaded from Kaggle using this link 2

#CHAPTER TWO

#Methods and Analysis

The regression in it’s most simplest setup, assume that the response variable \(Y\) is some function of the features or atributes, plus some estimate of error.

\[ Y = f(\boldsymbol{X}) + \epsilon \] - Where \(f(\boldsymbol{X})\) the features. This \(f\) is the function that we would like to learn or model. - The error term, i.e \(\epsilon\) is observed and not learn, learning the error with model is what bring about over-fitting. Therefore the better a models eliminates errors, the better it is.

While the \(\boldsymbol{x}\) represents the attributes values of the random variables \(\boldsymbol{X}\).

\[ \boldsymbol{x} = (x_1, x_2, \ldots, x_p) \] Therefore, the goal is to find some \(f\) or models such that \(f(\boldsymbol{X})\) is close to \(Y\). The closer the \(f(\boldsymbol{X})\) get to \(Y\) the lesser the loss function,(loss function would be defined later in this study).

If the attributes are added one after te other, the simple regression function with one independent variable/ attribute is expressed as and called a One Degree Polynomial:

Degree 1 Polynomial

\[ \mu(x) = \beta_0 + \beta_1 x \] A regression model with two independent variables is called multiple regression function and is seen below Model 2 or mod_2 in R

\[ \mu_2(\boldsymbol{x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \] The multiple regression model used in this study is described below;

Model 3 or mod_3 in R

\[ \mu_3(\boldsymbol{x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_{4} x_{4} +\beta_{4} x_{4} + \beta_5 x_5 \] The five independent variables used in this model includes: Age, lunch, parent level of education, test preparation course and race-ethnicity. These attributes of the students were used to predict the standardized performance score of the students in this study.

The Loss function used in this study is the RMSE which is estimated for the Training set, Testing set and validation set. The validation set is the final holdout test where models are estimated. To perform this split, we will randomly select some observations (from the train set) for the estimation (est) set, the remainder will be used for the validation (val) set.

\[ \mathcal{D}_{\texttt{trn}} = \mathcal{D}_{\texttt{test}} \cup \mathcal{D}_{\texttt{val}} \]

Regression Metrics

If our goal is to “predict” then we want small errors. In general there are three types of errors we consider:

  • Root Mean Squared Error
  • Squared Errors: \(\left(y_i - \hat{\mu}(\boldsymbol{x}_i)\right) ^2\)
  • Absolute Errors: \(|y_i - \hat{\mu}(\boldsymbol{x}_i)|\)

In both cases, we will want to consider the average errors made. We define three metrics.

Mean Absolute Error - MAE

The mean absolute error is the average of absolute differences between the predicted value and the actual value. The MAE is given by this formula:

\[MAE=\frac{1}{N}\sum_{i=1}^{N}|\hat y_i - y_i|\] where \(N\) is the number of observations, \(\hat y_i\) is the predicted value and \(y\) is the true value.

Mean Squared Error - MSE

The mean of the squares of differences between the predicted and actual values.

\[MSE=\frac{1}{N}\sum_{u,i}(\hat{y}_i-y_i)^2\] ###Root Mean Squared Error - RMSE

The Root Mean Squared Error, RMSE, is the square root of the MSE. It is the typical metric to evaluate models, and is defined by the formula:

\[RMSE=\sqrt{\frac{1}{N}\sum(\hat{y}-y)^2}\]

Similar to MSE, the RMSE penalizes large deviations from the mean and is appropriate in cases that small errors are not relevant. Contrary to the MSE, the error has the same unit as the measurement.

Data Preparation

For student performance prediction, the three scores were added and a new column was created in the dataset called grandscore. The independent variables will be used to predict the standardized ‘grandscore’.

#############>>>>>>Sum the scores to create grandscore
StudentsPerf <- StudentsPerf %>%
  mutate(grandscore =math.score + reading.score + writing.score)

############>>>>>>>>Standardize the grandscore
StudentsPerf <- StudentsPerf %>%
  mutate(stdscore = (grandscore- mean(grandscore))/sd(grandscore) )
dim(StudentsPerf)
## [1] 1000   10

There are 1000 observations and 9 variables. Among these nine variables, we will perform some tests to identify important variables using two methods: the ‘Boruta’ method which performs several iteration to select important variables and the ‘varImportance’ command which examines the variance of the variables. The ones with least variances or almost zero variance were identified. These methods will guide our choice of predictive variables in the model.

##Variable Importance There are about nine (9) variables in this dataset. In other to make the work simple and straight forward, we check which of the variables are important predictors of the dependent variable. To do this, two methods were used.

Variable Importance Method 1 - Boruta Library

This library runs several iterations to select which variables are good enough to predict the dependent variable. It also gives detailed and simplified output in summary tables and visualizations. The Boruta library also helps in the algorithm development process by suggesting a ready model specification functions. To use the Boruta function, the set.seed command has to be executed.

The parental level of education is considered not important. This may not be included in the model.

bor #7 important attributes
## Boruta performed 32 iterations in 29.71796 secs.
##  7 attributes confirmed important: gender, grandscore, lunch,
## math.score, reading.score and 2 more;
##  2 attributes confirmed unimportant: parental.level.of.education,
## race.ethnicity;
plot(bor, las = 2, cex.axis = 0.7)

The variables/boxes in green are confirmed to be significant attributes in predicting the the dependent variables while the red object is not confirmed. To examine if there are tentative attributes, use the ‘TentativeRoughFix’ command

borten<-TentativeRoughFix(bor) #View only the tentative attributes
## Warning in TentativeRoughFix(bor): There are no Tentative attributes! Returning
## original object.
borten
## Boruta performed 32 iterations in 29.71796 secs.
##  7 attributes confirmed important: gender, grandscore, lunch,
## math.score, reading.score and 2 more;
##  2 attributes confirmed unimportant: parental.level.of.education,
## race.ethnicity;

As seen from the graph, there are no tentative attributes, only 1 unconfirmed attribute. To view a detafiles report in a tabular form, the ‘attstats’ commmand is used while the ‘getNoneRejectedFormula’ is used to summon on the the confirmed attributes

attStats(bor) #detailed report
##                                meanImp  medianImp     minImp    maxImp normHits
## gender                      13.5443661 13.7742215 11.4908395 15.159592  1.00000
## race.ethnicity               1.1439511  0.9174042 -0.2903372  4.206549  0.15625
## parental.level.of.education  0.7693899  0.6447008 -1.3248907  3.007355  0.21875
## lunch                        6.1336775  5.9623852  4.1351947  7.461219  1.00000
## test.preparation.course      5.7285073  5.9389954  4.1405425  6.759400  1.00000
## math.score                  19.0732639 19.0263362 17.6721824 20.301525  1.00000
## reading.score               21.6313467 21.5991777 20.1474032 22.779106  1.00000
## writing.score               20.7362906 20.7036012 18.9383239 22.038797  1.00000
## grandscore                  29.1267178 28.9696051 27.5271532 31.254614  1.00000
##                              decision
## gender                      Confirmed
## race.ethnicity               Rejected
## parental.level.of.education  Rejected
## lunch                       Confirmed
## test.preparation.course     Confirmed
## math.score                  Confirmed
## reading.score               Confirmed
## writing.score               Confirmed
## grandscore                  Confirmed
#to get the non-rejected attributes
getNonRejectedFormula(bor) #returns all important attributes in a way that can be used in a model
## stdscore ~ gender + lunch + test.preparation.course + math.score + 
##     reading.score + writing.score + grandscore
## <environment: 0x000000002050cce8>

The variables that tend to one are very important at predicting the grandscore. Therefore we will focus on exploring the ‘gender’, ‘lunch’ and ‘test.preparation.course’ variables. the math and reading and writing scores will also be inspected thoroughly. To show only the confirmed formula which we can use directly in our model, the ‘getConfirmedFormula’ function is used.

getConfirmedFormula(bor) #among the accepted atributes, returns more important ones
## stdscore ~ gender + lunch + test.preparation.course + math.score + 
##     reading.score + writing.score + grandscore
## <environment: 0x000000001635dee0>

Method 2 of Variable Selection or Inspection - nearZero Function

The nearZerovar function is used to inspect variables which are constants or close to so they can be removed from the model.

###########>>>To check variabes that are close to constants i.e Nearzero variances
nearZeroVar(StudentsPerf, saveMetrics = TRUE)
##                             freqRatio percentUnique zeroVar   nzv
## gender                       1.074689           0.2   FALSE FALSE
## race.ethnicity               1.217557           0.5   FALSE FALSE
## parental.level.of.education  1.018018           0.6   FALSE FALSE
## lunch                        1.816901           0.2   FALSE FALSE
## test.preparation.course      1.793296           0.2   FALSE FALSE
## math.score                   1.028571           8.1   FALSE FALSE
## reading.score                1.030303           7.2   FALSE FALSE
## writing.score                1.060606           7.7   FALSE FALSE
## grandscore                   1.071429          19.4   FALSE FALSE
## stdscore                     1.071429          19.4   FALSE FALSE

All the variables returned FALSE, that is, their variance is not zero and they are not constants. This test indicates that all the variables may be used in the model building.

Exploratory Data Analysis

To explore the student performance data, different techniques were used to facilitate a better understanding of the dataset and their interactions with each other.

The ‘glimpse’ command functions like the Summary command. Here, the type of each variable is identified

glimpse(StudentsPerf)
## Rows: 1,000
## Columns: 10
## $ gender                      <chr> "female", "female", "female", "male", "mal~
## $ race.ethnicity              <chr> "group B", "group C", "group B", "group A"~
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast~
## $ lunch                       <chr> "standard", "standard", "standard", "free/~
## $ test.preparation.course     <chr> "none", "completed", "none", "none", "none~
## $ math.score                  <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score               <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score               <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~
## $ grandscore                  <int> 218, 247, 278, 148, 229, 232, 275, 122, 19~
## $ stdscore                    <dbl> 0.3434024, 1.0214164, 1.7461900, -1.293183~
#There are five character variables and three integers

To create a good model, the character variables must be transformed to factors while the integers are transformed to ‘numeric or double’.

#########<<<<<<Turn the character variables to factor variables
StudentsPerf <- StudentsPerf %>%
  mutate_if(is.character, as.factor) #this turns all the character variables to factors
StudentsPerf <- StudentsPerf%>%
  mutate_if(is.integer, as.numeric) #this turns all the integer variables to numeric
StudentsPerf <- StudentsPerf %>%
  mutate_if(is.factor, as.numeric)

Again we ‘glimpse’ to ensure that the transformations have taken place.

glimpse(StudentsPerf)
## Rows: 1,000
## Columns: 10
## $ gender                      <dbl> 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, ~
## $ race.ethnicity              <dbl> 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 4, 2, 1, ~
## $ parental.level.of.education <dbl> 2, 5, 4, 1, 5, 1, 5, 5, 3, 3, 1, 1, 3, 5, ~
## $ lunch                       <dbl> 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, ~
## $ test.preparation.course     <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, ~
## $ math.score                  <dbl> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58~
## $ reading.score               <dbl> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54~
## $ writing.score               <dbl> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52~
## $ grandscore                  <dbl> 218, 247, 278, 148, 229, 232, 275, 122, 19~
## $ stdscore                    <dbl> 0.3434024, 1.0214164, 1.7461900, -1.293183~

Now the variables are either in the form of factor or double, hence ready for model building. To answer the question if there are missing values, we sum the logical value to know what if there exist any missing value.

sum(is.na(StudentsPerf))
## [1] 0
#No missing values

The data comprises of no missing observation in any of the variables.

VISUALIZING THE DATA

This section provides a graphical representation of each variable in the data. Visualizing the distribution of the gender variable, we have:

Tab1<- table(StudentsPerf$gender)
Tab1
## 
##   1   2 
## 518 482
barchart(Tab1, horizontal = FALSE)

There are a little more females than the males in this dataset.

grid <- matrix(c(1,2,3,4),
               nrow = 2, ncol = 2, byrow = TRUE)

layout(grid)
hist(StudentsPerf$math.score)
hist(StudentsPerf$reading.score)
hist(StudentsPerf$writing.score)
hist(StudentsPerf$grandscore)

The histogram of the four numeric variables shows that the variables are approximately normally distributed. Therefore there is no need for any transformation to be done on them. We further check the actual skewness value of the four variables.

To compute skewness

skewness(StudentsPerf$math.score, na.rm = TRUE)
## [1] -0.2785166
skewness(StudentsPerf$reading.score, na.rm = TRUE)
## [1] -0.2587157
skewness(StudentsPerf$writing.score, na.rm = TRUE)
## [1] -0.2890096
skewness(StudentsPerf$grandscore, na.rm = TRUE)
## [1] -0.2986083

Although they all slightly tilted towards the left (negative;y skewed) they are approximately normal because when approximated, they are still less than 1. The scale variables are normally distributed.

###Relationship between variables

spcor<-cor(StudentsPerf[,-c(1:5)])
corrplot(spcor, method = "number", type = "lower")

#Correlation between lunch and high scores 
ggplot(StudentsPerf, aes(x=math.score, y=grandscore, color=lunch)) + geom_point(alpha=0.5) 

ggplot(StudentsPerf, aes(x=reading.score, y=grandscore, color=lunch)) + geom_point(alpha=0.5)

ggplot(StudentsPerf, aes(x=writing.score, y=grandscore, color=lunch)) + geom_point(alpha=0.5)

There exist a positive and strong relationship between the scores. To show their significant levels and their strengths of relationship, the following commands below provide such. The relationships are significantly correlated.

cor.test(StudentsPerf$math.score, StudentsPerf$reading.score) #significant and strongly correlated
## 
##  Pearson's product-moment correlation
## 
## data:  StudentsPerf$math.score and StudentsPerf$reading.score
## t = 44.855, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7959276 0.8371428
## sample estimates:
##       cor 
## 0.8175797
cor.test(StudentsPerf$math.score, StudentsPerf$writing.score) #significant and strongly correlated
## 
##  Pearson's product-moment correlation
## 
## data:  StudentsPerf$math.score and StudentsPerf$writing.score
## t = 42.511, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7794321 0.8236517
## sample estimates:
##      cor 
## 0.802642
cor.test(StudentsPerf$reading.score, StudentsPerf$writing.score) #significant and strongly correlated
## 
##  Pearson's product-moment correlation
## 
## data:  StudentsPerf$reading.score and StudentsPerf$writing.score
## t = 101.23, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9487506 0.9597921
## sample estimates:
##       cor 
## 0.9545981

The Pearson correlation estimates is used because they are all numeric variables. The output showed a strong and significant relationships among the variables.

#There is a clear divide between the score on each test and gender. Males tended to have a higher math score and a lower reading and writing scores than females.
ggplot(StudentsPerf, aes(x=math.score, y=grandscore, color=gender)) + geom_point(alpha=0.5) + theme_wsj()

ggplot(StudentsPerf, aes(x=reading.score, y=grandscore, color=gender)) + geom_point(alpha=0.5)+theme_wsj()

ggplot(StudentsPerf, aes(x=writing.score, y=grandscore, color=gender)) + geom_point(alpha=0.5)+theme_wsj()

#Hence lets investigate further using the Lunch variable
#It can be observed that more students with standard lunch are high performers and free lunch are low performers

classbylunch <- StudentsPerf %>%
  group_by(lunch) %>%
  summarize(math.avg=mean(math.score), reading.avg=mean(reading.score), writing.avg=mean(writing.score), grandscoreavg=mean(grandscore))
classbylunch
## # A tibble: 2 x 5
##   lunch math.avg reading.avg writing.avg grandscoreavg
##   <dbl>    <dbl>       <dbl>       <dbl>         <dbl>
## 1     1     58.9        64.7        63.0          187.
## 2     2     70.0        71.7        70.8          213.
sd(StudentsPerf$grandscore)
## [1] 42.77198
mean(StudentsPerf$grandscore)
## [1] 203.312
min(StudentsPerf$grandscore)
## [1] 27
max(StudentsPerf$grandscore)
## [1] 300
meanlesssd = mean(StudentsPerf$grandscore) - sd(StudentsPerf$grandscore) #160.54
meanplusssd = mean(StudentsPerf$grandscore) + sd(StudentsPerf$grandscore) #246.084
meanlesssd
## [1] 160.54
meanplusssd
## [1] 246.084
highlowbylunchviz <- StudentsPerf %>%
  group_by(lunch) %>%
  filter(grandscore < 160.54 | grandscore > 246.084)

Students who had standard lunch were seen to perform better and clustered at the upper high score sides of the graph while those in those who had free lunch, they are at the bottom of the class. Total score, gender and lunch were used, along with writing score because it had the largest difference in average score, between male and female compared to the other tests.

Visualizing scores by gender

#The females had greater score than the male in general except in mathematics
scoresbysex <- StudentsPerf %>%
  group_by(gender) %>%
  summarise(math.avg=mean(math.score), reading.avg=mean(reading.score), writing.avg=mean(writing.score), grandscoreavg=mean(grandscore))
scoresbysex
## # A tibble: 2 x 5
##   gender math.avg reading.avg writing.avg grandscoreavg
##    <dbl>    <dbl>       <dbl>       <dbl>         <dbl>
## 1      1     63.6        72.6        72.5          209.
## 2      2     68.7        65.5        63.3          198.
#########>>>>Plot the sex by how the students prepared - completely or not
####There is not much variation in the test preparation, among the sexes....this variable might be excluded from the model
ggplot(StudentsPerf, aes(x=gender, fill=test.preparation.course)) + geom_bar(position = 'dodge') + theme_wsj()

table(StudentsPerf$gender, StudentsPerf$test.preparation.course)
##    
##       1   2
##   1 184 334
##   2 174 308
#########>>>>Plot the sex by lunch type - free/reduced or standard
####There is not much variation in the test preparation, among the sexes....this variable will be excluded from the model

ggplot(StudentsPerf, aes(x=gender, fill=lunch)) + geom_bar(position = 'dodge')

table(StudentsPerf$gender, StudentsPerf$lunch)
##    
##       1   2
##   1 189 329
##   2 166 316
############>>>>>>Visualizing sex by parent's education for variations
ggplot(StudentsPerf, aes(x=gender, fill=parental.level.of.education)) + geom_bar(position = 'dodge')+ theme_wsj()

table(StudentsPerf$gender, StudentsPerf$parental.level.of.education)
##    
##       1   2   3   4   5   6
##   1 116  63  94  36 118  91
##   2 106  55 102  23 108  88
#####Not much variation across gender
###########>>>>>>>>>>>>> Gender by race
ggplot(StudentsPerf, aes(x=gender, fill=race.ethnicity)) + geom_bar(position = 'dodge') + theme_wsj()

table(StudentsPerf$gender, StudentsPerf$race.ethnicity)
##    
##       1   2   3   4   5
##   1  36 104 180 129  69
##   2  53  86 139 133  71
#####Females had a much higher number of students that fell under racial group C than males. Racial group C is the median average total score,
#######likely bringing the female score up. The males also had a higher number of students in racial group A, which has the lowest average score, likely bringing their score down

The women in Group B and Group C ethnicity groups had higher scores than in others. From the EDA, the only clear difference among the factor variables was observed in race and ethnicity, Gender, writing score and Lunch.

CHAPTER THREE

MODEL FITTING AND RESULTS

The model assessment estimate used is the RMSE. The RMSE is a function of errors produced by the model, the smaller the RMSE value (tending to zero) the better it is. The RMSE is defined below.

##############>>>>>>>Define RMSE<<<<<<######
# Define Mean Absolute Error (MAE)
MAE <- function(true_ratings, predicted_ratings){
  mean(abs(true_ratings - predicted_ratings))
}

# Define Mean Squared Error (MSE)
MSE <- function(true_ratings, predicted_ratings){
  mean((true_ratings - predicted_ratings)^2)
}

# Define Root Mean Squared Error (RMSE)
RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

####Create Data Partitioning The ‘createDataPartitioning’ function from the caret library is used to partition the data into three parts, the validation set, that is, the final hold out test which is 10% of the entire dataset. The 90% is called the training set which is further partitioned into another 90:10%. The new 10% is called the testset while the new 90% is the trainset. The ratios 0.9:0.1 is used because the data is not so large, therefore more of it is assigned to training the model.

Model Fitting

The first model used in this study is the linear regression model.

stdpf.lm <- lm(stdscore ~ test.preparation.course + lunch + parental.level.of.education +race.ethnicity + gender,
               data = studentPerf.trainset)
summary(stdpf.lm) # To examine the Adjusted R square
## 
## Call:
## lm(formula = stdscore ~ test.preparation.course + lunch + parental.level.of.education + 
##     race.ethnicity + gender, data = studentPerf.trainset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5383 -0.5550  0.0231  0.5983  2.1300 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.07932    0.20597   0.385   0.7003    
## test.preparation.course     -0.57247    0.06530  -8.767  < 2e-16 ***
## lunch                        0.58340    0.06550   8.907  < 2e-16 ***
## parental.level.of.education -0.04192    0.01709  -2.453   0.0144 *  
## race.ethnicity               0.13553    0.02709   5.004 6.91e-07 ***
## gender                      -0.25666    0.06268  -4.095 4.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8906 on 804 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1974 
## F-statistic:  40.8 on 5 and 804 DF,  p-value: < 2.2e-16
stdpf.pred <- predict(stdpf.lm, newdata = studentPerf.testset)

comparetest.pred <- cbind(studentPerf.testset$stdscore, stdpf.pred)
par(mar = rep(2, 4))
plot(comparetest.pred) #The predicted scores and the test scores are plotted

#Checking for the RMSE
result <- tibble(Method = "Model", RMSE = NA , MSE = NA, MAE = NA)
result <- bind_rows(result, 
                    tibble(Method = " Linear Model for standardized Grand Score",
                RMSE = RMSE(studentPerf.testset$stdscore, stdpf.pred),
                MSE = MSE(studentPerf.testset$stdscore, stdpf.pred),
                MAE = MAE(studentPerf.testset$stdscore, stdpf.pred)))
result
## # A tibble: 2 x 4
##   Method                                         RMSE    MSE    MAE
##   <chr>                                         <dbl>  <dbl>  <dbl>
## 1 "Model"                                      NA     NA     NA    
## 2 " Linear Model for standardized Grand Score"  0.929  0.863  0.779
k <- data.frame(studentPerf.trainset$gender, studentPerf.trainset$race.ethnicity , studentPerf.trainset$parental.level.of.education , studentPerf.trainset$lunch, studentPerf.trainset$test.preparation.course)

j <- data.frame(studentPerf.testset$gender, studentPerf.testset$race.ethnicity , studentPerf.testset$parental.level.of.education , studentPerf.testset$lunch, studentPerf.testset$test.preparation.course)


x = as.matrix(k)
y_train = studentPerf.trainset$stdscore

x_test = as.matrix(j)
y_test = studentPerf.testset$stdscore

lambdas <- 10^seq(2, -3, by = -.1)
ridge_regl2 = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)

summary(ridge_regl2)
##           Length Class     Mode   
## a0         51    -none-    numeric
## beta      255    dgCMatrix S4     
## df         51    -none-    numeric
## dim         2    -none-    numeric
## lambda     51    -none-    numeric
## dev.ratio  51    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        7    -none-    call   
## nobs        1    -none-    numeric
cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
## [1] 0.01995262

The optimal lambda value comes out to be 0.3981072 and will be used to build the ridge regression model.

predictions_train <- predict(ridge_regl2, s = optimal_lambda, newx = x)

predictions_test <- predict(ridge_regl2, s = optimal_lambda, newx = x_test)
#Checking for the RMSE
result <- bind_rows(result, 
                    tibble(Method = "Regularization Model for Standardized Score",
                RMSE = RMSE(predictions_test, studentPerf.testset$stdscore),
                MSE = MSE(predictions_test, studentPerf.testset$stdscore),
                MAE = MAE(predictions_test, studentPerf.testset$stdscore)))
result
## # A tibble: 3 x 4
##   Method                                          RMSE    MSE    MAE
##   <chr>                                          <dbl>  <dbl>  <dbl>
## 1 "Model"                                       NA     NA     NA    
## 2 " Linear Model for standardized Grand Score"   0.929  0.863  0.779
## 3 "Regularization Model for Standardized Score"  0.930  0.864  0.779

USING THE Random Forest

##########>>>Random Forest Model<<<<<<<<<#########
fit_rf <- rpart(stdscore ~ . -math.score -reading.score -writing.score -grandscore , data = studentPerf.trainset)

fit_rf$variable.importance
##     test.preparation.course                       lunch 
##                  63.0916191                  61.7739496 
##                      gender              race.ethnicity 
##                  11.3163140                   1.3498422 
## parental.level.of.education 
##                   0.1414539
# plot our regression tree 
plot(fit_rf, uniform=TRUE)
# add text labels & make them 60% as big as they are by default
text(fit_rf, cex=.6)

#Making Predictions

stdpf.rfpred <- predict(fit_rf, newdata = studentPerf.testset)

#Checking for the RMSE
result <- bind_rows(result, 
                    tibble(Method = "Random Forest Model for the standardized Score",
                RMSE = RMSE(studentPerf.testset$stdscore, stdpf.rfpred),
                MSE = MSE(studentPerf.testset$stdscore, stdpf.rfpred),
                MAE = MAE(studentPerf.testset$stdscore, stdpf.rfpred)))
result
## # A tibble: 4 x 4
##   Method                                             RMSE    MSE    MAE
##   <chr>                                             <dbl>  <dbl>  <dbl>
## 1 "Model"                                          NA     NA     NA    
## 2 " Linear Model for standardized Grand Score"      0.929  0.863  0.779
## 3 "Regularization Model for Standardized Score"     0.930  0.864  0.779
## 4 "Random Forest Model for the standardized Score"  0.980  0.961  0.811

Validating Set

Since the linear model gave the best RMSE value, 0.9035630, the validation set (Holdout-set) will be tested using this model.

val.lm <- lm(stdscore ~ test.preparation.course + lunch + parental.level.of.education +race.ethnicity + gender,
               data = StudentsPerf.training)
summary(val.lm) #Adjusted R square is 0.9612
## 
## Call:
## lm(formula = stdscore ~ test.preparation.course + lunch + parental.level.of.education + 
##     race.ethnicity + gender, data = StudentsPerf.training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5400 -0.5754  0.0164  0.6075  2.1418 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.04788    0.19759   0.242  0.80861    
## test.preparation.course     -0.55174    0.06219  -8.872  < 2e-16 ***
## lunch                        0.59409    0.06243   9.516  < 2e-16 ***
## parental.level.of.education -0.04562    0.01631  -2.797  0.00527 ** 
## race.ethnicity               0.14392    0.02602   5.531 4.17e-08 ***
## gender                      -0.27873    0.05973  -4.667 3.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8943 on 894 degrees of freedom
## Multiple R-squared:  0.2029, Adjusted R-squared:  0.1985 
## F-statistic: 45.52 on 5 and 894 DF,  p-value: < 2.2e-16
val.pred <- predict(val.lm, newdata = studentPerf.validation)

comparetest.valpred <- cbind(studentPerf.testset$stdscore, val.pred)
## Warning in cbind(studentPerf.testset$stdscore, val.pred): number of rows of
## result is not a multiple of vector length (arg 1)
par(mar = rep(2, 4))
plot(comparetest.valpred) #The predicted scores and the test scores are correlated

#Checking for the RMSE
result <- bind_rows(result, 
                    tibble(Method = " Linear Model Validation for standardized Grand Score",
                RMSE = RMSE(studentPerf.validation$stdscore, val.pred),
                MSE = MSE(studentPerf.validation$stdscore, val.pred),
                MAE = MAE(studentPerf.validation$stdscore, val.pred)))
result
## # A tibble: 5 x 4
##   Method                                                    RMSE    MSE    MAE
##   <chr>                                                    <dbl>  <dbl>  <dbl>
## 1 "Model"                                                 NA     NA     NA    
## 2 " Linear Model for standardized Grand Score"             0.929  0.863  0.779
## 3 "Regularization Model for Standardized Score"            0.930  0.864  0.779
## 4 "Random Forest Model for the standardized Score"         0.980  0.961  0.811
## 5 " Linear Model Validation for standardized Grand Score"  0.883  0.779  0.734

The validation model showed to have a better estimate, 0.8827046. The linear model is hence a better predictor of the student performance data of student hence recommended.

CHAPTER FOUR

The Study made use of some basic characteristics of students to predict their general performance in Mathematics English and Writing as provided by the data. Some models were used to execute the machine learning algorithms and the model which gave the least error was validated.

The study is applicable for decision makers and education managers to promote the factors that are significantly responsible for high grades. Student performance is an important construct in education as it is one of the fundamental objectives of academic institutions to attain high student performance. Investors would also be interested in schools who have high student performance and more interested in schools who would implement Machine learning algorithms to consciously measure and improve the performance of their students.

Conclusion

According to the data available3, and among the models used, the ordinary linear model had the least RMSE for predicting student performance using the attributes and standardized scores. In this model, all the independent variables comprising of standard lunch, parental education, race, age and sex are significant predictors of student performance.

It is also noted that although the Boruta package identified ethnicity and parental level of education, the, visualization, the zerovar and model showed that they are significantly important and can be used in predicting student’s performance.

Limitation

The dataset has been used before now just for the purpose of Exploratory Data Analysis hence, the whole dataset is not sufficiently large hence the train and test sets is also not large. The validation test showed an improved RMSE, this could be because the trainig and validation sets are more than the ones used in the model development. Therefore gathering more data can improve these models and perhaps give us more than one good model.

The data is historical data and not generated in real life. There can be real time changes and seasonal effects not measured which could affect the distribution and composition of the data. Gathering a live data makes the data not linearly distributed (standardizing makes it follow a linear distribution) hence other models would be able to express the data as good as or better than the linear model.

Future Work

Future studies can examine this data from an automatic on-line and instant environment with the purpose of integrating it into a school management system. Also, more research needed to be done on how and why variables like lunch, sex and level of preparation affects performance of students.

Student performance is a continuous quest as cultures and civilization improves. Future work should consider factors such as duration on social media daily, hubbies and other factors that are domestic and related to urbanization.

References

  1. Bickel Peter J and Li Bo (2006), recosystem: recommendation System Using Parallel Matrix Factorization

  2. Jason Brownlee (2019), A Gentle Introduction to Matrix Factorization for Machine Learning

  3. Jason Brownlee (2020), Machine Learning Mastery with R, Get started, Build Accurate Models, and Work through Projects step-by-step

  4. Ong Cheng Soon (2005), Kernels: Regularization andOptimization

  5. Rafael A. Irizarry (2019), Introduction to Data Science: Data Analysis and Prediction Algorithms with R

  6. Vijay Kotu, Bala Deshpande, (2019), [Recomendation Ssystem: Matrix Factorization] (https://www.sciencedirect.com/topics/computer-science/matrix-factorization)

  7. Yixuan Qiu, et al., (2020), recosystem: recommendation System Using Parallel Matrix Factorization

  8. Miron Bartosz Kursa, (2020), Boruta: Wrapper Algorithm for All Relevant Feature Selection

  9. Taiyun Wei, (2021), corrplot: Visualization of a Correlation Matrix


  1. https://www.edx.org/professional-certificate/harvardx-data-science↩︎

  2. https://www.kaggle.com/adithyabshetty100/student-performance↩︎

  3. ↩︎