knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>" )

Introduction

The performance index represents the student’s academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance. In this project of Programming for Data Science with R, we would like to make a predict model of the performance index students using regression model. This data collected from kaggle.

Data Preparation

Import Library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
library(performance)

options(scipen = 100, max.print = 1e+06)

Load Dataset

student <- read.csv("datainput/Student_Performance.csv")
head(student)

#>   Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
#> 1             7              99                        Yes           9
#> 2             4              82                         No           4
#> 3             8              51                        Yes           7
#> 4             5              52                        Yes           5
#> 5             7              75                         No           8
#> 6             3              78                         No           9
#>   Sample.Question.Papers.Practiced Performance.Index
#> 1                                1                91
#> 2                                2                65
#> 3                                2                45
#> 4                                2                36
#> 5                                5                66
#> 6                                6                61

Column Description

Hours Studied : The total number of hours spent studying by each student.
Previous Scores : The scores obtained by students in previous tests.
Extracurricular Activities : Whether the student participates in extracurricular activities (Yes or No).
Sleep Hours : The average number of hours of sleep the student had per day.
Sample Question Papers Practiced : The number of sample question papers the student practiced.

Target Variable:

Performance Index : A measure of the overall performance of each student.

Data Processing

Check general data information

glimpse(student)

#> Rows: 10,000
#> Columns: 6
#> $ Hours.Studied                    <int> 7, 4, 8, 5, 7, 3, 7, 8, 5, 4, 8, 8, 3…
#> $ Previous.Scores                  <int> 99, 82, 51, 52, 75, 78, 73, 45, 77, 8…
#> $ Extracurricular.Activities       <chr> "Yes", "No", "Yes", "Yes", "No", "No"…
#> $ Sleep.Hours                      <int> 9, 4, 7, 5, 8, 9, 5, 4, 8, 4, 4, 6, 9…
#> $ Sample.Question.Papers.Practiced <int> 1, 2, 2, 2, 5, 6, 6, 6, 2, 0, 5, 2, 2…
#> $ Performance.Index                <dbl> 91, 65, 45, 36, 66, 61, 63, 42, 61, 6…

From the dataset above, the data has 6 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

Missing Value

colSums(is.na(student))

#>                    Hours.Studied                  Previous.Scores 
#>                                0                                0 
#>       Extracurricular.Activities                      Sleep.Hours 
#>                                0                                0 
#> Sample.Question.Papers.Practiced                Performance.Index 
#>                                0                                0

Missing values in a dataset give significantly impact for the results of model prediction. In the dataset above, has no missing values in any columns.

Data Cleaning

student_clean <- student %>% 
  mutate_if(is.character, as.factor) #  change the data type of characters to a factor 
head(student_clean)

#>   Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
#> 1             7              99                        Yes           9
#> 2             4              82                         No           4
#> 3             8              51                        Yes           7
#> 4             5              52                        Yes           5
#> 5             7              75                         No           8
#> 6             3              78                         No           9
#>   Sample.Question.Papers.Practiced Performance.Index
#> 1                                1                91
#> 2                                2                65
#> 3                                2                45
#> 4                                2                36
#> 5                                5                66
#> 6                                6                61

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(student_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Insight :

The graphic shows that Previous.Score has the strong positif correlation with Performence.Index variables.

Here is the distribution of values for each variable.

boxplot(student_clean)

Insight :

Based on the boxplot visualization, no outliers found in each column of variables, and the data is normally distributed, so it can be further analyzed

Modeling

Train_Test Split Data

Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.

set.seed(100)
samplesize <- round(0.7 * nrow(student_clean), 0)
index <- sample(seq_len(nrow(student_clean)), size = samplesize)

data_train <- student_clean[index, ]
data_test <- student_clean[-index, ]

Linier Regression

Now we will try to model the linear regression using profit as the target variable and R.D.Spend as predictor due to has strong positif correlation.

set.seed(100)
student_lm <- lm(Performance.Index ~ Previous.Scores, 
                 data = data_train)

summary(student_lm)

#> 
#> Call:
#> lm(formula = Performance.Index ~ Previous.Scores, data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -17.332  -6.455  -0.054   6.518  19.481 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)     -15.637037   0.382547  -40.88 <0.0000000000000002 ***
#> Previous.Scores   1.021405   0.005344  191.14 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.762 on 6998 degrees of freedom
#> Multiple R-squared:  0.8393, Adjusted R-squared:  0.8392 
#> F-statistic: 3.654e+04 on 1 and 6998 DF,  p-value: < 0.00000000000000022

Insight :

The model with R.D.Spend variables has adjusted R-squared of 0.8377, meaning that the model can explain 83.77% of variance in the target variable (Performence.Index).
We got p-value: < 0.00000000000000022. If P-value < 0.05 , meaning that the prediction have significant effect to our model.
Previous.Scores and Performence.Index has strong positive correlation and we can confirmed with plot as below,

plot(x = student_clean$Previous.Scores, y = student_clean$Performance.Index )
abline(student_lm, col = "red")

Stepwise Regression

The firs step we will tyr make the linear regression model using Performence.Index as the target variable with all predictor.

The next step involves attempting automated predictor variable selection using stepwise regression with the backward elimination method.

library(MASS)
student_all <- lm(Performance.Index ~ ., 
                  data = data_train)
student_back<-stepAIC(student_all,
                   direction = "backward")

#> Start:  AIC=10032.3
#> Performance.Index ~ Hours.Studied + Previous.Scores + Extracurricular.Activities + 
#>     Sleep.Hours + Sample.Question.Papers.Practiced
#> 
#>                                    Df Sum of Sq     RSS   AIC
#> <none>                                            29294 10032
#> - Extracurricular.Activities        1       573   29867 10166
#> - Sample.Question.Papers.Practiced  1      2115   31409 10518
#> - Sleep.Hours                       1      4637   33931 11059
#> - Hours.Studied                     1    382486  411780 28532
#> - Previous.Scores                   1   2186947 2216241 40314

summary(student_back)

#> 
#> Call:
#> lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores + 
#>     Extracurricular.Activities + Sleep.Hours + Sample.Question.Papers.Practiced, 
#>     data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.0763 -1.3886 -0.0439  1.3599  8.7374 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value
#> (Intercept)                      -33.983798   0.151598 -224.17
#> Hours.Studied                      2.852506   0.009439  302.19
#> Previous.Scores                    1.018351   0.001409  722.59
#> Extracurricular.ActivitiesYes      0.572839   0.048963   11.70
#> Sleep.Hours                        0.476932   0.014334   33.27
#> Sample.Question.Papers.Practiced   0.191528   0.008523   22.47
#>                                             Pr(>|t|)    
#> (Intercept)                      <0.0000000000000002 ***
#> Hours.Studied                    <0.0000000000000002 ***
#> Previous.Scores                  <0.0000000000000002 ***
#> Extracurricular.ActivitiesYes    <0.0000000000000002 ***
#> Sleep.Hours                      <0.0000000000000002 ***
#> Sample.Question.Papers.Practiced <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.047 on 6994 degrees of freedom
#> Multiple R-squared:  0.9888, Adjusted R-squared:  0.9888 
#> F-statistic: 1.239e+05 on 5 and 6994 DF,  p-value: < 0.00000000000000022

Insight :

The model with R.D.Spend variables has adjusted R-squared of 0.9888, meaning that the model can explain 98.88% of variance in the target variable (Performence.Index).
We got p-value: < 0.00000000000000022. If P-value < 0.05 , meaning that the prediction have significant effect to our model.
improve model using stepwise regression “student_back” has better result from model of “student_lm” it’s provide a significant difference.

Evaluation Model

Compare Model Preformance

compare_performance(student_lm,student_back)

#> # Comparison of Model Performance Indices
#> 
#> Name         | Model |   AIC (weights) |  AICc (weights) |   BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
#> --------------------------------------------------------------------------------------------------------------
#> student_lm   |    lm | 48559.0 (<.001) | 48559.0 (<.001) | 48579.6 (<.001) | 0.839 |     0.839 | 7.761 | 7.762
#> student_back |    lm | 29899.4 (>.999) | 29899.5 (>.999) | 29947.4 (>.999) | 0.989 |     0.989 | 2.046 | 2.047

insight :

student_lm : R2 and R adjs have same value 0.838
student_back : Complex model due to has higher R2 and R adjs value and lowest AIC value from “student_lm” model
The stepwise regression method will produce an optimal formula based on the lowest (AIC) value, where the lower the AIC value, the smaller the unexplained observation values.

Root Mean Squared Error (RMSE)

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:

RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

RMSE(y_true = student_clean$Performance.Index, #  actual data
    y_pred = predict(student_back, data_train) ) # prediction data_train

#> [1] 27.23553

RMSE(y_true = student_clean$Performance.Index, #  actual data
    y_pred = predict(student_back, data_test) ) # prediction data_test

#> [1] 26.86997

Insight :

On “startup_back” models, based on Root Mean Squared Error (RMSE) having almost the same into train dataset and test dataset
“startup_back” models relatively small error values
this model can be considered optimum model due to it can capture the pattern of the training data and still good predictive ability for new data.

Assumptions Test

Normality Test

hist(student_back$residuals, breaks = 20)

Based on plot as above, the errors result are normally distributed. The Errors are concentrated around the value of 0, indicating that there are no extreme errors.

Homoscedasticity Test

The expectation from the model is to obtain errors or residuals with variance that does not form a pattern (should spread randomly).

Visualization with a scatterplot between predicted values (fitted values) and error values.

plot(student_back$fitted.values, student_back$residuals)
abline(h = 0, col = "red")

Checking the errors from the model are randomly distributed (if visualized, they should not exhibit a pattern). Objectively, we can check this by conducting the Breusch-Pagan hypothesis test using the bptest() function from ‘lmtest’ library, with the input being the model object.

bptest(student_back)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  student_back
#> BP = 5.4591, df = 5, p-value = 0.3624

The Model “student_back” has P-value > 0.05. The hypothesis (H0) model accepted, meaning that the residuals are normally distributed, allowing our model to have errors around its mean.

Multicolinearity Test

A good regression model ideally should not exhibit correlation among independent variables. This ensures that the correlation between the target and its predictors is not disrupted. There should be no values equal to or greater than 10, thus multicollinearity between predictor variables is not found (indicating that the predictor variables are mutually independent).

vif(student_back)

#>                    Hours.Studied                  Previous.Scores 
#>                         1.000647                         1.000590 
#>       Extracurricular.Activities                      Sleep.Hours 
#>                         1.001566                         1.001298 
#> Sample.Question.Papers.Practiced 
#>                         1.000492

Based on the result of Multicolinearity Test as above all variables has values under 10. Into this model, between predictor variables is not found indicating that the predictor variables are mutually independent.

Conclusion

We found optimal model after using the stepwise regression method. From model “student_back” we got the higher R2 and R adjs value and the lower (AIC) value than model before.
The accuracy of the model in predicting the student performance is measured with RMSE, with training data has RMSE of 26.96168 and testing data has RMSE of 26.96464.
The assumptions test from the normality test, homoscedasticity test, and multicollinearity test, the results obtained are as expected.
Suggesting that our model It can be considered optimal due to has small error values for both the training dataset and the test dataset.

Reference

https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression
https://rpubs.com/Argaadya/531140
https://rpubs.com/nabiilahardini/happiness
Modul Regerssion Model - Algoritma Data Science