Introduction
The performance index represents the student’s academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance. In this project of Programming for Data Science with R, we would like to make a predict model of the performance index students using regression model. This data collected from kaggle.
Data Preparation
Import Library
Load Dataset
#> Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
#> 1 7 99 Yes 9
#> 2 4 82 No 4
#> 3 8 51 Yes 7
#> 4 5 52 Yes 5
#> 5 7 75 No 8
#> 6 3 78 No 9
#> Sample.Question.Papers.Practiced Performance.Index
#> 1 1 91
#> 2 2 65
#> 3 2 45
#> 4 2 36
#> 5 5 66
#> 6 6 61
Column Description
- Hours Studied : The total number of hours spent studying by each student.
- Previous Scores : The scores obtained by students in previous tests.
- Extracurricular Activities : Whether the student participates in extracurricular activities (Yes or No).
- Sleep Hours : The average number of hours of sleep the student had per day.
- Sample Question Papers Practiced : The number of sample question papers the student practiced.
Target Variable:
- Performance Index : A measure of the overall performance of each student.
Data Processing
Check general data information
#> Rows: 10,000
#> Columns: 6
#> $ Hours.Studied <int> 7, 4, 8, 5, 7, 3, 7, 8, 5, 4, 8, 8, 3…
#> $ Previous.Scores <int> 99, 82, 51, 52, 75, 78, 73, 45, 77, 8…
#> $ Extracurricular.Activities <chr> "Yes", "No", "Yes", "Yes", "No", "No"…
#> $ Sleep.Hours <int> 9, 4, 7, 5, 8, 9, 5, 4, 8, 4, 4, 6, 9…
#> $ Sample.Question.Papers.Practiced <int> 1, 2, 2, 2, 5, 6, 6, 6, 2, 0, 5, 2, 2…
#> $ Performance.Index <dbl> 91, 65, 45, 36, 66, 61, 63, 42, 61, 6…
From the dataset above, the data has 6 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
Missing Value
#> Hours.Studied Previous.Scores
#> 0 0
#> Extracurricular.Activities Sleep.Hours
#> 0 0
#> Sample.Question.Papers.Practiced Performance.Index
#> 0 0
Missing values in a dataset give significantly impact for the results of model prediction. In the dataset above, has no missing values in any columns.
Data Cleaning
student_clean <- student %>%
mutate_if(is.character, as.factor) # change the data type of characters to a factor
head(student_clean)#> Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
#> 1 7 99 Yes 9
#> 2 4 82 No 4
#> 3 8 51 Yes 7
#> 4 5 52 Yes 5
#> 5 7 75 No 8
#> 6 3 78 No 9
#> Sample.Question.Papers.Practiced Performance.Index
#> 1 1 91
#> 2 2 65
#> 3 2 45
#> 4 2 36
#> 5 5 66
#> 6 6 61
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
Insight :
The graphic shows that Previous.Score has the strong positif correlation with Performence.Index variables.
Here is the distribution of values for each variable.
Insight :
Based on the boxplot visualization, no outliers found in each column of variables, and the data is normally distributed, so it can be further analyzed
Modeling
Train_Test Split Data
Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.
Linier Regression
Now we will try to model the linear regression using profit as the target variable and R.D.Spend as predictor due to has strong positif correlation.
set.seed(100)
student_lm <- lm(Performance.Index ~ Previous.Scores,
data = data_train)
summary(student_lm)#>
#> Call:
#> lm(formula = Performance.Index ~ Previous.Scores, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.332 -6.455 -0.054 6.518 19.481
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -15.637037 0.382547 -40.88 <0.0000000000000002 ***
#> Previous.Scores 1.021405 0.005344 191.14 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.762 on 6998 degrees of freedom
#> Multiple R-squared: 0.8393, Adjusted R-squared: 0.8392
#> F-statistic: 3.654e+04 on 1 and 6998 DF, p-value: < 0.00000000000000022
Insight :
- The model with R.D.Spend variables has adjusted R-squared of 0.8377, meaning that the model can explain 83.77% of variance in the target variable (Performence.Index).
- We got p-value: < 0.00000000000000022. If P-value < 0.05 , meaning that the prediction have significant effect to our model.
- Previous.Scores and Performence.Index has strong positive correlation and we can confirmed with plot as below,
plot(x = student_clean$Previous.Scores, y = student_clean$Performance.Index )
abline(student_lm, col = "red")Stepwise Regression
The firs step we will tyr make the linear regression model using Performence.Index as the target variable with all predictor.
The next step involves attempting automated predictor variable selection using stepwise regression with the backward elimination method.
library(MASS)
student_all <- lm(Performance.Index ~ .,
data = data_train)
student_back<-stepAIC(student_all,
direction = "backward")#> Start: AIC=10032.3
#> Performance.Index ~ Hours.Studied + Previous.Scores + Extracurricular.Activities +
#> Sleep.Hours + Sample.Question.Papers.Practiced
#>
#> Df Sum of Sq RSS AIC
#> <none> 29294 10032
#> - Extracurricular.Activities 1 573 29867 10166
#> - Sample.Question.Papers.Practiced 1 2115 31409 10518
#> - Sleep.Hours 1 4637 33931 11059
#> - Hours.Studied 1 382486 411780 28532
#> - Previous.Scores 1 2186947 2216241 40314
#>
#> Call:
#> lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores +
#> Extracurricular.Activities + Sleep.Hours + Sample.Question.Papers.Practiced,
#> data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7.0763 -1.3886 -0.0439 1.3599 8.7374
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) -33.983798 0.151598 -224.17
#> Hours.Studied 2.852506 0.009439 302.19
#> Previous.Scores 1.018351 0.001409 722.59
#> Extracurricular.ActivitiesYes 0.572839 0.048963 11.70
#> Sleep.Hours 0.476932 0.014334 33.27
#> Sample.Question.Papers.Practiced 0.191528 0.008523 22.47
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Hours.Studied <0.0000000000000002 ***
#> Previous.Scores <0.0000000000000002 ***
#> Extracurricular.ActivitiesYes <0.0000000000000002 ***
#> Sleep.Hours <0.0000000000000002 ***
#> Sample.Question.Papers.Practiced <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.047 on 6994 degrees of freedom
#> Multiple R-squared: 0.9888, Adjusted R-squared: 0.9888
#> F-statistic: 1.239e+05 on 5 and 6994 DF, p-value: < 0.00000000000000022
Insight :
- The model with R.D.Spend variables has adjusted R-squared of 0.9888, meaning that the model can explain 98.88% of variance in the target variable (Performence.Index).
- We got p-value: < 0.00000000000000022. If P-value < 0.05 , meaning that the prediction have significant effect to our model.
- improve model using stepwise regression “student_back” has better result from model of “student_lm” it’s provide a significant difference.
Evaluation Model
Compare Model Preformance
#> # Comparison of Model Performance Indices
#>
#> Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
#> --------------------------------------------------------------------------------------------------------------
#> student_lm | lm | 48559.0 (<.001) | 48559.0 (<.001) | 48579.6 (<.001) | 0.839 | 0.839 | 7.761 | 7.762
#> student_back | lm | 29899.4 (>.999) | 29899.5 (>.999) | 29947.4 (>.999) | 0.989 | 0.989 | 2.046 | 2.047
insight :
- student_lm : R2 and R adjs have same value 0.838
- student_back : Complex model due to has higher R2 and R adjs value and lowest AIC value from “student_lm” model
- The stepwise regression method will produce an optimal formula based on the lowest (AIC) value, where the lower the AIC value, the smaller the unexplained observation values.
Root Mean Squared Error (RMSE)
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:
RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.
RMSE(y_true = student_clean$Performance.Index, # actual data
y_pred = predict(student_back, data_train) ) # prediction data_train#> [1] 27.23553
RMSE(y_true = student_clean$Performance.Index, # actual data
y_pred = predict(student_back, data_test) ) # prediction data_test#> [1] 26.86997
Insight :
- On “startup_back” models, based on Root Mean Squared Error (RMSE)
having almost the same into train dataset and test dataset
- “startup_back” models relatively small error values
- this model can be considered optimum model due to it can capture the pattern of the training data and still good predictive ability for new data.
Assumptions Test
Normality Test
Based on plot as above, the errors result are normally distributed. The Errors are concentrated around the value of 0, indicating that there are no extreme errors.
Homoscedasticity Test
The expectation from the model is to obtain errors or residuals with variance that does not form a pattern (should spread randomly).
Visualization with a scatterplot between predicted values (fitted values) and error values.
Checking the errors from the model are randomly distributed (if visualized, they should not exhibit a pattern). Objectively, we can check this by conducting the Breusch-Pagan hypothesis test using the bptest() function from ‘lmtest’ library, with the input being the model object.
#>
#> studentized Breusch-Pagan test
#>
#> data: student_back
#> BP = 5.4591, df = 5, p-value = 0.3624
The Model “student_back” has P-value > 0.05. The hypothesis (H0) model accepted, meaning that the residuals are normally distributed, allowing our model to have errors around its mean.
Multicolinearity Test
A good regression model ideally should not exhibit correlation among independent variables. This ensures that the correlation between the target and its predictors is not disrupted. There should be no values equal to or greater than 10, thus multicollinearity between predictor variables is not found (indicating that the predictor variables are mutually independent).
#> Hours.Studied Previous.Scores
#> 1.000647 1.000590
#> Extracurricular.Activities Sleep.Hours
#> 1.001566 1.001298
#> Sample.Question.Papers.Practiced
#> 1.000492
Based on the result of Multicolinearity Test as above all variables has values under 10. Into this model, between predictor variables is not found indicating that the predictor variables are mutually independent.
Conclusion
- We found optimal model after using the stepwise regression method. From model “student_back” we got the higher R2 and R adjs value and the lower (AIC) value than model before.
- The accuracy of the model in predicting the student performance is measured with RMSE, with training data has RMSE of 26.96168 and testing data has RMSE of 26.96464.
- The assumptions test from the normality test, homoscedasticity test, and multicollinearity test, the results obtained are as expected.
- Suggesting that our model It can be considered optimal due to has small error values for both the training dataset and the test dataset.
Reference
- https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression
- https://rpubs.com/Argaadya/531140
- https://rpubs.com/nabiilahardini/happiness
- Modul Regerssion Model - Algoritma Data Science