The dataset used in this project is the Student Performance Factors Dataset. It contains information related to students’ academic performance and several factors that may influence examination results. The selected variables are Hours Studied, Attendance, Sleep Hours, Previous Scores, and Exam Score. The target variable is Exam Score, while the remaining variables are used as predictors.
# 1. Load required packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# 19. Boxplot of Exam Scoreggplot(student_clean, aes(y = Exam_Score)) +geom_boxplot(fill ="lightgreen") +labs(title ="Boxplot of Exam Scores",y ="Exam Score" )
Predictive Modeling: Machine Learning Methodology
In this section, we develop a predictive engine to forecast student exam scores based on key learning and academic factors. Since our target variable (Exam_Score) is continuous numeric data, this is a Regression Task. We implement a Multiple Linear Regression model using the structured `tidymodels` framework to ensure full reproducibility.
Methodology Explanation: The predictive workflow begins by initializing the core libraries. We load the tidyverse package for general data manipulation and tidymodels to handle the machine learning operations systematically. We then ingest student_clean.csv, which is the clean subset containing our target variable (Exam_Score) and four selected numeric predictors (Hours_Studied, Attendance, Sleep_Hours, and Previous_Scores).
2. Data Splitting Strategy
# 3. Split data into training and testing setsset.seed(123)student_split <-initial_split(student_clean, prop =0.8)train_data <-training(student_split)test_data <-testing(student_split)
Methodology Explanation:
To rigorously evaluate the model’s performance on unseen data, we implement an 80:20 Train-Test Split.
Training Set (train_data): Comprises 80% of the observations and is used exclusively to optimize the model’s regression coefficients.
Testing Set (test_data): Comprises the remaining 20% and acts as an independent evaluation set.
Reproducibility: We apply set.seed(123) before the split to ensure that the pseudo-random partition remains identical across different rendering instances of this Quarto document, securing consistent validation results.
3. Model Specification and Model Fitting
# 4. Specify linear regression modellm_spec <-linear_reg() %>%set_engine("lm")# 5. Fit the modellm_fit <- lm_spec %>%fit( Exam_Score ~ Hours_Studied + Attendance + Sleep_Hours + Previous_Scores,data = train_data )# 6. View model resultlm_fit
Methodology Explanation: We construct a Multiple Linear Regression model where the mathematical objective is to estimate a linear function that relates our four predictors to the target variable. Using linear_reg(), we specify the mathematical structure and set the computational engine to "lm" (Ordinary Least Squares estimation). The model is then trained using the fit() function on train_data, solving for the intercept (\(\beta_0\)) and slopes (\(\beta_1, \beta_2, \beta_3, \beta_4\)) for each respective academic feature.
4. Out-of-Sample Prediction and Performance Validation
# 7. Make prediction on test datapredictions <-predict(lm_fit, new_data = test_data) %>%bind_cols(test_data)# 8. Evaluate model performancermse_result <-rmse( predictions,truth = Exam_Score,estimate = .pred)rsq_result <-rsq( predictions,truth = Exam_Score,estimate = .pred)rmse_result
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 2.35
rsq_result
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.622
Methodology Explanation:
After fitting the model parameters, we project the regression line onto the unseen test_data using the predict() function to generate predicted values (.pred). To evaluate the empirical fit and accuracy of our baseline model, we track two primary regression metrics:
Root Mean Squared Error (RMSE): Measures the standard deviation of the residuals (the differences between predicted values and actual values). A smaller RMSE indicates tighter clustering of data points around the regression line, representing higher precision.
R-squared (\(R^2\)): Represents the coefficient of determination, indicating the proportion of variance in Exam_Score that is predictable from the combined linear influence of Hours_Studied, Attendance, Sleep_Hours, and Previous_Scores.
5. Model Serialization for Deployment
# 9. Save modelsaveRDS(lm_fit, "student_model.rds")
Methodology Explanation: The final phase of the machine learning pipeline involves exporting the trained model object using saveRDS(). Serializing the fitted workflow into a student_model.rds file ensures operational efficiency for deployment. The web application (app.R) can dynamically load this pre-trained file to compute instantaneous user predictions on the Prediction Model Tab without wasting server resources to retrain the algorithm from scratch on every session.
part appR
# 1. Load required packageslibrary(shiny)
Attaching package: 'shiny'
The following object is masked from 'package:infer':
observe
library(shinydashboard)
Attaching package: 'shinydashboard'
The following object is masked from 'package:graphics':
box
The cleaned dataset was imported into R and split into 80% training data and 20% testing data. A Linear Regression model was developed using the tidymodels framework to predict students’ exam scores based on Hours Studied, Attendance, Sleep Hours, and Previous Scores. The trained model was then used to generate predictions on the testing dataset. Finally, model performance was evaluated using RMSE and R², where RMSE measures prediction error and R² measures how much variation in exam scores is explained by the model.
The Shiny application was designed with two main tabs: Exploratory Data Analysis and Prediction Model. In the EDA tab, user can interactively explore relationships between study-related factors and exam scores through histograms, scatter plots, and summary statistics. In the Prediction tab, user can adjust Hours Studied, Attendance, Sleep Hours, and Previous Scores using slider inputs. The trained Linear Regression model then generates a real-time predicted exam score, while RMSE and R² values are displayed to evaluate model performance.
The server function controls all reactive features of the application. It generates a histogram of exam scores, creates scatter plots that automatically update based on user selected variables, and displays summary statistics of the dataset. For prediction, the application collects values for Hours Studied, Attendance, Sleep Hours, and Previous Scores, then uses a trained Linear Regression model to predict the student’s exam score. Finally, RMSE and R² values are displayed to evaluate model performance and provide users with information about the model’s predictive accuracy.
Finally, the shinyApp(ui = ui, server = server) function was used to combine the user interface and server logic into a complete Shiny application. This allows users to interact with the system through a web browser, explore the dataset dynamically, and obtain real-time exam score predictions generated by the trained Linear Regression model.