psstpsst

1.0 Introduction

The dataset used in this project is the Student Performance Factors Dataset. It contains information related to students’ academic performance and several factors that may influence examination results. The selected variables are Hours Studied, Attendance, Sleep Hours, Previous Scores, and Exam Score. The target variable is Exam Score, while the remaining variables are used as predictors.

# 1. Load required packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
corrplot 0.95 loaded
# 2. Import dataset

student <- read.csv("E:/wanmo/Documents/RSTUDIO/StudentPerformanceFactors.csv")

# 3. View dataset structure

str(student)
'data.frame':   6607 obs. of  20 variables:
 $ Hours_Studied             : int  23 19 24 29 19 19 29 25 17 23 ...
 $ Attendance                : int  84 64 98 89 92 88 84 78 94 98 ...
 $ Parental_Involvement      : chr  "Low" "Low" "Medium" "Low" ...
 $ Access_to_Resources       : chr  "High" "Medium" "Medium" "Medium" ...
 $ Extracurricular_Activities: chr  "No" "No" "Yes" "Yes" ...
 $ Sleep_Hours               : int  7 8 7 8 6 8 7 6 6 8 ...
 $ Previous_Scores           : int  73 59 91 98 65 89 68 50 80 71 ...
 $ Motivation_Level          : chr  "Low" "Low" "Medium" "Medium" ...
 $ Internet_Access           : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Tutoring_Sessions         : int  0 2 2 1 3 3 1 1 0 0 ...
 $ Family_Income             : chr  "Low" "Medium" "Medium" "Medium" ...
 $ Teacher_Quality           : chr  "Medium" "Medium" "Medium" "Medium" ...
 $ School_Type               : chr  "Public" "Public" "Public" "Public" ...
 $ Peer_Influence            : chr  "Positive" "Negative" "Neutral" "Negative" ...
 $ Physical_Activity         : int  3 4 4 4 4 3 2 2 1 5 ...
 $ Learning_Disabilities     : chr  "No" "No" "No" "No" ...
 $ Parental_Education_Level  : chr  "High School" "College" "Postgraduate" "High School" ...
 $ Distance_from_Home        : chr  "Near" "Moderate" "Near" "Moderate" ...
 $ Gender                    : chr  "Male" "Female" "Male" "Male" ...
 $ Exam_Score                : int  67 61 74 71 70 71 67 66 69 72 ...
# 4. Display summary statistics

summary(student)
 Hours_Studied     Attendance     Parental_Involvement Access_to_Resources
 Min.   : 1.00   Min.   : 60.00   Length:6607          Length:6607        
 1st Qu.:16.00   1st Qu.: 70.00   Class :character     Class :character   
 Median :20.00   Median : 80.00   Mode  :character     Mode  :character   
 Mean   :19.98   Mean   : 79.98                                           
 3rd Qu.:24.00   3rd Qu.: 90.00                                           
 Max.   :44.00   Max.   :100.00                                           
 Extracurricular_Activities  Sleep_Hours     Previous_Scores 
 Length:6607                Min.   : 4.000   Min.   : 50.00  
 Class :character           1st Qu.: 6.000   1st Qu.: 63.00  
 Mode  :character           Median : 7.000   Median : 75.00  
                            Mean   : 7.029   Mean   : 75.07  
                            3rd Qu.: 8.000   3rd Qu.: 88.00  
                            Max.   :10.000   Max.   :100.00  
 Motivation_Level   Internet_Access    Tutoring_Sessions Family_Income     
 Length:6607        Length:6607        Min.   :0.000     Length:6607       
 Class :character   Class :character   1st Qu.:1.000     Class :character  
 Mode  :character   Mode  :character   Median :1.000     Mode  :character  
                                       Mean   :1.494                       
                                       3rd Qu.:2.000                       
                                       Max.   :8.000                       
 Teacher_Quality    School_Type        Peer_Influence     Physical_Activity
 Length:6607        Length:6607        Length:6607        Min.   :0.000    
 Class :character   Class :character   Class :character   1st Qu.:2.000    
 Mode  :character   Mode  :character   Mode  :character   Median :3.000    
                                                          Mean   :2.968    
                                                          3rd Qu.:4.000    
                                                          Max.   :6.000    
 Learning_Disabilities Parental_Education_Level Distance_from_Home
 Length:6607           Length:6607              Length:6607       
 Class :character      Class :character         Class :character  
 Mode  :character      Mode  :character         Mode  :character  
                                                                  
                                                                  
                                                                  
    Gender            Exam_Score    
 Length:6607        Min.   : 55.00  
 Class :character   1st Qu.: 65.00  
 Mode  :character   Median : 67.00  
                    Mean   : 67.24  
                    3rd Qu.: 69.00  
                    Max.   :101.00  
# 5. Check missing values

colSums(is.na(student))
             Hours_Studied                 Attendance 
                         0                          0 
      Parental_Involvement        Access_to_Resources 
                         0                          0 
Extracurricular_Activities                Sleep_Hours 
                         0                          0 
           Previous_Scores           Motivation_Level 
                         0                          0 
           Internet_Access          Tutoring_Sessions 
                         0                          0 
             Family_Income            Teacher_Quality 
                         0                          0 
               School_Type             Peer_Influence 
                         0                          0 
         Physical_Activity      Learning_Disabilities 
                         0                          0 
  Parental_Education_Level         Distance_from_Home 
                         0                          0 
                    Gender                 Exam_Score 
                         0                          0 
# 6. Remove missing values

student_clean <- na.omit(student)

# 7. Check duplicate rows

sum(duplicated(student_clean))
[1] 0
# 8. Remove duplicate rows

student_clean <- student_clean %>%
  distinct()

# 9. Select important variables

student_clean <- student_clean %>%
  select(
    Hours_Studied,
    Attendance,
    Sleep_Hours,
    Previous_Scores,
    Exam_Score
  )

# 10. Check final cleaned dataset

glimpse(student_clean)
Rows: 6,607
Columns: 5
$ Hours_Studied   <int> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17, 17, 21, 9,…
$ Attendance      <int> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80, 97, 83, 82…
$ Sleep_Hours     <int> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8, 8, 10, 6,…
$ Previous_Scores <int> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88, 87, 97, 72…
$ Exam_Score      <int> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68, 71, 70, 66…
summary(student_clean)
 Hours_Studied     Attendance      Sleep_Hours     Previous_Scores 
 Min.   : 1.00   Min.   : 60.00   Min.   : 4.000   Min.   : 50.00  
 1st Qu.:16.00   1st Qu.: 70.00   1st Qu.: 6.000   1st Qu.: 63.00  
 Median :20.00   Median : 80.00   Median : 7.000   Median : 75.00  
 Mean   :19.98   Mean   : 79.98   Mean   : 7.029   Mean   : 75.07  
 3rd Qu.:24.00   3rd Qu.: 90.00   3rd Qu.: 8.000   3rd Qu.: 88.00  
 Max.   :44.00   Max.   :100.00   Max.   :10.000   Max.   :100.00  
   Exam_Score    
 Min.   : 55.00  
 1st Qu.: 65.00  
 Median : 67.00  
 Mean   : 67.24  
 3rd Qu.: 69.00  
 Max.   :101.00  
# 11. Save cleaned dataset

write.csv(student_clean, "student_clean.csv", row.names = FALSE)


# 12. Histogram of Exam Score
ggplot(student_clean, aes(x = Exam_Score)) +
  geom_histogram(bins = 30, color = "black", fill = "lightblue") +
  labs(
    title = "Distribution of Exam Scores",
    x = "Exam Score",
    y = "Number of Students"
  )

# 13. Hours Studied vs Exam Score
ggplot(student_clean, aes(x = Hours_Studied, y = Exam_Score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Hours Studied vs Exam Score",
    x = "Hours Studied",
    y = "Exam Score"
  )
`geom_smooth()` using formula = 'y ~ x'

# 14. Attendance vs Exam Score
ggplot(student_clean, aes(x = Attendance, y = Exam_Score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Attendance vs Exam Score",
    x = "Attendance",
    y = "Exam Score"
  )
`geom_smooth()` using formula = 'y ~ x'

# 15. Previous Scores vs Exam Score
ggplot(student_clean, aes(x = Previous_Scores, y = Exam_Score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Previous Scores vs Exam Score",
    x = "Previous Scores",
    y = "Exam Score"
  )
`geom_smooth()` using formula = 'y ~ x'

# 16. Sleep Hours vs Exam Score
ggplot(student_clean, aes(x = Sleep_Hours, y = Exam_Score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sleep Hours vs Exam Score",
    x = "Sleep Hours",
    y = "Exam Score"
  )
`geom_smooth()` using formula = 'y ~ x'

# 17. Correlation matrix
cor_matrix <- cor(student_clean)

# 18. Correlation plot
corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.col = "black",
  tl.srt = 45
)

# 19. Boxplot of Exam Score
ggplot(student_clean, aes(y = Exam_Score)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Boxplot of Exam Scores",
    y = "Exam Score"
  )

Predictive Modeling: Machine Learning Methodology

In this section, we develop a predictive engine to forecast student exam scores based on key learning and academic factors. Since our target variable (Exam_Score) is continuous numeric data, this is a Regression Task. We implement a Multiple Linear Regression model using the structured `tidymodels` framework to ensure full reproducibility.

1. Package Initialization and Data Ingestion

# 1. Load required packages
library(tidyverse)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.13     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.6.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.3      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
# 2. Import cleaned dataset
student_clean <- read.csv("student_clean.csv")

Methodology Explanation: The predictive workflow begins by initializing the core libraries. We load the tidyverse package for general data manipulation and tidymodels to handle the machine learning operations systematically. We then ingest student_clean.csv, which is the clean subset containing our target variable (Exam_Score) and four selected numeric predictors (Hours_Studied, Attendance, Sleep_Hours, and Previous_Scores).

2. Data Splitting Strategy

# 3. Split data into training and testing sets

set.seed(123)

student_split <- initial_split(student_clean, prop = 0.8)

train_data <- training(student_split)
test_data <- testing(student_split)

Methodology Explanation:

To rigorously evaluate the model’s performance on unseen data, we implement an 80:20 Train-Test Split.

  • Training Set (train_data): Comprises 80% of the observations and is used exclusively to optimize the model’s regression coefficients.

  • Testing Set (test_data): Comprises the remaining 20% and acts as an independent evaluation set.

  • Reproducibility: We apply set.seed(123) before the split to ensure that the pseudo-random partition remains identical across different rendering instances of this Quarto document, securing consistent validation results.

3. Model Specification and Model Fitting

# 4. Specify linear regression model

lm_spec <- linear_reg() %>%
  set_engine("lm")

# 5. Fit the model

lm_fit <- lm_spec %>%
  fit(
    Exam_Score ~ Hours_Studied + Attendance + Sleep_Hours + Previous_Scores,
    data = train_data
  )

# 6. View model result
lm_fit
parsnip model object


Call:
stats::lm(formula = Exam_Score ~ Hours_Studied + Attendance + 
    Sleep_Hours + Previous_Scores, data = data)

Coefficients:
    (Intercept)    Hours_Studied       Attendance      Sleep_Hours  
       42.35642          0.29049          0.19764         -0.03149  
Previous_Scores  
        0.04681  

Methodology Explanation: We construct a Multiple Linear Regression model where the mathematical objective is to estimate a linear function that relates our four predictors to the target variable. Using linear_reg(), we specify the mathematical structure and set the computational engine to "lm" (Ordinary Least Squares estimation). The model is then trained using the fit() function on train_data, solving for the intercept (\(\beta_0\)) and slopes (\(\beta_1, \beta_2, \beta_3, \beta_4\)) for each respective academic feature.

4. Out-of-Sample Prediction and Performance Validation

# 7. Make prediction on test data
predictions <- predict(lm_fit, new_data = test_data) %>%
  bind_cols(test_data)

# 8. Evaluate model performance

rmse_result <- rmse(
  predictions,
  truth = Exam_Score,
  estimate = .pred
)

rsq_result <- rsq(
  predictions,
  truth = Exam_Score,
  estimate = .pred
)

rmse_result
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        2.35
rsq_result
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.622

Methodology Explanation:

After fitting the model parameters, we project the regression line onto the unseen test_data using the predict() function to generate predicted values (.pred). To evaluate the empirical fit and accuracy of our baseline model, we track two primary regression metrics:

  1. Root Mean Squared Error (RMSE): Measures the standard deviation of the residuals (the differences between predicted values and actual values). A smaller RMSE indicates tighter clustering of data points around the regression line, representing higher precision.

  2. R-squared (\(R^2\)): Represents the coefficient of determination, indicating the proportion of variance in Exam_Score that is predictable from the combined linear influence of Hours_Studied, Attendance, Sleep_Hours, and Previous_Scores.

5. Model Serialization for Deployment

# 9. Save model

saveRDS(lm_fit, "student_model.rds")

Methodology Explanation: The final phase of the machine learning pipeline involves exporting the trained model object using saveRDS(). Serializing the fitted workflow into a student_model.rds file ensures operational efficiency for deployment. The web application (app.R) can dynamically load this pre-trained file to compute instantaneous user predictions on the Prediction Model Tab without wasting server resources to retrain the algorithm from scratch on every session.

part appR

#  1. Load required packages
library(shiny)

Attaching package: 'shiny'
The following object is masked from 'package:infer':

    observe
library(shinydashboard)

Attaching package: 'shinydashboard'
The following object is masked from 'package:graphics':

    box
library(tidyverse)
library(tidymodels)

# 2. Load cleaned dataset
student_clean <- read.csv("student_clean.csv")

# 3. Train model inside app

set.seed(123)

student_split <- initial_split(student_clean, prop = 0.8)

train_data <- training(student_split)
test_data <- testing(student_split)

lm_spec <- linear_reg() %>%
  set_engine("lm")

lm_fit <- lm_spec %>%
  fit(
    Exam_Score ~ Hours_Studied + Attendance + Sleep_Hours + Previous_Scores,
    data = train_data
  )

predictions <- predict(lm_fit, new_data = test_data) %>%
  bind_cols(test_data)

rmse_value <- rmse(predictions, truth = Exam_Score, estimate = .pred)
rsq_value <- rsq(predictions, truth = Exam_Score, estimate = .pred)

# ============================================================
# User Interface

ui <- fluidPage(
  
  titlePanel("Student Performance Prediction App"),
  
  tabsetPanel(
    
    # EDA TAB
    tabPanel(
      "Exploratory Data Analysis",
      
      sidebarLayout(
        sidebarPanel(
          selectInput(
            inputId = "xvar",
            label = "Choose X Variable:",
            choices = c(
              "Hours_Studied",
              "Attendance",
              "Sleep_Hours",
              "Previous_Scores"
            ),
            selected = "Hours_Studied"
          )
        ),
        
        mainPanel(
          h3("Distribution of Exam Score"),
          plotOutput("hist_plot"),
          
          h3("Relationship with Exam Score"),
          plotOutput("scatter_plot"),
          
          h3("Summary of Dataset"),
          tableOutput("summary_table")
        )
      )
    ),
    
    # MODELLING TAB
    tabPanel(
      "Prediction Model",
      
      sidebarLayout(
        sidebarPanel(
          sliderInput(
            "hours",
            "Hours Studied:",
            min = min(student_clean$Hours_Studied),
            max = max(student_clean$Hours_Studied),
            value = mean(student_clean$Hours_Studied)
          ),
          
          sliderInput(
            "attendance",
            "Attendance:",
            min = min(student_clean$Attendance),
            max = max(student_clean$Attendance),
            value = mean(student_clean$Attendance)
          ),
          
          sliderInput(
            "sleep",
            "Sleep Hours:",
            min = min(student_clean$Sleep_Hours),
            max = max(student_clean$Sleep_Hours),
            value = mean(student_clean$Sleep_Hours)
          ),
          
          sliderInput(
            "previous",
            "Previous Scores:",
            min = min(student_clean$Previous_Scores),
            max = max(student_clean$Previous_Scores),
            value = mean(student_clean$Previous_Scores)
          )
        ),
        
        mainPanel(
          h3("Predicted Exam Score"),
          verbatimTextOutput("prediction_output"),
          
          h3("Model Performance"),
          verbatimTextOutput("model_performance")
        )
      )
    )
  )
)

# ============================================================
# Server

server <- function(input, output) {
  
  # 1. Histogram of Exam Score
  output$hist_plot <- renderPlot({
    ggplot(student_clean, aes(x = Exam_Score)) +
      geom_histogram(bins = 30, color = "black", fill = "lightblue") +
      labs(
        title = "Distribution of Exam Scores",
        x = "Exam Score",
        y = "Number of Students"
      )
  })
  
  # 2. Reactive scatter plot
  output$scatter_plot <- renderPlot({
    ggplot(student_clean, aes_string(x = input$xvar, y = "Exam_Score")) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(
        title = paste(input$xvar, "vs Exam Score"),
        x = input$xvar,
        y = "Exam Score"
      )
  })
  
  # 3. Summary table
  output$summary_table <- renderTable({
    summary(student_clean)
  })
  
  # 4. Prediction output
  output$prediction_output <- renderPrint({
    
    new_student <- data.frame(
      Hours_Studied = input$hours,
      Attendance = input$attendance,
      Sleep_Hours = input$sleep,
      Previous_Scores = input$previous
    )
    
    predicted_score <- predict(lm_fit, new_data = new_student)
    
    paste("Predicted Exam Score:", round(predicted_score$.pred, 2))
  })
  
  # 5. Model performance output
  output$model_performance <- renderPrint({
    print(rmse_value)
    print(rsq_value)
  })
}

Methodology Explanation :

The cleaned dataset was imported into R and split into 80% training data and 20% testing data. A Linear Regression model was developed using the tidymodels framework to predict students’ exam scores based on Hours Studied, Attendance, Sleep Hours, and Previous Scores. The trained model was then used to generate predictions on the testing dataset. Finally, model performance was evaluated using RMSE and R², where RMSE measures prediction error and R² measures how much variation in exam scores is explained by the model.

The Shiny application was designed with two main tabs: Exploratory Data Analysis and Prediction Model. In the EDA tab, user can interactively explore relationships between study-related factors and exam scores through histograms, scatter plots, and summary statistics. In the Prediction tab, user can adjust Hours Studied, Attendance, Sleep Hours, and Previous Scores using slider inputs. The trained Linear Regression model then generates a real-time predicted exam score, while RMSE and R² values are displayed to evaluate model performance.

The server function controls all reactive features of the application. It generates a histogram of exam scores, creates scatter plots that automatically update based on user selected variables, and displays summary statistics of the dataset. For prediction, the application collects values for Hours Studied, Attendance, Sleep Hours, and Previous Scores, then uses a trained Linear Regression model to predict the student’s exam score. Finally, RMSE and R² values are displayed to evaluate model performance and provide users with information about the model’s predictive accuracy.

Finally, the shinyApp(ui = ui, server = server) function was used to combine the user interface and server logic into a complete Shiny application. This allows users to interact with the system through a web browser, explore the dataset dynamically, and obtain real-time exam score predictions generated by the trained Linear Regression model.