2026-03-25

Introduction of Dataset

The dataset I will be using is the Student Exam Performance Dataset Analysis.

The dataset lists 19 different variables that affects a students’ exam score. There are socio-economic factors, lifestyle factors, and academic factors.

The question I will be answering is can we accurately predict a student’s exam score based on academic factors?

Uploading and Defining Dataset

Uploading the dataset and selecting the academic variables that we will be using. When uploading I will also establish what counts as missing values because I had an issue with finding them while cleaning the data.

StudentPerformanceFactors = 
  read.csv("C:/Users/sophi/Downloads/StudentPerformanceFactors.csv", 
    na.strings = c("NA", "", "N/A")) 
df = subset(StudentPerformanceFactors, select = c(Hours_Studied, 
  Attendance,Previous_Scores, Motivation_Level, 
  Tutoring_Sessions, Teacher_Quality, Exam_Score))

Dataset Variables

This is a list of the academic variables we will be using to solve the question.
Hours_Studied Attendance Previous_Scores Motivation_Level Tutoring_Sessions Teacher_Quality Exam_Score
23 84 73 Low 0 Medium 67
19 64 59 Low 2 Medium 61
24 98 91 Medium 2 Medium 74
29 89 98 Medium 1 Medium 71
19 92 65 Medium 3 High 70
19 88 89 Medium 3 Medium 71

Cleaning Dataset

We need to check if the data needs cleaning or not.

anyNA(df)
[1] TRUE
colSums(is.na(df))
    Hours_Studied        Attendance   Previous_Scores  Motivation_Level 
                0                 0                 0                 0 
Tutoring_Sessions   Teacher_Quality        Exam_Score 
                0                78                 0 
sum(is.na(df$Teacher_Quality))/length(df$Teacher_Quality) * 100
[1] 1.180566

Cleaning Data

Now we that have identified which column has missing values, we will deal with the missing data by removing the rows with missing data. We also will remove duplicate rows.

cleaned_data = drop_na(df, Teacher_Quality)
cleaned_data = cleaned_data %>%
    distinct()
colSums(is.na(cleaned_data)) 
    Hours_Studied        Attendance   Previous_Scores  Motivation_Level 
                0                 0                 0                 0 
Tutoring_Sessions   Teacher_Quality        Exam_Score 
                0                 0                 0 

Cleaned Dataset

This is the summary of the new, cleaned dataset:

 Hours_Studied     Attendance  Previous_Scores  Motivation_Level  
 Min.   : 1.00   Min.   : 60   Min.   : 50.00   Length:6526       
 1st Qu.:16.00   1st Qu.: 70   1st Qu.: 63.00   Class :character  
 Median :20.00   Median : 80   Median : 75.00   Mode  :character  
 Mean   :19.98   Mean   : 80   Mean   : 75.05                     
 3rd Qu.:24.00   3rd Qu.: 90   3rd Qu.: 88.00                     
 Max.   :44.00   Max.   :100   Max.   :100.00                     
 Tutoring_Sessions Teacher_Quality      Exam_Score    
 Min.   :0.000     Length:6526        Min.   : 55.00  
 1st Qu.:1.000     Class :character   1st Qu.: 65.00  
 Median :1.000     Mode  :character   Median : 67.00  
 Mean   :1.494                        Mean   : 67.24  
 3rd Qu.:2.000                        3rd Qu.: 69.00  
 Max.   :8.000                        Max.   :101.00  

Exploratory Data Analysis

Now that we cleaned the data we will do exploratory data analysis. This histogram shows us the distribution of the exam scores. Most students scored below a 70.

Exploratory Data Analysis cont.

We can see that there is a small positive slope for this variable, meaning it should have significance in the model.

Exploratory Data Analysis cont.

Similarly to the last slide, this scatter plot displays a small positive slope as well.

Exploratory Data Analysis cont.

From this plot, we know that we must change the categorical variables to numerical so they can be used in the linear model.

Exploratory Data Analysis cont.

The conclusion from the last slide also applies to this categorical variable.

Exploratory Data Analysis cont.

We see little to no slope in this plot, so the previous scores seem to have little to no effect on the exam scores.

Exploratory Data Analysis cont.

The boxplot shows that there is a slight improvement in score and number of tutoring sessions. There also appears to be less outliers the more tutoring sessions but that may also be a result of less students attending tutoring sessions.

Altering Variables

Before we create the model, we need to change the categorical variables into numerical variables so it can be used in the linear regression prediction model.

cleaned_data$Motivation_Level <- 
  as.numeric(factor(cleaned_data$Motivation_Level,
  levels = c("Low", "Medium", "High")))
cleaned_data$Teacher_Quality <- 
  as.numeric(factor(cleaned_data$Teacher_Quality,
  levels = c("Low", "Medium", "High")))
sapply(cleaned_data, class)
    Hours_Studied        Attendance   Previous_Scores  Motivation_Level 
        "integer"         "integer"         "integer"         "numeric" 
Tutoring_Sessions   Teacher_Quality        Exam_Score 
        "integer"         "numeric"         "integer" 

Building the Prediction Model

To determine which data we will train and test, we split the data. Then we create new variables to store and train and test the data.

df_split <- cleaned_data %>% initial_split(strata = Exam_Score)
df_split
<Training/Testing/Total>
<4892/1634/6526>
df_train = training(df_split)
df_test = testing(df_split)

Building the Prediction Model

Then, we will create the linear model.

lm_fit = linear_reg() %>% fit(Exam_Score ~., data = cleaned_data)
tidy(lm_fit)
# A tibble: 7 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        39.1      0.318       123.  0        
2 Hours_Studied       0.293    0.00504      58.1 0        
3 Attendance          0.198    0.00261      75.8 0        
4 Previous_Scores     0.0482   0.00209      23.0 9.69e-113
5 Motivation_Level    0.530    0.0433       12.2 5.38e- 34
6 Tutoring_Sessions   0.494    0.0245       20.2 7.91e- 88
7 Teacher_Quality     0.525    0.0503       10.4 2.42e- 25

Prediction with Model

We then create predictions with the model and find the results from both the training and testing data.

lm_fit %>% 
  predict(new_data = df_train)
results_train = lm_fit %>%
  predict(new_data = df_train) %>%
  mutate(truth = df_train$Exam_Score)
results_test = lm_fit %>% 
  predict(new_data = df_test) %>% 
  mutate(truth = df_test$Exam_Score)

Evaluating the Prediction and Actual

Now, we find the root mean square error which measures the difference between the predicted values from the model and the actual observed values from the data.

We will get the root mean square error from both the training data and the testing data as shown in the next slide. We expect to get a value that is close or similar in the estimate section for both the training and testing results.

Evaluating the Prediction and Actual

results_train %>%
   rmse(truth = truth, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        2.34
results_test %>%
   rmse(truth = truth, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        2.68

Evaluating the Prediction and Actual

Plotting the training and testing results

Evaluating the Prediction and Actual

Based on the plots in the previous slide, the linear model seems to accurately predict students’ exam scores based on the academic variables provided from the data. The RMSE from the training and testing data seem to be around 2.5, meaning that the model’s predicted results deviate from the actual value by 2.5%.

There are also quite a few outliers that we saw in the plots, however majority of the data seems to fit the linear regression line well, so we can say that academic variables can accurately predict a student’s exam score.