Introduction

Student performance is a multifaceted measure of the success of both the students themselves and the academic institutions that they are involved in, and final exam scores are an important measure of academic performance and achievement. A large variety of factors contribute to student success, both on an individual and societal level, and academic achievement can be an important factor in the future success of an individual. Therefore, knowing the most important predictors of student performance can be a valuable tool in knowing how to help students to perform better, benefiting the individual and the academic institutions that they may attend.

Description of the Data

The working dataset(1) has 6607 observations providing a comprehensive overview of various factors affecting student performance in exams. It includes information on study habits, attendance, parental involvement, and other aspects influencing academic success across 20 variables, which are listed below.

Hours_Studied: Number of hours spent studying per week (numeric)

Attendance: Percentage of classes attended (numeric)

Parental_Involvement: Level of parental involvement in the student’s education (Low, Medium, High)

Access_to_Resources; Availability of educational resources (Low, Medium, High)

Extracurricular_Activities: Participation in extracurricular activities (Yes, No)

Sleep_Hours: Average number of hours of sleep per night (numeric)

Previous_Scores: Average of scores from previous exams (numeric)

Motivation_Level: Student’s level of motivation (Low, Medium, High)

Internet_Access: Availability of internet access (Yes, No)

Tutoring_Sessions: Number of tutoring sessions attended per month (numeric)

Family_Income: Family income level (Low, Medium, High)

Teacher_Quality: Quality of the teachers (Low, Medium, High)

School_Type: Type of school attended (Public, Private)

Peer_Influence: Influence of peers on academic performance (Positive, Neutral, Negative)

Physical_Activity: Average number of hours of physical activity per week

Learning_Disabilities: Presence of learning disabilities (Yes, No)

Parental_Education_Level: Highest education level of parents (High School, College, Postgraduate)

Distance_from_Home: Distance from home to school (Near, Moderate, Far)

Gender: Gender of the student (Male, Female)

Exam_Score: Final exam score (numeric)

Research Questions

The final exam score of a student is often considered a relatively good measure of how well the student learned the material and succeeded in the class, and often takes up a significant portion of the student’s final grade. As such, the final exam score of the student will be the main response variable we’ll consider from the working dataset.

Different grading scales are used across institutions, but while a failing grade is often considered to be one lower than 60, oftentimes a more satisfactory measure of a student’s performance in a class and the threshold for a course to count as passed in many institutions is a C or above, which translates to a percentage of 70% or higher(2)(3). As such, a binary response variable will be created for if the final exam is greater than or equal to 70 or not, and a model will be made based on the working dataset to create a model to predict student performance based on this variable.

Therefore, the two working research questions are:

  1. How do different predictors relate to the final exam performance of students?

  2. What factors best predict whether or not a student has a satisfactory (greater than or equal to 70%) final exam grade, and how accurately can their performance be predicted based on these factors?

Methodology

We will begin by examining the explanatory and response variables in the dataset and check for possible issues with the assumptions for a linear and logistic model as well as sparse categories and missing information. We will take a look at the underlying structure of the data and identify relationships between different variables, checking for possible issues with multicollinearity. We will look for outliers or unusual observations which may affect the final model or be indicative of possible data entry errors. If possible, missing values will be imputed. Violations to assumptions will be handled through appropriate transformations. Sparse categories or nonnormal explanatory numeric variables may be handled through discretization and/or regrouping. Certain variables may be dropped or otherwise aggregated to account for any possible issues with multicollinearity. Single variable and pairwise distributions will be examined to check for the above.

Multiple linear regression is a statistical method that uses several explanatory variables to predict the outcome of a continous response variable, and is an extension of ordinary least squares regression. It fits a linear relationship between the explanatory variables and the response. The assumptions for multiple regression include the following: the response variable is normally distributed and has a constant variance, the explanatory variables are nonrandom, the explanatory variables are noncorrelated, the explanatory variables have a linear relationship with the response, and the data is randomly collected and independent.

Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a binary response variable, and is therefore appropriate for situations where linear regression cannot be used due to a dichotomous categorical response variable.The assuptions for a logistic regression include the following: the dependent variable should be binary, the independent variables should not be correlated, the log odds (the logit of the probability) and the independent variables should have a linear relationship, the sample size has to be sufficiently large, observations must be independent. Unlike multiple linear analysis, due to a categorical response, logistic regression does not require a normally distributed response variable with a constant variance. Outliers for both the linear and logistic regression model can have a significant effect on the results, and as such should be examined before and after the model’s creation.

An alpha of 0.05 and a 95% confidence level will be used for all statistical test and confidence interval building respectively.

The Box-Cox transformation is a technique to adjust the response variable in a model to try and make it follow a normal distribution. We may need it to ensure that the assumptions of the multiple linear regression hold true.

In the model building process, stepwise regression will be used to identify significant predictors and build the model through an automatic procedure. In each step, predictor variables are considered for addition or subtraction from the model based on some prespecified criterion. Bidirectional stepwise regression, or stepwise selection, is a combination of both forward selection (adding the most statistically significant predictor to the model) and backward elimination (removing the least significant predictor from the model) and will be used, with additional manual adjustments outside of the automatic process, to create the most fitting linear and logistic model while ensuring that the final model makes statistical sense.

K-fold cross-validation is a statistical technique where data is split into k folds and then iteratively split into testing and training data which is then used to assess the predictive power of a model. It ensures the use of all datapoints and can help adjust for variability and keep the model from overfitting any one fold, leading to a more robust final result. We will use 5-fold cross validation to test the predictive performance of both the multiple linear regression and logistic regression models.

Explanatory Data Analysis and Feature Engineering

We begin with an initial look at the dataset.

 Hours_Studied     Attendance     Parental_Involvement Access_to_Resources
 Min.   : 1.00   Min.   : 60.00   High  :1908          High  :1975        
 1st Qu.:16.00   1st Qu.: 70.00   Low   :1337          Low   :1313        
 Median :20.00   Median : 80.00   Medium:3362          Medium:3319        
 Mean   :19.98   Mean   : 79.98                                           
 3rd Qu.:24.00   3rd Qu.: 90.00                                           
 Max.   :44.00   Max.   :100.00                                           
 Extracurricular_Activities  Sleep_Hours     Previous_Scores  Motivation_Level
 No :2669                   Min.   : 4.000   Min.   : 50.00   High  :1319     
 Yes:3938                   1st Qu.: 6.000   1st Qu.: 63.00   Low   :1937     
                            Median : 7.000   Median : 75.00   Medium:3351     
                            Mean   : 7.029   Mean   : 75.07                   
                            3rd Qu.: 8.000   3rd Qu.: 88.00                   
                            Max.   :10.000   Max.   :100.00                   
 Internet_Access Tutoring_Sessions Family_Income Teacher_Quality  School_Type  
 No : 499        Min.   :0.000     High  :1269   High  :1947     Private:2009  
 Yes:6108        1st Qu.:1.000     Low   :2672   Low   : 657     Public :4598  
                 Median :1.000     Medium:2666   Medium:3925                   
                 Mean   :1.494                   NA's  :  78                   
                 3rd Qu.:2.000                                                 
                 Max.   :8.000                                                 
  Peer_Influence Physical_Activity Learning_Disabilities
 Negative:1377   Min.   :0.000     No :5912             
 Neutral :2592   1st Qu.:2.000     Yes: 695             
 Positive:2638   Median :3.000                          
                 Mean   :2.968                          
                 3rd Qu.:4.000                          
                 Max.   :6.000                          
 Parental_Education_Level Distance_from_Home    Gender       Exam_Score    
 College     :1989        Far     : 658      Female:2793   Min.   : 55.00  
 High School :3223        Moderate:1998      Male  :3814   1st Qu.: 65.00  
 Postgraduate:1305        Near    :3884                    Median : 67.00  
 NA's        :  90        NA's    :  67                    Mean   : 67.24  
                                                           3rd Qu.: 69.00  
                                                           Max.   :101.00  
    PassFail    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.246  
 3rd Qu.:0.000  
 Max.   :1.000  

Based on an initial look at the variables, we see that there are no missing values in the numeric variables. However, the categorical variables Teacher_Quality, Parental_Education_Level, and Distance_from_Home have 78, 90, and 67 missing values respectively. Of the 6607 observations in the dataset, this would correspond to 1.18%, 1.36%, and 1.014% of the observations for each of the variables, and at most 3.56% of the observations across the dataset may contain a missing value. As each of these three variables are categorical, there is no easy way to impute for the missing values with other variables in the model. As the number of missing values is relatively small, we will proceed by omitting them from the dataset.

Single Variable Distributions and Pairwise Variable Distributions

Continuous Response

We will continue by looking at the distributions of all of the variables in the dataset. We will begin with the response variable Exam_Score, as the normality of the response is one of the assumptions we need to check for our multiple linear regression model.

The data shows pretty extreme right skew for the variable Exam_Score, which will likely affect our final model. We will look into possible transformations to adjust accordingly later in this report.

Continuous Predictors

We continue with a look at the distributions of the continuous explanatory variables as well as their pairwise relationships with other explanatory variables below.

The pairwise correlations of all the continuous variables except for Exam_Score are all quite low and marked as insignificant except for Hours_Studied and Previous_Scores, which have a correlation of 0.025. As this is still very small, we will proceed for now and check for issues with multicollinearity after the creation of the models. Among the explanatory variables, we largely are concerned about possible sparse categories in the variable Tutoring_Sessions. We will take a look at the distribution more closely below:

# A tibble: 9 × 2
  Tutoring_Sessions `n()`
              <int> <int>
1                 0  1458
2                 1  2111
3                 2  1586
4                 3   800
5                 4   296
6                 5   101
7                 6    18
8                 7     7
9                 8     1

The number of students with 6, 7, and 8 tutoring sessions in a month are all quite sparse, especially with one category with fewer than 5 students. Therefore, we discretize Tutoring_Sessions as follows: 0 Tutoring Sessions per month, 1, 2, 3, and 4+.

Categorical Predictors

We will continue by looking at the relationship between the categorical predictors and the continuous response through box plots.

We note a large number of outliers, which is expected with our prior look at the Exam_Scores distribution. For most of the categorical predictors, there do not appear to be any obvious differences between the different levels. We see small possible disparities between the two levels of Extracurricular Activities, slightly higher values for high amounts of Parental Involvement and Access to Resources. Having internet access appears to possibly have a positive effect on the final exam score, while having a learning disability has a negative effect. A trend seems to appear across the levels of Tutoring Sessions, where students that have more tutoring sessions within a month have a higher exam score.

Categorical Response

Finally, we will use mosaic plots to compare the binary response variable for the logistic model with the different categorical predictors.

Across the categorical predictors with the binary response, we generally see a slight positive association, the exceptions being Learning Disabilities with a negative association and Gender and School Type with no obvious association. Of all the categorical predictors, Tutoring Sessions has an especially notable positive association on students having a final exam score of over 70 across its levels.

We will create and save a copy of the analytic dataset, available on github at https://github.com/xiang-a/sta551/blob/main/analytic_student_performance.csv.

Linear Regression Modeling

Create Candidate Models

We will start with the full model.

Full Model examining Student Final Exam Scores
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.0593039 0.3506238 97.1391544 0.0000000
Hours_Studied 0.2951818 0.0043406 68.0053170 0.0000000
Attendance 0.1988267 0.0022522 88.2819301 0.0000000
Parental_InvolvementMedium 0.9200661 0.0683166 13.4676882 0.0000000
Parental_InvolvementHigh 1.9873553 0.0754642 26.3350667 0.0000000
Access_to_ResourcesMedium 1.0567475 0.0688563 15.3471352 0.0000000
Access_to_ResourcesHigh 2.0638510 0.0752078 27.4419711 0.0000000
Extracurricular_ActivitiesYes 0.5592436 0.0530058 10.5506047 0.0000000
Sleep_Hours -0.0031099 0.0177048 -0.1756549 0.8605707
Previous_Scores 0.0490476 0.0018078 27.1303692 0.0000000
Motivation_LevelMedium 0.5228284 0.0603979 8.6564051 0.0000000
Motivation_LevelHigh 1.0642365 0.0753928 14.1158855 0.0000000
Internet_AccessYes 0.9194475 0.0980986 9.3726882 0.0000000
Family_IncomeMedium 0.4937187 0.0578762 8.5306021 0.0000000
Family_IncomeHigh 1.0853227 0.0719036 15.0941294 0.0000000
Teacher_QualityMedium 0.5083142 0.0883149 5.7557021 0.0000000
Teacher_QualityHigh 1.0633314 0.0944731 11.2553881 0.0000000
School_TypePublic 0.0338177 0.0564628 0.5989383 0.5492354
Peer_InfluenceNeutral 0.5194792 0.0705107 7.3673823 0.0000000
Peer_InfluencePositive 1.0235361 0.0701717 14.5861734 0.0000000
Physical_Activity 0.1884882 0.0253228 7.4434270 0.0000000
Learning_DisabilitiesYes -0.8523793 0.0848911 -10.0408598 0.0000000
Parental_Education_LevelCollege 0.4843870 0.0599099 8.0852543 0.0000000
Parental_Education_LevelPostgraduate 0.9867580 0.0687640 14.3499174 0.0000000
Distance_from_HomeModerate 0.3852309 0.0948087 4.0632437 0.0000490
Distance_from_HomeNear 0.9075950 0.0888929 10.2099863 0.0000000
GenderMale -0.0433741 0.0526039 -0.8245404 0.4096636
f_Tutoring_Sessions1 0.5270752 0.0706850 7.4566783 0.0000000
f_Tutoring_Sessions2 1.0232187 0.0752685 13.5942458 0.0000000
f_Tutoring_Sessions3 1.4801647 0.0913167 16.2091266 0.0000000
f_Tutoring_Sessions4+ 2.2071292 0.1145570 19.2666507 0.0000000

As we expected from the earlier look at the data, School_Type and Gender do not appear to be significant, as is the same with Sleep_Hours. Otherwise, we see many significant predictor variables in the model. We will look at the residuals for the full model.

There appear to be pretty severe issues with our assumptions with this initial model, largely explainable by the distribution of the response variable Exam_Score that we noted earlier. The normality assumption is obviously very much violated, and the high extreme values we noted in the initial look at the distribution of the final exam scores may also contribute to the issues we see in the residual plots.

                               GVIF Df GVIF^(1/(2*Df))
Hours_Studied              1.003022  1        1.001510
Attendance                 1.005648  1        1.002820
Parental_Involvement       1.009370  2        1.002334
Access_to_Resources        1.011683  2        1.002908
Extracurricular_Activities 1.004739  1        1.002367
Sleep_Hours                1.003864  1        1.001930
Previous_Scores            1.007149  1        1.003568
Motivation_Level           1.009013  2        1.002246
Internet_Access            1.004904  1        1.002449
Family_Income              1.009531  2        1.002374
Teacher_Quality            1.008066  2        1.002010
School_Type                1.004009  1        1.002002
Peer_Influence             1.009543  2        1.002377
Physical_Activity          1.008817  1        1.004399
Learning_Disabilities      1.004286  1        1.002141
Parental_Education_Level   1.008506  2        1.002120
Distance_from_Home         1.006296  2        1.001570
Gender                     1.002999  1        1.001499
f_Tutoring_Sessions        1.016697  4        1.002072

All of the VIF values are very close to one, indicating that multicollinearity does not appear to be a large issue in this model.

Stepwise regression confirms our conclusions from looking at the full model above, removing Sleep_Hours, School_Type, and Gender as predictors.

Stepwise Regression Model examining Student Final Exam Scores
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.0396686 0.3218700 105.755962 0.00e+00
Hours_Studied 0.2951999 0.0043394 68.027872 0.00e+00
Attendance 0.1987991 0.0022509 88.320376 0.00e+00
Parental_InvolvementMedium 0.9188671 0.0682899 13.455395 0.00e+00
Parental_InvolvementHigh 1.9875105 0.0754343 26.347573 0.00e+00
Access_to_ResourcesMedium 1.0566369 0.0688313 15.351099 0.00e+00
Access_to_ResourcesHigh 2.0627549 0.0751670 27.442284 0.00e+00
Extracurricular_ActivitiesYes 0.5590663 0.0529951 10.549387 0.00e+00
Previous_Scores 0.0490696 0.0018069 27.156520 0.00e+00
Motivation_LevelMedium 0.5224736 0.0603819 8.652821 0.00e+00
Motivation_LevelHigh 1.0639331 0.0753690 14.116322 0.00e+00
Internet_AccessYes 0.9187046 0.0980625 9.368557 0.00e+00
Family_IncomeMedium 0.4939025 0.0578652 8.535393 0.00e+00
Family_IncomeHigh 1.0860538 0.0718762 15.110069 0.00e+00
Teacher_QualityMedium 0.5080882 0.0882982 5.754234 0.00e+00
Teacher_QualityHigh 1.0630287 0.0944501 11.254925 0.00e+00
Peer_InfluenceNeutral 0.5178882 0.0704830 7.347700 0.00e+00
Peer_InfluencePositive 1.0231235 0.0701487 14.585064 0.00e+00
Physical_Activity 0.1882143 0.0253171 7.434285 0.00e+00
Learning_DisabilitiesYes -0.8513129 0.0848505 -10.033089 0.00e+00
Parental_Education_LevelCollege 0.4846499 0.0598951 8.091650 0.00e+00
Parental_Education_LevelPostgraduate 0.9859052 0.0687338 14.343827 0.00e+00
Distance_from_HomeModerate 0.3847397 0.0947893 4.058894 4.99e-05
Distance_from_HomeNear 0.9073280 0.0888772 10.208785 0.00e+00
f_Tutoring_Sessions1 0.5270991 0.0706570 7.459971 0.00e+00
f_Tutoring_Sessions2 1.0235080 0.0752479 13.601816 0.00e+00
f_Tutoring_Sessions3 1.4817503 0.0912877 16.231658 0.00e+00
f_Tutoring_Sessions4+ 2.2067860 0.1145363 19.267126 0.00e+00

We will remove the nonsignificant predictors identified above, and we will continue by performing a Box-Cox transformation on the model to try and correct for some of the issues with normality in the response variable.

sqrt-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 249.030649 0
Hours_Studied 0 0 -113.653120 0
Attendance 0 0 -146.728957 0
Parental_InvolvementMedium 0 0 -24.030604 0
Parental_InvolvementHigh 0 0 -43.951740 0
Access_to_ResourcesMedium 0 0 -25.129848 0
Access_to_ResourcesHigh 0 0 -43.924726 0
Extracurricular_ActivitiesYes 0 0 -16.694296 0
Previous_Scores 0 0 -44.258904 0
Motivation_LevelMedium 0 0 -15.686697 0
Motivation_LevelHigh 0 0 -23.884951 0
Internet_AccessYes 0 0 -16.548903 0
f_Tutoring_Sessions1 0 0 -13.639796 0
f_Tutoring_Sessions2 0 0 -23.857457 0
f_Tutoring_Sessions3 0 0 -28.491143 0
f_Tutoring_Sessions4+ 0 0 -31.359251 0
Family_IncomeMedium 0 0 -14.377178 0
Family_IncomeHigh 0 0 -24.047458 0
Teacher_QualityMedium 0 0 -9.800508 0
Teacher_QualityHigh 0 0 -18.542659 0
Peer_InfluenceNeutral 0 0 -12.547745 0
Peer_InfluencePositive 0 0 -23.929231 0
Physical_Activity 0 0 -14.421992 0
Learning_DisabilitiesYes 0 0 18.885545 0
Parental_Education_LevelCollege 0 0 -14.437283 0
Parental_Education_LevelPostgraduate 0 0 -25.070817 0
Distance_from_HomeModerate 0 0 -8.704588 0
Distance_from_HomeNear 0 0 -18.682458 0

The Box-Cox returns a lambda value of -4.5, so we take the response variable to the -4.5th power for our first transformed model. Looking at the residual plots, there appear to be slight improvements in the residual and Q-Q plots, but still obvious indications of violations to our assumptions for multiple linear regression.

We attempt a log transformed model as well, to compare goodness of fit statistics later.

The issues with normality and the residual plots for the log transformed model appear similar to the issues with the full model.

We will compare these three candidate model according to a few goodness of fit measures.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj AIC BIC
Full Model 27237.233431 0.7212230 0.7199054 9321.136 9530.715
Transformed Model (Y^(-4.5)) 0.000000 0.8774757 0.8769547 -272600.366 -272411.069
Log Model 4.521172 0.7760399 0.7750876 -46196.226 -46006.929

Based on this initial look, the transformed model based on the Box-Cox seems to perform the highest, while the full and log transformed model have closer R-squared with the transformed model performing slightly better. The AIC and BIC goodness of fit criterion also show better performance from the transformed model. We will further assess these models and their predictive power through cross validation.

Cross-validation and Model Selection

We will perform five-fold cross validation to compare between the three models and assess their predictive performance. The data will be separated into test and training datasets randomly with an 80/20 split, and then a five-fold CV will be performed to assess the MSE of each model using the training dataset.

The graph shows that the MSE for the full model is by far the lowest, with an average of about 4. The average MSE for the log model and transformed model are similar, with the log model performing better with a lower MSE. Based on this, it looks like the full model has better predictive performance and therefore will be recommended for practical prediction.

We can further assess the full model now using the reserved 20% of the data and find the test MSE.

[1] 4.983024

This is, as expected, pretty close to our calculated MSE for the test data while doing the five-fold CV, and much lower than the estimated MSE’s for the other two models.

Results and Conclusions for the Multiple Linear Regression

In performing the multiple linear regression, multiple issues with violations to the assumptions of a multiple linear regression model were noted in all of the candidate models. A Box-Cox and a log transformation were done to compare different candidate models to the full model, and while the former two performed better across the board with R-squared, AIC, and BIC goodness-of-fit statistics, the full model had a much lower MSE when assessed through cross-validation for prediction purposes. As none of these models satisfied the assumptions necessary to conduct multiple linear regression, all recommendations are made with caution. The presence of right skewed data and many extreme observations hindered the ability to make a more accurate model. It is possible that considering a subset of the data may produce more accurate models, but a different statistical technique in general is recommended for future analysis. Based on the three multiple linear regression models, however, the best performing model when assessing the fit of the data seems to be the transformed model created based on the lambda value of the Box-Cox, taking the response variable of Exam_Score to a power of -4.5. However, the model that had the best predictive power when cross-validation was performed was the full model, and therefore is cautiously recommended as the best for practical prediction.

Logistic Regression

Creating Candidate Models

The lack of normality in the response variable noted in the section on multiple linear regression suggests the use of logistic regression may be more appropriate. We created a binary response variable that assesses if the student receives a 70% or higher on the final exam, and the respective counts in the data is displayed below.

# A tibble: 2 × 2
  PassFail `n()`
     <dbl> <int>
1        0  4797
2        1  1581

In the dataset, 4797 students had a score below 70, while 1581 had a score above 70. We have already checked for multicollinearity in the data. We will proceed by building the full model.

Significance tests of logistic regression model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -82.0749118 2.9498382 -27.8235299 0.0000000
Hours_Studied 0.6880979 0.0260858 26.3783051 0.0000000
Attendance 0.4521552 0.0164134 27.5479637 0.0000000
Parental_InvolvementMedium 2.2732254 0.1945465 11.6847392 0.0000000
Parental_InvolvementHigh 4.4105088 0.2437843 18.0918517 0.0000000
Access_to_ResourcesMedium 2.1525976 0.1945273 11.0657875 0.0000000
Access_to_ResourcesHigh 4.5817352 0.2429337 18.8600209 0.0000000
Extracurricular_ActivitiesYes 1.1554277 0.1373981 8.4093410 0.0000000
Sleep_Hours 0.0220834 0.0435265 0.5073548 0.6119059
Previous_Scores 0.1106114 0.0058687 18.8477807 0.0000000
Motivation_LevelMedium 1.1026330 0.1557666 7.0787521 0.0000000
Motivation_LevelHigh 2.5179473 0.2062185 12.2100955 0.0000000
Internet_AccessYes 2.2717152 0.2692970 8.4357249 0.0000000
Family_IncomeMedium 1.2459385 0.1508331 8.2603788 0.0000000
Family_IncomeHigh 2.1629824 0.1871887 11.5550910 0.0000000
Teacher_QualityMedium 1.5088299 0.2398871 6.2897490 0.0000000
Teacher_QualityHigh 2.6590168 0.2612484 10.1781158 0.0000000
School_TypePublic 0.1419680 0.1404448 1.0108456 0.3120904
Peer_InfluenceNeutral 1.1422249 0.1863663 6.1289234 0.0000000
Peer_InfluencePositive 2.3561428 0.1947175 12.1003121 0.0000000
Physical_Activity 0.4936347 0.0656610 7.5179256 0.0000000
Learning_DisabilitiesYes -1.9080873 0.2433246 -7.8417356 0.0000000
Parental_Education_LevelCollege 1.2731081 0.1538173 8.2767550 0.0000000
Parental_Education_LevelPostgraduate 2.3466344 0.1810533 12.9610134 0.0000000
Distance_from_HomeModerate 0.9585171 0.2529030 3.7900581 0.0001506
Distance_from_HomeNear 2.2438366 0.2471695 9.0781275 0.0000000
GenderMale -0.0608662 0.1290146 -0.4717779 0.6370853
f_Tutoring_Sessions1 1.1382940 0.1866240 6.0993992 0.0000000
f_Tutoring_Sessions2 2.3253771 0.2012471 11.5548357 0.0000000
f_Tutoring_Sessions3 3.1782923 0.2430255 13.0780179 0.0000000
f_Tutoring_Sessions4+ 4.9471930 0.3104012 15.9380622 0.0000000

As was the case in the multiple linear regression model, Sleep_Hours, School_Type, and Gender are the only predictors marked as not being significant in the full model. We will perform stepwise regression starting from a reduced model of only Learning Disabilities and Tutoring Sessions, which were the two categorical predictors we identified as having clear trends in the earlier mosaic plots.

Summary table of significant tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) -81.8608385 2.9340460 -27.900326 0.0000000
Hours_Studied 0.6884720 0.0260886 26.389720 0.0000000
Attendance 0.4518533 0.0164115 27.532678 0.0000000
Parental_InvolvementMedium 2.2782241 0.1946408 11.704758 0.0000000
Parental_InvolvementHigh 4.4153203 0.2439017 18.102867 0.0000000
Access_to_ResourcesMedium 2.1525865 0.1942602 11.080947 0.0000000
Access_to_ResourcesHigh 4.5812743 0.2428184 18.867079 0.0000000
Extracurricular_ActivitiesYes 1.1643146 0.1371816 8.487396 0.0000000
Previous_Scores 0.1106960 0.0058645 18.875684 0.0000000
Motivation_LevelMedium 1.1026390 0.1556496 7.084112 0.0000000
Motivation_LevelHigh 2.5159070 0.2057967 12.225208 0.0000000
Internet_AccessYes 2.2691498 0.2692556 8.427492 0.0000000
Family_IncomeMedium 1.2491640 0.1507814 8.284602 0.0000000
Family_IncomeHigh 2.1688944 0.1870020 11.598239 0.0000000
Teacher_QualityMedium 1.5181398 0.2394651 6.339713 0.0000000
Teacher_QualityHigh 2.6640905 0.2609433 10.209461 0.0000000
Peer_InfluenceNeutral 1.1463525 0.1861646 6.157735 0.0000000
Peer_InfluencePositive 2.3583937 0.1946989 12.113030 0.0000000
Physical_Activity 0.4935082 0.0655051 7.533885 0.0000000
Learning_DisabilitiesYes -1.8861506 0.2422159 -7.787064 0.0000000
Parental_Education_LevelCollege 1.2793946 0.1535546 8.331852 0.0000000
Parental_Education_LevelPostgraduate 2.3426961 0.1805602 12.974597 0.0000000
Distance_from_HomeModerate 0.9538673 0.2528677 3.772200 0.0001618
Distance_from_HomeNear 2.2399426 0.2471267 9.063945 0.0000000
f_Tutoring_Sessions1 1.1424932 0.1863416 6.131175 0.0000000
f_Tutoring_Sessions2 2.3250215 0.2011508 11.558601 0.0000000
f_Tutoring_Sessions3 3.1771363 0.2427745 13.086780 0.0000000
f_Tutoring_Sessions4+ 4.9440453 0.3098494 15.956287 0.0000000

The stepwise automatic variable selection process removed the predictors of Sleep_Hours, School_Type, and Gender, as predicted. The remaining predictors all seem highly efficient with very low p-values.

The goodness-of-fit measures for the models are shown below.

Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
Full Model 1645.635 7143.332 1707.635
Reduced Model 7030.475 7143.332 7042.475
Final Model 1647.127 7143.332 1703.127

The null deviance residual is a measure of how well the response can be predicted with just the intercept. The deviance residual gives a measure of how well a response can be predicted with a model with p predictors. A lower value indicates a model that can better predict the response. A chi-squared test may be used to assess the quality of a model using the null deviance and deviance residual(4), where the test statistic is the difference between the null and deviance residuals and a degrees of freedom of p, the number of predictors in the model. For the full model, this gives a p value of χ2(5497.697, 30) < 0.000001. For the reduced model, this gives a p value of χ2(112.857, 5) < 0.000001. For the stepwise model, this would be a p value of χ2(5496.205, 27) < 0.000001. All perform significantly better than just the intercept, with extremely low p-values from the chi-squared test. Combined with the lower AIC score, we chose the stepwise model as the better performing model based on goodness of fit statistics.

Cross-Validation and Model Selection

Again, we will perform 5-fold cross-validation to assess the predictive power of our candidate models.

Summary statistics of AUC for candidate models in 5-fold CV
Min. 1st Qu. Median Mean 3rd Qu. Max.
Full Model 0.9679 0.9817 0.9882 0.98410 0.9905 0.9922
Reduced Model 0.5666 0.5813 0.5823 0.58210 0.5851 0.5952
Stepwise Model 0.9680 0.9818 0.9883 0.98416 0.9905 0.9922

Based on this look at the AUC statistics, we see that the full model and stepwise model performed extremely similarly, while the reduced model performed considerably worse. We will also look at the AUC values using the hold out testing data.

Test AUC for candidate models
Full Model Reduced Model Stepwise Model
0.9879 0.5864 0.9879

Once again, the full and final models perform very similarly. As such, we recommend that both may be used for prediction, but recommend the stepwise model due to principles of parsimony, as the performance of the models is similar in prediction and the stepwise model has slightly fewer predictors.

Optimal Cut-off Probability

Moving forward with the stepwise reduced model as our final model, we will find the optimal cut-off probability using the ROC curve constructed earlier.

Results and Conclusion for the Logistic Regression

For logistic regression, both the full model and stepwise reduced model performed similar in goodness of fit and predictive performance using a 5-fold CV, and the stepwise reduced model is recommended on principles of parsimony and slightly better goodness of fit statistics when accounting for the simplicity of the model. In this logistic regression, we ran into far fewer problems with violations to our assumptions that we saw in the multiple linear regression, as the categorical binary response variable of whether or not a student had a score of over 70 or not meant that we did not have to worry about the problems of extreme observations and a lack of normality in the original response variable. The summary and the odds ratios of the final logstic model are shown below.

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -81.8608385 2.9340460 -27.900326 0.0000000 0.0000000
Hours_Studied 0.6884720 0.0260886 26.389720 0.0000000 1.9906714
Attendance 0.4518533 0.0164115 27.532678 0.0000000 1.5712215
Parental_InvolvementMedium 2.2782241 0.1946408 11.704758 0.0000000 9.7593331
Parental_InvolvementHigh 4.4153203 0.2439017 18.102867 0.0000000 82.7083314
Access_to_ResourcesMedium 2.1525865 0.1942602 11.080947 0.0000000 8.6070915
Access_to_ResourcesHigh 4.5812743 0.2428184 18.867079 0.0000000 97.6387347
Extracurricular_ActivitiesYes 1.1643146 0.1371816 8.487396 0.0000000 3.2037264
Previous_Scores 0.1106960 0.0058645 18.875684 0.0000000 1.1170553
Motivation_LevelMedium 1.1026390 0.1556496 7.084112 0.0000000 3.0121044
Motivation_LevelHigh 2.5159070 0.2057967 12.225208 0.0000000 12.3778309
Internet_AccessYes 2.2691498 0.2692556 8.427492 0.0000000 9.6711748
Family_IncomeMedium 1.2491640 0.1507814 8.284602 0.0000000 3.4874264
Family_IncomeHigh 2.1688944 0.1870020 11.598239 0.0000000 8.7486062
Teacher_QualityMedium 1.5181398 0.2394651 6.339713 0.0000000 4.5637279
Teacher_QualityHigh 2.6640905 0.2609433 10.209461 0.0000000 14.3548885
Peer_InfluenceNeutral 1.1463525 0.1861646 6.157735 0.0000000 3.1466942
Peer_InfluencePositive 2.3583937 0.1946989 12.113030 0.0000000 10.5739530
Physical_Activity 0.4935082 0.0655051 7.533885 0.0000000 1.6380527
Learning_DisabilitiesYes -1.8861506 0.2422159 -7.787064 0.0000000 0.1516545
Parental_Education_LevelCollege 1.2793946 0.1535546 8.331852 0.0000000 3.5944628
Parental_Education_LevelPostgraduate 2.3426961 0.1805602 12.974597 0.0000000 10.4092636
Distance_from_HomeModerate 0.9538673 0.2528677 3.772200 0.0001618 2.5957287
Distance_from_HomeNear 2.2399426 0.2471267 9.063945 0.0000000 9.3927924
f_Tutoring_Sessions1 1.1424932 0.1863416 6.131175 0.0000000 3.1345738
f_Tutoring_Sessions2 2.3250215 0.2011508 11.558601 0.0000000 10.2269004
f_Tutoring_Sessions3 3.1771363 0.2427745 13.086780 0.0000000 23.9779886
f_Tutoring_Sessions4+ 4.9440453 0.3098494 15.956287 0.0000000 140.3368014

An odds ratio of less than one indicate a negative relationship between the predictor and scoring over 70 on the final exam, and an odds ratio of over one indicate a positive relationship between the predictor and scoring over 70 on the final exam. Looking across this table, we note that having a learning disability is the only variable with a value of less than one, indicating that assuming that all other factors are held constant, the odds of a student with a learning disability scoring over 70 is 0.15165 the times of a student without a learning disability. Of the positive predictors, we note that Tutoring Sessions steadily increases with every tutoring session, with the odds of a student with 4+ tutoring sessions in a month scoring over a 70 on the exam being 140.3368 times that of a student with 0 tutoring sessions in a month, assuming everything else is held constant. Other notably high odds ratios include a high amount of parental involvement and a high amount of access to resources.

Summary and Discussion:

The final models chosen for both methods are as follows: 1) the transformed model based on the Box-Cox for based on goodness of fit, and the full model based on predictive performance for the multiple linear regression, and 2) the stepwise reduced model based on goodness of fit and predictive performance for the logistic regression.

The models had very similar findings regarding the statistical significance of the predictors Gender, Sleep_Hours, and School_Type, finding them to be insignificant in the prediction of exam scores for both the multiple linear regression and the logistic regression models. The three predictors with the highest positive association from highest to lowest for the multiple linear regression were having 4+ tutoring sessions in a month, a high amount of access to resources, a high amount of parental involvement. The same is true for the logistic regression. In both models, the only predictor with a negative association was the presence of a learning disability.

The multiple linear regression was severely limited by the nonnormality of the response variable. The transformations that were applied to create other candidate models were not enough to account for the extreme observations and skewness observed in the original response variable. Bootstrapping may be a way to counter some of the issues with normality, and subgroup analysis for those with certain scores without the extreme values to the right of the distribution might yield better results with multiple linear regression as well. However, with the violations to the assumptions present in the current models, all must be used with extreme caution.

The logistic regression performed well and had no issues with violations to assumptions observed unlike the linear regression, but further diagnostics should be run to ensure that there are no major issues with the models. The binary nature of the logistic regression also limits its predictive power, as it can only predict if a student scores above or below a 70 in the final exam, unlike the linear regression model. However, with all of the issues mentioned above with multiple linear regression using dataset, we recommend logistic regression as the better performing technique, and further analysis may be done to establish the optimal cut-off point for accuracy as well as further categorical analysis, as well as creating models for prediction with a different exam score than 70, such as the traditional 60% failing cut off.

References and Appendix

  1. https://www.kaggle.com/datasets/lainguyn123/student-performance-factors
  2. https://www.bestcolleges.com/blog/passing-grade-college/
  3. https://www.registrar.psu.edu/grades/grading-system.cfm
  4. https://www.statology.org/null-residual-deviance/