class: center, middle, inverse, title-slide .title[ # STA551 Final Presentation ] .subtitle[ ## A Cumulative Assessment on the Effects of Different Factors on the Final Exam Grades of Students ] .author[ ### Alice Xiang ] .date[ ### 2024-12-11 ] --- <style type="text/css"> .scroll-100 { max-height: 400px; overflow-y: auto; } </style> <h2 align="center"> Table of Contents</h2> .pull-left[ - Introduction - Description of the Data - Research Questions - EDA ] .pull-right[ - Multiple Linear Regression - Logistic Regression - Summary and Discussion - References and Appendix ] --- .pull-left[ ## Introduction - Student performance is a multifaceted measure of the success of both individual students and academic institutions - A large variety of factors contribute to student success, both on an individual and societal level - Academic achievement an important factor in the future success of an individual. ## Description of the Data We chose [this dataset](https://www.kaggle.com/datasets/lainguyn123/student-performance-factors) of various factors affecting student performance in exams. It includes information on study habits, attendance, parental involvement, and other aspects influencing academic success across 20 variables. ] .pull-right[ ## Variables The following are the variables included in the dataset (7 continuous, 13 categorical): .pull-left[ Numeric: - Hours_Studied - Attendance - Sleep_Hours - Previous_Scores - Tutoring_Sessions - Physical_Activity - Exam_Score ] .pull-right[ Categorical: - Parental_Involvement - Access_to_Resources - Extracurricular_Activities - Motivation_Level - Internet_Access - Family_Income - Teacher_Quality - School_Type - Peer_Influence - Learning_Disabilities - Parental_Education_Level - Distance_from_Home - Gender ] ] --- class: inverse center middle ## Research Question 1: How do different predictors relate to the final exam performance of students? ## Research Question 2: What factors best predict whether or not a student has a satisfactory (greater than or equal to 70%) final exam grade, and how accurately can their performance be predicted based on these factors? --- class: inverse center middle ## EDA --- ## Continuous Response for MLR .pull-left[ The distribution for the response variable of Exam_Score shows evidence of right skew: <img src="Final-Presentation_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> ] .pull-right[ Distributions of the continuous explanatory variables and pairwise correlations: <img src="Final-Presentation_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> Due to sparse categories in Tutoring_Sessions, we discretize it as follows: 0 Tutoring Sessions per month, 1, 2, 3, and 4+. ] --- ## Continuous Response (cont.) Box plots of categorical predictors against Exam_Score: .scroll-100[ <img src="Final-Presentation_files/figure-html/unnamed-chunk-7-1.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-7-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-7-3.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-7-4.png" width="100%" /> ] --- # Categorical Response Mosaic plots to compare the binary response variable for the logistic model with the different categorical predictors: .scroll-100[ <img src="Final-Presentation_files/figure-html/unnamed-chunk-8-1.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-3.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-4.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-5.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-6.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-7.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-8.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-9.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-10.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-11.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-12.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-8-13.png" width="100%" /> ] --- # EDA Takeaways: .pull-left[ Continuous Response: - Right skew to response variable Exam_Score - Positive correlations for Attendance and Hours studied on Exam Score - Parental Involvement, Access to Resources, Internet Access, Tutoring Sessions possibly positively associated with Exam Score - Learning Disability possibly negatively associated with Exam Score ] .pull-right[ Categorical Response: - Positive association evident across levels of Tutoring Sessions - Learning Disability negatively association - School type and Gender no obvious association - All other predictors slight positive association ] Analytic dataset available [here](https://github.com/xiang-a/sta551/blob/main/analytic_student_performance.csv) --- class: inverse center middle # Multiple Linear Regression --- # Create Candidate Models Full Model Parameter Estimates and Residual Plots: .scroll-100[ Table: Full Model examining Student Final Exam Scores | | Estimate| Std. Error| t value| Pr(>|t|)| |:------------------------------------|----------:|----------:|-----------:|------------------:| |(Intercept) | 34.0593039| 0.3506238| 97.1391544| 0.0000000| |Hours_Studied | 0.2951818| 0.0043406| 68.0053170| 0.0000000| |Attendance | 0.1988267| 0.0022522| 88.2819301| 0.0000000| |Parental_InvolvementMedium | 0.9200661| 0.0683166| 13.4676882| 0.0000000| |Parental_InvolvementHigh | 1.9873553| 0.0754642| 26.3350667| 0.0000000| |Access_to_ResourcesMedium | 1.0567475| 0.0688563| 15.3471352| 0.0000000| |Access_to_ResourcesHigh | 2.0638510| 0.0752078| 27.4419711| 0.0000000| |Extracurricular_ActivitiesYes | 0.5592436| 0.0530058| 10.5506047| 0.0000000| |Sleep_Hours | -0.0031099| 0.0177048| -0.1756549| 0.8605707| |Previous_Scores | 0.0490476| 0.0018078| 27.1303692| 0.0000000| |Motivation_LevelMedium | 0.5228284| 0.0603979| 8.6564051| 0.0000000| |Motivation_LevelHigh | 1.0642365| 0.0753928| 14.1158855| 0.0000000| |Internet_AccessYes | 0.9194475| 0.0980986| 9.3726882| 0.0000000| |Family_IncomeMedium | 0.4937187| 0.0578762| 8.5306021| 0.0000000| |Family_IncomeHigh | 1.0853227| 0.0719036| 15.0941294| 0.0000000| |Teacher_QualityMedium | 0.5083142| 0.0883149| 5.7557021| 0.0000000| |Teacher_QualityHigh | 1.0633314| 0.0944731| 11.2553881| 0.0000000| |School_TypePublic | 0.0338177| 0.0564628| 0.5989383| 0.5492354| |Peer_InfluenceNeutral | 0.5194792| 0.0705107| 7.3673823| 0.0000000| |Peer_InfluencePositive | 1.0235361| 0.0701717| 14.5861734| 0.0000000| |Physical_Activity | 0.1884882| 0.0253228| 7.4434270| 0.0000000| |Learning_DisabilitiesYes | -0.8523793| 0.0848911| -10.0408598| 0.0000000| |Parental_Education_LevelCollege | 0.4843870| 0.0599099| 8.0852543| 0.0000000| |Parental_Education_LevelPostgraduate | 0.9867580| 0.0687640| 14.3499174| 0.0000000| |Distance_from_HomeModerate | 0.3852309| 0.0948087| 4.0632437| 0.0000490| |Distance_from_HomeNear | 0.9075950| 0.0888929| 10.2099863| 0.0000000| |GenderMale | -0.0433741| 0.0526039| -0.8245404| 0.4096636| |f_Tutoring_Sessions1 | 0.5270752| 0.0706850| 7.4566783| 0.0000000| |f_Tutoring_Sessions2 | 1.0232187| 0.0752685| 13.5942458| 0.0000000| |f_Tutoring_Sessions3 | 1.4801647| 0.0913167| 16.2091266| 0.0000000| |f_Tutoring_Sessions4+ | 2.2071292| 0.1145570| 19.2666507| 0.0000000| <img src="Final-Presentation_files/figure-html/unnamed-chunk-10-1.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-10-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-10-3.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-10-4.png" width="100%" /> ``` GVIF Df GVIF^(1/(2*Df)) Hours_Studied 1.003022 1 1.001510 Attendance 1.005648 1 1.002820 Parental_Involvement 1.009370 2 1.002334 Access_to_Resources 1.011683 2 1.002908 Extracurricular_Activities 1.004739 1 1.002367 Sleep_Hours 1.003864 1 1.001930 Previous_Scores 1.007149 1 1.003568 Motivation_Level 1.009013 2 1.002246 Internet_Access 1.004904 1 1.002449 Family_Income 1.009531 2 1.002374 Teacher_Quality 1.008066 2 1.002010 School_Type 1.004009 1 1.002002 Peer_Influence 1.009543 2 1.002377 Physical_Activity 1.008817 1 1.004399 Learning_Disabilities 1.004286 1 1.002141 Parental_Education_Level 1.008506 2 1.002120 Distance_from_Home 1.006296 2 1.001570 Gender 1.002999 1 1.001499 f_Tutoring_Sessions 1.016697 4 1.002072 ``` ] --- ## Issues We See - Normality assumption violated - Many extreme values appear on residual plots - Likely due to right skew of response variable --- ## Transformed Multiple Linear Regression Models We create two candidate models with transformations. Both still show violations to our assumptions of Multiple Linear Regression. .pull-left[ Box-Cox transformed model: .scroll-100[ <img src="Final-Presentation_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> Table: Transformed model | | Estimate| Std. Error| t value| Pr(>|t|)| |:------------------------------------|--------:|----------:|-----------:|------------------:| |(Intercept) | 0| 0| 249.030649| 0| |Hours_Studied | 0| 0| -113.653120| 0| |Attendance | 0| 0| -146.728957| 0| |Parental_InvolvementMedium | 0| 0| -24.030604| 0| |Parental_InvolvementHigh | 0| 0| -43.951740| 0| |Access_to_ResourcesMedium | 0| 0| -25.129848| 0| |Access_to_ResourcesHigh | 0| 0| -43.924726| 0| |Extracurricular_ActivitiesYes | 0| 0| -16.694296| 0| |Previous_Scores | 0| 0| -44.258904| 0| |Motivation_LevelMedium | 0| 0| -15.686697| 0| |Motivation_LevelHigh | 0| 0| -23.884951| 0| |Internet_AccessYes | 0| 0| -16.548903| 0| |f_Tutoring_Sessions1 | 0| 0| -13.639796| 0| |f_Tutoring_Sessions2 | 0| 0| -23.857457| 0| |f_Tutoring_Sessions3 | 0| 0| -28.491143| 0| |f_Tutoring_Sessions4+ | 0| 0| -31.359251| 0| |Family_IncomeMedium | 0| 0| -14.377178| 0| |Family_IncomeHigh | 0| 0| -24.047458| 0| |Teacher_QualityMedium | 0| 0| -9.800508| 0| |Teacher_QualityHigh | 0| 0| -18.542659| 0| |Peer_InfluenceNeutral | 0| 0| -12.547745| 0| |Peer_InfluencePositive | 0| 0| -23.929231| 0| |Physical_Activity | 0| 0| -14.421992| 0| |Learning_DisabilitiesYes | 0| 0| 18.885545| 0| |Parental_Education_LevelCollege | 0| 0| -14.437283| 0| |Parental_Education_LevelPostgraduate | 0| 0| -25.070817| 0| |Distance_from_HomeModerate | 0| 0| -8.704588| 0| |Distance_from_HomeNear | 0| 0| -18.682458| 0| <img src="Final-Presentation_files/figure-html/unnamed-chunk-11-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-11-3.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-11-4.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-11-5.png" width="100%" /> ] ] .pull-right[ Log-transformed Model: .scroll-100[ Table: Log model | | Estimate| Std. Error| t value| Pr(>|t|)| |:------------------------------------|----------:|----------:|----------:|------------------:| |(Intercept) | 3.7109083| 0.0041466| 894.937063| 0e+00| |Hours_Studied | 0.0044048| 0.0000559| 78.792758| 0e+00| |Attendance | 0.0029654| 0.0000290| 102.264739| 0e+00| |Parental_InvolvementMedium | 0.0139423| 0.0008798| 15.847852| 0e+00| |Parental_InvolvementHigh | 0.0296470| 0.0009718| 30.507395| 0e+00| |Access_to_ResourcesMedium | 0.0156402| 0.0008867| 17.637931| 0e+00| |Access_to_ResourcesHigh | 0.0304995| 0.0009684| 31.496171| 0e+00| |Extracurricular_ActivitiesYes | 0.0082189| 0.0006827| 12.038477| 0e+00| |Previous_Scores | 0.0007273| 0.0000233| 31.242888| 0e+00| |Motivation_LevelMedium | 0.0079308| 0.0007779| 10.195408| 0e+00| |Motivation_LevelHigh | 0.0159342| 0.0009710| 16.410800| 0e+00| |Internet_AccessYes | 0.0139336| 0.0012633| 11.029447| 0e+00| |f_Tutoring_Sessions1 | 0.0079655| 0.0009103| 8.750876| 0e+00| |f_Tutoring_Sessions2 | 0.0153688| 0.0009694| 15.853939| 0e+00| |f_Tutoring_Sessions3 | 0.0223291| 0.0011760| 18.986792| 0e+00| |f_Tutoring_Sessions4+ | 0.0327000| 0.0014755| 22.161389| 0e+00| |Family_IncomeMedium | 0.0073716| 0.0007455| 9.888698| 0e+00| |Family_IncomeHigh | 0.0159534| 0.0009260| 17.229035| 0e+00| |Teacher_QualityMedium | 0.0076180| 0.0011375| 6.697017| 0e+00| |Teacher_QualityHigh | 0.0157941| 0.0012168| 12.980300| 0e+00| |Peer_InfluenceNeutral | 0.0077662| 0.0009080| 8.552951| 0e+00| |Peer_InfluencePositive | 0.0152199| 0.0009037| 16.841631| 0e+00| |Physical_Activity | 0.0029033| 0.0003262| 8.901676| 0e+00| |Learning_DisabilitiesYes | -0.0129879| 0.0010931| -11.881636| 0e+00| |Parental_Education_LevelCollege | 0.0073585| 0.0007716| 9.536494| 0e+00| |Parental_Education_LevelPostgraduate | 0.0148786| 0.0008855| 16.802960| 0e+00| |Distance_from_HomeModerate | 0.0060853| 0.0012211| 4.983268| 6e-07| |Distance_from_HomeNear | 0.0138094| 0.0011450| 12.060789| 0e+00| <img src="Final-Presentation_files/figure-html/unnamed-chunk-12-1.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-12-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-12-3.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-12-4.png" width="100%" /> ] ] --- ## Multiple Linear Regression: Goodness of Fit Table: Goodness-of-fit Measures of Candidate Models | | SSE| R.sq| R.adj| AIC| BIC| |:----------------------------|------------:|---------:|---------:|-----------:|-----------:| |Full Model | 27237.233431| 0.7212230| 0.7199054| 9321.136| 9530.715| |Transformed Model (Y^(-4.5)) | 0.000000| 0.8774757| 0.8769547| -272600.366| -272411.069| |Log Model | 4.521172| 0.7760399| 0.7750876| -46196.226| -46006.929| --- ## Cross-validation and Model Selection We perform a 5-fold Cross Validation to assess predictive performance of each model .scroll-100[ <img src="Final-Presentation_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> Full model MSE the lowest, and recommended for practical prediction. Test MSE: ``` [1] 8.301118 ``` ] --- ## Multiple Linear Regression Conclusions - Multiple issues with violations to the assumptions of a multiple linear regression model were noted in all of the candidate models - Transformed Model showed best goodness of fit - Full model recommended for practical prediction - All should be used with caution, due to the nonnormality of the response --- class: inverse center middle # Logistic Regression --- ## Creating Candidate Models Lack of normality in response could mean that logistic regression will be the better performing technique in the creation of a model to make predictions on the data. Full Model Parameter Estimates: .scroll-100[ Table: Significance tests of logistic regression model | | Estimate| Std. Error| z value| Pr(>|z|)| |:------------------------------------|-----------:|----------:|-----------:|------------------:| |(Intercept) | -82.0749118| 2.9498382| -27.8235299| 0.0000000| |Hours_Studied | 0.6880979| 0.0260858| 26.3783051| 0.0000000| |Attendance | 0.4521552| 0.0164134| 27.5479637| 0.0000000| |Parental_InvolvementMedium | 2.2732254| 0.1945465| 11.6847392| 0.0000000| |Parental_InvolvementHigh | 4.4105088| 0.2437843| 18.0918517| 0.0000000| |Access_to_ResourcesMedium | 2.1525976| 0.1945273| 11.0657875| 0.0000000| |Access_to_ResourcesHigh | 4.5817352| 0.2429337| 18.8600209| 0.0000000| |Extracurricular_ActivitiesYes | 1.1554277| 0.1373981| 8.4093410| 0.0000000| |Sleep_Hours | 0.0220834| 0.0435265| 0.5073548| 0.6119059| |Previous_Scores | 0.1106114| 0.0058687| 18.8477807| 0.0000000| |Motivation_LevelMedium | 1.1026330| 0.1557666| 7.0787521| 0.0000000| |Motivation_LevelHigh | 2.5179473| 0.2062185| 12.2100955| 0.0000000| |Internet_AccessYes | 2.2717152| 0.2692970| 8.4357249| 0.0000000| |Family_IncomeMedium | 1.2459385| 0.1508331| 8.2603788| 0.0000000| |Family_IncomeHigh | 2.1629824| 0.1871887| 11.5550910| 0.0000000| |Teacher_QualityMedium | 1.5088299| 0.2398871| 6.2897490| 0.0000000| |Teacher_QualityHigh | 2.6590168| 0.2612484| 10.1781158| 0.0000000| |School_TypePublic | 0.1419680| 0.1404448| 1.0108456| 0.3120904| |Peer_InfluenceNeutral | 1.1422249| 0.1863663| 6.1289234| 0.0000000| |Peer_InfluencePositive | 2.3561428| 0.1947175| 12.1003121| 0.0000000| |Physical_Activity | 0.4936347| 0.0656610| 7.5179256| 0.0000000| |Learning_DisabilitiesYes | -1.9080873| 0.2433246| -7.8417356| 0.0000000| |Parental_Education_LevelCollege | 1.2731081| 0.1538173| 8.2767550| 0.0000000| |Parental_Education_LevelPostgraduate | 2.3466344| 0.1810533| 12.9610134| 0.0000000| |Distance_from_HomeModerate | 0.9585171| 0.2529030| 3.7900581| 0.0001506| |Distance_from_HomeNear | 2.2438366| 0.2471695| 9.0781275| 0.0000000| |GenderMale | -0.0608662| 0.1290146| -0.4717779| 0.6370853| |f_Tutoring_Sessions1 | 1.1382940| 0.1866240| 6.0993992| 0.0000000| |f_Tutoring_Sessions2 | 2.3253771| 0.2012471| 11.5548357| 0.0000000| |f_Tutoring_Sessions3 | 3.1782923| 0.2430255| 13.0780179| 0.0000000| |f_Tutoring_Sessions4+ | 4.9471930| 0.3104012| 15.9380622| 0.0000000| ] --- ## Reduced and Stepwise Models .pull-left[ Reduced Model Parameter Estimates: .scroll-100[ We make a reduced model starting with Tutoring Sessions and Learning Disabilities, which were the two categorical predictors we identified as having clear trends in the earlier mosaic plots. Table: Summary table of Reduced Model | | Estimate| Std. Error| z value| Pr(>|z|)| |:------------------------|----------:|----------:|----------:|------------------:| |(Intercept) | -1.3949859| 0.0674027| -20.696290| 0.0000000| |f_Tutoring_Sessions1 | 0.1809256| 0.0849366| 2.130124| 0.0331614| |f_Tutoring_Sessions2 | 0.4529969| 0.0875031| 5.176925| 0.0000002| |f_Tutoring_Sessions3 | 0.6143210| 0.1019523| 6.025572| 0.0000000| |f_Tutoring_Sessions4+ | 0.9703233| 0.1206532| 8.042252| 0.0000000| |Learning_DisabilitiesYes | -0.5014822| 0.1070342| -4.685253| 0.0000028| ] ] .pull-right[ .scroll-100[ Stepwise Model Parameter Estimates: Table: Summary table of Stepwise Model | | Estimate| Std. Error| z value| Pr(>|z|)| |:------------------------------------|-----------:|----------:|----------:|------------------:| |(Intercept) | -81.8608385| 2.9340460| -27.900326| 0.0000000| |Hours_Studied | 0.6884720| 0.0260886| 26.389720| 0.0000000| |Attendance | 0.4518533| 0.0164115| 27.532678| 0.0000000| |Parental_InvolvementMedium | 2.2782241| 0.1946408| 11.704758| 0.0000000| |Parental_InvolvementHigh | 4.4153203| 0.2439017| 18.102867| 0.0000000| |Access_to_ResourcesMedium | 2.1525865| 0.1942602| 11.080947| 0.0000000| |Access_to_ResourcesHigh | 4.5812743| 0.2428184| 18.867079| 0.0000000| |Extracurricular_ActivitiesYes | 1.1643146| 0.1371816| 8.487396| 0.0000000| |Previous_Scores | 0.1106960| 0.0058645| 18.875684| 0.0000000| |Motivation_LevelMedium | 1.1026390| 0.1556496| 7.084112| 0.0000000| |Motivation_LevelHigh | 2.5159070| 0.2057967| 12.225208| 0.0000000| |Internet_AccessYes | 2.2691498| 0.2692556| 8.427492| 0.0000000| |Family_IncomeMedium | 1.2491640| 0.1507814| 8.284602| 0.0000000| |Family_IncomeHigh | 2.1688944| 0.1870020| 11.598239| 0.0000000| |Teacher_QualityMedium | 1.5181398| 0.2394651| 6.339713| 0.0000000| |Teacher_QualityHigh | 2.6640905| 0.2609433| 10.209461| 0.0000000| |Peer_InfluenceNeutral | 1.1463525| 0.1861646| 6.157735| 0.0000000| |Peer_InfluencePositive | 2.3583937| 0.1946989| 12.113030| 0.0000000| |Physical_Activity | 0.4935082| 0.0655051| 7.533885| 0.0000000| |Learning_DisabilitiesYes | -1.8861506| 0.2422159| -7.787064| 0.0000000| |Parental_Education_LevelCollege | 1.2793946| 0.1535546| 8.331852| 0.0000000| |Parental_Education_LevelPostgraduate | 2.3426961| 0.1805602| 12.974597| 0.0000000| |Distance_from_HomeModerate | 0.9538673| 0.2528677| 3.772200| 0.0001618| |Distance_from_HomeNear | 2.2399426| 0.2471267| 9.063945| 0.0000000| |f_Tutoring_Sessions1 | 1.1424932| 0.1863416| 6.131175| 0.0000000| |f_Tutoring_Sessions2 | 2.3250215| 0.2011508| 11.558601| 0.0000000| |f_Tutoring_Sessions3 | 3.1771363| 0.2427745| 13.086780| 0.0000000| |f_Tutoring_Sessions4+ | 4.9440453| 0.3098494| 15.956287| 0.0000000| The stepwise automatic variable selection process removed the predictors of Sleep_Hours, School_Type, and Gender. ] ] --- ## Goodness of Fit The goodness-of-fit measures for the models are shown below. Table: Comparison of global goodness-of-fit statistics | | Deviance.residual| Null.Deviance.Residual| AIC| |:-------------|-----------------:|----------------------:|--------:| |Full Model | 1645.635| 7143.332| 1707.635| |Reduced Model | 7030.475| 7143.332| 7042.475| |Final Model | 1647.127| 7143.332| 1703.127| - When a chi-squared test is performed, all perform significantly better than just the intercept - Combined with the lower AIC score, we chose the stepwise model as the better performing model based on goodness of fit statistics. --- ## Cross-Validation and Model Selection We perform 5-fold Cross Validation and look at ROC curves and AUC for each model, as well as assess its performance using a randomly selected holdout test dataset with 20% of the data. .scroll-100[ <img src="Final-Presentation_files/figure-html/unnamed-chunk-20-1.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-20-2.png" width="100%" /><img src="Final-Presentation_files/figure-html/unnamed-chunk-20-3.png" width="100%" /> Table: Summary statistics of AUC for candidate models in 5-fold CV | | Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.| |:--------------|------:|-------:|------:|-------:|-------:|------:| |Full Model | 0.9748| 0.9772| 0.9850| 0.98286| 0.9877| 0.9896| |Reduced Model | 0.5637| 0.5732| 0.5866| 0.58622| 0.5977| 0.6099| |Stepwise Model | 0.9749| 0.9773| 0.9850| 0.98288| 0.9877| 0.9895| Test Data AUC: Table: Test AUC for candidate models | Full Model| Reduced Model| Stepwise Model| |----------:|-------------:|--------------:| | 0.9919| 0.5753| 0.9919| The stepwise and full models perform very similarly, so the stepwise model is recommended on principles of parsimony ] --- # Optimal Cut-off Probability ## Optimal Cut-off Probability We will find the optimal cut-off probability of the stepwise model using the ROC curve constructed earlier. <img src="Final-Presentation_files/figure-html/unnamed-chunk-22-1.png" width="100%" /> --- ## Logistic Regression Conclusions: - Stepwise model recommended based on both goodness of fit and predictive performance - Fewer issues with violation to regression assumptions due to binary nature of response Summary of parameter estimates with odds ratios shown below: - Learning Disability only parameter negatively associated - Notably high odds ratios for Tutoring Sessions, Parental Involvement, and a high access to resources .scroll-100[ Table: Summary Stats with Odds Ratios | | Estimate| Std. Error| z value| Pr(>|z|)| odds.ratio| |:------------------------------------|-----------:|----------:|----------:|------------------:|-----------:| |(Intercept) | -81.8608385| 2.9340460| -27.900326| 0.0000000| 0.0000000| |Hours_Studied | 0.6884720| 0.0260886| 26.389720| 0.0000000| 1.9906714| |Attendance | 0.4518533| 0.0164115| 27.532678| 0.0000000| 1.5712215| |Parental_InvolvementMedium | 2.2782241| 0.1946408| 11.704758| 0.0000000| 9.7593331| |Parental_InvolvementHigh | 4.4153203| 0.2439017| 18.102867| 0.0000000| 82.7083314| |Access_to_ResourcesMedium | 2.1525865| 0.1942602| 11.080947| 0.0000000| 8.6070915| |Access_to_ResourcesHigh | 4.5812743| 0.2428184| 18.867079| 0.0000000| 97.6387347| |Extracurricular_ActivitiesYes | 1.1643146| 0.1371816| 8.487396| 0.0000000| 3.2037264| |Previous_Scores | 0.1106960| 0.0058645| 18.875684| 0.0000000| 1.1170553| |Motivation_LevelMedium | 1.1026390| 0.1556496| 7.084112| 0.0000000| 3.0121044| |Motivation_LevelHigh | 2.5159070| 0.2057967| 12.225208| 0.0000000| 12.3778309| |Internet_AccessYes | 2.2691498| 0.2692556| 8.427492| 0.0000000| 9.6711748| |Family_IncomeMedium | 1.2491640| 0.1507814| 8.284602| 0.0000000| 3.4874264| |Family_IncomeHigh | 2.1688944| 0.1870020| 11.598239| 0.0000000| 8.7486062| |Teacher_QualityMedium | 1.5181398| 0.2394651| 6.339713| 0.0000000| 4.5637279| |Teacher_QualityHigh | 2.6640905| 0.2609433| 10.209461| 0.0000000| 14.3548885| |Peer_InfluenceNeutral | 1.1463525| 0.1861646| 6.157735| 0.0000000| 3.1466942| |Peer_InfluencePositive | 2.3583937| 0.1946989| 12.113030| 0.0000000| 10.5739530| |Physical_Activity | 0.4935082| 0.0655051| 7.533885| 0.0000000| 1.6380527| |Learning_DisabilitiesYes | -1.8861506| 0.2422159| -7.787064| 0.0000000| 0.1516545| |Parental_Education_LevelCollege | 1.2793946| 0.1535546| 8.331852| 0.0000000| 3.5944628| |Parental_Education_LevelPostgraduate | 2.3426961| 0.1805602| 12.974597| 0.0000000| 10.4092636| |Distance_from_HomeModerate | 0.9538673| 0.2528677| 3.772200| 0.0001618| 2.5957287| |Distance_from_HomeNear | 2.2399426| 0.2471267| 9.063945| 0.0000000| 9.3927924| |f_Tutoring_Sessions1 | 1.1424932| 0.1863416| 6.131175| 0.0000000| 3.1345738| |f_Tutoring_Sessions2 | 2.3250215| 0.2011508| 11.558601| 0.0000000| 10.2269004| |f_Tutoring_Sessions3 | 3.1771363| 0.2427745| 13.086780| 0.0000000| 23.9779886| |f_Tutoring_Sessions4+ | 4.9440453| 0.3098494| 15.956287| 0.0000000| 140.3368014| ] --- class: inverse center middle # Summary and Discussion --- # Comparison of Techniques - Both linear and logistic regression models found similar factors insignificant (Gender, School Type, and Sleep Hours) - Both linear and logistic regression found having 4+ tutoring sessions in a month, a high amount of access to resources, a high amount of parental involvement highly significant - Only predictor with a negative association was the presence of a learning disability for both models - Multiple linear regression was severely limited by the nonnormality of the response variable, recommend bootstrapping, subgroup analysis - Logistic regression limited by binary response, but performed well - Logistic regression recommended for analysis of this dataset --- # References and Appendix (1) https://www.kaggle.com/datasets/lainguyn123/student-performance-factors (2) https://www.bestcolleges.com/blog/passing-grade-college/ (3) https://www.registrar.psu.edu/grades/grading-system.cfm (4) https://www.statology.org/null-residual-deviance/