class: center, middle, inverse, title-slide .title[ #
Multiple Linear Regression on the mortality of Union Soldiers
] .author[ ###
By: Johnny Zhang
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
May 15, 2022
Prepared for
STA 321: Advanced Topic In Statistics
] --- class: inverse, middle ## <center><b><font color = gold>Agenda</font></b></center> ### Dataset Description ### Variable Breakdown ### Final Dataset Creation ### Model Building ### Model Selection --- class: inverse # <center><b><font color = gold>Dataset Description</font></b></center> Source: www.CivilWarTalk.com Description: Initial Strength and casualties for the 9 divisions of the Army of the Potomac (Union Army) at the Battle of Gettysburg July 1-3,1863. (Meade vs Lee), 18 observations, Divisions: 1-I Corps, 2-II Corps, 3-III Corps, 4-V Corps, 5-VI Corps, 6-XI Corps, 7-XII Corps ,8 Cavalry Corps, 9-Artillery Reserves, Casualties are classified as Killed,Wounded,Missing/Captured **From this Analysis:** We hope to discover whether or not our exaplnatory variables affect our response variables which is death count. --- class: inverse # <center><b><font color = gold>List of Variables</font></b></center> - **Officer**: Catergorical Variable, 1 for Officer, 0 for Non Officer - **Total Soldiers**: Total amount of soldiers engaged in Battle - **Killed, Wounded** : Amount killed or wounded - **Captured/Missing**: Non death casualty - **Non-Casualty** : Soldiers who were safe from harm - **Division**: Repreents specific groups of solders who fought **Response**: - **Killed**: Amounts of Soldiers that in battle --- # <center><b><font color = purple> Analytic Data Set</font></b></center> <table> <thead> <tr> <th style="text-align:right;"> Missingrate </th> <th style="text-align:right;"> Division </th> <th style="text-align:right;"> Officer </th> <th style="text-align:right;"> Total soldiers </th> <th style="text-align:right;"> Killed </th> <th style="text-align:right;"> Wounded </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1172316 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 708 </td> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 262 </td> </tr> <tr> <td style="text-align:right;"> 0.2232124 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 9314 </td> <td style="text-align:right;"> 624 </td> <td style="text-align:right;"> 2969 </td> </tr> <tr> <td style="text-align:right;"> 0.0138151 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 941 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 270 </td> </tr> <tr> <td style="text-align:right;"> 0.0305618 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 11943 </td> <td style="text-align:right;"> 731 </td> <td style="text-align:right;"> 2924 </td> </tr> <tr> <td style="text-align:right;"> 0.0171779 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 815 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 251 </td> </tr> <tr> <td style="text-align:right;"> 0.0517598 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 11109 </td> <td style="text-align:right;"> 543 </td> <td style="text-align:right;"> 2778 </td> </tr> </tbody> </table> --- # <center><b><font color = purple>Full Model and Diagnostic</font></b></center> <table> <caption>Statistics of Regression Coefficients</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> t value </th> <th style="text-align:right;"> Pr(>|t|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 18.0939934 </td> <td style="text-align:right;"> 65.3153362 </td> <td style="text-align:right;"> 0.2770252 </td> <td style="text-align:right;"> 0.7864750 </td> </tr> <tr> <td style="text-align:left;"> Missingrate </td> <td style="text-align:right;"> -276.8557362 </td> <td style="text-align:right;"> 162.5525244 </td> <td style="text-align:right;"> -1.7031771 </td> <td style="text-align:right;"> 0.1142706 </td> </tr> <tr> <td style="text-align:left;"> Division </td> <td style="text-align:right;"> -1.4332399 </td> <td style="text-align:right;"> 4.6385509 </td> <td style="text-align:right;"> -0.3089844 </td> <td style="text-align:right;"> 0.7626322 </td> </tr> <tr> <td style="text-align:left;"> Officer </td> <td style="text-align:right;"> -3.2179773 </td> <td style="text-align:right;"> 45.9309093 </td> <td style="text-align:right;"> -0.0700613 </td> <td style="text-align:right;"> 0.9452989 </td> </tr> <tr> <td style="text-align:left;"> `Total soldiers` </td> <td style="text-align:right;"> -0.0004628 </td> <td style="text-align:right;"> 0.0040924 </td> <td style="text-align:right;"> -0.1130988 </td> <td style="text-align:right;"> 0.9118223 </td> </tr> <tr> <td style="text-align:left;"> Wounded </td> <td style="text-align:right;"> 0.2241071 </td> <td style="text-align:right;"> 0.0145997 </td> <td style="text-align:right;"> 15.3501685 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> --- # <center><b><font color = purple>Residual Plots of Log Transformation</font></b></center> <img src="data:image/png;base64,#final_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> - There are some violations for the normal assumption. - We will perform Box-Cox transformation to correct the non-constant variance and correct the non-normality of the QQ plot. --- # <center><b><font color = purple>Box Cox Transformations</font></b></center> <img src="data:image/png;base64,#final_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> - Since non-constant variance, we perform the Box-Cox procedure to search of a transformation of the response variable. --- # <center><b><font color = purple>Sq root transformation</font></b></center> <table> <caption>log-transformed model</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> t value </th> <th style="text-align:right;"> Pr(>|t|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 10.6869578 </td> <td style="text-align:right;"> 3.5020212 </td> <td style="text-align:right;"> 3.0516542 </td> <td style="text-align:right;"> 0.0100538 </td> </tr> <tr> <td style="text-align:left;"> log(1 + Missingrate) </td> <td style="text-align:right;"> -3.2734075 </td> <td style="text-align:right;"> 9.5040551 </td> <td style="text-align:right;"> -0.3444222 </td> <td style="text-align:right;"> 0.7364895 </td> </tr> <tr> <td style="text-align:left;"> Division </td> <td style="text-align:right;"> -0.3921241 </td> <td style="text-align:right;"> 0.2487102 </td> <td style="text-align:right;"> -1.5766304 </td> <td style="text-align:right;"> 0.1408641 </td> </tr> <tr> <td style="text-align:left;"> Officer </td> <td style="text-align:right;"> -4.6453615 </td> <td style="text-align:right;"> 2.4612655 </td> <td style="text-align:right;"> -1.8873874 </td> <td style="text-align:right;"> 0.0835257 </td> </tr> <tr> <td style="text-align:left;"> `Total soldiers` </td> <td style="text-align:right;"> -0.0001463 </td> <td style="text-align:right;"> 0.0002192 </td> <td style="text-align:right;"> -0.6676445 </td> <td style="text-align:right;"> 0.5169951 </td> </tr> <tr> <td style="text-align:left;"> Wounded </td> <td style="text-align:right;"> 0.0061631 </td> <td style="text-align:right;"> 0.0007826 </td> <td style="text-align:right;"> 7.8754665 </td> <td style="text-align:right;"> 0.0000044 </td> </tr> </tbody> </table> --- # <center><b><font color = purple>Residual Plots of Sq Root Transformation</font></b></center> <img src="data:image/png;base64,#final_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> - There are some improvements in the residual diagnostic plots - Unfortunately, the violation of the normality assumption is still an issue. --- # <center><b><font color = purple>Log Transformation</font></b></center> <table> <caption>log-transformed model</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> t value </th> <th style="text-align:right;"> Pr(>|t|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.7813426 </td> <td style="text-align:right;"> 1.7566139 </td> <td style="text-align:right;"> 3.8604628 </td> <td style="text-align:right;"> 0.0022667 </td> </tr> <tr> <td style="text-align:left;"> Missingrate </td> <td style="text-align:right;"> 1.3906265 </td> <td style="text-align:right;"> 4.3717455 </td> <td style="text-align:right;"> 0.3180941 </td> <td style="text-align:right;"> 0.7558811 </td> </tr> <tr> <td style="text-align:left;"> Division </td> <td style="text-align:right;"> -0.2681417 </td> <td style="text-align:right;"> 0.1247508 </td> <td style="text-align:right;"> -2.1494181 </td> <td style="text-align:right;"> 0.0526980 </td> </tr> <tr> <td style="text-align:left;"> Officer </td> <td style="text-align:right;"> -2.7511788 </td> <td style="text-align:right;"> 1.2352822 </td> <td style="text-align:right;"> -2.2271662 </td> <td style="text-align:right;"> 0.0458439 </td> </tr> <tr> <td style="text-align:left;"> `Total soldiers` </td> <td style="text-align:right;"> -0.0000930 </td> <td style="text-align:right;"> 0.0001101 </td> <td style="text-align:right;"> -0.8446590 </td> <td style="text-align:right;"> 0.4148242 </td> </tr> <tr> <td style="text-align:left;"> Wounded </td> <td style="text-align:right;"> 0.0004615 </td> <td style="text-align:right;"> 0.0003926 </td> <td style="text-align:right;"> 1.1752885 </td> <td style="text-align:right;"> 0.2626724 </td> </tr> </tbody> </table> --- # <center><b><font color = purple>Residual Plots of Log Transformation</font></b></center> <img src="data:image/png;base64,#final_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> - The residual diagnostic plots below are similar to that of the previous model - whichs means that the assumption of normal residuals is not satisfied for the two models. --- # <center><b><font color = gold>Goodness of fit measure</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/popeyey1/bob/main/Screenshot%202023-04-30%20150326.png" style="display: block; margin: auto;" /> - we extract several other goodness-of-fit from each of the three candidate models and see which one of them has the highest R^2 value. - the goodness-of-fit measures of the non transformed model is unanimously better than the other two models --- # <center><b><font color = gold>Final Model</font></b></center> <table> <caption>Inferential Statistics of Final Model</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> t value </th> <th style="text-align:right;"> Pr(>|t|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 18.0939934 </td> <td style="text-align:right;"> 65.3153362 </td> <td style="text-align:right;"> 0.2770252 </td> <td style="text-align:right;"> 0.7864750 </td> </tr> <tr> <td style="text-align:left;"> Missingrate </td> <td style="text-align:right;"> -276.8557362 </td> <td style="text-align:right;"> 162.5525244 </td> <td style="text-align:right;"> -1.7031771 </td> <td style="text-align:right;"> 0.1142706 </td> </tr> <tr> <td style="text-align:left;"> Division </td> <td style="text-align:right;"> -1.4332399 </td> <td style="text-align:right;"> 4.6385509 </td> <td style="text-align:right;"> -0.3089844 </td> <td style="text-align:right;"> 0.7626322 </td> </tr> <tr> <td style="text-align:left;"> Officer </td> <td style="text-align:right;"> -3.2179773 </td> <td style="text-align:right;"> 45.9309093 </td> <td style="text-align:right;"> -0.0700613 </td> <td style="text-align:right;"> 0.9452989 </td> </tr> <tr> <td style="text-align:left;"> `Total soldiers` </td> <td style="text-align:right;"> -0.0004628 </td> <td style="text-align:right;"> 0.0040924 </td> <td style="text-align:right;"> -0.1130988 </td> <td style="text-align:right;"> 0.9118223 </td> </tr> <tr> <td style="text-align:left;"> Wounded </td> <td style="text-align:right;"> 0.2241071 </td> <td style="text-align:right;"> 0.0145997 </td> <td style="text-align:right;"> 15.3501685 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> - most of the p-values are not close to zero - we will need to perform variable selection methods such as the likelihood ratio test to see if any additional terms reduce the error sum of squares. --- # <center><b><font color = gold> full model vs Model1</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/popeyey1/bob/main/M.1.png.png" style="display: block; margin: auto;" /> - the likelihood ratio test between the full model and the simpler model1, we got a p>.05, which means that the reduced model(model1) is a more effective model - we fail to reject our null and we do not have evidence that the full model is better than the reduced model. --- # <center><b><font color = gold>model1 vs model2</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/popeyey1/bob/main/M.2.png.png" style="display: block; margin: auto;" /> - we got a p>.05, which means that the more reduced model (model2) is a more effective model - we fail to reject our null and we do not have evidence that the model1 is better than the the more reduced model. --- # <center><b><font color = gold>model2 vs model3</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/popeyey1/bob/main/M.1.png.png" style="display: block; margin: auto;" /> - by performing the likelihood ratio test between model2 and model3, we got a p>.05 again, which means that the our most reduced model (model3) is a more effective model - we fail to reject our null and we do not have evidence that the model2 is better than our most reduced model. - our most reduced model is our most effective model from comparing the other models with the likelihood ratio test --- # <center><b><font color = gold> Summary of the final model</font></b></center> **Killed** = 5.506 − 260.7(**Missingrate**) + .224(**Wounded**) - Model is used for find the amount of union soldiers killed depending on the missingrate and amount of soldiers wounded --- class: inverse, middle # <center><b><font color = gold>Conclusion and Discussion</font></b></center> - Since all the variables are not significant except for wounded, we did a variable selection procedure. - The violation of the normal assumption of the residuals was still not completely fixed. However, we found out that Killed and Missingrate have a negative correlation. ## <center><b>Any Questions?</b></center>