I obtained this data from dataworld.com and copy and pasted the raw data file into the csv. The data set we use in this note is taken from a web-based online community of data scientists, that allows users to find and publish data sets and learn about models in a data-science environment.
Source: www.CivilWarTalk.com Description: Initial Strength and casualties for the 9 divisions of the Army of the Potomac (Union Army) at the Battle of Gettysburg July 1-3,1863. (Meade vs Lee) Divisions: 1-I Corps, 2-II Corps, 3-III Corps, 4-V Corps, 5-VI Corps, 6-XI Corps, 7-XII Corps ,8 Cavalry Corps, 9-Artillery Reserves, Casualties are classified as Killed,Wounded,Missing/Captured
Variables/Columns:Division, Officer-,Total Soldiers, Killed, Wounded,
Captured/Missing, Non-Casualty
Our categorical variables are Officer and Division, 0=Is an officer, 1=
not an officer. Division has 9 categories ranging from 1-9. ## Practical
Question Our question is whether or not our exaplnatory variables affect
our response variables which is death count.
In this section, we combine 2 data sets created earlier to define the final analytic data set for statistical analysis.
| Missingrate | Division | Officer | Total soldiers | Killed | Wounded |
|---|---|---|---|---|---|
| 0.1172316 | 1 | 1 | 708 | 42 | 262 |
| 0.2232124 | 1 | 0 | 9314 | 624 | 2969 |
| 0.0138151 | 2 | 1 | 941 | 66 | 270 |
| 0.0305618 | 2 | 0 | 11943 | 731 | 2924 |
| 0.0171779 | 3 | 1 | 815 | 50 | 251 |
| 0.0517598 | 3 | 0 | 11109 | 543 | 2778 |
Here is a linear model that has all the predictor variables.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 18.0939934 | 65.3153362 | 0.2770252 | 0.7864750 |
| Missingrate | -276.8557362 | 162.5525244 | -1.7031771 | 0.1142706 |
| Division | -1.4332399 | 4.6385509 | -0.3089844 | 0.7626322 |
| Officer | -3.2179773 | 45.9309093 | -0.0700613 | 0.9452989 |
Total soldiers |
-0.0004628 | 0.0040924 | -0.1130988 | 0.9118223 |
| Wounded | 0.2241071 | 0.0145997 | 15.3501685 | 0.0000000 |
Next, we perform residual analysis to check the validity of the model
before making an inference about the model. There are some violations
for the normal assumption. The variance of the residuals are not
constant. Also, the top left graph shows clusters of variables and
groups of missing variables The QQ plot shows that the residuals are
veering off from the ends quite a lot. The curvature on the residual
plot also looks weak. Thus, We will perform Box-Cox transformation to
correct the non-constant variance and correct the non-normality of the
QQ plot.
Since non-constant variance, we perform the Box-Cox procedure to
search of a transformation of the response variable.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 10.6869578 | 3.5020212 | 3.0516542 | 0.0100538 |
| log(1 + Missingrate) | -3.2734075 | 9.5040551 | -0.3444222 | 0.7364895 |
| Division | -0.3921241 | 0.2487102 | -1.5766304 | 0.1408641 |
| Officer | -4.6453615 | 2.4612655 | -1.8873874 | 0.0835257 |
Total soldiers |
-0.0001463 | 0.0002192 | -0.6676445 | 0.5169951 |
| Wounded | 0.0061631 | 0.0007826 | 7.8754665 | 0.0000044 |
There are some improvements in the residual diagnostic plots: first
of all, the weak curve pattern has been removed from the residual plot;
Second of all, the points on the QQ line are reaching closer to the line
but the points on the right side are still veering off a bit.
Unfortunately, the violation of the normality assumption is still an
issue.
The residual diagnostic plots below are similar to that of the previous model that had a square root transformation. The Q-Q plots of two three models are similar and the residual plots are similar, whichs means that the assumption of normal residuals is not satisfied for the two models.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.7813426 | 1.7566139 | 3.8604628 | 0.0022667 |
| Missingrate | 1.3906265 | 4.3717455 | 0.3180941 | 0.7558811 |
| Division | -0.2681417 | 0.1247508 | -2.1494181 | 0.0526980 |
| Officer | -2.7511788 | 1.2352822 | -2.2271662 | 0.0458439 |
Total soldiers |
-0.0000930 | 0.0001101 | -0.8446590 | 0.4148242 |
| Wounded | 0.0004615 | 0.0003926 | 1.1752885 | 0.2626724 |
Next, we extract several other goodness-of-fit from each of the three candidate models and see which one of them has the highest R^2 value.We can see from the table below that the goodness-of-fit measures of the non transformed model is unanimously better than the other two models. Considering the interpretability and simplicity, we choose the first model as the final model.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 14251.50983 | 0.9849540 | 0.9786849 | 6 | 132.136435 | 137.478665 | 38252.60131 |
| sqrt.price.log.dist | 40.98324 | 0.9653949 | 0.9509761 | 6 | 26.810245 | 32.152476 | 173.20888 |
| log.price | 10.30821 | 0.8178306 | 0.7419267 | 6 | 1.966237 | 7.308467 | 31.58792 |
Since most of the p-values are not close to zero, we will need to perform variable selection methods such as the likelihood ratio test to see if any additional terms reduce the error sum of squares. We will start by removing the variables with the highest p-values one at a time and correlate the the models suing the likelhood test.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 18.0939934 | 65.3153362 | 0.2770252 | 0.7864750 |
| Missingrate | -276.8557362 | 162.5525244 | -1.7031771 | 0.1142706 |
| Division | -1.4332399 | 4.6385509 | -0.3089844 | 0.7626322 |
| Officer | -3.2179773 | 45.9309093 | -0.0700613 | 0.9452989 |
Total soldiers |
-0.0004628 | 0.0040924 | -0.1130988 | 0.9118223 |
| Wounded | 0.2241071 | 0.0145997 | 15.3501685 | 0.0000000 |
We can see that by performing the likelihood ratio test between the full model and the simpler model1, we got a p>.05, which means that the reduced model(model1) is a more effective model, so we fail to reject our null and we do not have evidence that the full model is better than the reduced model. Also, we can conclude that we can drop Officer from our model without losing predictive power.
We can see that by performing the likelihood ratio test between model1 and model2, we got a p>.05, which means that the more reduced model (model2) is a more effective model, so we fail to reject our null and we do not have evidence that the model1 is better than the the more reduced model. Also, we can conclude that we can drop total soldiers and Officer from our model without losing predictive power.
We can see that by performing the likelihood ratio test between model2 and model3, we got a p>.05 again, which means that the our most reduced model (model3) is a more effective model, so we fail to reject our null and we do not have evidence that the model2 is better than our most reduced model. Also, we can conclude that we can drop total soldiers, Officer, and Division from our model without losing predictive power. As a result, our most reduced model is our most effective model from comparing the other models with the likelihood ratio test.
Killed = 5.506 − 260.7(Missingrate) + .224(Wounded)
We used regression techniques such as Box-Cox transformation for response variables and other and the explanatory variables to find the final model. Since all the variables are not significant except for wounded, we did a variable selection procedure. The violation of the normal assumption of the residuals was still not completely fixed. However, we found out that Killed and Missingrate have a negative correlation. That is, as the Missing rate increases by one, the number of soldiers killed decreases by 260.7. And as the number of wounded soldiers increase by 1, then the number of soldiers killed increase by .224.