0.1 Data Description

I obtained this data from dataworld.com and copy and pasted the raw data file into the csv. The data set we use in this note is taken from a web-based online community of data scientists, that allows users to find and publish data sets and learn about models in a data-science environment.

Source: www.CivilWarTalk.com Description: Initial Strength and casualties for the 9 divisions of the Army of the Potomac (Union Army) at the Battle of Gettysburg July 1-3,1863. (Meade vs Lee) Divisions: 1-I Corps, 2-II Corps, 3-III Corps, 4-V Corps, 5-VI Corps, 6-XI Corps, 7-XII Corps ,8 Cavalry Corps, 9-Artillery Reserves, Casualties are classified as Killed,Wounded,Missing/Captured

Variables/Columns:Division, Officer-,Total Soldiers, Killed, Wounded, Captured/Missing, Non-Casualty
Our categorical variables are Officer and Division, 0=Is an officer, 1= not an officer. Division has 9 categories ranging from 1-9. ## Practical Question Our question is whether or not our exaplnatory variables affect our response variables which is death count.

0.2 Analytic Data set

In this section, we combine 2 data sets created earlier to define the final analytic data set for statistical analysis.

Missingrate Division Officer Total soldiers Killed Wounded
0.1172316 1 1 708 42 262
0.2232124 1 0 9314 624 2969
0.0138151 2 1 941 66 270
0.0305618 2 0 11943 731 2924
0.0171779 3 1 815 50 251
0.0517598 3 0 11109 543 2778

0.3 Full model and Dianostic

Here is a linear model that has all the predictor variables.

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.0939934 65.3153362 0.2770252 0.7864750
Missingrate -276.8557362 162.5525244 -1.7031771 0.1142706
Division -1.4332399 4.6385509 -0.3089844 0.7626322
Officer -3.2179773 45.9309093 -0.0700613 0.9452989
Total soldiers -0.0004628 0.0040924 -0.1130988 0.9118223
Wounded 0.2241071 0.0145997 15.3501685 0.0000000

Next, we perform residual analysis to check the validity of the model before making an inference about the model. There are some violations for the normal assumption. The variance of the residuals are not constant. Also, the top left graph shows clusters of variables and groups of missing variables The QQ plot shows that the residuals are veering off from the ends quite a lot. The curvature on the residual plot also looks weak. Thus, We will perform Box-Cox transformation to correct the non-constant variance and correct the non-normality of the QQ plot.

0.4 Box Cox Transformations

Since non-constant variance, we perform the Box-Cox procedure to search of a transformation of the response variable.

0.5 Sq root transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.6869578 3.5020212 3.0516542 0.0100538
log(1 + Missingrate) -3.2734075 9.5040551 -0.3444222 0.7364895
Division -0.3921241 0.2487102 -1.5766304 0.1408641
Officer -4.6453615 2.4612655 -1.8873874 0.0835257
Total soldiers -0.0001463 0.0002192 -0.6676445 0.5169951
Wounded 0.0061631 0.0007826 7.8754665 0.0000044

There are some improvements in the residual diagnostic plots: first of all, the weak curve pattern has been removed from the residual plot; Second of all, the points on the QQ line are reaching closer to the line but the points on the right side are still veering off a bit. Unfortunately, the violation of the normality assumption is still an issue.

0.6 Log Transformation

The residual diagnostic plots below are similar to that of the previous model that had a square root transformation. The Q-Q plots of two three models are similar and the residual plots are similar, whichs means that the assumption of normal residuals is not satisfied for the two models.

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7813426 1.7566139 3.8604628 0.0022667
Missingrate 1.3906265 4.3717455 0.3180941 0.7558811
Division -0.2681417 0.1247508 -2.1494181 0.0526980
Officer -2.7511788 1.2352822 -2.2271662 0.0458439
Total soldiers -0.0000930 0.0001101 -0.8446590 0.4148242
Wounded 0.0004615 0.0003926 1.1752885 0.2626724

0.7 Goodness of fit measure

Next, we extract several other goodness-of-fit from each of the three candidate models and see which one of them has the highest R^2 value.We can see from the table below that the goodness-of-fit measures of the non transformed model is unanimously better than the other two models. Considering the interpretability and simplicity, we choose the first model as the final model.

Goodness-of-fit Measures of the Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 14251.50983 0.9849540 0.9786849 6 132.136435 137.478665 38252.60131
sqrt.price.log.dist 40.98324 0.9653949 0.9509761 6 26.810245 32.152476 173.20888
log.price 10.30821 0.8178306 0.7419267 6 1.966237 7.308467 31.58792

0.8 Final Model

Since most of the p-values are not close to zero, we will need to perform variable selection methods such as the likelihood ratio test to see if any additional terms reduce the error sum of squares. We will start by removing the variables with the highest p-values one at a time and correlate the the models suing the likelhood test.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.0939934 65.3153362 0.2770252 0.7864750
Missingrate -276.8557362 162.5525244 -1.7031771 0.1142706
Division -1.4332399 4.6385509 -0.3089844 0.7626322
Officer -3.2179773 45.9309093 -0.0700613 0.9452989
Total soldiers -0.0004628 0.0040924 -0.1130988 0.9118223
Wounded 0.2241071 0.0145997 15.3501685 0.0000000

We can see that by performing the likelihood ratio test between the full model and the simpler model1, we got a p>.05, which means that the reduced model(model1) is a more effective model, so we fail to reject our null and we do not have evidence that the full model is better than the reduced model. Also, we can conclude that we can drop Officer from our model without losing predictive power.

We can see that by performing the likelihood ratio test between model1 and model2, we got a p>.05, which means that the more reduced model (model2) is a more effective model, so we fail to reject our null and we do not have evidence that the model1 is better than the the more reduced model. Also, we can conclude that we can drop total soldiers and Officer from our model without losing predictive power.

We can see that by performing the likelihood ratio test between model2 and model3, we got a p>.05 again, which means that the our most reduced model (model3) is a more effective model, so we fail to reject our null and we do not have evidence that the model2 is better than our most reduced model. Also, we can conclude that we can drop total soldiers, Officer, and Division from our model without losing predictive power. As a result, our most reduced model is our most effective model from comparing the other models with the likelihood ratio test.

0.9 Summary of the final model

Killed = 5.506 − 260.7(Missingrate) + .224(Wounded)

We used regression techniques such as Box-Cox transformation for response variables and other and the explanatory variables to find the final model. Since all the variables are not significant except for wounded, we did a variable selection procedure. The violation of the normal assumption of the residuals was still not completely fixed. However, we found out that Killed and Missingrate have a negative correlation. That is, as the Missing rate increases by one, the number of soldiers killed decreases by 260.7. And as the number of wounded soldiers increase by 1, then the number of soldiers killed increase by .224.