I obtained this data from dataworld.com and copy and pasted the raw data
file into the csv. The data set we use in this note is taken from a
web-based online community of data scientists, that allows users to find
and publish data sets and learn about models in a data-science
environment.
Source: www.CivilWarTalk.com Description: Initial Strength and casualties for the 9 divisions of the Army of the Potomac (Union Army) at the Battle of Gettysburg July 1-3,1863. (Meade vs Lee) Divisions: 1-I Corps, 2-II Corps, 3-III Corps, 4-V Corps, 5-VI Corps, 6-XI Corps, 7-XII Corps ,8 Cavalry Corps, 9-Artillery Reserves, Casualties are classified as Killed,Wounded,Missing/Captured
Variables/Columns:Division, Officer-,Total Soldiers, Killed, Wounded,
Captured/Missing, Non-Casualty
Our categorical variables are Officer and Division, 0=Is an officer, 1=
not an officer. Division has 9 categories ranging from 1-9.
A Pair-Wise plot shows that some of the relationships between the variables are stronger than others. Wounded and killed seem to have the strongest postive correlation while the other variables have weaker positive and negative correlations.
We can see that there is a postive linear relationship between deaths and wounds
The bottom left graph shows a cluster of variables and shows missing
groups of variables. Also, the bottom left also shows that some groups
of variables missing. While the bottom right chart indicates that there
are some outliers. Moreover, the top right chart shows points veering
away from the line, which indicates that the assumption for normal
assumption has been violated. Also, the top left chart has no linear
trend.
Since the assumptions in the model are violated, also, the p-value could be wrong. Thus, since our orginal sample is not too small, we will perform a bootstrap sampling process. Also, description of the slope paremeter is that as rank increases by 1, the corresponding death rate decrease by .44 people.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.4239421 | 9.9297740 | 0.042694 | 0.9664737 |
| wound | 0.2166863 | 0.0075045 | 28.874202 | 0.0000000 |
The idea is to take a bootstrap sample of the observations and then use the observations to take the corresponding records to create a bootstrap sample.
If we repeat the bootstrap sampling and regression modeling 1000 times, there will be bootstrap regression coefficients. These bootstrap coefficients such as Alpha and Beta and Epsilon, which can be used to construct the bootstrap confidence interval of the regression coefficients. If 0 is not in the confidence interval, then the slope is not considered significant.
since 0 is not in the interval, which means that the mean is not equal to 0, which states that it is statistically significant. Since both limits are positive, the amount of soldiers wounded and the amount of soldiers killed are positively associated. Both the parametric and bootstrap regression models show that the slope coefficent is different from 0. This means the number of wounded soldiers and dead soldiers are statistically correlated. This could be because back during the 1800s, there was not very effective methods of treating wound infections and the medical technology was not as advanced during that time period. Moreover, the soliders during that time period did not wear as much protective gear as compared to the the modern era which made wounds more fatal.
| Estimate | Std. Error | t value | Pr(>|t|) | per.025 | per.975 | |
|---|---|---|---|---|---|---|
| (Intercept) | 0.4239421 | 9.9297740 | 0.042694 | 0.9664737 | -8.8756441 | 10.0708532 |
| wound | 0.2166863 | 0.0075045 | 28.874202 | 0.0000000 | 0.1936989 | 0.2475672 |