Bootstrap Distibution

I obtained this data from dataworld.com and copy and pasted the raw data file into the csv. The data set we use in this note is taken from a web-based online community of data scientists, that allows users to find and publish data sets and learn about models in a data-science environment.

Source: www.CivilWarTalk.com Description: Initial Strength and casualties for the 9 divisions of the Army of the Potomac (Union Army) at the Battle of Gettysburg July 1-3,1863. (Meade vs Lee) Divisions: 1-I Corps, 2-II Corps, 3-III Corps, 4-V Corps, 5-VI Corps, 6-XI Corps, 7-XII Corps ,8 Cavalry Corps, 9-Artillery Reserves, Casualties are classified as Killed,Wounded,Missing/Captured

Variables/Columns:Division, Officer-,Total Soldiers, Killed, Wounded, Captured/Missing, Non-Casualty
Our categorical variables are Officer and Division, 0=Is an officer, 1= not an officer. Division has 9 categories ranging from 1-9.

A Pair-Wise plot shows that some of the relationships between the variables are stronger than others. Wounded and killed seem to have the strongest postive correlation while the other variables have weaker positive and negative correlations.

We can see that there is a postive linear relationship between deaths and wounds

The bottom left graph shows a cluster of variables and shows missing groups of variables. Also, the bottom left also shows that some groups of variables missing. While the bottom right chart indicates that there are some outliers. Moreover, the top right chart shows points veering away from the line, which indicates that the assumption for normal assumption has been violated. Also, the top left chart has no linear trend.

Since the assumptions in the model are violated, also, the p-value could be wrong. Thus, since our orginal sample is not too small, we will perform a bootstrap sampling process. Also, description of the slope paremeter is that as rank increases by 1, the corresponding death rate decrease by .44 people.

Inferential statistics for the parametric linear regression model: The amount of deaths affected by wounded soldiers
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.4239421	9.9297740	0.042694	0.9664737
wound	0.2166863	0.0075045	28.874202	0.0000000

The idea is to take a bootstrap sample of the observations and then use the observations to take the corresponding records to create a bootstrap sample.

If we repeat the bootstrap sampling and regression modeling 1000 times, there will be bootstrap regression coefficients. These bootstrap coefficients such as Alpha and Beta and Epsilon, which can be used to construct the bootstrap confidence interval of the regression coefficients. If 0 is not in the confidence interval, then the slope is not considered significant.

since 0 is not in the interval, which means that the mean is not equal to 0, which states that it is statistically significant. Since both limits are positive, the amount of soldiers wounded and the amount of soldiers killed are positively associated. Both the parametric and bootstrap regression models show that the slope coefficent is different from 0. This means the number of wounded soldiers and dead soldiers are statistically correlated. This could be because back during the 1800s, there was not very effective methods of treating wound infections and the medical technology was not as advanced during that time period. Moreover, the soliders during that time period did not wear as much protective gear as compared to the the modern era which made wounds more fatal.

	Estimate	Std. Error	t value	Pr(>\|t\|)	per.025	per.975
(Intercept)	0.4239421	9.9297740	0.042694	0.9664737	-8.8756441	10.0708532
wound	0.2166863	0.0075045	28.874202	0.0000000	0.1936989	0.2475672

Bootstrap Distibution

Yuanqi Zhang

Homework