0.1 Intro

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	5.5069838	9.6070759	0.5732216	0.5749883
Missingrate	-260.7794011	138.3239661	-1.8852800	0.0789136
Wounded	0.2248974	0.0082179	27.3668544	0.0000000

0.2 Explicit Expression of our model

Killed = 5.506 − 260.7(Missingrate) + .224(Wounded)

The bottom left graph shows a cluster of variables and shows missing groups of variables. Also, the bottom left also shows that some groups of variables missing. While the bottom right chart indicates that there are some outliers. Moreover, the top right chart shows points veering away from the end points of the line, which indicates that the assumption for normal assumption has been violated. Also, the top left chart has a weak linear trend.

We can see that the explanatory variables are also not normally distributed. Both of the explanatory variables are skewed more to the right

1 Transformations

Since non-constant variance and also there were violations in the normal assumption, we perform various transformations of the response variable.

1.1 Sq root transformation

log-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.4608743	0.6358428	7.0156876	0.0000042
Missingrate	-3.5363246	9.1549495	-0.3862746	0.7047172
Wounded	0.0076354	0.0005439	14.0381886	0.0000000

There are some improvements in the residual diagnostic plots: first of all, the weak curve pattern still remains in residual plot; Second of all, the points on the QQ line are reaching closer to the line but the points are still veering off a bit. There is also pattern with the points of the residual plot, which shows that the variance is not constant. The points should look like a cloud of points.Therefore, Unfortunately, the violation of the normality assumption is still an issue.

1.2 Log Transformation

The residual diagnostic plots below are similar to that of the previous model that had a square root transformation. The Q-Q plots of two models are similar and the residual plots are similar. Also, There is also pattern with the points of the residual plot, which shows that the variance is also not constant, Therefore, the assumption of normal residuals is not satisfied for the two models.

log-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	5.5069838	9.6070759	0.5732216	0.5749883
Missingrate	-260.7794011	138.3239661	-1.8852800	0.0789136
Wounded	0.2248974	0.0082179	27.3668544	0.0000000

1.3 Bootstrap Confidence intervals

In this section, we use bootstrapping cases to find the confidence intervals for the coefficients in the final regression model. The code finds the confidence interval our of final model.

We made an R function to make histograms of the bootstrap coeffifcients. This function will aslo be usedto make histograms for the residual bootstrap estimated regression coefficients.

These histograms of the bootstrap estimates of regression coefficients represent the sampling distribution’scorresponding estimates from our final model.

The above histograms that the red and blue curves in all histograms are close. However, for the variable, wonded, it is skewed to the left a bit. The significance test results and the corresponding confidence intervals should be consistent. Afterwards, we calculate the 95% bootstrap confidence intervals of each regression coefficient and combine them with the output of the final model.

Regression Coefficient Matrix
	Estimate	Std. Error	t value	Pr(>\|t\|)	btc.ci.95
(Intercept)	5.5070	9.6071	0.5732	0.5750	[ -3.1679 , 14.5639 ]
Missingrate	-260.7794	138.3240	-1.8853	0.0789	[ -620.8353 , 103.6062 ]
Wounded	0.2249	0.0082	27.3669	0.0000	[ 0.1947 , 0.2507 ]

The residual bootstrap confidence interval’s results as p-values are not consistent.

1.4 Bootstrap residuals

Here we will demonstrate bootstrap methods to estimate the bootstrap confidence intervals of the residuals

We can see that this distribution has outliers and is skewed to the left.

Now, we will make histograms of the boostrap residuals

After resampling the residuals, it only shows more of a normal distribution. As the number of trials increased, the distribution began to look more curvy and skinnier. Next, we calculate the 95% residual bootstrap confidence intervals

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	Pr(>\|t\|)	btr.ci.95
(Intercept)	5.5070	9.6071	0.5732	0.5750	[ -11.1114 , 22.3416 ]
Missingrate	-260.7794	138.3240	-1.8853	0.0789	[ -502.2863 , -21.5395 ]
Wounded	0.2249	0.0082	27.3669	0.0000	[ 0.2095 , 0.2403 ]

The residual bootstrap confidence intervals do not yield the same results as p-values for the variable Wounded because p<.05 and 0 is in the confidence interval. However, Missingrate’s p-value and confidence intervals do match because 0 is inside the confidence interval and the p>.05, so it considered statistically insignificant. The sample size is not very big, so the sampling distributions of the estimated coefficients do not have good approximations of the normal distributions.

1.5 Combining all the Inferential Statistics

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	Pr(>\|t\|)	btc.ci.95	btr.ci.95
(Intercept)	5.5070	9.6071	0.5750	[ -3.1679 , 14.5639 ]	[ -11.1114 , 22.3416 ]
Missingrate	-260.7794	138.3240	0.0789	[ -620.8353 , 103.6062 ]	[ -502.2863 , -21.5395 ]
Wounded	0.2249	0.0082	0.0000	[ 0.1947 , 0.2507 ]	[ 0.2095 , 0.2403 ]

All three methods do not give the same results in terms of the significance of the individual explanatory variables. It may be because Our last model has a serious violation of the model assumption

width of the two bootstrap confidence intervals
btc.wd	btr.wd
17.7317276	33.4530050
724.4415366	480.7468021
0.0560342	0.0307326

We can also see that the widths are about the same. However, we are getting similar results in the confidence intervals of the bootstrapped residuals sampling means because inside the bootstrap confidence intervals of regression coefficients and residuals, contains 0. Therefore, the mean=o.

However, Since there are violations to the final model assumptions, in addition to failing to fix the plots of normality using various transformations, we can see that bootstrap confidence intervals of regression coefficients could be more reliable than the parametric p-values because the bootstrap method gives us a nonparamentric inference about a population. Moreover, the fact that the histograms of the explanatory variables of the regression function are both skewed to the right is another reason why the bootstrap method is more reliable.

Bootstrap Distibution

Yuanqi Zhang

Homework