Answer (Part 1): A MLR model was fit using the cars data (mpg on cylinders, size, hp, and weight). The results from the summary() function are displayed below:
Call:
lm(formula = mpg ~ cylinders + size + hp + weight, data = cars)
Residuals:
Min 1Q Median 3Q Max
-3.0166 -1.6324 -0.5795 1.3890 5.2958
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.29187 4.60156 12.885 3.36e-10 ***
cylinders -1.52024 1.06901 -1.422 0.17309
size 0.06595 0.02756 2.393 0.02854 *
hp -0.06502 0.05948 -1.093 0.28952
weight -10.66719 3.02130 -3.531 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.478 on 17 degrees of freedom
(16 observations deleted due to missingness)
Multiple R-squared: 0.8989, Adjusted R-squared: 0.8751
F-statistic: 37.77 on 4 and 17 DF, p-value: 3.008e-08
[1] 22
Answer (Part 2): The cars data set has 38 rows of information. However, 16 of those 38 rows contained NAs and were removed due to missingness. Therefore, only 22 rows were used in the analysis.
Question 2
Answer: The number of rows and total number of missing values in the cars data set are found below using the three R functions md.pattern(), md.pairs(), and marginplot():
### md.pairs() table - mm (both variables missing)
We see that we have 18 missing observations in the first plot using the md.pattern() function [we see the total number of missing observations in the bottom right corner]. We also see the total number of rows used in the analysis (the rows with non-missing data) in the top left corner, 22. Using the pairs() pattern we can also see the 18 missing observations by summing diagonally (4+3+5+6). The marginplot() helps us visualize the pairwise missing values. For example, in the marginplot of cylinders vs. weight, we see on the left that we have 1 row that is missing both cylinders and weight. Overall, there are are 4 missing observations for cylinders and 6 missing observations for weight.
Question 3
Part A and B: See code below
Part C: Below two strip plots are presented: pmm (m=100) and blr (m=100). We will compare the two strip plots by variable. First, thankfully, we see that mpg is not imputed using either method because there were no missing mpg values. Second, we notice that the pmm method imputes most of the cylinder values at 4, 6, and 8 (none at 5). Cylinder is discrete, so these values make sense. The Bayesian method gives nonsensical values that are scattered between 4 and 5 and 6 and 8. Third, the pmm approach seems to impute values along the blue bands for size (size is discrete), with the densest region being above 150 and at the top of the blue band. The Bayesian approach again imputes values that are not on the bands; it is imputing non-discrete values. However, the blr points seem less scattered and variable than the pmm points. Fourth, we again see the same phenomenon that occurred with size (pmm imputing along the bands, while blr imputes inside and outside of the bands) for hp, which is another discrete variable. Finally, we again see the same phenomenon (pmm imputing along blue bands and blr imputing outside of bands) for the weight variable, which is surprising because weight is “continuous”. I would have expected the pmm method to impute values outside of the bands.
An xyplot() of weight vs. size using the data from the pmm (m=5) method is also displayed. In this plot, we see a fairly linear relationship between weight and size. We see that over the different iterations, there is a cluster of imputations in the bottom left corner of the plot that never go away. It seems that the pmm method is imputing values for small, light weight cars. There are subtle changes over the different iterations, but it seems that the pmm method regularly imputes a couple of medium size/weight values and one large size/weight value.
Part D: Below we also see the convergence plot for the pmm method (m=20). We see that something funky is going on with cylinders (the lines for the mean do not overlap and the standard deviation is doing something wonky. This is likely due to the discrete nature of the data and only having a small number of options for values. The convergence lines for the other three variables seem to overlap better than the cylinders lines, but even these lines are still a bit “all over the place”. This is likely due to the small number of imputations. The standard deviation lines are definitely “wonkier” than the mean lines, especially especially for cylinders.
###Part D. Check convergence#plot(cars.impute100_pmm)#plot(cars.impute5_blr)#plot(cars.impute20_blr)#plot(cars.impute100_blr)#plot(cars.impute5_pmm)plot(cars.impute20_pmm)
Question 4
Answer: Below we see the pooled regression estimates using Rubin’s rules and the pmm (m=5) imputed data set. The estimates and standard errors were relatively close within method type (pmm or blr) and across the number of imputed data sets (5, 20, or 100). Using the pmm method, the coefficients and standard errors for cylinder, size, hp, and weight were approximately -1.4 and 0.9 (not significant), .06 and .02 (significant), -.06 and .04 (not significant), and -11 and 2.5 (significant). Interestingly, the blr method gave very similar results, but the means were all slightly shifted (the standard errors did not seem to shift). Using the blr method, the coefficient estimates and standard errors for cylinder, size, hp, and weight were approximately -1.5 and 0.9 (not significant), .07 and .02 (significant), -.05 and .04 (not significant), and -12 and 2.4 (significant). Notice that the significance of the estimates were essentially the same, regardless of imputation method.The means were just shifted slightly.
##Fit a linear regression model to each imputed data setcars.imp.lm.pmm5 <-with(cars.impute5_pmm, lm(mpg ~cylinders + size + hp + weight))cars.imp.lm.pmm20 <-with(cars.impute20_pmm, lm(mpg ~cylinders + size + hp + weight))cars.imp.lm.pmm100 <-with(cars.impute100_pmm, lm(mpg ~cylinders + size + hp + weight))cars.imp.lm.blr5 <-with(cars.impute5_blr, lm(mpg ~cylinders + size + hp + weight))cars.imp.lm.blr20 <-with(cars.impute20_blr, lm(mpg ~cylinders + size + hp + weight))cars.imp.lm.blr100 <-with(cars.impute100_blr, lm(mpg ~cylinders + size + hp + weight))##Pool the regression estimates using Rubin's rules#summary(pool(cars.imp.lm.pmm20))#summary(pool(cars.imp.lm.pmm100))#summary(pool(cars.imp.lm.blr5))#summary(pool(cars.imp.lm.blr20))#summary(pool(cars.imp.lm.blr100))summary(pool(cars.imp.lm.pmm5))
Below we have four line graphs, one for each of the variables (other than the intercept). Iterations (5, 20, and 100) are located on the x-axis and the corresponding values of the variables are located on the y-axis. The lines are colored by group (ble, pmm, and complete case). Note that complete case is a horizontal line used for reference. First, looking at the cylinders plot when m=5, the pmm method results in a larger coefficient than the complete case analysis, while the blr method results in a smaller coefficient. When m=20, the blr method aligns closely with the pmm method, and the coefficients are both larger than the complete case coefficient. When m=100, the blr and pmm coefficients are still larger than the complete case coefficient, but the blr estimate is much closer to the complete case.
Next, looking at the hp plot, we see that the pmm estimate is the same as the complete case estimate when m=5 and when m=20. When m=100, the pmm estimate grows and is much larger than the complete case estimate. The blr estimate is consistently larger than the complete case estimate across all three imputation sizes.
Third, looking at the size plot, we see that the blr and complete case methods result in the same coefficient estimates across all three imputation sizes. When m=5, the pmm method results in an estimate that is smaller than the other two, but when m=20, the pmm method results in the same estimates as the blr and complete case methods. When m=100, all three methods produce the same coefficient estimates.
Finally, looking at the weight plot, we see that the pmm method results in a slightly larger estimate than the complete case analysis when m=5. When m=20, it decreases to an estimate that is smaller than the complete case estimate and continues to decrease when m=100. The blr method resulted in estimates that were consistently smaller than the complete case analysis.
Question 6
Based on everything that I just presented, I would use the pmm method with 100 imputations. I recommend the pmm for a few reasons. First, due to the discrete nature of the variables, specifically cylinders, size, and hp, I would not feel comfortable using the Bayesian approach. We saw in the strip plots that the Bayesian approach imputed non-integer values, which are nonsensical. Second, although the Bayesian method seemed to converge better (more overlap in the models - un-comment my code to see for yourself!), the data that it imputed was non-sensical, so the increase in efficiency was moot. Finally, I chose 100 imputations rather than 5 or 20 because 5 was too few (not enough overlap) and it didn’t take that much longer to run the 100 imputations compared to the 20 imputations, as seen below. Yes, it is slower (5x the number of imputations), but it is still relatively fast and the convergence plots are much tighter.
Using the pmm method with m=100, the only significant variables (significance level of 0.05) are size (p-value of 0.0039) and weight (p-value of 0.0001). Cylinders and hp have corresponding p-values of 0.109 and 0.206, respectively. The confidence intervals (using t critical values rather than the standard 1.96) for all four variables are displayed below. The confidence intervals for cylinders and hp contain 0, which is again evidence of non-significance. The confidence interval for size is (0.02,0.11) and the confidence interval for weight is (-16.4,-6.3). Neither contain 0, which is evidence of significance.
Based on this output, we can say that for every one unit in size, mpg increases by 0.067, and for every one unit in weight, the mpg decreases by -11.35. This seems a little counter-intuitive because I would think that increased size would lead to decreased mpg, but there could be some sort of interaction here. Future direction of research!
#calculate confidence intervals using the t-cv because I am TAing STAT 2331 and we teach them thisdata <-summary(pool(cars.imp.lm.pmm100))data$t_critical <-qt(0.975, data$df)data$CI_lower <- data$estimate - data$t_critical * data$std.errordata$CI_upper <- data$estimate + data$t_critical * data$std.errorprint(data[, c("term", "estimate", "std.error", "CI_lower", "CI_upper")])