Raven Shan


Introduction

The objective of this exercise is to utilize the Amelia package to perform multiple imputations as a way of handling missing data. Generating imputed values helps in reducing bias as well as increasing efficiency. For this assignment, I selected a cross-sectional time-series dataset that has a considerable amount of missing values, which could potentially affect the reliability of my results if not properly addressed.

The dataset I will be using, “USSeatBelts” found in the AER package, contains panel data on traffic fatalities for all US states from 1988 to 1997. The outcome variable is fatalities, representing the number of fatalities per 1,000 traffic miles by state. While there are multiple predictors to choose from, this analysis will specifically examine the relationship between fatalities and selected state-level predictors such as seatbelt usage rate and median per capita income. In addition, I will be examining the effect of some mandatory driving-related laws on the number of fatalities per state. The goal is to be able to answer the following questions: What affect does a state’s seatbelt usage rate have on fatalities? What is the relationship between between median per capita income and the the number of fatalities in each state? Do wealthier states have fewer or more vehicular deaths? Thirdly, does the implementation of a maximum .08 blood alcohol content substantially reduce fatalities, and if so, to what extent? Lastly, what impact does having a minimum speed limit of 70 miles per hour have?

First, I will estimate the regression model using listwise deletion, then I will estimate the same model after using Amelia to generate multiple imputed values. Finally, the two sets of results will be compared.


Data and Variables

The variables used in this analysis and their definitions are as follows:

##Viewing the total number of observations and variables 
dim(USSeatBelts2)
[1] 765   7

There are 765 observations.

summary(USSeatBelts2)
     state          year      fatalities        seatbelt      speed70  
 AK     : 15   Min.   : 1   Min.   : 8.327   Min.   :0.0600   no :711  
 AL     : 15   1st Qu.: 4   1st Qu.:17.341   1st Qu.:0.4200   yes: 54  
 AR     : 15   Median : 8   Median :21.199   Median :0.5500            
 AZ     : 15   Mean   : 8   Mean   :21.490   Mean   :0.5289            
 CA     : 15   3rd Qu.:12   3rd Qu.:24.774   3rd Qu.:0.6500            
 CO     : 15   Max.   :15   Max.   :45.470   Max.   :0.8700            
 (Other):675                                 NA's   :209               
     income      alcohol  
 Min.   : 8372   no :676  
 1st Qu.:14266   yes: 89  
 Median :17624            
 Mean   :17993            
 3rd Qu.:21080            
 Max.   :35863            
                          

The above summary output shows 209 N/A’s (missing values) for the variable ‘seatbelt’. This is a considerable amount of missing values in a dataset that has 765 observations. Meanwhile, the remaining variables have no missing observations.

missmap(USSeatBelts2, legend=FALSE)

The above map was generated using the missmmap() function in Amelia. It is an alternative way of demonstrating where missingness occurs in the data. Again, it is clear that all of the missing values occur in the variable seatbelt, while the remaining variables have no missing observations.


Results

Estimating Model with Listwise Deletion

Below, I am estimating a regression model after listwise deletion, which is performed automatically by Zelig. Initially, I began with a simple model that examined the effect of seatbelt usage rate on the number of fatalities per 1,000 traffic miles. Despite the relationship being insignificant, “seatbelt” is included in the model for the sake of this exercise as it was the only variable in the dataset with missing observations. I gradually added additional variables and tested various interactions. Although it would have been ideal, unfortunately, none of the tested interactions were significant. Therefore, the analysis (for both methods) will be based on the best fitting model below.

This model estimates the effect of the variables seatbelt, alcohol, income and speed70 on the total number of fatalities (per 1,000 traffic miles). Contrary to the common assumption, the results indicate one unit increase in seatbelt usage rate to increase the number of fatalities by 1.72, however it is not statistically significant. Nonetheless, the remaining predictors are significant. The results suggest a one unit increase in median income to reduce the number of fatalities by an extremely small margin, .0008 (This value was taken from the summary output, but note that the html table displays the number as -0.00). We can also predict the number of fatalities to drop by 1.98 in states when there is a minimum drinking age of 21, and to increase by 2.24 when there is a 70 mile per hour (or higher) speed limit.

Statistical models
Model 1
(Intercept) 35.23***
(0.69)
seatbelt 1.72
(1.16)
income -0.00***
(0.00)
alcoholyes -1.98***
(0.46)
speed70yes 2.24***
(0.53)
R2 0.51
Adj. R2 0.51
Num. obs. 556
RMSE 3.54
p < 0.001, p < 0.01, p < 0.05
zlist3$setx()
zlist3$sim()
plot(zlist3)


Estimating the Model with Multiple Imputation

Creating Multiple Imputations with Amelia

##Imputation output excluded due to length 

a.out <- amelia(USSeatBelts2, m=10, ts="year", cs = "state", logs = "income", 
                noms = c("speed70", "alcohol"),
                polytime = 2, intercs=TRUE)
summary(a.out)

Amelia output with 10 imputed datasets.
Return code:  1 
Message:  Normal EM convergence. 

Chain Lengths:
--------------
Imputation 1:  1934
Imputation 2:  2
Imputation 3:  2
Imputation 4:  2
Imputation 5:  2
Imputation 6:  1431
Imputation 7:  5029
Imputation 8:  2
Imputation 9:  4625
Imputation 10:  2

Rows after Listwise Deletion:  556 
Rows after Imputation:  765 
Patterns of missingness in the data:  2 

Fraction Missing for original variables: 
-----------------------------------------

           Fraction Missing
state             0.0000000
year              0.0000000
fatalities        0.0000000
seatbelt          0.2732026
speed70           0.0000000
income            0.0000000
alcohol           0.0000000

MI model Results

With multiple imputation, the results below indicate that a one unit increase in seatbelt usage rate decreases fatalities by .95, however the variable is not significant. Note that it was the other way around in the listwise deletion method, in which an increase in seatbelt use increased the number of fatalities. One unit increase in median income reduces the number of fatalities by .0008. We can also predict the number of fatalities to drop by 1.91 in states when there is a minimum drinking age of 21 in effect, and to increase by 2.34 when there is a 70 mile per hour (or higher) speed limit in place. I will later compare these results with those of the listwise deletion model.

z.out$zelig(fatalities ~ seatbelt + income + alcohol + speed70, data = a.out)
summary(z.out)
Model: Combined Imputations 

             Estimate Std.Error z value Pr(>|z|)
(Intercept) 37.863550  0.646498   58.57  < 2e-16
seatbelt    -0.950882  1.532796   -0.62  0.53502
income      -0.000883  0.000056  -15.78  < 2e-16
alcoholyes  -1.915860  0.515520   -3.72  0.00020
speed70yes   2.343042  0.642672    3.65  0.00027

For results from individual imputed datasets, use summary(x, subset = i:j)
Next step: Use 'setx' method
z.out$setx()
z.out$sim()
plot(z.out)


Comparing Results By Missing Data Methods: Listwise Deletion vs MI Method

The results from the two methods show interesting differences, specifically in the variable ‘seatbelt’(representing seatbelt usage rate per state) as this is the only variable with missing values. Despite the relationship being insignificant in both methods, the listwise deletion method shows an increase in seatbelt usage rates to increase fatalities, whereas the multiple imputation method shows an increase in seatbelt usage rates to decrease fatalities, which is indeed more logical. This is a substantially different story, and further attests to why the MI method is preferred. As for the remaining independent variables, the differences in coefficients were negligible as there were no missing observations in these variables to begin with. Ultimately, the multiple imputation method produces much more reliable results as it considers multiple predicted values and their distribution. Comparatively, the listwise deletion method simply disregards cases that could contain important information, making the findings much less convincing.

