Loading the States.dta dataset, which my Professor provided me. This dataset is in STATA format. I was not provided a codebook, but when I load the data into R it provides me with the definitions of each variable. The data is taken from 1990 and provides use with various characteristics including:
First, I need to set my working directory to the place where the file is contained so I can import it into R. Then I will use the print and names functions to ensure that the data is correct for those who view it.
setwd("~/Desktop/Fall 19/POGO 611/Assignments/HW2")
library(haven)
states <- read_dta("states.dta")
print(states, n = 51)
## # A tibble: 51 x 21
## state region pop area density metro waste energy miles toxic
## <chr> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 3 [Sou… 4.04e6 52423 77.1 67.4 1.11 393 10.5 27.9
## 2 Alas… 1 [Wes… 5.50e5 570374 0.960 41.1 0.910 991 7.20 37.4
## 3 Ariz… 1 [Wes… 3.66e6 113642 32.2 79 0.790 258 9.70 19.6
## 4 Arka… 3 [Sou… 2.35e6 52075 45.2 40.1 0.850 330 8.90 24.6
## 5 Cali… 1 [Wes… 2.98e7 155973 191. 95.7 1.51 246 8.70 3.26
## 6 Colo… 1 [Wes… 3.29e6 103730 31.8 81.5 0.730 273 8.30 2.25
## 7 Conn… 2 [N. … 3.29e6 4845 678. 92.4 0.880 234 8 6.49
## 8 Dela… 3 [Sou… 6.66e5 1955 341. 66.3 1.13 349 9.80 9.88
## 9 Dist… NA NA NA NA NA NA NA NA NA
## 10 Flor… 3 [Sou… 1.29e7 53997 240. 90.8 1.45 237 8.5 8.24
## 11 Geor… 3 [Sou… 6.48e6 57919 112. 65 0.680 322 11.2 12.0
## 12 Hawa… 1 [Wes… 1.11e6 6423 173. 75.5 1.17 243 7.30 0.770
## 13 Idaho 1 [Wes… 1.01e6 82751 12.2 20.4 0.840 354 9.80 11.5
## 14 Illi… 4 [Mid… 1.14e7 55593 206. 82.7 1.28 308 7.30 10.4
## 15 Indi… 4 [Mid… 5.54e6 35870 155. 68.5 1.03 446 9.70 30.5
## 16 Iowa 4 [Mid… 2.78e6 55875 49.7 44 0.830 333 8.30 14.6
## 17 Kans… 4 [Mid… 2.48e6 81823 30.3 53.8 0.970 423 9.20 36.3
## 18 Kent… 3 [Sou… 3.68e6 39732 92.8 46.5 0.950 376 9.10 11.5
## 19 Loui… 3 [Sou… 4.22e6 43566 96.9 69.5 0.830 783 8.90 101.
## 20 Maine 2 [N. … 1.23e6 30865 39.8 35.9 0.770 309 9.70 12.1
## 21 Mary… 3 [Sou… 4.78e6 9775 489. 92.8 1.07 268 8.5 3.36
## 22 Mass… 2 [N. … 6.02e6 7838 768. 90.4 1.13 229 7.70 3.46
## 23 Mich… 4 [Mid… 9.30e6 56809 164. 80.1 1.26 298 8.70 12.9
## 24 Minn… 4 [Mid… 4.38e6 79617 55.0 67.7 1.01 305 8.90 11.9
## 25 Miss… 3 [Sou… 2.57e6 46914 54.8 30.1 0.540 358 9.5 40.3
## 26 Miss… 4 [Mid… 5.12e6 68898 74.3 66.2 1.47 294 9.90 13.4
## 27 Mont… 1 [Wes… 7.99e5 145556 5.49 23.9 0.75 415 10.4 53.4
## 28 Nebr… 4 [Mid… 1.58e6 76878 20.5 48.5 0.820 334 8.80 11.0
## 29 Neva… 1 [Wes… 1.20e6 109806 10.9 82.9 0.830 337 8.5 2.72
## 30 New … 2 [N. … 1.11e6 8969 124. 56.1 0.990 224 8.90 7.48
## 31 New … 2 [N. … 7.73e6 7419 1042. 100 0.920 296 7.60 3.36
## 32 New … 1 [Wes… 1.52e6 121365 12.5 48.4 0.990 347 10.7 21.5
## 33 New … 2 [N. … 1.80e7 47224 381. 91.1 1.22 200 5.90 3.36
## 34 Nort… 3 [Sou… 6.63e6 48718 136. 56.7 0.910 300 9.5 18.7
## 35 Nort… 4 [Mid… 6.39e5 68994 9.26 40.3 0.630 467 9.20 3.30
## 36 Ohio 4 [Mid… 1.08e7 40953 265. 79 1.45 348 8 15.5
## 37 Okla… 3 [Sou… 3.15e6 68679 45.8 59.4 0.950 396 10.5 10.5
## 38 Oreg… 1 [Wes… 2.84e6 96003 29.6 68.5 1.16 318 9.40 7.99
## 39 Penn… 2 [N. … 1.19e7 44820 265. 84.8 0.800 300 7.20 7.52
## 40 Rhod… 2 [N. … 1.00e6 1045 960. 92.5 1.20 218 7 5.25
## 41 Sout… 3 [Sou… 3.49e6 30111 116. 60.6 1.15 330 9.90 19.3
## 42 Sout… 4 [Mid… 6.96e5 75898 9.17 29.5 1.15 284 10 4.22
## 43 Tenn… 3 [Sou… 4.88e6 41220 118. 67.7 1.03 350 9.60 42.5
## 44 Texas 3 [Sou… 1.70e7 261914 64.9 81.6 1.06 569 9.60 24.6
## 45 Utah 1 [Wes… 1.72e6 82168 21.0 77.5 0.700 316 8.5 72.8
## 46 Verm… 2 [N. … 5.63e5 9249 60.9 23.4 0.690 232 10.4 1.81
## 47 Virg… 3 [Sou… 6.19e6 39598 156. 72.5 1.45 306 9.70 12.9
## 48 Wash… 1 [Wes… 4.87e6 66582 73.1 81.7 1.05 389 9.20 8.51
## 49 West… 3 [Sou… 1.79e6 24087 74.4 36.4 0.950 415 8.60 21.3
## 50 Wisc… 4 [Mid… 4.89e6 54314 90.1 67.4 0.700 288 9.10 9.20
## 51 Wyom… 1 [Wes… 4.54e5 97105 4.68 29.6 0.700 786 12.8 25.5
## # … with 11 more variables: green <dbl>, house <dbl>, senate <dbl>,
## # csat <dbl>, vsat <dbl>, msat <dbl>, percent <dbl>, expense <dbl>,
## # income <dbl>, high <dbl>, college <dbl>
names(states)
## [1] "state" "region" "pop" "area" "density" "metro" "waste"
## [8] "energy" "miles" "toxic" "green" "house" "senate" "csat"
## [15] "vsat" "msat" "percent" "expense" "income" "high" "college"
The first task in Question 1 is to run a regression model with Green (per capita greenhouse gas emissions in tons) as the Independent Variable and Miles, House, Income, Density, Senate, High, College, and Pop as the independent variables. But first, I just want to see the comparison of greenhouse gas emissions to see if there is anything missing or if there are any outliers.
states[,c("state", "green")]
## # A tibble: 51 x 2
## state green
## <chr> <dbl>
## 1 Alabama 29.2
## 2 Alaska NA
## 3 Arizona 18.4
## 4 Arkansas 26.0
## 5 California 15.6
## 6 Colorado 21.9
## 7 Connecticut 14.6
## 8 Delaware 21.5
## 9 District of Columbia NA
## 10 Florida 13.6
## # … with 41 more rows
tail(states[,c("state", "green")])
## # A tibble: 6 x 2
## state green
## <chr> <dbl>
## 1 Vermont 15.2
## 2 Virginia 18.7
## 3 Washington 16.5
## 4 West Virginia 51.1
## 5 Wisconsin 20.6
## 6 Wyoming 114.
Alaska, DC, and Hawaii are NA, and Wyoming is quite large at 114.
Now I will run the model.
attach(states)
x<- lm(green ~ miles + house + income + density + senate + high + college + pop)
summary(x)
##
## Call:
## lm(formula = green ~ miles + house + income + density + senate +
## high + college + pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.624 -6.110 -1.033 2.477 38.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.458e+01 3.828e+01 -1.948 0.058579 .
## miles 7.887e+00 2.086e+00 3.781 0.000524 ***
## house -5.347e-01 1.291e-01 -4.143 0.000178 ***
## income -7.296e-01 6.977e-01 -1.046 0.302167
## density 3.715e-02 1.438e-02 2.584 0.013628 *
## senate 1.425e-01 1.110e-01 1.284 0.206584
## high 1.157e+00 5.378e-01 2.151 0.037765 *
## college -1.342e+00 9.244e-01 -1.452 0.154573
## pop 2.031e-07 4.027e-07 0.504 0.616886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.11 on 39 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5527, Adjusted R-squared: 0.4609
## F-statistic: 6.023 on 8 and 39 DF, p-value: 4.846e-05
The Adjusted R-Squared statistic - 0.4609 - indicates that roughly 46 percent of the variation of Green can be attributed to the Indepdent Variables. In terms of the the coefficients, 4 are significant (Miles, House, Density, and high) and 4 are not significant (Income, Senate, College, and Pop).
So, with a decently high R-squared statistic and only half of the IVs being significant, Multicollinearity may be present.
First I will check the pairwise correlations. In order to do this, I need to load the Hmisc package - Harrell Miscellaneous - created by Frank Harrell of Vanderbilt University. I use this package because it has the rcorr function. I will also create a new dataframe called states1, which contains all the variables I just tested in the regression model I saved into x.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
states1 <- states[,c("green", "miles", "house", "income", "density", "senate", "high", "college", "pop")]
Running a pearson correlation as a matrix.
rcorr(as.matrix(states1), type = "pearson")
## green miles house income density senate high college pop
## green 1.00 0.50 -0.56 -0.37 -0.31 -0.37 -0.01 -0.30 -0.24
## miles 0.50 1.00 -0.27 -0.50 -0.51 -0.44 -0.12 -0.30 -0.31
## house -0.56 -0.27 1.00 0.41 0.53 0.65 0.07 0.37 0.06
## income -0.37 -0.50 0.41 1.00 0.59 0.54 0.51 0.72 0.21
## density -0.31 -0.51 0.53 0.59 1.00 0.61 -0.05 0.46 0.20
## senate -0.37 -0.44 0.65 0.54 0.61 1.00 0.22 0.47 0.11
## high -0.01 -0.12 0.07 0.51 -0.05 0.22 1.00 0.53 -0.16
## college -0.30 -0.30 0.37 0.72 0.46 0.47 0.53 1.00 0.13
## pop -0.24 -0.31 0.06 0.21 0.20 0.11 -0.16 0.13 1.00
##
## n
## green miles house income density senate high college pop
## green 48 48 48 48 48 48 48 48 48
## miles 48 50 50 50 50 50 50 50 50
## house 48 50 50 50 50 50 50 50 50
## income 48 50 50 51 50 50 51 51 50
## density 48 50 50 50 50 50 50 50 50
## senate 48 50 50 50 50 50 50 50 50
## high 48 50 50 51 50 50 51 51 50
## college 48 50 50 51 50 50 51 51 50
## pop 48 50 50 50 50 50 50 50 50
##
## P
## green miles house income density senate high college pop
## green 0.0003 0.0000 0.0106 0.0340 0.0101 0.9596 0.0387 0.1059
## miles 0.0003 0.0602 0.0002 0.0002 0.0013 0.3918 0.0372 0.0287
## house 0.0000 0.0602 0.0033 0.0000 0.0000 0.6070 0.0082 0.6831
## income 0.0106 0.0002 0.0033 0.0000 0.0000 0.0001 0.0000 0.1498
## density 0.0340 0.0002 0.0000 0.0000 0.0000 0.7377 0.0009 0.1714
## senate 0.0101 0.0013 0.0000 0.0000 0.0000 0.1308 0.0006 0.4268
## high 0.9596 0.3918 0.6070 0.0001 0.7377 0.1308 0.0000 0.2727
## college 0.0387 0.0372 0.0082 0.0000 0.0009 0.0006 0.0000 0.3603
## pop 0.1059 0.0287 0.6831 0.1498 0.1714 0.4268 0.2727 0.3603
The output above has three different blocks. The first is the Pearson correlation coefficient between each of the variables. The second is n, the number of observations for each variables. The third and final are the p-values.
At first glance, Income and College are problematic. The correlation coefficient between them is 0.72, which is quite high and makes intutive sense, as those who go to college are likely to have higher income levels, and vice-versa.
Now I will check the VIF scores. In order to do this I need to load the “car” package, which contains the VIF score function. VIF (variance inflation factor) helps quantify the level of multicollinearity.
library(car)
## Loading required package: carData
vif(x)
## miles house income density senate high college pop
## 1.930084 1.977911 5.681471 3.778719 2.754703 2.815526 3.883805 1.575304
Income’s VIF score is higher (5.68) - indicating there is multicollinearity. I will now remove this variable and test another model. This regression model will be put onto an object called y.
y <- lm(green ~ miles + house + density + senate + high + college + pop)
summary(y)
##
## Call:
## lm(formula = green ~ miles + house + density + senate + high +
## college + pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.644 -7.064 -1.865 2.875 38.490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.090e+01 3.816e+01 -1.858 0.070548 .
## miles 7.639e+00 2.075e+00 3.682 0.000684 ***
## house -5.381e-01 1.292e-01 -4.165 0.000161 ***
## density 2.948e-02 1.238e-02 2.381 0.022100 *
## senate 1.149e-01 1.079e-01 1.065 0.293427
## high 9.720e-01 5.086e-01 1.911 0.063180 .
## college -1.753e+00 8.377e-01 -2.092 0.042788 *
## pop 4.201e-08 3.725e-07 0.113 0.910774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.12 on 40 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5401, Adjusted R-squared: 0.4596
## F-statistic: 6.711 on 7 and 40 DF, p-value: 2.819e-05
The adjusted R-squred dropped to 0.4596. I’ll check the VIF scores now to check the severity of any multicollinearity.
vif(y)
## miles house density senate high college pop
## 1.905113 1.976671 2.795565 2.598198 2.511804 3.182043 1.344795
The highest VIF score is now 3.18 (college). Ideally I’d like to have a model where my VIF scores are as low as possible, but, at the same time, I do not want to keep removing variables just to find a model where my VIF scores get lower because the variables I am removing are of interest.