This code sample is R code that I used to complete the first quetion of a recent homework for my Data Analysis class at GMU that I am currently taking.

Question 1

Loading the States.dta dataset, which my Professor provided me. This dataset is in STATA format. I was not provided a codebook, but when I load the data into R it provides me with the definitions of each variable. The data is taken from 1990 and provides use with various characteristics including:

First, I need to set my working directory to the place where the file is contained so I can import it into R. Then I will use the print and names functions to ensure that the data is correct for those who view it.

setwd("~/Desktop/Fall 19/POGO 611/Assignments/HW2")
library(haven)
states <- read_dta("states.dta")
print(states, n = 51)
## # A tibble: 51 x 21
##    state   region     pop   area  density metro  waste energy miles   toxic
##    <chr> <dbl+lb>   <dbl>  <dbl>    <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
##  1 Alab…  3 [Sou…  4.04e6  52423   77.1    67.4  1.11     393 10.5   27.9  
##  2 Alas…  1 [Wes…  5.50e5 570374    0.960  41.1  0.910    991  7.20  37.4  
##  3 Ariz…  1 [Wes…  3.66e6 113642   32.2    79    0.790    258  9.70  19.6  
##  4 Arka…  3 [Sou…  2.35e6  52075   45.2    40.1  0.850    330  8.90  24.6  
##  5 Cali…  1 [Wes…  2.98e7 155973  191.     95.7  1.51     246  8.70   3.26 
##  6 Colo…  1 [Wes…  3.29e6 103730   31.8    81.5  0.730    273  8.30   2.25 
##  7 Conn…  2 [N. …  3.29e6   4845  678.     92.4  0.880    234  8      6.49 
##  8 Dela…  3 [Sou…  6.66e5   1955  341.     66.3  1.13     349  9.80   9.88 
##  9 Dist… NA       NA          NA   NA      NA   NA         NA NA     NA    
## 10 Flor…  3 [Sou…  1.29e7  53997  240.     90.8  1.45     237  8.5    8.24 
## 11 Geor…  3 [Sou…  6.48e6  57919  112.     65    0.680    322 11.2   12.0  
## 12 Hawa…  1 [Wes…  1.11e6   6423  173.     75.5  1.17     243  7.30   0.770
## 13 Idaho  1 [Wes…  1.01e6  82751   12.2    20.4  0.840    354  9.80  11.5  
## 14 Illi…  4 [Mid…  1.14e7  55593  206.     82.7  1.28     308  7.30  10.4  
## 15 Indi…  4 [Mid…  5.54e6  35870  155.     68.5  1.03     446  9.70  30.5  
## 16 Iowa   4 [Mid…  2.78e6  55875   49.7    44    0.830    333  8.30  14.6  
## 17 Kans…  4 [Mid…  2.48e6  81823   30.3    53.8  0.970    423  9.20  36.3  
## 18 Kent…  3 [Sou…  3.68e6  39732   92.8    46.5  0.950    376  9.10  11.5  
## 19 Loui…  3 [Sou…  4.22e6  43566   96.9    69.5  0.830    783  8.90 101.   
## 20 Maine  2 [N. …  1.23e6  30865   39.8    35.9  0.770    309  9.70  12.1  
## 21 Mary…  3 [Sou…  4.78e6   9775  489.     92.8  1.07     268  8.5    3.36 
## 22 Mass…  2 [N. …  6.02e6   7838  768.     90.4  1.13     229  7.70   3.46 
## 23 Mich…  4 [Mid…  9.30e6  56809  164.     80.1  1.26     298  8.70  12.9  
## 24 Minn…  4 [Mid…  4.38e6  79617   55.0    67.7  1.01     305  8.90  11.9  
## 25 Miss…  3 [Sou…  2.57e6  46914   54.8    30.1  0.540    358  9.5   40.3  
## 26 Miss…  4 [Mid…  5.12e6  68898   74.3    66.2  1.47     294  9.90  13.4  
## 27 Mont…  1 [Wes…  7.99e5 145556    5.49   23.9  0.75     415 10.4   53.4  
## 28 Nebr…  4 [Mid…  1.58e6  76878   20.5    48.5  0.820    334  8.80  11.0  
## 29 Neva…  1 [Wes…  1.20e6 109806   10.9    82.9  0.830    337  8.5    2.72 
## 30 New …  2 [N. …  1.11e6   8969  124.     56.1  0.990    224  8.90   7.48 
## 31 New …  2 [N. …  7.73e6   7419 1042.    100    0.920    296  7.60   3.36 
## 32 New …  1 [Wes…  1.52e6 121365   12.5    48.4  0.990    347 10.7   21.5  
## 33 New …  2 [N. …  1.80e7  47224  381.     91.1  1.22     200  5.90   3.36 
## 34 Nort…  3 [Sou…  6.63e6  48718  136.     56.7  0.910    300  9.5   18.7  
## 35 Nort…  4 [Mid…  6.39e5  68994    9.26   40.3  0.630    467  9.20   3.30 
## 36 Ohio   4 [Mid…  1.08e7  40953  265.     79    1.45     348  8     15.5  
## 37 Okla…  3 [Sou…  3.15e6  68679   45.8    59.4  0.950    396 10.5   10.5  
## 38 Oreg…  1 [Wes…  2.84e6  96003   29.6    68.5  1.16     318  9.40   7.99 
## 39 Penn…  2 [N. …  1.19e7  44820  265.     84.8  0.800    300  7.20   7.52 
## 40 Rhod…  2 [N. …  1.00e6   1045  960.     92.5  1.20     218  7      5.25 
## 41 Sout…  3 [Sou…  3.49e6  30111  116.     60.6  1.15     330  9.90  19.3  
## 42 Sout…  4 [Mid…  6.96e5  75898    9.17   29.5  1.15     284 10      4.22 
## 43 Tenn…  3 [Sou…  4.88e6  41220  118.     67.7  1.03     350  9.60  42.5  
## 44 Texas  3 [Sou…  1.70e7 261914   64.9    81.6  1.06     569  9.60  24.6  
## 45 Utah   1 [Wes…  1.72e6  82168   21.0    77.5  0.700    316  8.5   72.8  
## 46 Verm…  2 [N. …  5.63e5   9249   60.9    23.4  0.690    232 10.4    1.81 
## 47 Virg…  3 [Sou…  6.19e6  39598  156.     72.5  1.45     306  9.70  12.9  
## 48 Wash…  1 [Wes…  4.87e6  66582   73.1    81.7  1.05     389  9.20   8.51 
## 49 West…  3 [Sou…  1.79e6  24087   74.4    36.4  0.950    415  8.60  21.3  
## 50 Wisc…  4 [Mid…  4.89e6  54314   90.1    67.4  0.700    288  9.10   9.20 
## 51 Wyom…  1 [Wes…  4.54e5  97105    4.68   29.6  0.700    786 12.8   25.5  
## # … with 11 more variables: green <dbl>, house <dbl>, senate <dbl>,
## #   csat <dbl>, vsat <dbl>, msat <dbl>, percent <dbl>, expense <dbl>,
## #   income <dbl>, high <dbl>, college <dbl>
names(states)
##  [1] "state"   "region"  "pop"     "area"    "density" "metro"   "waste"  
##  [8] "energy"  "miles"   "toxic"   "green"   "house"   "senate"  "csat"   
## [15] "vsat"    "msat"    "percent" "expense" "income"  "high"    "college"

The first task in Question 1 is to run a regression model with Green (per capita greenhouse gas emissions in tons) as the Independent Variable and Miles, House, Income, Density, Senate, High, College, and Pop as the independent variables. But first, I just want to see the comparison of greenhouse gas emissions to see if there is anything missing or if there are any outliers.

states[,c("state", "green")]
## # A tibble: 51 x 2
##    state                green
##    <chr>                <dbl>
##  1 Alabama               29.2
##  2 Alaska                NA  
##  3 Arizona               18.4
##  4 Arkansas              26.0
##  5 California            15.6
##  6 Colorado              21.9
##  7 Connecticut           14.6
##  8 Delaware              21.5
##  9 District of Columbia  NA  
## 10 Florida               13.6
## # … with 41 more rows
tail(states[,c("state", "green")])
## # A tibble: 6 x 2
##   state         green
##   <chr>         <dbl>
## 1 Vermont        15.2
## 2 Virginia       18.7
## 3 Washington     16.5
## 4 West Virginia  51.1
## 5 Wisconsin      20.6
## 6 Wyoming       114.

Alaska, DC, and Hawaii are NA, and Wyoming is quite large at 114.

Now I will run the model.

attach(states)
x<- lm(green ~ miles + house + income + density + senate + high + college + pop)
summary(x)
## 
## Call:
## lm(formula = green ~ miles + house + income + density + senate + 
##     high + college + pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.624  -6.110  -1.033   2.477  38.594 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.458e+01  3.828e+01  -1.948 0.058579 .  
## miles        7.887e+00  2.086e+00   3.781 0.000524 ***
## house       -5.347e-01  1.291e-01  -4.143 0.000178 ***
## income      -7.296e-01  6.977e-01  -1.046 0.302167    
## density      3.715e-02  1.438e-02   2.584 0.013628 *  
## senate       1.425e-01  1.110e-01   1.284 0.206584    
## high         1.157e+00  5.378e-01   2.151 0.037765 *  
## college     -1.342e+00  9.244e-01  -1.452 0.154573    
## pop          2.031e-07  4.027e-07   0.504 0.616886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.11 on 39 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5527, Adjusted R-squared:  0.4609 
## F-statistic: 6.023 on 8 and 39 DF,  p-value: 4.846e-05

The Adjusted R-Squared statistic - 0.4609 - indicates that roughly 46 percent of the variation of Green can be attributed to the Indepdent Variables. In terms of the the coefficients, 4 are significant (Miles, House, Density, and high) and 4 are not significant (Income, Senate, College, and Pop).

So, with a decently high R-squared statistic and only half of the IVs being significant, Multicollinearity may be present.

First I will check the pairwise correlations. In order to do this, I need to load the Hmisc package - Harrell Miscellaneous - created by Frank Harrell of Vanderbilt University. I use this package because it has the rcorr function. I will also create a new dataframe called states1, which contains all the variables I just tested in the regression model I saved into x.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
states1 <- states[,c("green", "miles", "house", "income", "density", "senate", "high", "college", "pop")]

Running a pearson correlation as a matrix.

rcorr(as.matrix(states1), type = "pearson")
##         green miles house income density senate  high college   pop
## green    1.00  0.50 -0.56  -0.37   -0.31  -0.37 -0.01   -0.30 -0.24
## miles    0.50  1.00 -0.27  -0.50   -0.51  -0.44 -0.12   -0.30 -0.31
## house   -0.56 -0.27  1.00   0.41    0.53   0.65  0.07    0.37  0.06
## income  -0.37 -0.50  0.41   1.00    0.59   0.54  0.51    0.72  0.21
## density -0.31 -0.51  0.53   0.59    1.00   0.61 -0.05    0.46  0.20
## senate  -0.37 -0.44  0.65   0.54    0.61   1.00  0.22    0.47  0.11
## high    -0.01 -0.12  0.07   0.51   -0.05   0.22  1.00    0.53 -0.16
## college -0.30 -0.30  0.37   0.72    0.46   0.47  0.53    1.00  0.13
## pop     -0.24 -0.31  0.06   0.21    0.20   0.11 -0.16    0.13  1.00
## 
## n
##         green miles house income density senate high college pop
## green      48    48    48     48      48     48   48      48  48
## miles      48    50    50     50      50     50   50      50  50
## house      48    50    50     50      50     50   50      50  50
## income     48    50    50     51      50     50   51      51  50
## density    48    50    50     50      50     50   50      50  50
## senate     48    50    50     50      50     50   50      50  50
## high       48    50    50     51      50     50   51      51  50
## college    48    50    50     51      50     50   51      51  50
## pop        48    50    50     50      50     50   50      50  50
## 
## P
##         green  miles  house  income density senate high   college pop   
## green          0.0003 0.0000 0.0106 0.0340  0.0101 0.9596 0.0387  0.1059
## miles   0.0003        0.0602 0.0002 0.0002  0.0013 0.3918 0.0372  0.0287
## house   0.0000 0.0602        0.0033 0.0000  0.0000 0.6070 0.0082  0.6831
## income  0.0106 0.0002 0.0033        0.0000  0.0000 0.0001 0.0000  0.1498
## density 0.0340 0.0002 0.0000 0.0000         0.0000 0.7377 0.0009  0.1714
## senate  0.0101 0.0013 0.0000 0.0000 0.0000         0.1308 0.0006  0.4268
## high    0.9596 0.3918 0.6070 0.0001 0.7377  0.1308        0.0000  0.2727
## college 0.0387 0.0372 0.0082 0.0000 0.0009  0.0006 0.0000         0.3603
## pop     0.1059 0.0287 0.6831 0.1498 0.1714  0.4268 0.2727 0.3603

The output above has three different blocks. The first is the Pearson correlation coefficient between each of the variables. The second is n, the number of observations for each variables. The third and final are the p-values.

At first glance, Income and College are problematic. The correlation coefficient between them is 0.72, which is quite high and makes intutive sense, as those who go to college are likely to have higher income levels, and vice-versa.

Now I will check the VIF scores. In order to do this I need to load the “car” package, which contains the VIF score function. VIF (variance inflation factor) helps quantify the level of multicollinearity.

library(car)
## Loading required package: carData
vif(x)
##    miles    house   income  density   senate     high  college      pop 
## 1.930084 1.977911 5.681471 3.778719 2.754703 2.815526 3.883805 1.575304

Income’s VIF score is higher (5.68) - indicating there is multicollinearity. I will now remove this variable and test another model. This regression model will be put onto an object called y.

y <- lm(green ~ miles + house + density + senate + high + college + pop)
summary(y)
## 
## Call:
## lm(formula = green ~ miles + house + density + senate + high + 
##     college + pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.644  -7.064  -1.865   2.875  38.490 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.090e+01  3.816e+01  -1.858 0.070548 .  
## miles        7.639e+00  2.075e+00   3.682 0.000684 ***
## house       -5.381e-01  1.292e-01  -4.165 0.000161 ***
## density      2.948e-02  1.238e-02   2.381 0.022100 *  
## senate       1.149e-01  1.079e-01   1.065 0.293427    
## high         9.720e-01  5.086e-01   1.911 0.063180 .  
## college     -1.753e+00  8.377e-01  -2.092 0.042788 *  
## pop          4.201e-08  3.725e-07   0.113 0.910774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 40 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5401, Adjusted R-squared:  0.4596 
## F-statistic: 6.711 on 7 and 40 DF,  p-value: 2.819e-05

The adjusted R-squred dropped to 0.4596. I’ll check the VIF scores now to check the severity of any multicollinearity.

vif(y)
##    miles    house  density   senate     high  college      pop 
## 1.905113 1.976671 2.795565 2.598198 2.511804 3.182043 1.344795

The highest VIF score is now 3.18 (college). Ideally I’d like to have a model where my VIF scores are as low as possible, but, at the same time, I do not want to keep removing variables just to find a model where my VIF scores get lower because the variables I am removing are of interest.