rm(list=ls())
library(ggplot2)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(AER)
## Warning: package 'AER' was built under R version 4.2.3
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## Loading required package: survival
library(rsconnect)
The bias of an estimator can be defined as positive, negative, or zero. When the bias of an estimator is positive, we know that the estimator is over-predicting the target variable. When the bias of an estimator is negative, the estimator is under-predicting the target variable. Lastly, when the bias of an estimator is zero, we can say that the estimator is unbiased and it is equal to the true parameter value. Unbiased estimators are desirable, however, bias is not always avoidable, and sometimes biased estimators might be preferred due to other properties, such as lower variance or simplicity. Bias quantifies the systematic deviation of an estimator from its true parameter. This helps to assess the accuracy and reliability of the estimator.
By increasing the sample population (n) we are able to make the bias go away due to the fact that the estimator will startto converge toward the true parameter value (creates lower bias).
By adding more variables we are able to reduce the bias of estimators, as long as the added variable is not orthogonal to the previous variables. If these variables are highly correlated, we can create more bias.
data("Guns")
fm1 <- lm(violent ~ state + year + murder + robbery + prisoners + afam + cauc + male + population + income + density + law, data = Guns)
The data set Guns is a set of 1,173 observations of 51 states over 23 years. This data is an example of panel data, and contains 13 variables.
state: factor indicating state
year: factor indicating year
violent: violent crime rate (incidents per 100,000 members of the population)
murder: murder rate (incidents per 100,000)
robbery: robbery rate (incidents per 100,000)
prisoners: incarceration rate in state in the previous year (sentenced prisoners per 100,000 residents; value for the previous year)
afam: percent of the state population that is African American
cauc: percent of state that is Caucasian, ages 10 to 64
male: percent of state population that is male, ages 10 to 29
population: state population, in millions of people
income: real per capita personal income in the state (US dollars)
density: population per square mile of land area, divided by 1,000
law: factor. Does the state have a shall carry law in effect that year?
\[ violent_i = \beta_0 + \beta_1state + \beta_2year + \beta_3murder + \beta_4robbery + \beta_5prisoners + \beta_6afam + \beta_7cauc + \]
\[\beta_8male + \beta_9population + \beta_10income + \beta_11density + \beta_12law + \epsilon_i\]
The key independent variable in this model is violent, which measures the violent crime rate in incidents per 100,000 members of the population
\[ violent_i = \beta_0 + \beta_1*state + \beta_2*year + \beta_3*murder + \beta_4*robbery + \beta_5prisoners + \beta_6afam + \beta_7cauc + \]
\[\beta_8*male + \beta_9*population + \beta_10*income + \beta_11*density + \epsilon_i\]
fm2 <- lm(violent ~ state + year + murder + prisoners + afam + cauc + male + population + income + density + law, data = Guns)
summary(fm2)
##
## Call:
## lm(formula = violent ~ state + year + murder + prisoners + afam +
## cauc + male + population + income + density + law, data = Guns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310.89 -40.40 3.00 36.01 448.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.033e+01 2.466e+02 0.366 0.714232
## stateAlaska 1.483e+02 3.686e+01 4.023 6.13e-05 ***
## stateArizona 1.737e+02 4.311e+01 4.029 5.99e-05 ***
## stateArkansas -2.000e+01 3.277e+01 -0.610 0.541735
## stateCalifornia 5.785e+01 1.089e+02 0.531 0.595453
## stateColorado 1.518e+02 5.398e+01 2.812 0.005016 **
## stateConnecticut 1.582e+02 6.640e+01 2.382 0.017396 *
## stateDelaware 2.372e+02 3.831e+01 6.192 8.38e-10 ***
## stateDistrict of Columbia 8.014e+02 3.890e+02 2.060 0.039599 *
## stateFlorida 4.451e+02 5.386e+01 8.264 4.04e-16 ***
## stateGeorgia -1.037e+01 2.641e+01 -0.393 0.694728
## stateHawaii -3.357e+02 1.250e+02 -2.686 0.007344 **
## stateIdaho -2.836e+01 6.346e+01 -0.447 0.654982
## stateIllinois 2.717e+02 5.028e+01 5.403 8.02e-08 ***
## stateIndiana -1.065e+01 5.218e+01 -0.204 0.838394
## stateIowa -1.150e+01 6.390e+01 -0.180 0.857210
## stateKansas 4.809e+01 5.065e+01 0.949 0.342594
## stateKentucky -7.286e+01 5.021e+01 -1.451 0.147091
## stateLouisiana 9.668e+01 2.675e+01 3.615 0.000315 ***
## stateMaine -7.433e+01 6.492e+01 -1.145 0.252448
## stateMaryland 3.538e+02 3.677e+01 9.621 < 2e-16 ***
## stateMassachusetts 3.169e+02 7.074e+01 4.480 8.26e-06 ***
## stateMichigan 1.660e+02 4.623e+01 3.591 0.000344 ***
## stateMinnesota -6.633e+00 6.104e+01 -0.109 0.913491
## stateMississippi -2.855e+02 3.669e+01 -7.779 1.68e-14 ***
## stateMissouri 1.358e+02 4.336e+01 3.133 0.001777 **
## stateMontana -1.052e+02 5.217e+01 -2.016 0.044078 *
## stateNebraska 2.505e+01 5.821e+01 0.430 0.666979
## stateNevada 3.313e+02 4.242e+01 7.809 1.34e-14 ***
## stateNew Hampshire -6.473e+01 6.802e+01 -0.952 0.341501
## stateNew Jersey 1.767e+02 6.437e+01 2.745 0.006153 **
## stateNew Mexico 2.828e+02 3.852e+01 7.344 4.06e-13 ***
## stateNew York 2.978e+02 7.236e+01 4.116 4.15e-05 ***
## stateNorth Carolina -2.666e+01 2.590e+01 -1.029 0.303475
## stateNorth Dakota -1.988e+02 6.104e+01 -3.257 0.001161 **
## stateOhio -1.636e+01 5.863e+01 -0.279 0.780221
## stateOklahoma 4.741e+01 3.277e+01 1.447 0.148195
## stateOregon 2.024e+02 5.552e+01 3.645 0.000280 ***
## statePennsylvania -6.536e+01 6.411e+01 -1.019 0.308250
## stateRhode Island 1.156e+02 7.440e+01 1.554 0.120428
## stateSouth Carolina 2.277e+02 2.673e+01 8.519 < 2e-16 ***
## stateSouth Dakota -1.230e+02 5.314e+01 -2.315 0.020797 *
## stateTennessee 9.732e+01 3.327e+01 2.925 0.003514 **
## stateTexas -9.103e+01 6.662e+01 -1.366 0.172105
## stateUtah -7.069e+01 6.802e+01 -1.039 0.298919
## stateVermont -9.099e+01 6.628e+01 -1.373 0.170088
## stateVirginia -1.611e+02 3.159e+01 -5.100 4.01e-07 ***
## stateWashington 1.042e+02 4.974e+01 2.094 0.036450 *
## stateWest Virginia -1.265e+02 5.890e+01 -2.147 0.031994 *
## stateWisconsin -1.030e+02 5.701e+01 -1.807 0.071027 .
## stateWyoming 5.761e+00 6.145e+01 0.094 0.925325
## year1978 2.455e+01 1.467e+01 1.673 0.094556 .
## year1979 6.131e+01 1.492e+01 4.109 4.27e-05 ***
## year1980 8.549e+01 1.522e+01 5.615 2.49e-08 ***
## year1981 9.655e+01 1.575e+01 6.129 1.24e-09 ***
## year1982 9.044e+01 1.657e+01 5.458 5.95e-08 ***
## year1983 8.063e+01 1.771e+01 4.552 5.92e-06 ***
## year1984 9.987e+01 1.955e+01 5.109 3.82e-07 ***
## year1985 1.193e+02 2.137e+01 5.581 3.01e-08 ***
## year1986 1.466e+02 2.344e+01 6.255 5.69e-10 ***
## year1987 1.471e+02 2.559e+01 5.749 1.16e-08 ***
## year1988 1.746e+02 2.799e+01 6.239 6.30e-10 ***
## year1989 1.963e+02 3.022e+01 6.498 1.24e-10 ***
## year1990 2.553e+02 3.708e+01 6.885 9.76e-12 ***
## year1991 2.760e+02 3.892e+01 7.091 2.39e-12 ***
## year1992 3.001e+02 4.106e+01 7.308 5.21e-13 ***
## year1993 3.040e+02 4.268e+01 7.123 1.92e-12 ***
## year1994 2.966e+02 4.463e+01 6.646 4.76e-11 ***
## year1995 2.940e+02 4.651e+01 6.321 3.77e-10 ***
## year1996 2.661e+02 4.829e+01 5.511 4.44e-08 ***
## year1997 2.612e+02 5.000e+01 5.223 2.11e-07 ***
## year1998 2.426e+02 5.202e+01 4.664 3.49e-06 ***
## year1999 2.226e+02 5.388e+01 4.131 3.89e-05 ***
## murder 1.729e+01 8.805e-01 19.638 < 2e-16 ***
## prisoners 1.026e-02 4.797e-02 0.214 0.830693
## afam 4.338e+00 1.201e+01 0.361 0.718019
## cauc -4.691e+00 4.111e+00 -1.141 0.254125
## male 2.394e+01 8.227e+00 2.911 0.003681 **
## population 1.341e+01 4.143e+00 3.238 0.001241 **
## income -1.018e-02 3.466e-03 -2.937 0.003389 **
## density -5.827e+00 4.003e+01 -0.146 0.884298
## lawyes 7.590e-01 8.946e+00 0.085 0.932409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.01 on 1091 degrees of freedom
## Multiple R-squared: 0.9556, Adjusted R-squared: 0.9523
## F-statistic: 289.8 on 81 and 1091 DF, p-value: < 2.2e-16
stargazer(fm1, fm2,
type = "text"
)
##
## ===============================================================================
## Dependent variable:
## -----------------------------------------------------
## violent
## (1) (2)
## -------------------------------------------------------------------------------
## stateAlaska 86.744*** 148.294***
## (25.068) (36.858)
##
## stateArizona -31.880 173.695***
## (29.811) (43.112)
##
## stateArkansas -64.629*** -20.000
## (22.267) (32.766)
##
## stateCalifornia -135.755* 57.847
## (74.101) (108.919)
##
## stateColorado -59.666 151.774***
## (37.099) (53.979)
##
## stateConnecticut -177.362*** 158.162**
## (46.021) (66.403)
##
## stateDelaware 27.685 237.229***
## (26.646) (38.310)
##
## stateDistrict of Columbia 129.577 801.437**
## (264.593) (388.977)
##
## stateFlorida 134.715*** 445.100***
## (37.559) (53.858)
##
## stateGeorgia -61.312*** -10.369
## (17.978) (26.414)
##
## stateHawaii 30.791 -335.689***
## (85.419) (124.983)
##
## stateIdaho -151.606*** -28.365
## (43.195) (63.459)
##
## stateIllinois -41.351 271.702***
## (35.222) (50.283)
##
## stateIndiana -139.874*** -10.645
## (35.592) (52.185)
##
## stateIowa -177.910*** -11.500
## (43.607) (63.901)
##
## stateKansas -128.971*** 48.094
## (34.724) (50.653)
##
## stateKentucky -179.145*** -72.855
## (34.200) (50.213)
##
## stateLouisiana 62.670*** 96.685***
## (18.174) (26.749)
##
## stateMaine -200.732*** -74.335
## (44.190) (64.920)
##
## stateMaryland 47.579* 353.763***
## (26.376) (36.770)
##
## stateMassachusetts 23.361 316.915***
## (48.696) (70.743)
##
## stateMichigan -65.662** 166.028***
## (32.030) (46.232)
##
## stateMinnesota -210.383*** -6.633
## (41.806) (61.040)
##
## stateMississippi -120.194*** -285.459***
## (25.322) (36.694)
##
## stateMissouri -84.109*** 135.837***
## (30.055) (43.359)
##
## stateMontana -190.735*** -105.167**
## (35.482) (52.174)
##
## stateNebraska -140.674*** 25.053
## (39.765) (58.208)
##
## stateNevada -96.602*** 331.275***
## (31.169) (42.421)
##
## stateNew Hampshire -220.970*** -64.727
## (46.357) (68.017)
##
## stateNew Jersey -131.260*** 176.698***
## (44.519) (64.374)
##
## stateNew Mexico 182.425*** 282.848***
## (26.284) (38.516)
##
## stateNew York -189.162*** 297.805***
## (50.948) (72.357)
##
## stateNorth Carolina -21.718 -26.663
## (17.573) (25.899)
##
## stateNorth Dakota -273.383*** -198.818***
## (41.472) (61.044)
##
## stateOhio -219.976*** -16.365
## (40.190) (58.635)
##
## stateOklahoma -60.255*** 47.414
## (22.436) (32.768)
##
## stateOregon -59.087 202.356***
## (38.373) (55.519)
##
## statePennsylvania -236.312*** -65.357
## (43.764) (64.115)
##
## stateRhode Island -99.087* 115.626
## (50.834) (74.396)
##
## stateSouth Carolina 249.269*** 227.672***
## (18.144) (26.726)
##
## stateSouth Dakota -203.141*** -123.020**
## (36.126) (53.140)
##
## stateTennessee -43.347* 97.320***
## (22.914) (33.270)
##
## stateTexas -208.368*** -91.031
## (45.323) (66.622)
##
## stateUtah -204.625*** -70.693
## (46.306) (68.023)
##
## stateVermont -215.010*** -90.991
## (45.105) (66.280)
##
## stateVirginia -198.054*** -161.106***
## (21.460) (31.592)
##
## stateWashington -68.126** 104.177**
## (34.091) (49.739)
##
## stateWest Virginia -208.072*** -126.469**
## (40.028) (58.898)
##
## stateWisconsin -266.551*** -103.024*
## (38.952) (57.012)
##
## stateWyoming -133.177*** 5.761
## (41.877) (61.453)
##
## year1978 13.840 24.549*
## (9.959) (14.671)
##
## year1979 33.720*** 61.310***
## (10.153) (14.920)
##
## year1980 35.930*** 85.485***
## (10.423) (15.225)
##
## year1981 33.268*** 96.547***
## (10.834) (15.753)
##
## year1982 41.773*** 90.443***
## (11.325) (16.570)
##
## year1983 42.516*** 80.626***
## (12.066) (17.713)
##
## year1984 62.836*** 99.870***
## (13.304) (19.548)
##
## year1985 77.153*** 119.253***
## (14.545) (21.367)
##
## year1986 93.961*** 146.641***
## (15.974) (23.443)
##
## year1987 98.752*** 147.093***
## (17.413) (25.586)
##
## year1988 114.960*** 174.613***
## (19.064) (27.989)
##
## year1989 122.089*** 196.339***
## (20.607) (30.217)
##
## year1990 166.985*** 255.265***
## (25.278) (37.078)
##
## year1991 174.522*** 275.960***
## (26.558) (38.917)
##
## year1992 194.199*** 300.088***
## (28.017) (41.060)
##
## year1993 200.630*** 304.029***
## (29.106) (42.685)
##
## year1994 192.383*** 296.578***
## (30.420) (44.628)
##
## year1995 183.995*** 294.024***
## (31.709) (46.513)
##
## year1996 164.913*** 266.142***
## (32.887) (48.290)
##
## year1997 166.496*** 261.181***
## (34.032) (50.004)
##
## year1998 154.455*** 242.634***
## (35.386) (52.025)
##
## year1999 136.573*** 222.577***
## (36.639) (53.883)
##
## murder 7.517*** 17.291***
## (0.657) (0.881)
##
## robbery 1.303***
## (0.036)
##
## prisoners 0.263*** 0.010
## (0.033) (0.048)
##
## afam -24.922*** 4.338
## (8.190) (12.009)
##
## cauc -5.894** -4.691
## (2.790) (4.111)
##
## male 21.546*** 23.944***
## (5.582) (8.227)
##
## population 7.029** 13.413***
## (2.817) (4.143)
##
## income 0.002 -0.010***
## (0.002) (0.003)
##
## density -11.702 -5.827
## (27.164) (40.033)
##
## lawyes -19.013*** 0.759
## (6.095) (8.946)
##
## Constant 257.806 90.326
## (167.392) (246.607)
##
## -------------------------------------------------------------------------------
## Observations 1,173 1,173
## R2 0.980 0.956
## Adjusted R2 0.978 0.952
## Residual Std. Error 49.538 (df = 1090) 73.010 (df = 1091)
## F Statistic 637.504*** (df = 82; 1090) 289.844*** (df = 81; 1091)
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In this case we can see that the variable robbery is statistically significant at p<0.01 with a positive influence on the target variable of 1.303
the omitted variables needs to be correlated with an independent variable from the regression function
cor(Guns[,c(2:4)])
## violent murder robbery
## violent 1.0000000 0.8265086 0.9070773
## murder 0.8265086 1.0000000 0.7976060
## robbery 0.9070773 0.7976060 1.0000000
cortest4 <- cor.test(Guns$robbery, Guns$murder)
cortest4
##
## Pearson's product-moment correlation
##
## data: Guns$robbery and Guns$murder
## t = 45.25, df = 1171, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7757854 0.8175212
## sample estimates:
## cor
## 0.797606
As we can see from this correlation table above, robbery is highly correlated with murder, therefore the omitted variable passes condition 1. We can also see that the correlation is statistically significant with a p-value < 2.2e-16.
the omitted variable is a determinant of the target variable y
cortest1 <- cor.test(Guns$robbery, Guns$violent)
cortest1
##
## Pearson's product-moment correlation
##
## data: Guns$robbery and Guns$violent
## t = 73.736, df = 1171, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8963788 0.9167198
## sample estimates:
## cor
## 0.9070773
In this case, we can see that the correlation of the variables robbery and violent is (+) at 0.907, in addition there is a very low p-value at 2.2e-16 showing that this is statistically significant.
Based on the tests above, we know that the correlation between the omitted variable and key variable murder is positive at 0.797, and we know that the impact of the omitted variable on the target value y is positive at 1.303, so we can tell that the omitted variable is positively biased.
Intuition: the impact of a robbery on violent crime rates is often positive, as less often a peaceful crime, and people can be willing to fight back against a robbery. We also know that murder and robbery can be positively correlated, as murders from robbery are an intrinsic byproduct of the crime itself.
data("CigarettesB")
fm3 <- lm(packs ~ price + income, data = CigarettesB)
fm4 <- lm(packs ~ price, data = CigarettesB)
summary(fm3)
##
## Call:
## lm(formula = packs ~ price + income, data = CigarettesB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41867 -0.10683 0.00757 0.11738 0.32868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2997 0.9089 4.730 2.43e-05 ***
## price -1.3383 0.3246 -4.123 0.000168 ***
## income 0.1724 0.1968 0.876 0.385818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1634 on 43 degrees of freedom
## Multiple R-squared: 0.3037, Adjusted R-squared: 0.2713
## F-statistic: 9.378 on 2 and 43 DF, p-value: 0.0004168
The second data set that I am using is titled CigarettesB, where there is a data frame consisting of 46 observations on 3 variables. This data is cross-sectional for US states in the year 1992.
packs: log of cigarette consumption (in packs) per person of smoking age (>16 years)
price: lg of real price of cigarette in each state
income: log of real disposable income (per capita) in each state
\[ packs_i = \beta_0 + \beta_1price + \beta_2income + \epsilon_i \]
\[ packs_i = \beta_0 + \beta_1price + \epsilon_i \]
stargazer(fm3, fm4,
type = "text"
)
##
## ================================================================
## Dependent variable:
## --------------------------------------------
## packs
## (1) (2)
## ----------------------------------------------------------------
## price -1.338*** -1.198***
## (0.325) (0.282)
##
## income 0.172
## (0.197)
##
## Constant 4.300*** 5.094***
## (0.909) (0.063)
##
## ----------------------------------------------------------------
## Observations 46 46
## R2 0.304 0.291
## Adjusted R2 0.271 0.275
## Residual Std. Error 0.163 (df = 43) 0.163 (df = 44)
## F Statistic 9.378*** (df = 2; 43) 18.084*** (df = 1; 44)
## ================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Based on the table above, we can see that the omitted variable is statistically significant, and it does have a negative impact on price
the omitted variables needs to be correlated with an independent variable from the regression function
cor(CigarettesB)
## packs price income
## packs 1.0000000 -0.5397069 -0.1686728
## price -0.5397069 1.0000000 0.4923318
## income -0.1686728 0.4923318 1.0000000
cortest3 <- cor.test(CigarettesB$price, CigarettesB$income)
cortest3
##
## Pearson's product-moment correlation
##
## data: CigarettesB$price and CigarettesB$income
## t = 3.752, df = 44, p-value = 0.0005099
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2357241 0.6847616
## sample estimates:
## cor
## 0.4923318
Based on the correlation table above we can see that the variable price has a positive correlation of 0.492 with the variable income. We can pass condition 1. Additionally, we can see that the correlation is statistically significant with a p-value of 0.0005099
the omitted variable is a determinant of the target variable y
cortest2 <- cor.test(CigarettesB$price, CigarettesB$packs)
cortest2
##
## Pearson's product-moment correlation
##
## data: CigarettesB$price and CigarettesB$packs
## t = -4.2525, df = 44, p-value = 0.0001085
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7175778 -0.2957450
## sample estimates:
## cor
## -0.5397069
Based on the correlation test above, we can see that price and packs are negatively correlated at -0.54 at a p-value of 0.0001085, therefore the correlation of the omitted variable on y is statistically significant.
We can see from the tests run above for model 2 that the impact of the omitted variable on y is negative at -1.338. We can also see that the correlation of the omitted variable on a key independent variable “income” is positive at 0.492. Therefore, we know that the omitted variable has a negative bias.
Intuition: It makes sense that the omitted variable would cause a negative bias, as the price of a pack of cigarettes could deter whether someone buys a pack or not, and those who have a higher income are less likely to be deterred from purchasing cigarettes, therefore more packs would be consumed.