Question on Dummy Variables

The question

The question was whether modeling binary data using two distinct variables instead of a single variable with values of zero and one would produce different coefficients or a different model fit.

Loading and preparing data

Data comes from the VDem Project Version 12¹.

I use only data with no NAs for all variables, using identical data for both models. (This might be problematic for inference, but our purpose here is only to compare the outcome of two identical datasets.)

library(here)

here() starts at C:/Users/tomha/Documents/3 - R Studio Projects/dummy_variable_question

library(readr)

data1 <- read_csv(here("data","vdem_plus_w.csv"))

New names:
• `` -> `...1`

Rows: 12546 Columns: 180
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): namedrows, country_name, v2lpname
dbl (177): ...1, year, COWcode, v2x_polyarchy, v2x_libdem, v2x_partipdem, v2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data1)

# A tibble: 6 × 180
   ...1 namedrows   year COWcode count…¹ v2x_p…² v2x_l…³ v2x_p…⁴ v2x_d…⁵ v2x_e…⁶
  <dbl> <chr>      <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1   158 Mexico_19…  1946      70 Mexico    0.194   0.095   0.126   0.133   0.095
2   159 Mexico_19…  1947      70 Mexico    0.195   0.095   0.127   0.133   0.102
3   160 Mexico_19…  1948      70 Mexico    0.195   0.095   0.123   0.133   0.102
4   161 Mexico_19…  1949      70 Mexico    0.196   0.096   0.123   0.133   0.1  
5   162 Mexico_19…  1950      70 Mexico    0.197   0.096   0.123   0.134   0.101
6   163 Mexico_19…  1951      70 Mexico    0.197   0.096   0.123   0.134   0.102
# … with 170 more variables: v2x_regime <dbl>, v2x_regime_amb <dbl>,
#   v2x_ex_military <dbl>, v2x_ex_confidence <dbl>, v2x_ex_direlect <dbl>,
#   v2x_ex_hereditary <dbl>, v2x_ex_party <dbl>, v2x_neopat <dbl>,
#   v2xnp_client <dbl>, v2x_frassoc_thick <dbl>, v2xcl_rol <dbl>,
#   v2x_jucon <dbl>, v2xnp_pres <dbl>, v2xlg_legcon <dbl>, v2lgoppart <dbl>,
#   v2dlencmps <dbl>, v2juhcind <dbl>, v2juncind <dbl>, v2juhccomp <dbl>,
#   v2jucomp <dbl>, v2jureview <dbl>, v2stfisccap <dbl>, v2svstterr <dbl>, …

summary(data1$v2x_regime)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   1.000   1.124   2.000   3.000       5

summary(data1$e_gdppc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.286   2.029   4.666  10.099  12.233 156.628     996

summary(data1$v2clprptym)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-4.2730 -0.1930  0.9320  0.6599  1.7350  2.7010       3

summary(data1$v2clprptyw)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-3.7240 -0.2440  0.8800  0.6872  1.7350  3.2280       3

myvars <- c("year","COWcode","country_name","v2x_regime","e_gdppc","v2clprptym","v2clprptyw")

new_data <- data1[myvars]

new_data <- new_data[complete.cases(new_data),]
head(new_data)

# A tibble: 6 × 7
   year COWcode country_name v2x_regime e_gdppc v2clprptym v2clprptyw
  <dbl>   <dbl> <chr>             <dbl>   <dbl>      <dbl>      <dbl>
1  1946      70 Mexico                1    3.17       1.16      0.941
2  1947      70 Mexico                1    3.24       1.16      0.941
3  1948      70 Mexico                1    3.37       1.16      0.941
4  1949      70 Mexico                1    3.57       1.16      0.941
5  1950      70 Mexico                1    3.96       1.16      0.941
6  1951      70 Mexico                1    4.18       1.16      0.941

Create variables for democracy and nondemocracy

I create a variable for democracy where 0 equals nondemocracy and 1 equals democracy.

I create a second variable, nondemocracy, where 0 equals democracy and 1 equals nondemocracy.

I then show summary values for each. Note that the value of the means sums to 1.

# This is using the VDem data for the category equal to liberal democracies as a 1

new_data$democracy <- ifelse(new_data$v2x_regime == 3, 1, 0)

# This sets everything not a democracy as a 0

new_data$nondemocracy <- 1 - new_data$democracy

summary(new_data$democracy)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1723  0.0000  1.0000

summary(new_data$nondemocracy)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8277  1.0000  1.0000

First model

In all three models, I use a simple linear model for demonstration purposes². I use values from VDem for property rights for men (v2clprptym) and property rights for women (v2clprptyw) plus the democracy variable(s) as explanatory variables. I use Gross Domestic Product per capita as a dependent variable.

In the first model, I use the commonly used econometric technique, using only one variable to represent the two categories. I use democracy with value of 1 for democracy. Nondemocracy is then the omitted reference category.

model1 <- lm(e_gdppc ~ v2clprptym + v2clprptyw + democracy, data = new_data)
summary(model1)


Call:
lm(formula = e_gdppc ~ v2clprptym + v2clprptyw + democracy, data = new_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.532  -4.974  -2.636   1.163 150.944 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.4150     0.1235  51.943  < 2e-16 ***
v2clprptym   -1.1816     0.1555  -7.601 3.18e-14 ***
v2clprptyw    2.4857     0.1576  15.772  < 2e-16 ***
democracy    15.9993     0.3420  46.777  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.64 on 11544 degrees of freedom
Multiple R-squared:  0.2903,    Adjusted R-squared:  0.2901 
F-statistic:  1574 on 3 and 11544 DF,  p-value: < 2.2e-16

Second model

In the second model, I include the second variable nondemocracy.

model2 <- lm(e_gdppc ~ v2clprptym + v2clprptyw + democracy + nondemocracy, data = new_data)
summary(model2)


Call:
lm(formula = e_gdppc ~ v2clprptym + v2clprptyw + democracy + 
    nondemocracy, data = new_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.532  -4.974  -2.636   1.163 150.944 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.4150     0.1235  51.943  < 2e-16 ***
v2clprptym    -1.1816     0.1555  -7.601 3.18e-14 ***
v2clprptyw     2.4857     0.1576  15.772  < 2e-16 ***
democracy     15.9993     0.3420  46.777  < 2e-16 ***
nondemocracy       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.64 on 11544 degrees of freedom
Multiple R-squared:  0.2903,    Adjusted R-squared:  0.2901 
F-statistic:  1574 on 3 and 11544 DF,  p-value: < 2.2e-16

Model 3

In this third model, I leave democracy coded with values of 0 for nondemocracy and 1 for democracy. I create a new variable nondemocracy2 with a value equal to 1 for democracy and 2 for nondemocracy.

Using democracy as 0/1 and nondemocracy as 1/2

new_data$nondemocracy2 <- new_data$nondemocracy + 1
summary(new_data$nondemocracy2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.000   1.828   2.000   2.000

model3 <- lm(e_gdppc ~ v2clprptym + v2clprptyw + democracy + nondemocracy2, data = new_data)
summary(model3)


Call:
lm(formula = e_gdppc ~ v2clprptym + v2clprptyw + democracy + 
    nondemocracy2, data = new_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.532  -4.974  -2.636   1.163 150.944 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.4150     0.1235  51.943  < 2e-16 ***
v2clprptym     -1.1816     0.1555  -7.601 3.18e-14 ***
v2clprptyw      2.4857     0.1576  15.772  < 2e-16 ***
democracy      15.9993     0.3420  46.777  < 2e-16 ***
nondemocracy2       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.64 on 11544 degrees of freedom
Multiple R-squared:  0.2903,    Adjusted R-squared:  0.2901 
F-statistic:  1574 on 3 and 11544 DF,  p-value: < 2.2e-16

Footnotes

Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, Nazifa Alizada, David Altman, Michael Bernhard, Agnes Cornell, M. Steven Fish, Lisa Gastaldi, Haakon Gjerløw, Adam Glynn, Sandra Grahn, Allen Hicken, Garry Hindle, Nina Ilchenko, Katrin Kinzelbach, Joshua Krusell, Kyle L. Marquardt, Kelly McMann, Valeriya Mechkova, Juraj Medzihorsky, Pamela Paxton, Daniel Pemstein, Josefine Pernes, Oskar Rydén, Johannes von Römer, Brigitte Seim, Rachel Sigman, Svend-Erik Skaaning, Jeffrey Staton, Aksel Sundström, Eitan Tzelgov, Yi-ting Wang, Tore Wig, Steven Wilson and Daniel Ziblatt. 2022. “V-Dem [Country-Year/Country-Date] Dataset v12” Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/vdemds22.↩︎
This model is likely not the best model for inference or prediction, but suffices to show the effect of the different methods of treating the binary variable.↩︎