The question was whether modeling binary data using two distinct variables instead of a single variable with values of zero and one would produce different coefficients or a different model fit.
I use only data with no NAs for all variables, using identical data for both models. (This might be problematic for inference, but our purpose here is only to compare the outcome of two identical datasets.)
library(here)
here() starts at C:/Users/tomha/Documents/3 - R Studio Projects/dummy_variable_question
Rows: 12546 Columns: 180
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): namedrows, country_name, v2lpname
dbl (177): ...1, year, COWcode, v2x_polyarchy, v2x_libdem, v2x_partipdem, v2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I create a variable for democracy where 0 equals nondemocracy and 1 equals democracy.
I create a second variable, nondemocracy, where 0 equals democracy and 1 equals nondemocracy.
I then show summary values for each. Note that the value of the means sums to 1.
# This is using the VDem data for the category equal to liberal democracies as a 1new_data$democracy <-ifelse(new_data$v2x_regime ==3, 1, 0)# This sets everything not a democracy as a 0new_data$nondemocracy <-1- new_data$democracysummary(new_data$democracy)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1723 0.0000 1.0000
summary(new_data$nondemocracy)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 1.0000 1.0000 0.8277 1.0000 1.0000
First model
In all three models, I use a simple linear model for demonstration purposes2. I use values from VDem for property rights for men (v2clprptym) and property rights for women (v2clprptyw) plus the democracy variable(s) as explanatory variables. I use Gross Domestic Product per capita as a dependent variable.
In the first model, I use the commonly used econometric technique, using only one variable to represent the two categories. I use democracy with value of 1 for democracy. Nondemocracy is then the omitted reference category.
Call:
lm(formula = e_gdppc ~ v2clprptym + v2clprptyw + democracy +
nondemocracy, data = new_data)
Residuals:
Min 1Q Median 3Q Max
-22.532 -4.974 -2.636 1.163 150.944
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4150 0.1235 51.943 < 2e-16 ***
v2clprptym -1.1816 0.1555 -7.601 3.18e-14 ***
v2clprptyw 2.4857 0.1576 15.772 < 2e-16 ***
democracy 15.9993 0.3420 46.777 < 2e-16 ***
nondemocracy NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.64 on 11544 degrees of freedom
Multiple R-squared: 0.2903, Adjusted R-squared: 0.2901
F-statistic: 1574 on 3 and 11544 DF, p-value: < 2.2e-16
Model 3
In this third model, I leave democracy coded with values of 0 for nondemocracy and 1 for democracy. I create a new variable nondemocracy2 with a value equal to 1 for democracy and 2 for nondemocracy.
Call:
lm(formula = e_gdppc ~ v2clprptym + v2clprptyw + democracy +
nondemocracy2, data = new_data)
Residuals:
Min 1Q Median 3Q Max
-22.532 -4.974 -2.636 1.163 150.944
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4150 0.1235 51.943 < 2e-16 ***
v2clprptym -1.1816 0.1555 -7.601 3.18e-14 ***
v2clprptyw 2.4857 0.1576 15.772 < 2e-16 ***
democracy 15.9993 0.3420 46.777 < 2e-16 ***
nondemocracy2 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.64 on 11544 degrees of freedom
Multiple R-squared: 0.2903, Adjusted R-squared: 0.2901
F-statistic: 1574 on 3 and 11544 DF, p-value: < 2.2e-16
Footnotes
Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, Nazifa Alizada, David Altman, Michael Bernhard, Agnes Cornell, M. Steven Fish, Lisa Gastaldi, Haakon Gjerløw, Adam Glynn, Sandra Grahn, Allen Hicken, Garry Hindle, Nina Ilchenko, Katrin Kinzelbach, Joshua Krusell, Kyle L. Marquardt, Kelly McMann, Valeriya Mechkova, Juraj Medzihorsky, Pamela Paxton, Daniel Pemstein, Josefine Pernes, Oskar Rydén, Johannes von Römer, Brigitte Seim, Rachel Sigman, Svend-Erik Skaaning, Jeffrey Staton, Aksel Sundström, Eitan Tzelgov, Yi-ting Wang, Tore Wig, Steven Wilson and Daniel Ziblatt. 2022. “V-Dem [Country-Year/Country-Date] Dataset v12” Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/vdemds22.↩︎
This model is likely not the best model for inference or prediction, but suffices to show the effect of the different methods of treating the binary variable.↩︎