INTRODUCTION

The data used in this analysis is from a Randomized Control Trial aimed at increasing voter turnout.

The RCT was conducted in 27 towns, with approximately half of the polling booths in each town randomly selected for intervention (the value “1” in the data set represents where an intervention was made). The outcomes of interest were total turnout (the number of votes cast at each polling booth) and female turnout (the number of votes cast by women at each polling booth). Data was also collected on the number of registered voters at each polling booth, disaggregated by gender, but for some polling booths this data could not be obtained, and so data entry operators entered “-999” whenever there was missing data. (The intervention is meant to increase voter turnout)

I want to perform a fixed effects regression to determine if there is a significant effect of intervention on voter turnout.

I am using a fixed effects regression because I want to control for confounding variables such as registered total which may have an effect on turnout but are not being directly investigated.

I also want to factor the town id to understand the performance of intervention efforts within the different towns.

DATA PREPARATION

Clearing object environment

rm(list = ls())

Loading required packages

library(rmarkdown)
library(readxl)
library(skimr)
library(dplyr)
library(tidyverse)
library(fastDummies) # Needed for generating our dummy variables
library(janitor) # Cleaning
library(plm)
library(stats)
library(ggplot2)
library(forcats)
library(scales)
library(kableExtra)

Loading our datasets

df_test <- read_excel("Test Data.xlsx") %>% clean_names

df_names <- read_excel("Town Names.xlsx") %>% clean_names

DATA QUALITY CHECK

Using skimr package to do a data quality check

skim(df_test)

Data summary
Name	df_test
Number of rows	6991
Number of columns	8
_______________________
Column type frequency:
character	3
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
turnout_total	1	1	4	765
turnout_male	1	1	3	471
turnout_female	1	1	3	402

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
town_id	1	200.99	21.33	171	190	198	212.0	239	▇▇▆▂▆
registered_total	1	973.60	469.42	-999	592	870	1331.5	1999	▁▁▇▆▃
registered_male	1	531.83	266.28	-999	321	479	729.5	1194	▁▁▃▇▃
registered_female	1	438.62	222.01	-999	268	394	600.5	991	▁▁▁▇▃
intervention	1	0.50	0.50	0	0	0	1.0	1	▇▁▁▁▇

There are no missing values in our test data.

There are some outliers for registered_total,registered_male and registered_female variables. These outliers are -999 and were included where data entry operators found missing data.

What is the structure of our variables?

str(df_test)

## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
##  $ town_id          : num [1:6991] 171 171 171 171 171 171 171 171 171 171 ...
##  $ turnout_total    : chr [1:6991] "203" "87" "215" "274" ...
##  $ turnout_male     : chr [1:6991] "107" "46" "110" "158" ...
##  $ turnout_female   : chr [1:6991] "96" "41" "105" "116" ...
##  $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
##  $ registered_male  : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
##  $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
##  $ intervention     : num [1:6991] 1 1 1 0 0 1 1 1 0 0 ...

The turnout values for our test data are in character type. They should be coerced to numeric for the purpose of analysis.

I have also made changes to town_id which should be a factor and intervention which I have converted to a binary factor.

df_test$turnout_female <- as.numeric(df_test$turnout_female)
df_test$turnout_male <- as.numeric(df_test$turnout_male)
df_test$turnout_total <- as.numeric(df_test$turnout_total)
df_test$intervention <- as.factor(df_test$intervention)
df_test$town_id <- as.character(df_test$town_id)
df_names$town_id <- as.character(df_names$town_id)

I am going to replace these values with NAs so that results for my data are not skewed.

df_test$registered_total[df_test$registered_total == -999] <- NA
df_test$registered_female[df_test$registered_female== -999] <- NA
df_test$registered_male[df_test$registered_male == -999] <- NA

Reexamining our data types

str(df_test)

## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
##  $ town_id          : chr [1:6991] "171" "171" "171" "171" ...
##  $ turnout_total    : num [1:6991] 203 87 215 274 225 281 279 320 348 328 ...
##  $ turnout_male     : num [1:6991] 107 46 110 158 126 156 141 176 205 195 ...
##  $ turnout_female   : num [1:6991] 96 41 105 116 99 125 138 144 143 133 ...
##  $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
##  $ registered_male  : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
##  $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
##  $ intervention     : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...

Data types are in the right format

Examining missing values

skim(df_test)

Data summary
Name	df_test
Number of rows	6991
Number of columns	8
_______________________
Column type frequency:
character	1
factor	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
town_id	0	1	3	3	0	28	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
intervention	0	1	FALSE	2	0: 3510, 1: 3481

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
turnout_total	2	1	465.31	155.27	0	352	456	564	1675	▂▇▁▁▁
turnout_male	2	1	253.47	86.57	0	190	247	308	999	▃▇▁▁▁
turnout_female	2	1	211.86	71.47	0	161	207	257	884	▅▇▁▁▁
registered_total	22	1	979.82	456.87	204	593	873	1333	1999	▆▇▅▅▃
registered_male	22	1	536.67	252.40	106	322	481	730	1194	▆▇▆▅▂
registered_female	22	1	443.16	207.12	88	269	395	601	991	▆▇▅▅▂

My data types are in the right format

sum(is.na(df_test))

## [1] 72

There a total of 72 null values

The coercion of my data types has identified non-numeric values in the turnout variables which are represented as NAs after coercion.

DATA ANALYSIS AND PRESENTATION OF RESULTS

Creating a polling booth variable to rank polling booths within each town by turnout.

df_test <- df_test %>% group_by(town_id) %>%
                       arrange(turnout_total,.by_group = FALSE) %>% 
                       mutate("polling_booth_id" = row_number()) %>% 
                       arrange(town_id,polling_booth_id)

I want to sample town_id 172 to identify if the ranking worked correctly

df_test %>%
  select(town_id,turnout_total,polling_booth_id) %>% 
  filter(town_id == "172") %>%
  arrange(polling_booth_id)  %>%
  head(10) %>%
  kbl() %>% 
  kable_styling()

town_id	turnout_total	polling_booth_id
172	185	1
172	185	2
172	191	3
172	194	4
172	208	5
172	211	6
172	227	7
172	233	8
172	235	9
172	235	10

Creating dummy variables for town_id

dummy_data <- df_test %>%
              select(town_id,turnout_total,intervention,registered_total) %>%
              fastDummies::dummy_cols(select_columns = "town_id")

Appending Town Names into the main dataset and dummy_data

df_test <- inner_join(df_test,df_names, 
                      by = join_by("town_id"),
                      keep = FALSE)

dummy_data <- inner_join(dummy_data,df_names, 
                         by = join_by("town_id"),
                         keep = FALSE)

The merging has led to 21 observations being excluded meaning no matching town name was found. To use the anti join to obtain the missing town, I change the order of arguments x and y within the function.

df_na <- anti_join(df_names, df_test, 
                   by="town_name")
df_na

## # A tibble: 1 × 2
##   town_id town_name
##   <chr>   <chr>    
## 1 250     Patna

The town that does not have matching records in our df_test data is 250 - Patna.

Labeling values for intervention variable - 0 as no intervention and 1 as intervention and introducing a new variable ‘intervention_status’.

df_test <- df_test %>% mutate("intervention_status" =
                       recode(intervention, "0" = "No_intervention", 
                                            "1" = "Intervention"))

FIXED EFFECTS REGRESSION ANALYSIS AND INTERPRETATION OF RESULTS

Regressing total_turnout on intervention

OLS Model

df_ols_model <- lm(turnout_total ~ intervention, data = df_test)

summary(df_ols_model)

## 
## Call:
## lm(formula = turnout_total ~ intervention, data = df_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -467.65 -112.65   -8.65   99.65  576.35 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    461.254      2.597 177.627   <2e-16 ***
## intervention1    7.396      3.682   2.008   0.0446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153.7 on 6968 degrees of freedom
## Multiple R-squared:  0.0005785,  Adjusted R-squared:  0.0004351 
## F-statistic: 4.034 on 1 and 6968 DF,  p-value: 0.04464

A basic ols reveals a positive impact of intervention on voter turnout.

However, I have not factored in any other variables that may influence the dependent variable.

Regressing total turnout on intervention, with town fixed-effects

Dummy Variable Simulator

dummy_model <- lm(turnout_total ~ intervention + factor(town_id), 
                  data = df_test)

summary(dummy_model)

## 
## Call:
## lm(formula = turnout_total ~ intervention + factor(town_id), 
##     data = df_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -481.62 -106.13   -4.95   95.62  540.55 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         392.550      9.157  42.871  < 2e-16 ***
## intervention1         7.376      3.538   2.085 0.037099 *  
## factor(town_id)172   74.602     12.211   6.109 1.05e-09 ***
## factor(town_id)173    6.148     12.921   0.476 0.634232    
## factor(town_id)175   41.436     12.509   3.313 0.000929 ***
## factor(town_id)176   45.022     12.663   3.556 0.000380 ***
## factor(town_id)177   34.251     12.606   2.717 0.006603 ** 
## factor(town_id)190   56.513     12.882   4.387 1.17e-05 ***
## factor(town_id)191   40.114     12.368   3.243 0.001187 ** 
## factor(town_id)192  113.496     13.446   8.441  < 2e-16 ***
## factor(town_id)193   93.000     13.297   6.994 2.92e-12 ***
## factor(town_id)195   14.289     13.329   1.072 0.283756    
## factor(town_id)196  119.049     12.988   9.166  < 2e-16 ***
## factor(town_id)197  135.464     13.029  10.397  < 2e-16 ***
## factor(town_id)198   72.950     13.144   5.550 2.96e-08 ***
## factor(town_id)199  113.653     13.001   8.741  < 2e-16 ***
## factor(town_id)200  179.428     13.378  13.412  < 2e-16 ***
## factor(town_id)201  120.578     13.086   9.214  < 2e-16 ***
## factor(town_id)208   94.472     12.447   7.590 3.62e-14 ***
## factor(town_id)210   82.693     12.947   6.387 1.80e-10 ***
## factor(town_id)211   91.664     13.115   6.989 3.02e-12 ***
## factor(town_id)212  134.048     13.204  10.152  < 2e-16 ***
## factor(town_id)213   44.234     12.488   3.542 0.000400 ***
## factor(town_id)235   29.480     12.908   2.284 0.022408 *  
## factor(town_id)236   25.523     12.437   2.052 0.040185 *  
## factor(town_id)237   61.481     12.831   4.792 1.69e-06 ***
## factor(town_id)238   15.642     13.057   1.198 0.230961    
## factor(town_id)239   68.592     12.831   5.346 9.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.7 on 6942 degrees of freedom
## Multiple R-squared:  0.08112,    Adjusted R-squared:  0.07755 
## F-statistic:  22.7 on 27 and 6942 DF,  p-value: < 2.2e-16

Using town dummy variables simulator we observe a positive impact on voter turnout where there was an intervention

This model uses town_id 171 as the base group. We can then interpret the coefficients for our town dummy variables as follows: i.e. town_id_172 has a higher turnout than town_id_171 by 74.602

However the coefficients for our dummy variables don’t give us any inference for our variable of interest which is voter turnout. Rather it helps us understand the performance of the different towns with reference to our base group which is town_171.

Regressing total turnout on intervention, with town fixed-effects and controlling for the total number of registered voters at each polling station

Within estimator

Here I am going to explicitly use a fixed effects model.

Using the fixed effects model I want to control for the registered voter total which is a confounding variable that is not under direct experimental control.

df_model <- plm(turnout_total ~ intervention + registered_total, 
                index = "town_id", 
                model = "within", 
                data = df_test)

summary(df_model)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = turnout_total ~ intervention + registered_total, 
##     data = df_test, model = "within", index = "town_id")
## 
## Unbalanced Panel: n = 27, T = 218-319, N = 6948
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -526.6604 -104.6209   -7.4254   95.2227  534.1849 
## 
## Coefficients:
##                   Estimate Std. Error t-value Pr(>|t|)    
## intervention1    8.4006438  3.4865148  2.4095    0.016 *  
## registered_total 0.0576584  0.0038244 15.0763   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    150960000
## Residual Sum of Squares: 146060000
## R-Squared:      0.032465
## Adj. R-Squared: 0.028549
## F-statistic: 116.079 on 2 and 6919 DF, p-value: < 2.22e-16

Here I also see a positive impact on voter turnout where there was intervention.

Controlling for the registered total number of voters we still have a positive effect on voter turnout where there was an intervention. The intervention variable is still statistically significant.

I also find that registered total has a significant effect on voter turnout.

DATA VISUALIZATIONS TO SUPPORT REGRESSION ANALYSIS

Difference in female turnout between intervention and control polling booths

df_femint <- df_test %>% group_by(intervention_status) %>%
                summarize(female_turnout = sum(turnout_female)) %>%       
                mutate(intervention_status=fct_reorder(intervention_status,
                                                  female_turnout))

df_femint %>% ggplot(aes(intervention_status, female_turnout)) + 
              geom_col(fill = "black") + 
    ggtitle("Female Turnout for Intervention(1) and Control Groups(0)") + 
              ylab("Turn Out") + 
              xlab("Intervention Status") +
              theme_light() + 
              geom_text(aes(label = female_turnout), 
                        hjust = 1, 
                        color = "skyblue") + 
              scale_y_continuous(labels = scales::comma_format(),
                                 limits = c(0,800000)) + coord_flip()

Intervention status If we drill down by town_id

What is the effect of intervention on female_turnout?

This summary gives a net effect of turnout total for towns that had an intervention and those that did not.

fem_int_effect <- df_test %>% group_by(intervention_status,town_id) %>%   
                    summarize(female_turnout = sum(turnout_female)) %>%
                    pivot_wider(names_from = "intervention_status", 
                                      values_from = "female_turnout") %>%  
                  mutate("intervention_effect" = Intervention - No_intervention)

fem_int_effect %>% arrange(desc(intervention_effect)) %>%
                   kbl() %>%
                   kable_styling()

town_id	No_intervention	Intervention	intervention_effect
199	26514	30661	4147
190	24952	28091	3139
208	31185	34062	2877
211	25396	27852	2456
235	23584	25930	2346
173	22067	24385	2318
201	27214	29412	2198
193	24249	26414	2165
200	28160	29975	1815
213	28494	30017	1523
212	27519	28880	1361
177	26535	27810	1275
239	26921	28042	1121
197	28942	30004	1062
236	27926	28770	844
176	27193	27936	743
195	20587	21321	734
237	26721	27274	553
192	25335	25387	52
238	22695	22641	-54
191	30362	29568	-794
175	29153	27987	-1166
196	29759	28175	-1584
198	26363	24361	-2002
171	25591	23367	-2224
172	35455	32943	-2512
210	28773	26108	-2665

Net effect in voter turnout when comparing towns on which an intervention was was effected to those with no intervention

fem_int_effect %>% ggplot(aes(intervention_effect,town_id)) +
                   geom_col(fill = "black") + 
                   ggtitle("Net effect in voter turnout") +
                   labs(subtitle = "Data for the different towns") +
                   theme_minimal() + ylab("Town ID") +
                   xlab("Intervention Effect") + 
                   scale_x_continuous(labels = scales::comma_format())

Towns with a net negative effect indicate that intervention did not yield an increased voter turnout.

I want to know how many towns benefited from increased voter turnout due to intervention.

I classify the intervention_effect variable as positive for numbers greater than 0 and negative for numbers less than 0.

fem_int_effect$sign_effect <- dplyr::case_when(
        fem_int_effect$intervention_effect < 0 ~ "Negative",  
        fem_int_effect$intervention_effect > 0 ~ "Positive")

In how many towns do we have a positive effect after intervention?

fem_int_effect %>% ggplot(aes(sign_effect)) +
                   geom_bar(fill = "black") + 
ggtitle("In how many towns do we have a positive effect after intervention?") +  
                   labs(subtitle = "Female Turnout") + 
                   theme_minimal() +
                   ylab("Town ID") + 
                   geom_text(aes(label = ..count..), 
                   stat = "count",
                   hjust = 1.5, colour = "white") + 
                   xlab("Intervention Effect") +
                   scale_y_continuous(limits = c(0,20)) + 
                   coord_flip()

CONCLUSION

There were more towns with a positive effect from intervention compared to those with a negative effect. Hence, I can go on to conclude that Intervention has a significant effect on voter turnout.

town_id	turnout_total	polling_booth_id
172	185	1
172	185	2
172	191	3
172	194	4
172	208	5
172	211	6
172	227	7
172	233	8
172	235	9
172	235	10

town_id	turnout_total	polling_booth_id
172	185	1
172	185	2
172	191	3
172	194	4
172	208	5
172	211	6
172	227	7
172	233	8
172	235	9
172	235	10

FIXED EFFECTS REGRESSION - ELECTION TURNOUT INTERVENTION DATA