INTRODUCTION

The data used in this analysis is from a Randomized Control Trial aimed at increasing voter turnout.

The RCT was conducted in 27 towns, with approximately half of the polling booths in each town randomly selected for intervention (the value “1” in the data set represents where an intervention was made). The outcomes of interest were total turnout (the number of votes cast at each polling booth) and female turnout (the number of votes cast by women at each polling booth). Data was also collected on the number of registered voters at each polling booth, disaggregated by gender, but for some polling booths this data could not be obtained, and so data entry operators entered “-999” whenever there was missing data. (The intervention is meant to increase voter turnout)

I want to perform a fixed effects regression to determine if there is a significant effect of intervention on voter turnout.

I am using a fixed effects regression because I want to control for confounding variables such as registered total which may have an effect on turnout but are not being directly investigated.

I also want to factor the town id to understand the performance of intervention efforts within the different towns.

DATA PREPARATION

Clearing object environment

rm(list = ls())

Loading required packages

library(rmarkdown)
library(readxl)
library(skimr)
library(dplyr)
library(tidyverse)
library(fastDummies) # Needed for generating our dummy variables
library(janitor) # Cleaning
library(plm)
library(stats)
library(ggplot2)
library(forcats)
library(scales)
library(kableExtra)

Loading our datasets

df_test <- read_excel("Test Data.xlsx") %>% clean_names

df_names <- read_excel("Town Names.xlsx") %>% clean_names

DATA QUALITY CHECK

Using skimr package to do a data quality check

skim(df_test)
Data summary
Name df_test
Number of rows 6991
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
turnout_total 0 1 1 4 0 765 0
turnout_male 0 1 1 3 0 471 0
turnout_female 0 1 1 3 0 402 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
town_id 0 1 200.99 21.33 171 190 198 212.0 239 ▇▇▆▂▆
registered_total 0 1 973.60 469.42 -999 592 870 1331.5 1999 ▁▁▇▆▃
registered_male 0 1 531.83 266.28 -999 321 479 729.5 1194 ▁▁▃▇▃
registered_female 0 1 438.62 222.01 -999 268 394 600.5 991 ▁▁▁▇▃
intervention 0 1 0.50 0.50 0 0 0 1.0 1 ▇▁▁▁▇

There are no missing values in our test data.

There are some outliers for registered_total,registered_male and registered_female variables. These outliers are -999 and were included where data entry operators found missing data.

What is the structure of our variables?

str(df_test)
## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
##  $ town_id          : num [1:6991] 171 171 171 171 171 171 171 171 171 171 ...
##  $ turnout_total    : chr [1:6991] "203" "87" "215" "274" ...
##  $ turnout_male     : chr [1:6991] "107" "46" "110" "158" ...
##  $ turnout_female   : chr [1:6991] "96" "41" "105" "116" ...
##  $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
##  $ registered_male  : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
##  $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
##  $ intervention     : num [1:6991] 1 1 1 0 0 1 1 1 0 0 ...

The turnout values for our test data are in character type. They should be coerced to numeric for the purpose of analysis.

I have also made changes to town_id which should be a factor and intervention which I have converted to a binary factor.

df_test$turnout_female <- as.numeric(df_test$turnout_female)
df_test$turnout_male <- as.numeric(df_test$turnout_male)
df_test$turnout_total <- as.numeric(df_test$turnout_total)
df_test$intervention <- as.factor(df_test$intervention)
df_test$town_id <- as.character(df_test$town_id)
df_names$town_id <- as.character(df_names$town_id)

I am going to replace these values with NAs so that results for my data are not skewed.

df_test$registered_total[df_test$registered_total == -999] <- NA
df_test$registered_female[df_test$registered_female== -999] <- NA
df_test$registered_male[df_test$registered_male == -999] <- NA

Reexamining our data types

str(df_test) 
## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
##  $ town_id          : chr [1:6991] "171" "171" "171" "171" ...
##  $ turnout_total    : num [1:6991] 203 87 215 274 225 281 279 320 348 328 ...
##  $ turnout_male     : num [1:6991] 107 46 110 158 126 156 141 176 205 195 ...
##  $ turnout_female   : num [1:6991] 96 41 105 116 99 125 138 144 143 133 ...
##  $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
##  $ registered_male  : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
##  $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
##  $ intervention     : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...

Data types are in the right format

Examining missing values

skim(df_test)
Data summary
Name df_test
Number of rows 6991
Number of columns 8
_______________________
Column type frequency:
character 1
factor 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
town_id 0 1 3 3 0 28 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
intervention 0 1 FALSE 2 0: 3510, 1: 3481

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
turnout_total 2 1 465.31 155.27 0 352 456 564 1675 ▂▇▁▁▁
turnout_male 2 1 253.47 86.57 0 190 247 308 999 ▃▇▁▁▁
turnout_female 2 1 211.86 71.47 0 161 207 257 884 ▅▇▁▁▁
registered_total 22 1 979.82 456.87 204 593 873 1333 1999 ▆▇▅▅▃
registered_male 22 1 536.67 252.40 106 322 481 730 1194 ▆▇▆▅▂
registered_female 22 1 443.16 207.12 88 269 395 601 991 ▆▇▅▅▂

My data types are in the right format

sum(is.na(df_test))
## [1] 72

There a total of 72 null values

The coercion of my data types has identified non-numeric values in the turnout variables which are represented as NAs after coercion.

DATA ANALYSIS AND PRESENTATION OF RESULTS

Creating a polling booth variable to rank polling booths within each town by turnout.

df_test <- df_test %>% group_by(town_id) %>%
                       arrange(turnout_total,.by_group = FALSE) %>% 
                       mutate("polling_booth_id" = row_number()) %>% 
                       arrange(town_id,polling_booth_id)

I want to sample town_id 172 to identify if the ranking worked correctly

df_test %>%
  select(town_id,turnout_total,polling_booth_id) %>% 
  filter(town_id == "172") %>%
  arrange(polling_booth_id)  %>%
  head(10) %>%
  kbl() %>% 
  kable_styling()
town_id turnout_total polling_booth_id
172 185 1
172 185 2
172 191 3
172 194 4
172 208 5
172 211 6
172 227 7
172 233 8
172 235 9
172 235 10

Creating dummy variables for town_id

dummy_data <- df_test %>%
              select(town_id,turnout_total,intervention,registered_total) %>%
              fastDummies::dummy_cols(select_columns = "town_id")

Appending Town Names into the main dataset and dummy_data

df_test <- inner_join(df_test,df_names, 
                      by = join_by("town_id"),
                      keep = FALSE)

dummy_data <- inner_join(dummy_data,df_names, 
                         by = join_by("town_id"),
                         keep = FALSE)

The merging has led to 21 observations being excluded meaning no matching town name was found. To use the anti join to obtain the missing town, I change the order of arguments x and y within the function.

df_na <- anti_join(df_names, df_test, 
                   by="town_name")
df_na
## # A tibble: 1 × 2
##   town_id town_name
##   <chr>   <chr>    
## 1 250     Patna

The town that does not have matching records in our df_test data is 250 - Patna.

Labeling values for intervention variable - 0 as no intervention and 1 as intervention and introducing a new variable ‘intervention_status’.

df_test <- df_test %>% mutate("intervention_status" =
                       recode(intervention, "0" = "No_intervention", 
                                            "1" = "Intervention"))

FIXED EFFECTS REGRESSION ANALYSIS AND INTERPRETATION OF RESULTS

Regressing total_turnout on intervention

OLS Model

df_ols_model <- lm(turnout_total ~ intervention, data = df_test)

summary(df_ols_model)
## 
## Call:
## lm(formula = turnout_total ~ intervention, data = df_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -467.65 -112.65   -8.65   99.65  576.35 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    461.254      2.597 177.627   <2e-16 ***
## intervention1    7.396      3.682   2.008   0.0446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153.7 on 6968 degrees of freedom
## Multiple R-squared:  0.0005785,  Adjusted R-squared:  0.0004351 
## F-statistic: 4.034 on 1 and 6968 DF,  p-value: 0.04464

A basic ols reveals a positive impact of intervention on voter turnout.

However, I have not factored in any other variables that may influence the dependent variable.

Regressing total turnout on intervention, with town fixed-effects

Dummy Variable Simulator

dummy_model <- lm(turnout_total ~ intervention + factor(town_id), 
                  data = df_test)

summary(dummy_model)
## 
## Call:
## lm(formula = turnout_total ~ intervention + factor(town_id), 
##     data = df_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -481.62 -106.13   -4.95   95.62  540.55 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         392.550      9.157  42.871  < 2e-16 ***
## intervention1         7.376      3.538   2.085 0.037099 *  
## factor(town_id)172   74.602     12.211   6.109 1.05e-09 ***
## factor(town_id)173    6.148     12.921   0.476 0.634232    
## factor(town_id)175   41.436     12.509   3.313 0.000929 ***
## factor(town_id)176   45.022     12.663   3.556 0.000380 ***
## factor(town_id)177   34.251     12.606   2.717 0.006603 ** 
## factor(town_id)190   56.513     12.882   4.387 1.17e-05 ***
## factor(town_id)191   40.114     12.368   3.243 0.001187 ** 
## factor(town_id)192  113.496     13.446   8.441  < 2e-16 ***
## factor(town_id)193   93.000     13.297   6.994 2.92e-12 ***
## factor(town_id)195   14.289     13.329   1.072 0.283756    
## factor(town_id)196  119.049     12.988   9.166  < 2e-16 ***
## factor(town_id)197  135.464     13.029  10.397  < 2e-16 ***
## factor(town_id)198   72.950     13.144   5.550 2.96e-08 ***
## factor(town_id)199  113.653     13.001   8.741  < 2e-16 ***
## factor(town_id)200  179.428     13.378  13.412  < 2e-16 ***
## factor(town_id)201  120.578     13.086   9.214  < 2e-16 ***
## factor(town_id)208   94.472     12.447   7.590 3.62e-14 ***
## factor(town_id)210   82.693     12.947   6.387 1.80e-10 ***
## factor(town_id)211   91.664     13.115   6.989 3.02e-12 ***
## factor(town_id)212  134.048     13.204  10.152  < 2e-16 ***
## factor(town_id)213   44.234     12.488   3.542 0.000400 ***
## factor(town_id)235   29.480     12.908   2.284 0.022408 *  
## factor(town_id)236   25.523     12.437   2.052 0.040185 *  
## factor(town_id)237   61.481     12.831   4.792 1.69e-06 ***
## factor(town_id)238   15.642     13.057   1.198 0.230961    
## factor(town_id)239   68.592     12.831   5.346 9.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.7 on 6942 degrees of freedom
## Multiple R-squared:  0.08112,    Adjusted R-squared:  0.07755 
## F-statistic:  22.7 on 27 and 6942 DF,  p-value: < 2.2e-16

Using town dummy variables simulator we observe a positive impact on voter turnout where there was an intervention

This model uses town_id 171 as the base group. We can then interpret the coefficients for our town dummy variables as follows: i.e. town_id_172 has a higher turnout than town_id_171 by 74.602

However the coefficients for our dummy variables don’t give us any inference for our variable of interest which is voter turnout. Rather it helps us understand the performance of the different towns with reference to our base group which is town_171.

Regressing total turnout on intervention, with town fixed-effects and controlling for the total number of registered voters at each polling station

Within estimator

Here I am going to explicitly use a fixed effects model.

Using the fixed effects model I want to control for the registered voter total which is a confounding variable that is not under direct experimental control.

df_model <- plm(turnout_total ~ intervention + registered_total, 
                index = "town_id", 
                model = "within", 
                data = df_test)

summary(df_model)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = turnout_total ~ intervention + registered_total, 
##     data = df_test, model = "within", index = "town_id")
## 
## Unbalanced Panel: n = 27, T = 218-319, N = 6948
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -526.6604 -104.6209   -7.4254   95.2227  534.1849 
## 
## Coefficients:
##                   Estimate Std. Error t-value Pr(>|t|)    
## intervention1    8.4006438  3.4865148  2.4095    0.016 *  
## registered_total 0.0576584  0.0038244 15.0763   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    150960000
## Residual Sum of Squares: 146060000
## R-Squared:      0.032465
## Adj. R-Squared: 0.028549
## F-statistic: 116.079 on 2 and 6919 DF, p-value: < 2.22e-16

Here I also see a positive impact on voter turnout where there was intervention.

Controlling for the registered total number of voters we still have a positive effect on voter turnout where there was an intervention. The intervention variable is still statistically significant.

I also find that registered total has a significant effect on voter turnout.

DATA VISUALIZATIONS TO SUPPORT REGRESSION ANALYSIS

Difference in female turnout between intervention and control polling booths

df_femint <- df_test %>% group_by(intervention_status) %>%
                summarize(female_turnout = sum(turnout_female)) %>%       
                mutate(intervention_status=fct_reorder(intervention_status,
                                                  female_turnout))
df_femint %>% ggplot(aes(intervention_status, female_turnout)) + 
              geom_col(fill = "black") + 
    ggtitle("Female Turnout for Intervention(1) and Control Groups(0)") + 
              ylab("Turn Out") + 
              xlab("Intervention Status") +
              theme_light() + 
              geom_text(aes(label = female_turnout), 
                        hjust = 1, 
                        color = "skyblue") + 
              scale_y_continuous(labels = scales::comma_format(),
                                 limits = c(0,800000)) + coord_flip()

Intervention status If we drill down by town_id

What is the effect of intervention on female_turnout?

This summary gives a net effect of turnout total for towns that had an intervention and those that did not.

fem_int_effect <- df_test %>% group_by(intervention_status,town_id) %>%   
                    summarize(female_turnout = sum(turnout_female)) %>%
                    pivot_wider(names_from = "intervention_status", 
                                      values_from = "female_turnout") %>%  
                  mutate("intervention_effect" = Intervention - No_intervention)

fem_int_effect %>% arrange(desc(intervention_effect)) %>%
                   kbl() %>%
                   kable_styling()
town_id No_intervention Intervention intervention_effect
199 26514 30661 4147
190 24952 28091 3139
208 31185 34062 2877
211 25396 27852 2456
235 23584 25930 2346
173 22067 24385 2318
201 27214 29412 2198
193 24249 26414 2165
200 28160 29975 1815
213 28494 30017 1523
212 27519 28880 1361
177 26535 27810 1275
239 26921 28042 1121
197 28942 30004 1062
236 27926 28770 844
176 27193 27936 743
195 20587 21321 734
237 26721 27274 553
192 25335 25387 52
238 22695 22641 -54
191 30362 29568 -794
175 29153 27987 -1166
196 29759 28175 -1584
198 26363 24361 -2002
171 25591 23367 -2224
172 35455 32943 -2512
210 28773 26108 -2665

Net effect in voter turnout when comparing towns on which an intervention was was effected to those with no intervention

fem_int_effect %>% ggplot(aes(intervention_effect,town_id)) +
                   geom_col(fill = "black") + 
                   ggtitle("Net effect in voter turnout") +
                   labs(subtitle = "Data for the different towns") +
                   theme_minimal() + ylab("Town ID") +
                   xlab("Intervention Effect") + 
                   scale_x_continuous(labels = scales::comma_format())

Towns with a net negative effect indicate that intervention did not yield an increased voter turnout.

I want to know how many towns benefited from increased voter turnout due to intervention.

I classify the intervention_effect variable as positive for numbers greater than 0 and negative for numbers less than 0.

fem_int_effect$sign_effect <- dplyr::case_when(
        fem_int_effect$intervention_effect < 0 ~ "Negative",  
        fem_int_effect$intervention_effect > 0 ~ "Positive")

In how many towns do we have a positive effect after intervention?

fem_int_effect %>% ggplot(aes(sign_effect)) +
                   geom_bar(fill = "black") + 
ggtitle("In how many towns do we have a positive effect after intervention?") +  
                   labs(subtitle = "Female Turnout") + 
                   theme_minimal() +
                   ylab("Town ID") + 
                   geom_text(aes(label = ..count..), 
                   stat = "count",
                   hjust = 1.5, colour = "white") + 
                   xlab("Intervention Effect") +
                   scale_y_continuous(limits = c(0,20)) + 
                   coord_flip()

CONCLUSION

There were more towns with a positive effect from intervention compared to those with a negative effect. Hence, I can go on to conclude that Intervention has a significant effect on voter turnout.