The data used in this analysis is from a Randomized Control Trial aimed at increasing voter turnout.
The RCT was conducted in 27 towns, with approximately half of the polling booths in each town randomly selected for intervention (the value “1” in the data set represents where an intervention was made). The outcomes of interest were total turnout (the number of votes cast at each polling booth) and female turnout (the number of votes cast by women at each polling booth). Data was also collected on the number of registered voters at each polling booth, disaggregated by gender, but for some polling booths this data could not be obtained, and so data entry operators entered “-999” whenever there was missing data. (The intervention is meant to increase voter turnout)
I want to perform a fixed effects regression to determine if there is a significant effect of intervention on voter turnout.
I am using a fixed effects regression because I want to control for confounding variables such as registered total which may have an effect on turnout but are not being directly investigated.
I also want to factor the town id to understand the performance of intervention efforts within the different towns.
rm(list = ls())
library(rmarkdown)
library(readxl)
library(skimr)
library(dplyr)
library(tidyverse)
library(fastDummies) # Needed for generating our dummy variables
library(janitor) # Cleaning
library(plm)
library(stats)
library(ggplot2)
library(forcats)
library(scales)
library(kableExtra)
df_test <- read_excel("Test Data.xlsx") %>% clean_names
df_names <- read_excel("Town Names.xlsx") %>% clean_names
skim(df_test)
| Name | df_test |
| Number of rows | 6991 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| turnout_total | 0 | 1 | 1 | 4 | 0 | 765 | 0 |
| turnout_male | 0 | 1 | 1 | 3 | 0 | 471 | 0 |
| turnout_female | 0 | 1 | 1 | 3 | 0 | 402 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| town_id | 0 | 1 | 200.99 | 21.33 | 171 | 190 | 198 | 212.0 | 239 | ▇▇▆▂▆ |
| registered_total | 0 | 1 | 973.60 | 469.42 | -999 | 592 | 870 | 1331.5 | 1999 | ▁▁▇▆▃ |
| registered_male | 0 | 1 | 531.83 | 266.28 | -999 | 321 | 479 | 729.5 | 1194 | ▁▁▃▇▃ |
| registered_female | 0 | 1 | 438.62 | 222.01 | -999 | 268 | 394 | 600.5 | 991 | ▁▁▁▇▃ |
| intervention | 0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
There are no missing values in our test data.
There are some outliers for registered_total,registered_male and registered_female variables. These outliers are -999 and were included where data entry operators found missing data.
str(df_test)
## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
## $ town_id : num [1:6991] 171 171 171 171 171 171 171 171 171 171 ...
## $ turnout_total : chr [1:6991] "203" "87" "215" "274" ...
## $ turnout_male : chr [1:6991] "107" "46" "110" "158" ...
## $ turnout_female : chr [1:6991] "96" "41" "105" "116" ...
## $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
## $ registered_male : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
## $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
## $ intervention : num [1:6991] 1 1 1 0 0 1 1 1 0 0 ...
The turnout values for our test data are in character type. They should be coerced to numeric for the purpose of analysis.
I have also made changes to town_id which should be a factor and intervention which I have converted to a binary factor.
df_test$turnout_female <- as.numeric(df_test$turnout_female)
df_test$turnout_male <- as.numeric(df_test$turnout_male)
df_test$turnout_total <- as.numeric(df_test$turnout_total)
df_test$intervention <- as.factor(df_test$intervention)
df_test$town_id <- as.character(df_test$town_id)
df_names$town_id <- as.character(df_names$town_id)
I am going to replace these values with NAs so that results for my data are not skewed.
df_test$registered_total[df_test$registered_total == -999] <- NA
df_test$registered_female[df_test$registered_female== -999] <- NA
df_test$registered_male[df_test$registered_male == -999] <- NA
Reexamining our data types
str(df_test)
## tibble [6,991 × 8] (S3: tbl_df/tbl/data.frame)
## $ town_id : chr [1:6991] "171" "171" "171" "171" ...
## $ turnout_total : num [1:6991] 203 87 215 274 225 281 279 320 348 328 ...
## $ turnout_male : num [1:6991] 107 46 110 158 126 156 141 176 205 195 ...
## $ turnout_female : num [1:6991] 96 41 105 116 99 125 138 144 143 133 ...
## $ registered_total : num [1:6991] 219 241 271 277 296 306 319 343 350 351 ...
## $ registered_male : num [1:6991] 116 125 139 160 166 171 162 188 206 207 ...
## $ registered_female: num [1:6991] 103 116 132 117 130 135 157 155 144 144 ...
## $ intervention : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
Data types are in the right format
skim(df_test)
| Name | df_test |
| Number of rows | 6991 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| town_id | 0 | 1 | 3 | 3 | 0 | 28 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| intervention | 0 | 1 | FALSE | 2 | 0: 3510, 1: 3481 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| turnout_total | 2 | 1 | 465.31 | 155.27 | 0 | 352 | 456 | 564 | 1675 | ▂▇▁▁▁ |
| turnout_male | 2 | 1 | 253.47 | 86.57 | 0 | 190 | 247 | 308 | 999 | ▃▇▁▁▁ |
| turnout_female | 2 | 1 | 211.86 | 71.47 | 0 | 161 | 207 | 257 | 884 | ▅▇▁▁▁ |
| registered_total | 22 | 1 | 979.82 | 456.87 | 204 | 593 | 873 | 1333 | 1999 | ▆▇▅▅▃ |
| registered_male | 22 | 1 | 536.67 | 252.40 | 106 | 322 | 481 | 730 | 1194 | ▆▇▆▅▂ |
| registered_female | 22 | 1 | 443.16 | 207.12 | 88 | 269 | 395 | 601 | 991 | ▆▇▅▅▂ |
My data types are in the right format
sum(is.na(df_test))
## [1] 72
There a total of 72 null values
The coercion of my data types has identified non-numeric values in the turnout variables which are represented as NAs after coercion.
df_test <- df_test %>% group_by(town_id) %>%
arrange(turnout_total,.by_group = FALSE) %>%
mutate("polling_booth_id" = row_number()) %>%
arrange(town_id,polling_booth_id)
I want to sample town_id 172 to identify if the ranking worked correctly
df_test %>%
select(town_id,turnout_total,polling_booth_id) %>%
filter(town_id == "172") %>%
arrange(polling_booth_id) %>%
head(10) %>%
kbl() %>%
kable_styling()
| town_id | turnout_total | polling_booth_id |
|---|---|---|
| 172 | 185 | 1 |
| 172 | 185 | 2 |
| 172 | 191 | 3 |
| 172 | 194 | 4 |
| 172 | 208 | 5 |
| 172 | 211 | 6 |
| 172 | 227 | 7 |
| 172 | 233 | 8 |
| 172 | 235 | 9 |
| 172 | 235 | 10 |
dummy_data <- df_test %>%
select(town_id,turnout_total,intervention,registered_total) %>%
fastDummies::dummy_cols(select_columns = "town_id")
df_test <- inner_join(df_test,df_names,
by = join_by("town_id"),
keep = FALSE)
dummy_data <- inner_join(dummy_data,df_names,
by = join_by("town_id"),
keep = FALSE)
The merging has led to 21 observations being excluded meaning no matching town name was found. To use the anti join to obtain the missing town, I change the order of arguments x and y within the function.
df_na <- anti_join(df_names, df_test,
by="town_name")
df_na
## # A tibble: 1 × 2
## town_id town_name
## <chr> <chr>
## 1 250 Patna
The town that does not have matching records in our df_test data is 250 - Patna.
Labeling values for intervention variable - 0 as no intervention and 1 as intervention and introducing a new variable ‘intervention_status’.
df_test <- df_test %>% mutate("intervention_status" =
recode(intervention, "0" = "No_intervention",
"1" = "Intervention"))
df_ols_model <- lm(turnout_total ~ intervention, data = df_test)
summary(df_ols_model)
##
## Call:
## lm(formula = turnout_total ~ intervention, data = df_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -467.65 -112.65 -8.65 99.65 576.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 461.254 2.597 177.627 <2e-16 ***
## intervention1 7.396 3.682 2.008 0.0446 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 153.7 on 6968 degrees of freedom
## Multiple R-squared: 0.0005785, Adjusted R-squared: 0.0004351
## F-statistic: 4.034 on 1 and 6968 DF, p-value: 0.04464
A basic ols reveals a positive impact of intervention on voter turnout.
However, I have not factored in any other variables that may influence the dependent variable.
dummy_model <- lm(turnout_total ~ intervention + factor(town_id),
data = df_test)
summary(dummy_model)
##
## Call:
## lm(formula = turnout_total ~ intervention + factor(town_id),
## data = df_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -481.62 -106.13 -4.95 95.62 540.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 392.550 9.157 42.871 < 2e-16 ***
## intervention1 7.376 3.538 2.085 0.037099 *
## factor(town_id)172 74.602 12.211 6.109 1.05e-09 ***
## factor(town_id)173 6.148 12.921 0.476 0.634232
## factor(town_id)175 41.436 12.509 3.313 0.000929 ***
## factor(town_id)176 45.022 12.663 3.556 0.000380 ***
## factor(town_id)177 34.251 12.606 2.717 0.006603 **
## factor(town_id)190 56.513 12.882 4.387 1.17e-05 ***
## factor(town_id)191 40.114 12.368 3.243 0.001187 **
## factor(town_id)192 113.496 13.446 8.441 < 2e-16 ***
## factor(town_id)193 93.000 13.297 6.994 2.92e-12 ***
## factor(town_id)195 14.289 13.329 1.072 0.283756
## factor(town_id)196 119.049 12.988 9.166 < 2e-16 ***
## factor(town_id)197 135.464 13.029 10.397 < 2e-16 ***
## factor(town_id)198 72.950 13.144 5.550 2.96e-08 ***
## factor(town_id)199 113.653 13.001 8.741 < 2e-16 ***
## factor(town_id)200 179.428 13.378 13.412 < 2e-16 ***
## factor(town_id)201 120.578 13.086 9.214 < 2e-16 ***
## factor(town_id)208 94.472 12.447 7.590 3.62e-14 ***
## factor(town_id)210 82.693 12.947 6.387 1.80e-10 ***
## factor(town_id)211 91.664 13.115 6.989 3.02e-12 ***
## factor(town_id)212 134.048 13.204 10.152 < 2e-16 ***
## factor(town_id)213 44.234 12.488 3.542 0.000400 ***
## factor(town_id)235 29.480 12.908 2.284 0.022408 *
## factor(town_id)236 25.523 12.437 2.052 0.040185 *
## factor(town_id)237 61.481 12.831 4.792 1.69e-06 ***
## factor(town_id)238 15.642 13.057 1.198 0.230961
## factor(town_id)239 68.592 12.831 5.346 9.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.7 on 6942 degrees of freedom
## Multiple R-squared: 0.08112, Adjusted R-squared: 0.07755
## F-statistic: 22.7 on 27 and 6942 DF, p-value: < 2.2e-16
Using town dummy variables simulator we observe a positive impact on voter turnout where there was an intervention
This model uses town_id 171 as the base group. We can then interpret the coefficients for our town dummy variables as follows: i.e. town_id_172 has a higher turnout than town_id_171 by 74.602
However the coefficients for our dummy variables don’t give us any inference for our variable of interest which is voter turnout. Rather it helps us understand the performance of the different towns with reference to our base group which is town_171.
Here I am going to explicitly use a fixed effects model.
Using the fixed effects model I want to control for the registered voter total which is a confounding variable that is not under direct experimental control.
df_model <- plm(turnout_total ~ intervention + registered_total,
index = "town_id",
model = "within",
data = df_test)
summary(df_model)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = turnout_total ~ intervention + registered_total,
## data = df_test, model = "within", index = "town_id")
##
## Unbalanced Panel: n = 27, T = 218-319, N = 6948
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -526.6604 -104.6209 -7.4254 95.2227 534.1849
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## intervention1 8.4006438 3.4865148 2.4095 0.016 *
## registered_total 0.0576584 0.0038244 15.0763 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 150960000
## Residual Sum of Squares: 146060000
## R-Squared: 0.032465
## Adj. R-Squared: 0.028549
## F-statistic: 116.079 on 2 and 6919 DF, p-value: < 2.22e-16
Here I also see a positive impact on voter turnout where there was intervention.
Controlling for the registered total number of voters we still have a positive effect on voter turnout where there was an intervention. The intervention variable is still statistically significant.
I also find that registered total has a significant effect on voter turnout.
df_femint <- df_test %>% group_by(intervention_status) %>%
summarize(female_turnout = sum(turnout_female)) %>%
mutate(intervention_status=fct_reorder(intervention_status,
female_turnout))
df_femint %>% ggplot(aes(intervention_status, female_turnout)) +
geom_col(fill = "black") +
ggtitle("Female Turnout for Intervention(1) and Control Groups(0)") +
ylab("Turn Out") +
xlab("Intervention Status") +
theme_light() +
geom_text(aes(label = female_turnout),
hjust = 1,
color = "skyblue") +
scale_y_continuous(labels = scales::comma_format(),
limits = c(0,800000)) + coord_flip()
This summary gives a net effect of turnout total for towns that had an intervention and those that did not.
fem_int_effect <- df_test %>% group_by(intervention_status,town_id) %>%
summarize(female_turnout = sum(turnout_female)) %>%
pivot_wider(names_from = "intervention_status",
values_from = "female_turnout") %>%
mutate("intervention_effect" = Intervention - No_intervention)
fem_int_effect %>% arrange(desc(intervention_effect)) %>%
kbl() %>%
kable_styling()
| town_id | No_intervention | Intervention | intervention_effect |
|---|---|---|---|
| 199 | 26514 | 30661 | 4147 |
| 190 | 24952 | 28091 | 3139 |
| 208 | 31185 | 34062 | 2877 |
| 211 | 25396 | 27852 | 2456 |
| 235 | 23584 | 25930 | 2346 |
| 173 | 22067 | 24385 | 2318 |
| 201 | 27214 | 29412 | 2198 |
| 193 | 24249 | 26414 | 2165 |
| 200 | 28160 | 29975 | 1815 |
| 213 | 28494 | 30017 | 1523 |
| 212 | 27519 | 28880 | 1361 |
| 177 | 26535 | 27810 | 1275 |
| 239 | 26921 | 28042 | 1121 |
| 197 | 28942 | 30004 | 1062 |
| 236 | 27926 | 28770 | 844 |
| 176 | 27193 | 27936 | 743 |
| 195 | 20587 | 21321 | 734 |
| 237 | 26721 | 27274 | 553 |
| 192 | 25335 | 25387 | 52 |
| 238 | 22695 | 22641 | -54 |
| 191 | 30362 | 29568 | -794 |
| 175 | 29153 | 27987 | -1166 |
| 196 | 29759 | 28175 | -1584 |
| 198 | 26363 | 24361 | -2002 |
| 171 | 25591 | 23367 | -2224 |
| 172 | 35455 | 32943 | -2512 |
| 210 | 28773 | 26108 | -2665 |
fem_int_effect %>% ggplot(aes(intervention_effect,town_id)) +
geom_col(fill = "black") +
ggtitle("Net effect in voter turnout") +
labs(subtitle = "Data for the different towns") +
theme_minimal() + ylab("Town ID") +
xlab("Intervention Effect") +
scale_x_continuous(labels = scales::comma_format())
Towns with a net negative effect indicate that intervention did not yield an increased voter turnout.
I want to know how many towns benefited from increased voter turnout due to intervention.
I classify the intervention_effect variable as positive for numbers greater than 0 and negative for numbers less than 0.
fem_int_effect$sign_effect <- dplyr::case_when(
fem_int_effect$intervention_effect < 0 ~ "Negative",
fem_int_effect$intervention_effect > 0 ~ "Positive")
fem_int_effect %>% ggplot(aes(sign_effect)) +
geom_bar(fill = "black") +
ggtitle("In how many towns do we have a positive effect after intervention?") +
labs(subtitle = "Female Turnout") +
theme_minimal() +
ylab("Town ID") +
geom_text(aes(label = ..count..),
stat = "count",
hjust = 1.5, colour = "white") +
xlab("Intervention Effect") +
scale_y_continuous(limits = c(0,20)) +
coord_flip()
There were more towns with a positive effect from intervention compared to those with a negative effect. Hence, I can go on to conclude that Intervention has a significant effect on voter turnout.