Packages

Loading relevant packages:

library(lme4) #Hierarchical models
library(stargazer) #Regression tables
library(dplyr) #Data wrangling and pipes
library(ggplot2) #Graphics
library(fixest) #TWFE

Data

Loading the relevant data:

load('/Users/davidhamad/Documents/Cand.Scient.Pol/2. Semester/Statistical models beyond linear regression - applied statistics for political scientists/4-5) Hierarchical data structures/MEP.rda')

df <- MEP

Main variables

Rename the three main variables for convenience:

df <- df %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = ShareOfLocalAssistants,
    #The seat share that national party has in national election
    x = SeatsNatPal.prop,
    #Candidate-centered (or not) electoral system for EU elections
    z = OpenList) 

Exercise 1: Simpson’s paradox and leveraging variation

Q1: Describe the bivariate relationship between MEPs’ local staff (ShareOfLocalStaff, y) and their party’s seat share (SeatNatPal.prop, x) by fitting a linear regression. What do you find?

A1: Intercept is 2,56, meaning that when x (party size) equals zero, the number of local staff is 2,56. There’s a small negative relationship between party size (x) and local staff (y). The coefficient is -0,44, meaning for every 1,0 increase in seat share, MEPs hire about 0,44 fewer local staff. The effect is only barely significant (one star, p = 0,019). R² is basically zero. The model explains almost nothing of the variation in local staff.

model_a <- lm(y ~ x, data = df)
summary(model_a)

Call:
lm(formula = y ~ x, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-2.561 -1.561 -0.519  0.652 40.470 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.56125    0.05743  44.596   <2e-16 ***
x           -0.44420    0.18926  -2.347    0.019 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.803 on 6946 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.0007924, Adjusted R-squared:  0.0006486 
F-statistic: 5.509 on 1 and 6946 DF,  p-value: 0.01895

Q2: Illustrate the relationship between the two in ggplot2. You can use the geometric function for local regression and specify the method geom_smooth(aes(x, y), method = “loess”). Change the method to “lm” (linear model). What do you think?

A2: The line in plot 1 wiggles up and down. This tells you the relationship isn’t even consistently negative. It’s messy and unclear. The line in plot 2 is almost flat with a very slight downward slope. That’s our -0,44 coefficient visualized. But look at how spread out the points are. The line barely captures any pattern. Both plots confirm the same story. When you pool all MEPs together (complete pooling) and ignore groups, the relationship between party size and local staff is weak and messy. The data is screaming that something else is creating noise that hides the real pattern…

ggplot(df, aes(x, y)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess")

ggplot(df, aes(x, y)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm")


Q3: Building on your previous ggplot code, use the color = “Nationality” argument to create a separate regression for each MEP nationality. You can remove the uncertainty by adding se = FALSE if the shaded lines makes it hard to read the plot. Compare the new plot with the plot from 1b. What happened?

A3: Most country lines now slope downward. The negative effect is clearer once you separate by nationality. You can also see that countries have very different intercepts. That’s the between-group variation that was confounding the pooled model. The pooled model said barely any effect. The split-by-country model says clear negative effect, it was just hidden by country differences.

ggplot(df, aes(x, y, color = Nationality)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE)


Q4: Ok. Calculate the grouped averages of both x and y using group_by() and reframe(). You now have a new data frame with 28 observations. Plot x and y against each other in a scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit a regression line to the data. What do you see?

A4: When we collapse the data to country averages, the relationship between x and y is positive. Countries with bigger parties on average also have more staff on average. This is the opposite of what we found within countries (negative). That’s Simpson’s paradox. The between-country relationship goes in one direction, the within-country relationship goes in the other, and the pooled model mixed them together to produce a misleading near-zero result.

df_grouped <- df %>%
  group_by(Nationality) %>%
  reframe(x = mean(x, na.rm = TRUE),
          y = mean(y, na.rm = TRUE))

ggplot(df_grouped, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm")


Q5: Fit a fixed-effects model where you control for Nationality.

A5: x equals -1,68 (three stars, highly significant). Compare that to the pooled model’s -0.44. The effect is almost 4 times stronger once you control for nationality. R² also goes up by a lot compared to the pooled model. Just adding nationality explains a lot more of the variation. The nationality dummies show the different baseline levels compared to the reference country (Austria). For example, Poland is 4,91 and Sweden is -1.49, meaning Polish MEPs hire way more staff than Austrian ones, and Swedish MEPs hire fewer. These are exactly the country differences that were confounding the pooled estimate. So by removing the between-country variation with dummies, you isolate the within-country effect of x, and it’s strong, negative, and significant. Exactly what the theory predicts.

model_fe <- lm(y ~ x + Nationality, data = df)
summary(model_fe)

Call:
lm(formula = y ~ x + Nationality, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-7.218 -1.245 -0.330  0.826 34.782 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                2.18226    0.17978  12.138  < 2e-16 ***
x                         -1.68283    0.16796 -10.019  < 2e-16 ***
NationalityBelgium        -1.03592    0.23712  -4.369 1.27e-05 ***
NationalityBulgaria        2.06289    0.25254   8.169 3.68e-16 ***
NationalityCroatia         1.19829    0.32032   3.741 0.000185 ***
NationalityCyprus          0.44566    0.35038   1.272 0.203436    
NationalityCzech Republic  0.39254    0.25109   1.563 0.118015    
NationalityDenmark        -0.90697    0.27318  -3.320 0.000905 ***
NationalityEstonia        -0.65138    0.36827  -1.769 0.076974 .  
NationalityFinland        -0.88978    0.26638  -3.340 0.000841 ***
NationalityFrance         -0.42375    0.20128  -2.105 0.035305 *  
NationalityGermany         0.66037    0.19056   3.465 0.000533 ***
NationalityGreece         -0.27969    0.23775  -1.176 0.239471    
NationalityHungary        -0.20723    0.23831  -0.870 0.384555    
NationalityIreland         0.43633    0.29627   1.473 0.140863    
NationalityItaly           1.45353    0.19607   7.413 1.38e-13 ***
NationalityLatvia          1.16429    0.38052   3.060 0.002224 ** 
NationalityLithuania       6.15558    0.28775  21.392  < 2e-16 ***
NationalityLuxembourg     -0.96838    0.34385  -2.816 0.004872 ** 
NationalityMalta           3.00511    0.35342   8.503  < 2e-16 ***
NationalityNetherlands    -1.30239    0.22829  -5.705 1.21e-08 ***
NationalityPoland          4.90525    0.20373  24.077  < 2e-16 ***
NationalityPortugal       -0.81120    0.23764  -3.414 0.000645 ***
NationalityRomania         0.81788    0.22212   3.682 0.000233 ***
NationalitySlovakia       -0.06289    0.26790  -0.235 0.814406    
NationalitySlovenia       -0.68207    0.31084  -2.194 0.028250 *  
NationalitySpain          -0.11428    0.20315  -0.563 0.573779    
NationalitySweden         -1.48629    0.24066  -6.176 6.95e-10 ***
NationalityUnited Kingdom  1.14512    0.19566   5.853 5.06e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.289 on 6919 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.3362,    Adjusted R-squared:  0.3335 
F-statistic: 125.1 on 28 and 6919 DF,  p-value: < 2.2e-16

Q6: Use stargazer() to compare your pooled model with the fixed-effects model. If you don’t want to see the regression coefficients relating to Nationality, you may use the argument omit = “Nationality”. What happened to the effect of vote share? Why do you think this happened?

A6: Model 1 (pooled): x = -0,44, two stars, R² = 0,001. Model 2 (fixed effects): x = -1,68, three stars, R² = 0,336. By adding nationality dummies, the effect became almost 4 times stronger, went to highly significant, and the model now explains a lot more of the variation. That’s the power of removing confounders. The between-country variation (positive) was cancelling out the within-country variation (negative), making the pooled estimate weak and misleading.

stargazer(model_a, model_fe, type = "text", omit = "Nationality")

=====================================================================
                                   Dependent variable:               
                    -------------------------------------------------
                                            y                        
                             (1)                      (2)            
---------------------------------------------------------------------
x                          -0.444**                -1.683***         
                           (0.189)                  (0.168)          
                                                                     
Constant                   2.561***                 2.182***         
                           (0.057)                  (0.180)          
                                                                     
---------------------------------------------------------------------
Observations                6,948                    6,948           
R2                          0.001                    0.336           
Adjusted R2                 0.001                    0.333           
Residual Std. Error   2.803 (df = 6946)        2.289 (df = 6919)     
F Statistic         5.509** (df = 1; 6946) 125.140*** (df = 28; 6919)
=====================================================================
Note:                                     *p<0.1; **p<0.05; ***p<0.01

Exercise 2: Two ways to obtain the same result

Q1: Re-fit a pooled model where you control for LaborCost. What happens?

A1: Controlling for LaborCost gets you some of the way toward the true effect of x, but nationality dummies still do a better job because they absorb everything that differs between countries.

model_labor <- lm(y ~ x + LaborCost, data = df)
summary(model_labor)

Call:
lm(formula = y ~ x + LaborCost, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-4.183 -1.691 -0.517  1.036 39.034 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.491255   0.093758  47.902  < 2e-16 ***
x           -1.162326   0.183244  -6.343 2.39e-10 ***
LaborCost   -0.075100   0.002956 -25.403  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.681 on 6945 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.08574,   Adjusted R-squared:  0.08548 
F-statistic: 325.7 on 2 and 6945 DF,  p-value: < 2.2e-16

Q2: What happens if you add national fixed effects to that model?

A2: if LaborCost had been perfectly constant within countries (no time variation), R would have dropped it due to perfect collinearity. It survived only because it has some within-country variation over time. But notice: x = -1,69, which is almost identical to the fixed effects model without LaborCost (x = -1,68). So adding LaborCost on top of nationality dummies barely changed the effect of x, the dummies were already doing most of the work.

model_both <- lm(y ~ x + LaborCost + Nationality, data = df)
summary(model_both)

Call:
lm(formula = y ~ x + LaborCost + Nationality, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-7.213 -1.213 -0.372  0.832 34.694 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                7.02824    1.08422   6.482 9.66e-11 ***
x                         -1.69313    0.16774 -10.094  < 2e-16 ***
LaborCost                 -0.15417    0.03402  -4.532 5.94e-06 ***
NationalityBelgium         0.10362    0.34538   0.300 0.764182    
NationalityBulgaria       -2.18833    0.97133  -2.253 0.024296 *  
NationalityCroatia        -2.15904    0.80690  -2.676 0.007474 ** 
NationalityCyprus         -1.91302    0.62712  -3.050 0.002293 ** 
NationalityCzech Republic -2.93881    0.77665  -3.784 0.000156 ***
NationalityDenmark         0.50650    0.41435   1.222 0.221594    
NationalityEstonia        -3.99011    0.82337  -4.846 1.29e-06 ***
NationalityFinland        -0.73817    0.26810  -2.753 0.005915 ** 
NationalityFrance          0.10360    0.23225   0.446 0.655545    
NationalityGermany         0.68604    0.19038   3.604 0.000316 ***
NationalityGreece         -2.86794    0.61848  -4.637 3.60e-06 ***
NationalityHungary        -3.84642    0.83750  -4.593 4.45e-06 ***
NationalityIreland         0.20793    0.30011   0.693 0.488438    
NationalityItaly           0.92724    0.22764   4.073 4.69e-05 ***
NationalityLatvia         -2.62810    0.91902  -2.860 0.004253 ** 
NationalityLithuania       2.30738    0.89640   2.574 0.010072 *  
NationalityLuxembourg     -0.31974    0.37200  -0.860 0.390081    
NationalityMalta           0.11912    0.72805   0.164 0.870036    
NationalityNetherlands    -0.97594    0.23908  -4.082 4.51e-05 ***
NationalityPoland          1.34129    0.81227   1.651 0.098726 .  
NationalityPortugal       -3.59100    0.65766  -5.460 4.92e-08 ***
NationalityRomania        -3.30532    0.93643  -3.530 0.000419 ***
NationalitySlovakia       -3.42230    0.78804  -4.343 1.43e-05 ***
NationalitySlovenia       -3.10622    0.61843  -5.023 5.22e-07 ***
NationalitySpain          -1.69202    0.40292  -4.199 2.71e-05 ***
NationalitySweden         -0.52701    0.32024  -1.646 0.099883 .  
NationalityUnited Kingdom  0.34926    0.26270   1.330 0.183725    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.286 on 6918 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.3381,    Adjusted R-squared:  0.3354 
F-statistic: 121.9 on 29 and 6918 DF,  p-value: < 2.2e-16

Exercise 3: Non-identified models

Q1: Fit a pooled model with predictors for vote share (x) and electoral system (z).

A1: x = -0,23, not significant. Even weaker than the pooled model without z (-0,44). Adding z absorbed some of the variation that x was capturing before. z = 0,67, three stars, highly significant. MEPs in open-list (candidate-centered) electoral systems hire about 0,67 more local staff. That makes sense if you need personal votes, you invest more in local presence.

model_pooled_xz <- lm(y ~ x + z, data = df)
summary(model_pooled_xz)

Call:
lm(formula = y ~ x + z, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-2.871 -1.859 -0.804  0.910 40.146 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.19975    0.06762  32.533   <2e-16 ***
x           -0.23406    0.18912  -1.238    0.216    
z            0.67078    0.06740   9.952   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.783 on 6945 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.01484,   Adjusted R-squared:  0.01456 
F-statistic: 52.32 on 2 and 6945 DF,  p-value: < 2.2e-16

Q2: Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?

A2: Sweden was dropped (NA) because z (electoral system) has no within-country variation, each country’s electoral system stays the same across all observations. Since the nationality dummies already capture all country-level differences, including electoral system, z provides no new information. This creates perfect collinearity, which R resolves by dropping one country (Sweden).

model_fe_xz <- lm(y ~ x + z + Nationality, data = df)
summary(model_fe_xz)

Call:
lm(formula = y ~ x + z + Nationality, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-7.218 -1.245 -0.330  0.826 34.782 

Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 2.1823     0.1798  12.138  < 2e-16 ***
x                          -1.6828     0.1680 -10.019  < 2e-16 ***
z                          -1.4863     0.2407  -6.176 6.95e-10 ***
NationalityBelgium          0.4504     0.2283   1.973 0.048526 *  
NationalityBulgaria         2.0629     0.2525   8.169 3.68e-16 ***
NationalityCroatia          2.6846     0.3144   8.539  < 2e-16 ***
NationalityCyprus           1.9320     0.3452   5.597 2.26e-08 ***
NationalityCzech Republic   1.8788     0.2432   7.725 1.28e-14 ***
NationalityDenmark          0.5793     0.2659   2.179 0.029357 *  
NationalityEstonia          0.8349     0.3632   2.299 0.021560 *  
NationalityFinland          0.5965     0.2589   2.304 0.021236 *  
NationalityFrance          -0.4238     0.2013  -2.105 0.035305 *  
NationalityGermany          0.6604     0.1906   3.465 0.000533 ***
NationalityGreece           1.2066     0.2296   5.255 1.52e-07 ***
NationalityHungary         -0.2072     0.2383  -0.870 0.384555    
NationalityIreland          1.9226     0.2899   6.631 3.58e-11 ***
NationalityItaly            2.9398     0.1868  15.741  < 2e-16 ***
NationalityLatvia           2.6506     0.3754   7.061 1.81e-12 ***
NationalityLithuania        7.6419     0.2809  27.208  < 2e-16 ***
NationalityLuxembourg       0.5179     0.3386   1.529 0.126230    
NationalityMalta            4.4914     0.3494  12.856  < 2e-16 ***
NationalityNetherlands      0.1839     0.2192   0.839 0.401523    
NationalityPoland           6.3915     0.1950  32.772  < 2e-16 ***
NationalityPortugal        -0.8112     0.2376  -3.414 0.000645 ***
NationalityRomania          0.8179     0.2221   3.682 0.000233 ***
NationalitySlovakia         1.4234     0.2607   5.459 4.95e-08 ***
NationalitySlovenia         0.8042     0.3043   2.643 0.008230 ** 
NationalitySpain           -0.1143     0.2032  -0.563 0.573779    
NationalitySweden               NA         NA      NA       NA    
NationalityUnited Kingdom   1.1451     0.1957   5.853 5.06e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.289 on 6919 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.3362,    Adjusted R-squared:  0.3335 
F-statistic: 125.1 on 28 and 6919 DF,  p-value: < 2.2e-16

Q3: Tweak your fixed-effects model such that you remove z. What happens to Sweden?

A3: Sweden is back! Exactly because we removed z. The problem before was that z and the nationality dummies contained the same information, so R had to drop one. Without z, there’s no collinearity and all 28 countries get their own dummy.

model_fe_x <- lm(y ~ x + Nationality, data = df)
summary(model_fe_x)

Call:
lm(formula = y ~ x + Nationality, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-7.218 -1.245 -0.330  0.826 34.782 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                2.18226    0.17978  12.138  < 2e-16 ***
x                         -1.68283    0.16796 -10.019  < 2e-16 ***
NationalityBelgium        -1.03592    0.23712  -4.369 1.27e-05 ***
NationalityBulgaria        2.06289    0.25254   8.169 3.68e-16 ***
NationalityCroatia         1.19829    0.32032   3.741 0.000185 ***
NationalityCyprus          0.44566    0.35038   1.272 0.203436    
NationalityCzech Republic  0.39254    0.25109   1.563 0.118015    
NationalityDenmark        -0.90697    0.27318  -3.320 0.000905 ***
NationalityEstonia        -0.65138    0.36827  -1.769 0.076974 .  
NationalityFinland        -0.88978    0.26638  -3.340 0.000841 ***
NationalityFrance         -0.42375    0.20128  -2.105 0.035305 *  
NationalityGermany         0.66037    0.19056   3.465 0.000533 ***
NationalityGreece         -0.27969    0.23775  -1.176 0.239471    
NationalityHungary        -0.20723    0.23831  -0.870 0.384555    
NationalityIreland         0.43633    0.29627   1.473 0.140863    
NationalityItaly           1.45353    0.19607   7.413 1.38e-13 ***
NationalityLatvia          1.16429    0.38052   3.060 0.002224 ** 
NationalityLithuania       6.15558    0.28775  21.392  < 2e-16 ***
NationalityLuxembourg     -0.96838    0.34385  -2.816 0.004872 ** 
NationalityMalta           3.00511    0.35342   8.503  < 2e-16 ***
NationalityNetherlands    -1.30239    0.22829  -5.705 1.21e-08 ***
NationalityPoland          4.90525    0.20373  24.077  < 2e-16 ***
NationalityPortugal       -0.81120    0.23764  -3.414 0.000645 ***
NationalityRomania         0.81788    0.22212   3.682 0.000233 ***
NationalitySlovakia       -0.06289    0.26790  -0.235 0.814406    
NationalitySlovenia       -0.68207    0.31084  -2.194 0.028250 *  
NationalitySpain          -0.11428    0.20315  -0.563 0.573779    
NationalitySweden         -1.48629    0.24066  -6.176 6.95e-10 ***
NationalityUnited Kingdom  1.14512    0.19566   5.853 5.06e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.289 on 6919 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.3362,    Adjusted R-squared:  0.3335 
F-statistic: 125.1 on 28 and 6919 DF,  p-value: < 2.2e-16

Exercise 4: Demeaning and fixed effects models

Q1: Begin by calculating the number of MEPs that experienced at least one change in national party size (x) and local staff (y). In your opinion, do you have enough variation in the data to estimate within-individual effects?

A1: So about half the MEPs had changes in x and about 76 % had changes in y. That’s plenty of within-individual variation to estimate fixed effects. The model won’t use information from MEPs who never changed, but there’s still a large enough group who did.

n_total <- n_distinct(df$ID)

n_change_x <- df %>%
  group_by(ID) %>%
  mutate(change_x = x - lag(x)) %>%
  filter(change_x != 0) %>%
  summarise(n = n()) %>%
  nrow()

n_change_x
[1] 553
n_change_x / n_total * 100
[1] 47.10392
n_change_y <- df %>%
  group_by(ID) %>%
  mutate(change_y = y - lag(y)) %>%
  filter(change_y != 0) %>%
  summarise(n = n()) %>%
  nrow()

n_change_y
[1] 891
n_change_y / n_total * 100
[1] 75.89438

Q2: Leverage only the between-individual variation between MEPs’ local staff and their party size.

A2: x = 0,03, insignificant. R² is basically zero. There’s no between-individual relationship at all.

df0 <- df %>%
  group_by(ID) %>%
  reframe(x = mean(x, na.rm = TRUE),
          y = mean(y, na.rm = TRUE))

mod_between <- lm(y ~ x, data = df0)
summary(mod_between)

Call:
lm(formula = y ~ x, data = df0)

Residuals:
   Min     1Q Median     3Q    Max 
-2.530 -1.710 -0.714  0.883 36.984 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.50922    0.13725  18.282   <2e-16 ***
x            0.03076    0.46144   0.067    0.947    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.729 on 1150 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  3.865e-06, Adjusted R-squared:  -0.0008657 
F-statistic: 0.004445 on 1 and 1150 DF,  p-value: 0.9469

Q3: Leverage only the within-individual variation in a similar model.

A3: By demeaning, subtracting each MEP’s own average from x and y, we isolate only within-individual variation. The coefficient is -0,67, but not significant.

df <- df %>%
  group_by(ID) %>%
  mutate(x_id = mean(x, na.rm = TRUE),
         y_id = mean(y, na.rm = TRUE)) %>%
  ungroup()

mod_within <- lm(I(y - y_id) ~ -1 + I(x - x_id), data = df)
summary(mod_within)

Call:
lm(formula = I(y - y_id) ~ -1 + I(x - x_id), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.577  -0.403   0.000   0.404  37.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
I(x - x_id)  -0.6673     0.4298  -1.553    0.121

Residual standard error: 1.35 on 6947 degrees of freedom
  (195 observations deleted due to missingness)
Multiple R-squared:  0.000347,  Adjusted R-squared:  0.0002031 
F-statistic: 2.411 on 1 and 6947 DF,  p-value: 0.1205

Q4: Estimate a fixed effects model where you add MEP’s id variable as a covariate.

A4: Adding a dummy for every MEP gives x = -0,667, not significant. The same coefficient as the demeaned model. This confirms that demeaning and adding individual dummies are mathematically equivalent.

mod_fe <- lm(y ~ x + as.factor(ID), data = df)
stargazer(mod_within, mod_fe, type = "text", omit = "as.factor")

====================================================================
                                  Dependent variable:               
                    ------------------------------------------------
                        I(y - y_id)                   y             
                            (1)                      (2)            
--------------------------------------------------------------------
I(x - x_id)                -0.667                                   
                          (0.430)                                   
                                                                    
x                                                  -0.667           
                                                   (0.467)          
                                                                    
Constant                                          1.731***          
                                                   (0.472)          
                                                                    
--------------------------------------------------------------------
Observations               6,948                    6,948           
R2                         0.0003                   0.772           
Adjusted R2                0.0002                   0.727           
Residual Std. Error  1.350 (df = 6947)        1.466 (df = 5795)     
F Statistic         2.411 (df = 1; 6947) 17.040*** (df = 1152; 5795)
====================================================================
Note:                                    *p<0.1; **p<0.05; ***p<0.01

Q5: Do the Mundlak move. Estimate a model on you main data frame with both the between- and within- variation variables included.

A5: The Mundlak move separates the total effect into two parts. The between effect (x_id = -0,434) is the relationship when comparing different MEPs to each other. The within effect I(x - x_id) = -0,667) is the relationship when comparing each MEP to themselves over time. Both are negative, but the within effect is larger.

mod_both <- lm(y ~ x_id + I(x - x_id), data = df)
stargazer(mod_between, mod_within, mod_fe, mod_both, type = "text", omit = "as.factor")

===============================================================================================================
                                                        Dependent variable:                                    
                    -------------------------------------------------------------------------------------------
                             y               I(y - y_id)                              y                        
                            (1)                  (2)                      (3)                      (4)         
---------------------------------------------------------------------------------------------------------------
x                          0.031                                        -0.667                                 
                          (0.461)                                       (0.467)                                
                                                                                                               
x_id                                                                                            -0.434**       
                                                                                                 (0.194)       
                                                                                                               
I(x - x_id)                                     -0.667                                           -0.667        
                                               (0.430)                                           (0.893)       
                                                                                                               
Constant                  2.509***                                     1.731***                 2.559***       
                          (0.137)                                       (0.472)                  (0.058)       
                                                                                                               
---------------------------------------------------------------------------------------------------------------
Observations               1,152                6,948                    6,948                    6,948        
R2                        0.00000               0.0003                   0.772                    0.001        
Adjusted R2                -0.001               0.0002                   0.727                    0.001        
Residual Std. Error  2.729 (df = 1150)    1.350 (df = 6947)        1.466 (df = 5795)        2.803 (df = 6945)  
F Statistic         0.004 (df = 1; 1150) 2.411 (df = 1; 6947) 17.040*** (df = 1152; 5795) 2.787* (df = 2; 6945)
===============================================================================================================
Note:                                                                               *p<0.1; **p<0.05; ***p<0.01

Q6: What did you just do? What do these models tell you?

A6: We split the effect of party size into between and within components. The demeaned model and the individual fixed effects model give the exact same coefficient (-0,667), proving they’re mathematically equivalent. The Mundlak move puts both components in one model. The between effect is -0,43 and the within effect is -0,67. The within effect is stronger and more meaningful because it removes everything that’s constant about each MEP.


Exercise 5: Variation is information

Q1 + Q2: Go back to your very first code and store in your data which MEP experienced at least one change in x (and in y). Reestimate the individual fixed-effects model on data where you filter out all MEPs that have not experienced a change (filter(change_x == 0)). Store the the result in mod_fe_var and add it to the stargazer results table. Compare the regression coefficents of mod_fe and mod_fe_var. What do you see? Why? What about the standard errors?

A1 + A2: Coefficients are almost identical. Removing the non-changers barely changed the estimate, because they weren’t contributing any information anyway. Standard errors went up. That’s because we removed ~ 2,900 observations. Fewer observations = less data = more uncertainty. The non-changers don’t affect what the answer is, but they do affect how confident we are in it (standard errors go up).

df <- df %>%
  group_by(ID) %>%
  mutate(change_x = any(x != lag(x), na.rm = TRUE)) %>%
  ungroup()

mod_fe_var <- lm(y ~ x + as.factor(ID), data = df %>% filter(change_x == TRUE))

stargazer(mod_within, mod_fe, mod_fe_var, type = "text", omit = "as.factor")

===============================================================================================
                                                Dependent variable:                            
                    ---------------------------------------------------------------------------
                        I(y - y_id)                                y                           
                            (1)                      (2)                        (3)            
-----------------------------------------------------------------------------------------------
I(x - x_id)                -0.667                                                              
                          (0.430)                                                              
                                                                                               
x                                                  -0.667                      -0.652          
                                                   (0.467)                    (0.540)          
                                                                                               
Constant                                          1.731***                    1.728***         
                                                   (0.472)                    (0.544)          
                                                                                               
-----------------------------------------------------------------------------------------------
Observations               6,948                    6,948                      4,071           
R2                         0.0003                   0.772                      0.709           
Adjusted R2                0.0002                   0.727                      0.663           
Residual Std. Error  1.350 (df = 6947)        1.466 (df = 5795)          1.686 (df = 3517)     
F Statistic         2.411 (df = 1; 6947) 17.040*** (df = 1152; 5795) 15.494*** (df = 553; 3517)
===============================================================================================
Note:                                                               *p<0.1; **p<0.05; ***p<0.01

Exercise 6: Interpret using partial scenarios

Context: Let’s theorize that MEPs will move to compensate for loss in party funding coming from an unfortunate national election. To do so, we expect that MEPs will respond to changes in their own party’s size. That’s the intuition the fixed effects model narrows down on. Now that we are focusing on the within-individual change in party size (a proxy for the party’s funding), the partial scenario for interpretation changes. We are no longer interested in the between-MEP variation in party size, but what a realistic change in x is after a national election.

Q1: Use the same group_by() and mutate() trick and calculate some metric of the typical change in party size following a national election. To do so, calculate the difference between x and the previous x for each MEP in the main data, then filter out all non-changes using filter(). This is the increment (change) you use when you interpret β, your slope coefficient for x.

A1: A typical change in party size after a national election is about 5 percentage points (the median), with a mean of about 7 percentage points. This is the realistic increment we should use when interpreting β.

df %>%
  group_by(ID) %>%
  mutate(change_x = x - lag(x)) %>%
  filter(change_x != 0) %>%
  ungroup() %>%
  summarise(
    mean = mean(abs(change_x), na.rm = TRUE),
    median = median(abs(change_x), na.rm = TRUE),
    sd = sd(abs(change_x), na.rm = TRUE)
  )

Q2: Calculate the marginal effect of such a change in MEP’s local staff.

A2: Using the median change of 0,05, the marginal effect on local staff is -0,036. Using the mean change of 0,07, the effect is -0,048. In both cases, a typical national election results in a reduction of less than 0,05 local staff members.

#Median
-0.667 * 0.05333
[1] -0.03557111
#Mean
-0.667 * 0.07214
[1] -0.04811738

Q3: Formulate this in a plain English way. In your opinion, is this a “big” effect?

A3: No, this is not a big effect. It’s essentially zero. The theory predicts that MEPs compensate for losing party support by hiring more local staff, but at the individual level, a single election barely changes anything. This makes sense. MEPs don’t reshuffle their entire staff team because their party gained or lost a few seats. Staff changes are probably driven more by other things.


Exercise 7: Consider time varying effects

Context: The general level of MEP’s local staff changed over the period for reasons unrelated to national electoral fortunes. First, the European Parliament finally adressed a phenomenon among some MEPs to invest all their funding in local presence rather than legislative work. In 2016 the European Parliament therefore reformed the system by imposing a rule that only 75% of the funding can be spent locally. After that, the general level of local staff dropped. Second, the overall local staff size was higher in the spring of 2014 because of the European election where many of them were competing for seats.

Q1: Are these elements confounders? How could you adress it?

A1: These are not classical confounders since they mainly affect y (local staff) and not x (party size). However, they are time-specific shocks that hit all MEPs at the same time, which could still bias the estimate if they coincide with changes in party size. To address this, we include period fixed effects alongside individual fixed effects (ID), giving us a two-way fixed effects model that controls for both time-invariant MEP differences and period-specific shocks.

mod_fe_time <- feols(y ~ x | ID + Period, data = df)
summary(mod_fe_time)
OLS estimation, Dep. Var.: y
Observations: 6,908
Fixed-effects: ID: 1,112,  Period: 10
Standard-errors: IID 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.29706     Adj. R2: 0.738068
                Within R2: 0.001317

Q2: Explore the general variation in local staff size across periods (Period) in the data.

A2: The mean local staff size varies considerably across periods. It starts around 2,9 in period 1, drops to about 2,4 by period 4, spikes back up to 2,9 in period 5 (the EP election in June 2014), then drops sharply after period 8 to around 1,9 in periods 9-10 (following the 2016 reform). This confirms that there is substantial time variation in y that is unrelated to party size and needs to be accounted for.

df %>%
  select(Period, DatePeriod) %>%
  distinct() %>%
  arrange(Period)
df %>%
  group_by(Period) %>%
  summarise(mean_y = mean(y, na.rm = TRUE))
df %>%
  group_by(Period) %>%
  summarise(mean_y = mean(y, na.rm = TRUE)) %>%
  ggplot(aes(x = Period, y = mean_y)) +
  geom_line() +
  geom_point() +
  labs(x = "Period", y = "Mean local staff")


Q3: Address the problem if you perceive any.

A3: The plot in question 2 shows substantial variation in mean local staff across periods that is unrelated to party size. To address this, we add period fixed effects alongside individual fixed effects, giving us a two-way fixed effects model, which was shown in question 1.


Q4: Re-interpret your results. What changed?

A4: After adding period fixed effects, the coefficient on x changed from -0,667 to -1,253 and became statistically significant. This means that once we control for time-specific shocks like the 2016 reform and the 2014 EP election, the effect of party size on local staff is actually stronger and significant. The earlier model underestimated the effect because time-varying factors were creating noise in the estimate.


Exercise 8: Your call

Q1: Considering all the covariates in the data, which model would you go for?

A1: I would go with the two-way fixed effects model.


Q2: What did you learn?

A2: Party size has a statistically significant negative effect on local staff hiring. However, the realistic effect of a typical election is still small. MEPs do appear to compensate for party losses, but only marginally.

---
title: "Fixed effects, variation and demeaning"
output:
  html_notebook: default
  pdf_document: default
---

```{css, echo=FALSE}
body{text-align: Justify;}
h1.title {border-bottom: 3px solid black; padding-bottom: 10px;}
```

### Packages
Loading relevant packages:

```{r}
library(lme4) #Hierarchical models
library(stargazer) #Regression tables
library(dplyr) #Data wrangling and pipes
library(ggplot2) #Graphics
library(fixest) #TWFE
```

---

### Data
Loading the relevant data:

```{r}
load('/Users/davidhamad/Documents/Cand.Scient.Pol/2. Semester/Statistical models beyond linear regression - applied statistics for political scientists/4-5) Hierarchical data structures/MEP.rda')

df <- MEP
```

---

### Main variables
Rename the three main variables for convenience:

```{r}
df <- df %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = ShareOfLocalAssistants,
    #The seat share that national party has in national election
    x = SeatsNatPal.prop,
    #Candidate-centered (or not) electoral system for EU elections
    z = OpenList) 
```

---

### Exercise 1: Simpson’s paradox and leveraging variation
**Q1:** Describe the bivariate relationship between MEPs’ local staff (ShareOfLocalStaff, y) and their party’s seat share (SeatNatPal.prop, x) by fitting a linear regression. What do you find?

**A1:** Intercept is 2,56, meaning that when x (party size) equals zero, the number of local staff is 2,56. There's a small negative relationship between party size (x) and local staff (y). The coefficient is -0,44, meaning for every 1,0 increase in seat share, MEPs hire about 0,44 fewer local staff. The effect is only barely significant (one star, p = 0,019). R² is basically zero. The model explains almost nothing of the variation in local staff.

```{r}
model_a <- lm(y ~ x, data = df)
summary(model_a)
```

---

**Q2:** Illustrate the relationship between the two in ggplot2. You can use the geometric function for local regression and specify the method geom_smooth(aes(x, y), method = "loess"). Change the method to “lm” (linear model). What do you think?

**A2:** The line in plot 1 wiggles up and down. This tells you the relationship isn't even consistently negative. It's messy and unclear. The line in plot 2 is almost flat with a very slight downward slope. That's our -0,44 coefficient visualized. But look at how spread out the points are. The line barely captures any pattern. Both plots confirm the same story. When you pool all MEPs together (complete pooling) and ignore groups, the relationship between party size and local staff is weak and messy. The data is screaming that something else is creating noise that hides the real pattern...

```{r}
ggplot(df, aes(x, y)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess")
```

```{r}
ggplot(df, aes(x, y)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm")
```

---

**Q3:** Building on your previous ggplot code, use the color = "Nationality" argument to create a separate regression for each MEP nationality. You can remove the uncertainty by adding se = FALSE if the shaded lines makes it hard to read the plot. Compare the new plot with the plot from 1b. What happened?

**A3:** Most country lines now slope downward. The negative effect is clearer once you separate by nationality. You can also see that countries have very different intercepts. That's the between-group variation that was confounding the pooled model. The pooled model said barely any effect. The split-by-country model says clear negative effect, it was just hidden by country differences.

```{r}
ggplot(df, aes(x, y, color = Nationality)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE)
```

---

**Q4:** Ok. Calculate the grouped averages of both x and y using group_by() and reframe(). You now have a new data frame with 28 observations. Plot x and y against each other in a scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit a regression line to the data. What do you see?

**A4:** When we collapse the data to country averages, the relationship between x and y is positive. Countries with bigger parties on average also have more staff on average. This is the opposite of what we found within countries (negative). That's Simpson's paradox. The between-country relationship goes in one direction, the within-country relationship goes in the other, and the pooled model mixed them together to produce a misleading near-zero result.

```{r}
df_grouped <- df %>%
  group_by(Nationality) %>%
  reframe(x = mean(x, na.rm = TRUE),
          y = mean(y, na.rm = TRUE))

ggplot(df_grouped, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm")
```

---

**Q5:** Fit a fixed-effects model where you control for Nationality.

**A5:** x equals -1,68 (three stars, highly significant). Compare that to the pooled model's -0.44. The effect is almost 4 times stronger once you control for nationality. R² also goes up by a lot compared to the pooled model. Just adding nationality explains a lot more of the variation. The nationality dummies show the different baseline levels compared to the reference country (Austria). For example, Poland is 4,91 and Sweden is -1.49, meaning Polish MEPs hire way more staff than Austrian ones, and Swedish MEPs hire fewer. These are exactly the country differences that were confounding the pooled estimate. So by removing the between-country variation with dummies, you isolate the within-country effect of x, and it's strong, negative, and significant. Exactly what the theory predicts.

```{r}
model_fe <- lm(y ~ x + Nationality, data = df)
summary(model_fe)
```

---

**Q6:** Use stargazer() to compare your pooled model with the fixed-effects model. If you don’t want to see the regression coefficients relating to Nationality, you may use the argument omit = "Nationality". What happened to the effect of vote share? Why do you think this happened?

**A6:** Model 1 (pooled): x = -0,44, two stars, R² = 0,001. Model 2 (fixed effects): x = -1,68, three stars, R² = 0,336. By adding nationality dummies, the effect became almost 4 times stronger, went to highly significant, and the model now explains a lot more of the variation. That's the power of removing confounders. The between-country variation (positive) was cancelling out the within-country variation (negative), making the pooled estimate weak and misleading.

```{r}
stargazer(model_a, model_fe, type = "text", omit = "Nationality")
```

---

### Exercise 2: Two ways to obtain the same result
**Q1:** Re-fit a pooled model where you control for LaborCost. What happens?

**A1:** Controlling for LaborCost gets you some of the way toward the true effect of x, but nationality dummies still do a better job because they absorb everything that differs between countries.

```{r}
model_labor <- lm(y ~ x + LaborCost, data = df)
summary(model_labor)
```

---

**Q2:** What happens if you add national fixed effects to that model?

**A2:** if LaborCost had been perfectly constant within countries (no time variation), R would have dropped it due to perfect collinearity. It survived only because it has some within-country variation over time. But notice: x = -1,69, which is almost identical to the fixed effects model without LaborCost (x = -1,68). So adding LaborCost on top of nationality dummies barely changed the effect of x, the dummies were already doing most of the work.

```{r}
model_both <- lm(y ~ x + LaborCost + Nationality, data = df)
summary(model_both)
```

---

### Exercise 3: Non-identified models
**Q1:** Fit a pooled model with predictors for vote share (x) and electoral system (z).

**A1:** x = -0,23, not significant. Even weaker than the pooled model without z (-0,44). Adding z absorbed some of the variation that x was capturing before. z = 0,67, three stars, highly significant. MEPs in open-list (candidate-centered) electoral systems hire about 0,67 more local staff. That makes sense if you need personal votes, you invest more in local presence.

```{r}
model_pooled_xz <- lm(y ~ x + z, data = df)
summary(model_pooled_xz)
```

---

**Q2:** Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?

**A2:** Sweden was dropped (NA) because z (electoral system) has no within-country variation, each country's electoral system stays the same across all observations. Since the nationality dummies already capture all country-level differences, including electoral system, z provides no new information. This creates perfect collinearity, which R resolves by dropping one country (Sweden).

```{r}
model_fe_xz <- lm(y ~ x + z + Nationality, data = df)
summary(model_fe_xz)
```

---

**Q3:** Tweak your fixed-effects model such that you remove z. What happens to Sweden?

**A3:** Sweden is back! Exactly because we removed z. The problem before was that z and the nationality dummies contained the same information, so R had to drop one. Without z, there's no collinearity and all 28 countries get their own dummy.

```{r}
model_fe_x <- lm(y ~ x + Nationality, data = df)
summary(model_fe_x)
```

---

### Exercise 4: Demeaning and fixed effects models

**Q1:** Begin by calculating the number of MEPs that experienced at least one change in national party size (x) and local staff (y). In your opinion, do you have enough variation in the data to estimate within-individual effects?

**A1:** So about half the MEPs had changes in x and about 76 % had changes in y. That's plenty of within-individual variation to estimate fixed effects. The model won't use information from MEPs who never changed, but there's still a large enough group who did.

```{r}
n_total <- n_distinct(df$ID)

n_change_x <- df %>%
  group_by(ID) %>%
  mutate(change_x = x - lag(x)) %>%
  filter(change_x != 0) %>%
  summarise(n = n()) %>%
  nrow()

n_change_x
n_change_x / n_total * 100

n_change_y <- df %>%
  group_by(ID) %>%
  mutate(change_y = y - lag(y)) %>%
  filter(change_y != 0) %>%
  summarise(n = n()) %>%
  nrow()

n_change_y
n_change_y / n_total * 100
```

---

**Q2:** Leverage only the between-individual variation between MEPs’ local staff and their party size.

**A2:** x = 0,03, insignificant. R² is basically zero. There's no between-individual relationship at all.

```{r}
df0 <- df %>%
  group_by(ID) %>%
  reframe(x = mean(x, na.rm = TRUE),
          y = mean(y, na.rm = TRUE))

mod_between <- lm(y ~ x, data = df0)
summary(mod_between)
```

---

**Q3:** Leverage only the within-individual variation in a similar model.

**A3:** By demeaning, subtracting each MEP's own average from x and y,  we isolate only within-individual variation. The coefficient is -0,67, but not significant.

```{r}
df <- df %>%
  group_by(ID) %>%
  mutate(x_id = mean(x, na.rm = TRUE),
         y_id = mean(y, na.rm = TRUE)) %>%
  ungroup()

mod_within <- lm(I(y - y_id) ~ -1 + I(x - x_id), data = df)
summary(mod_within)
```

---

**Q4:** Estimate a fixed effects model where you add MEP’s id variable as a covariate.

**A4:** Adding a dummy for every MEP gives x = -0,667, not significant. The same coefficient as the demeaned model. This confirms that demeaning and adding individual dummies are mathematically equivalent.

```{r}
mod_fe <- lm(y ~ x + as.factor(ID), data = df)
stargazer(mod_within, mod_fe, type = "text", omit = "as.factor")
```

---

**Q5:** Do the Mundlak move. Estimate a model on you main data frame with both the between- and within- variation variables included.

**A5:** The Mundlak move separates the total effect into two parts. The between effect (x_id = -0,434) is the relationship when comparing different MEPs to each other. The within effect I(x - x_id) = -0,667) is the relationship when comparing each MEP to themselves over time. Both are negative, but the within effect is larger. 

```{r}
mod_both <- lm(y ~ x_id + I(x - x_id), data = df)
stargazer(mod_between, mod_within, mod_fe, mod_both, type = "text", omit = "as.factor")
```

**Q6:** What did you just do? What do these models tell you?

**A6:** We split the effect of party size into between and within components. The demeaned model and the individual fixed effects model give the exact same coefficient (-0,667), proving they're mathematically equivalent. The Mundlak move puts both components in one model. The between effect is -0,43 and the within effect is -0,67. The within effect is stronger and more meaningful because it removes everything that's constant about each MEP.

---

### Exercise 5: Variation is information
**Q1 + Q2:** Go back to your very first code and store in your data which MEP experienced at least one change in x (and in y). Reestimate the individual fixed-effects model on data where you filter out all MEPs that have not experienced a change (filter(change_x == 0)). Store the the result in mod_fe_var and add it to the stargazer results table. Compare the regression coefficents of mod_fe and mod_fe_var. What do you see? Why? What about the standard errors?

**A1 + A2:** Coefficients are almost identical. Removing the non-changers barely changed the estimate, because they weren't contributing any information anyway. Standard errors went up. That's because we removed ~ 2,900 observations. Fewer observations = less data = more uncertainty. The non-changers don't affect what the answer is, but they do affect how confident we are in it (standard errors go up).

```{r}
df <- df %>%
  group_by(ID) %>%
  mutate(change_x = any(x != lag(x), na.rm = TRUE)) %>%
  ungroup()

mod_fe_var <- lm(y ~ x + as.factor(ID), data = df %>% filter(change_x == TRUE))

stargazer(mod_within, mod_fe, mod_fe_var, type = "text", omit = "as.factor")
```

---

### Exercise 6: Interpret using partial scenarios
**Context:** Let’s theorize that MEPs will move to compensate for loss in party funding coming from an unfortunate national election. To do so, we expect that MEPs will respond to changes in their own party’s size. That’s the intuition the fixed effects model narrows down on. Now that we are focusing on the within-individual change in party size (a proxy for the party’s funding), the partial scenario for interpretation changes. We are no longer interested in the between-MEP variation in party size, but what a realistic change in x is after a national election. 

**Q1:** Use the same group_by() and mutate() trick and calculate some metric of the typical change in party size following a national election. To do so, calculate the difference between x and the previous x for each MEP in the main data, then filter out all non-changes using filter(). This is the increment (change) you use when you interpret β, your slope coefficient for x.

**A1:** A typical change in party size after a national election is about 5 percentage points (the median), with a mean of about 7 percentage points. This is the realistic increment we should use when interpreting β.

```{r}
df %>%
  group_by(ID) %>%
  mutate(change_x = x - lag(x)) %>%
  filter(change_x != 0) %>%
  ungroup() %>%
  summarise(
    mean = mean(abs(change_x), na.rm = TRUE),
    median = median(abs(change_x), na.rm = TRUE),
    sd = sd(abs(change_x), na.rm = TRUE)
  )
```

---

**Q2:** Calculate the marginal effect of such a change in MEP’s local staff.

**A2:** Using the median change of 0,05, the marginal effect on local staff is -0,036. Using the mean change of 0,07, the effect is -0,048. In both cases, a typical national election results in a reduction of less than 0,05 local staff members.

```{r}
#Median
-0.667 * 0.05333

#Mean
-0.667 * 0.07214
```

---

**Q3:** Formulate this in a plain English way. In your opinion, is this a “big” effect?

**A3:** No, this is not a big effect. It's essentially zero. The theory predicts that MEPs compensate for losing party support by hiring more local staff, but at the individual level, a single election barely changes anything. This makes sense. MEPs don't reshuffle their entire staff team because their party gained or lost a few seats. Staff changes are probably driven more by other things.

---

### Exercise 7: Consider time varying effects
**Context:** The general level of MEP’s local staff changed over the period for reasons unrelated to national electoral fortunes. First, the European Parliament finally adressed a phenomenon among some MEPs to invest all their funding in local presence rather than legislative work. In 2016 the European Parliament therefore reformed the system by imposing a rule that only 75% of the funding can be spent locally. After that, the general level of local staff dropped. Second, the overall local staff size was higher in the spring of 2014 because of the European election where many of them were competing for seats.


**Q1:** Are these elements confounders? How could you adress it?

**A1:** These are not classical confounders since they mainly affect y (local staff) and not x (party size). However, they are time-specific shocks that hit all MEPs at the same time, which could still bias the estimate if they coincide with changes in party size. To address this, we include period fixed effects alongside individual fixed effects (ID), giving us a two-way fixed effects model that controls for both time-invariant MEP differences and period-specific shocks.

```{r}
mod_fe_time <- feols(y ~ x | ID + Period, data = df)
summary(mod_fe_time)
```

---

**Q2:** Explore the general variation in local staff size across periods (Period) in the data.

**A2:** The mean local staff size varies considerably across periods. It starts around 2,9 in period 1, drops to about 2,4 by period 4, spikes back up to 2,9 in period 5 (the EP election in June 2014), then drops sharply after period 8 to around 1,9 in periods 9-10 (following the 2016 reform). This confirms that there is substantial time variation in y that is unrelated to party size and needs to be accounted for.

```{r}
df %>%
  select(Period, DatePeriod) %>%
  distinct() %>%
  arrange(Period)
```

```{r}
df %>%
  group_by(Period) %>%
  summarise(mean_y = mean(y, na.rm = TRUE))
```

```{r}
df %>%
  group_by(Period) %>%
  summarise(mean_y = mean(y, na.rm = TRUE)) %>%
  ggplot(aes(x = Period, y = mean_y)) +
  geom_line() +
  geom_point() +
  labs(x = "Period", y = "Mean local staff")
```

---

**Q3:** Address the problem if you perceive any.

**A3:** The plot in question 2 shows substantial variation in mean local staff across periods that is unrelated to party size. To address this, we add period fixed effects alongside individual fixed effects, giving us a two-way fixed effects model, which was shown in question 1.

---

**Q4:** Re-interpret your results. What changed?

**A4:** After adding period fixed effects, the coefficient on x changed from -0,667 to -1,253 and became statistically significant. This means that once we control for time-specific shocks like the 2016 reform and the 2014 EP election, the effect of party size on local staff is actually stronger and significant. The earlier model underestimated the effect because time-varying factors were creating noise in the estimate.

---

### Exercise 8: Your call

**Q1:** Considering all the covariates in the data, which model would you go for?

**A1:** I would go with the two-way fixed effects model.

---

**Q2:** What did you learn?

**A2:** Party size has a statistically significant negative effect on local staff hiring. However, the realistic effect of a typical election is still small. MEPs do appear to compensate for party losses, but only marginally.