Linear regression problem set

Conceptual points

Q1: Can you express in layman’s terms what a “standard deviation” of a variable is?

A1: A standard deviation tells you how spread out the values of a variable are around the average. If the standard deviation is small, most values are close to the average. If it’s large, values are scattered far from the average.

Q2: What are the “residuals” of the regression? How are they calculated?

A2: They are calculated as: Residuals = Observed Y - Predicted Y. Residuals are the mistakes your model makes. If someone actually earns 60.000 DKK, but your model predicted 55.000 DKK, the residual is 5.000 DKK. Residuals represent the fundamental uncertainty. That is all the random stuff (weather, illness, luck, etc.) that affects Y, but isn’t captured by your explanatory variable X. This is a stochastic component.

Q3: What is a variance-covariance matrix?

A3: A variance-covariance matrix is a table that summarizes the uncertainty of your parameter estimates. The diagonal contains the variance of each coefficient. that’s just the standard error squared. The larger the variance, the less precise your estimate. The off-diagonal elements contain the covariances, which tell you whether the uncertainties of two coefficients are linked. A negative covariance means that when one coefficient is overestimated, the other tends to be underestimated.

Q4:: What is the role of the variance-covariance matrix in the King et al. article?

A4: In the text the variance-covariance matrix is the engine of their entire simulation approach. Their method works in three steps. First, estimate the model and record the point estimates (γ̂) and the variance-covariance matrix V(γ̂). Then, draw simulated parameter values from a multivariate normal distribution using those two pieces of information. The variance-covariance matrix determines how spread out and correlated these simulated draws are.

Q5: Can you explain what the covariance matrix is good for in this example?

A5: In practical terms, the covariance matrix allows researchers to do something powerful. They can translate raw regression output into quantities that anyone can understand, while properly accounting for uncertainty. For example, instead of saying the coefficient on education is 0.3 with a standard error of 0.1, you can say an extra year of education increases your income by 1.500 DKK on average, plus or minus about 500 DKK.

Q6: What is the difference between fundamental and estimation uncertainty?

A6: Fundamental uncertainty comes from the randomness of the world itself (stochastic component). Estimation uncertainty comes from not having infinite data. We estimate β and α from a sample, so our estimates are imperfect. If we had more observations, our estimates would be more precise. This type of uncertainty can be reduced by collecting more data.

Q7: What is the difference between expected and predicted values of Y, and how does this relate to fundamental vs. estimation uncertainty? When am I interested in one rather than the other?

A7: Expected values give you the average outcome for a given set of X values. They only contain estimation uncertainty. The variability comes solely from not knowing the parameters perfectly. Fundamental uncertainty is averaged away. Use expected values when you care about the average effect of a variable. For example, on average, how many more assistants does a candidate-centered MEP have compared to a party-centered one? Predicted values give you a specific outcome for a given set of X values. They contain both estimation uncertainty and fundamental uncertainty. Use predicted values when you care about a specific case. For example, how many assistants will this particular MEP actually have? Here you need to account for all the random factors that could push the actual outcome away from the average.

Exercise in R (point estimates and simulation using ggpredict)

Q1: Can you re-fit model 2 with each MEP’s national party size in the national parliament as a predictor?

A1: Adding SeatsNatPal.prop (party size in the national parliament) to model 2 yields a coefficient of -0.184, but it is not statistically significant. The other coefficients remain largely unchanged.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

df <- MEP2014
mod2 <- lm(LocalAssistants ~ OpenList + LaborCost, df)
mod2.party <- lm(LocalAssistants ~ OpenList + LaborCost + SeatsNatPal.prop, df)
stargazer(mod2, mod2.party, type = "text")

## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                     LocalAssistants                
##                               (1)                     (2)          
## -------------------------------------------------------------------
## OpenList                   0.829***                0.937***        
##                             (0.228)                 (0.227)        
##                                                                    
## LaborCost                  -0.070***               -0.068***       
##                             (0.010)                 (0.010)        
##                                                                    
## SeatsNatPal.prop                                    -0.184         
##                                                     (0.625)        
##                                                                    
## Constant                   4.127***                4.057***        
##                             (0.286)                 (0.352)        
##                                                                    
## -------------------------------------------------------------------
## Observations                  739                     722          
## R2                           0.081                   0.085         
## Adjusted R2                  0.079                   0.081         
## Residual Std. Error    3.083 (df = 736)        3.009 (df = 718)    
## F Statistic         32.612*** (df = 2; 736) 22.200*** (df = 3; 718)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

Q2: What is the marginal effect of party size on MEP’s local investment?

A2: The marginal effect of party size is -0.184. This means that a one-unit increase in party size is associated with 0.184 fewer local assistants, on average. However, since the effect is not statistically significant, we cannot distinguish it from zero. In substantive terms, party size does not appear to affect how many local assistants an MEP hires, once we control for the electoral system and labor costs.

Q3: Create two scenarios, justify your choice and calculate the first difference between the two.

A3: I chose two scenarios based on the quartiles of the party size variable (Q1 = 0,09 and Q3 = 0,40), to ensure they represent realistic values in the data. The predicted staff size for a small party is 2.47 (95 % CI: 2,09-2,85) and for a large party is 2.41 (95 % CI: 2,08-2,75). The first difference is -0,06, meaning that moving from a small to a large national party is associated with essentially no change in local staff size. The confidence intervals overlap almost entirely, confirming that the difference is not statistically significant.

summary(df$SeatsNatPal.prop)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.08769 0.29738 0.26110 0.40426 0.66834      17

eff <- ggpredict(mod2.party, terms = "SeatsNatPal.prop [0.09, 0.40]")
eff

## # Predicted values of LocalAssistants
## 
## SeatsNatPal.prop | Predicted |     95% CI
## -----------------------------------------
##             0.09 |      2.47 | 2.09, 2.85
##             0.40 |      2.41 | 2.08, 2.75
## 
## Adjusted for:
## *  OpenList =  0.00
## * LaborCost = 22.96

diff(eff$predicted)

## [1] -0.05711017

Q4: Visualize the effect of party size on MEP’s local investment.

A4: The plot shows a nearly flat line with a wide confidence interval, visually confirming that party size has no meaningful effect on the number of local assistants. This is consistent with the non-significant coefficient from the regression.

eff_full <- ggpredict(mod2.party, terms = "SeatsNatPal.prop")
eff_full %>%
  plot +
  ylab("Predicted local staff size") +
  xlab("Party size in national parliament (proportion)") +
  ggtitle("Effect of national party size on MEP local staff",
          subtitle = "Controlling for OpenList and LaborCost")

Exercise in R (fundamental variation)

Q1 + Q2: Can you calculate the residuals for model 1, then model 2 and store them as separate variables in R? Can you describe the residuals of the two models in a histogram, then in numbers by calculating the mean and standard deviation?

A1 + A2: Both models have a residual mean of essentially zero (as expected with OLS). The standard deviation of the residuals drops slightly from 3.18 in model 1 to 3.08 in model 2. This means that adding LaborCost as a predictor reduces the fundamental uncertainty. The model makes slightly smaller mistakes on average. Both histograms are right-skewed with a few extreme outliers, which means the residuals are not perfectly normally distributed.

mod1 <- lm(LocalAssistants ~ OpenList, df)
mod2 <- lm(LocalAssistants ~ OpenList + LaborCost, df)

df$resid_mod1 <- residuals(mod1)
df$resid_mod2 <- residuals(mod2)

ggplot(data.frame(r = residuals(mod1)), aes(r)) +
  geom_histogram(bins = 30) +
  ggtitle("Residuals: Model 1 (OpenList only)")

ggplot(data.frame(r = residuals(mod2)), aes(r)) +
  geom_histogram(bins = 30) +
  ggtitle("Residuals: Model 2 (OpenList + LaborCost)")

mean(residuals(mod1))

## [1] 4.254603e-16

sd(residuals(mod1))

## [1] 3.176834

mean(residuals(mod2))

## [1] 5.910172e-16

sd(residuals(mod2))

## [1] 3.078466

Q3: What is the difference between the two sets of residuals and why?

A3: Model 2’s residuals have a smaller standard deviation (3.08 vs 3.18) because adding LaborCost as a predictor explains more of the variation in LocalAssistants. Adding a relevant predictor reduces the fundamental uncertainty. There is less leftover randomness that the model cannot account for.

Q4 + Q5: Can you extract the variance-covariance matrix for model 2? Can you calculate the standard error for the regression coefficients (parameters) from the variance-covariance matrix?

A4 + A5: The standard errors are the square root of the diagonal elements: Intercept = 0,286, OpenList = 0,228, and LaborCost = 0,010.

vcov(mod2)

##              (Intercept)      OpenList     LaborCost
## (Intercept)  0.081867230 -0.0284615942 -0.0024363118
## OpenList    -0.028461594  0.0519105737  0.0001759822
## LaborCost   -0.002436312  0.0001759822  0.0001031140

sqrt(diag(vcov(mod2)))

## (Intercept)    OpenList   LaborCost 
##  0.28612450  0.22783892  0.01015451

Q6: Are there any predictors that correlate more than others?

A6: The correlation between the Intercept and LaborCost estimates is -0.84, which is very strong. This means when the model overestimates the intercept, it tends to underestimate the effect of LaborCost, and vice versa. The correlation between Intercept and OpenList is moderate (-0.44), while OpenList and LaborCost are nearly independent (0.08).

cov2cor(vcov(mod2))

##             (Intercept)    OpenList   LaborCost
## (Intercept)   1.0000000 -0.43659249 -0.83853084
## OpenList     -0.4365925  1.00000000  0.07606449
## LaborCost    -0.8385308  0.07606449  1.00000000

Q7: How does this relate to King et al’s argument?

A7: The text argue that you need the full variance-covariance matrix. Not just individual standard errors to properly account for uncertainty. The variance-covariance matrix ensures that when you simulate parameters to calculate quantities like predicted values or first differences, the draws respect these connections, giving you accurate confidence intervals. Without it, your uncertainty estimates would be wrong.