rm(list = ls())
This tutorial is for my own understanding of the Linear Model function in R. This file will include the following steps:
Estimation of unknown parameters is in the crux of social science and educational research. When we have to analyze the conditional distribution of the outcome/dependent variable relying on may be continuous random variable, or the way we estimate the parameters of the conditional distribution, we use LINEAR MODEL, and the way we estimate the parameters is called the LINEAR REGRESSION (Duflo, 2021). “Linear models are the cornerstone of statistical methodology” (Caffo, 2017, p. 1). We often use the linear models to estimate parameters using the joint distribution because most model have include more than one predictor variables. The Linear Models can be used:
1. Prediction Develop a model which can infer a single aspect of the data (predicted variable)from some combination of other aspect of data, the predictor variable.
2. Causality What happens to something when we give some sort of treatment? For example does the use of I-pad in the first grade classroom increase students’ learning gains? , &
3. Discovery with Models or Distillation of data for human judgment Sometimes we use the regression model to understand our data better so that we understand phenomena under question better, we may then, take appropriate action like refine the existing model etc.
In general, linear regression attempts to model relationship between dependent and independent variables by fitting a linear equation to observed data. The equation looks something like this:
Y = mX + c
Where, Y is the dependent variable, and X is the predictor/independent variable, ‘c’ is a constant, while the expression ‘m’ refers to the slope. This equation quickly gets complicated when we add independent variables.
Now, I am moving towards creating some fake data and fitting a regression model.
In R, we can model a regression model using following syntax:
lm(formula, data, subset, weights, na.action, method = “qr”, model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, …)
With this knowledge in my pocket, I am now ready to test this model.
# Loading Required Library
library(dplyr)
library(ggplot2)
library(UsingR)
set.seed(0417)
b <- c(120:180)#creates vector b
a <- c(50:80)+ (b/100) #created vector a. There is a loose relationship between these variables.
## Now creating an empty table with 100 rows and two columns
mydata_m <- matrix(ncol=2, nrow=100, data=NA)
colnames(mydata_m) <- c("Height", "Weight")
head(mydata_m)
Height Weight
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] NA NA
[5,] NA NA
[6,] NA NA
The empty matrix has been created. It has two columns namely “Height” and “Weight” and it has 100 observations. I am now going to simulate the variable and fill the matrix with the fake data:
set.seed(0417)
library(data.table)
for(i in 1:100){
mydata_m[i:100,1] <- sample(b, size=1)
mydata_m[i:100,2] <- sample(a, size=1)
}
# Now, I want to change my matrix to a data table
mydata <- as.data.table(mydata_m)
summary(mydata)
Height Weight
Min. :120.0 Min. :51.51
1st Qu.:133.8 1st Qu.:58.50
Median :149.5 Median :65.34
Mean :148.4 Mean :65.61
3rd Qu.:161.0 3rd Qu.:71.48
Max. :180.0 Max. :80.80
Here we go. I created a matrix with two variables. I then changed the matrix to a table, so that, I can manipulate data as I want.
Based on the output, Minimum height is 120 cm, with mean of 148.4 among 100 subject and the maximum height is 180 cm. On the other hand, minimum weight is 51.51 Kilogram, while the maximum is 80.80 KG. Mean weight is 65.61 and median is 65.34 feet.
I am now ready to do run a linear model.
I am going to try to fit a linear model. I am interested to test i there is any relationship between peoples’ height and weight. It is obvious that there is a huge relationship, but for the sake of this practice, I am pretending I am not sure on this certain sample.
My outcome variable (dependent variable) will be the person’s height and the predicting variable/regressor/explanatory variable/independent variable will be the people’s weight. I am going to save the outcome in an instance called ‘fit’.
fit <- lm(Weight~Height, data=mydata)
summary(fit)
Call:
lm(formula = Weight ~ Height, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-13.6460 -6.9360 -0.4791 6.8876 15.1548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.92645 6.99517 10.139 <2e-16 ***
Height -0.03584 0.04680 -0.766 0.446
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.306 on 98 degrees of freedom
Multiple R-squared: 0.00595, Adjusted R-squared: -0.004194
F-statistic: 0.5866 on 1 and 98 DF, p-value: 0.4456
The program run successfully and I obtained the results. As noted by the Intercept, the model is statistically significantly better than a zero. The average Weight is 70.92645 KG with the standard error of 6.99517 kg.
However, we failed to reject the null that there is statistically significant linear relationship between people’s weight and height. Based on the results, the weight is negatively associated with height. Based on the results, each one unit increase in weigh (i.e., 1 kg) is associated to 0.03584 kg lower body weight(what a data!). It may be true, though:
But, hey, this relationship is not statistically significant. So, we are okay.
Now, lets check if we can trace the relationship in the following plots. 1. A simple plot showing relationship between Height and Weight. 2. Include the fit line that we obtained from our linear model
plot(mydata$Height, mydata$Weight)
Yes, exactly the there are the data points everywhere. I have hard time tracing any pattern.
Now, let’s include the fit line and see what it has to say.
g = ggplot(mydata, aes(x = Height, y = Weight))
g = g + xlab("Peoples' Height")
g = g + ylab("Corresponding Weight")
g = g + geom_point(size = 7, colour = "green", alpha = 0.5)
g = g + geom_point(size = 5, colour = "blue", alpha = 0.2)
g = g + geom_smooth(method = "lm", colour = "black")
g
Fit line is trending slightly downward from if we follow it from the Y-axis. The shaded region suggests the confidence interval, and vast majority of data points fall beyond this limit. So, we definitely see a minute negative relationships between peoples’ weight and height on the simulated data.
I could have made more elegant linear model studies by simulating related data. However, my intention is to help me understand the linear model in R, so, I don’t really care about the relationships between the variables.
The dataset I am going to use for the rest of this study is on labor market outcomes of a representative sample of women in the US. The dataset has the followign variables:
I am going to read the data using the read.csv command, and I am storing the data in a vector named labor_data.
labor_data <- read.csv("nlsw88.csv")
str(labor_data)
'data.frame': 2246 obs. of 4 variables:
$ lwage : num 2.46 1.86 1.61 2.2 2.09 ...
$ yrs_school: int 12 12 12 17 12 12 12 18 14 15 ...
$ ttl_exp : num 10.3 13.6 17.7 13.2 17.8 ...
$ black : int 1 1 1 0 0 0 0 0 0 0 ...
As mentioned above, there are four variables, and 2246 observations. Log of wage and total experience are numerical variables and total schooling and black are recorded as the integer variables.
I am now going to model a linear regression. I want to see if the log of wage is influenced in any way by the years of schooling. We assume that people who spend more years in school are more likely to have higher academic degrees, thus, earn higher wages than with lower academic degrees.
fit1 <- lm(lwage ~ yrs_school, data=labor_data)
summary(fit1)
Call:
lm(formula = lwage ~ yrs_school, data = labor_data)
Residuals:
Min 1Q Median 3Q Max
-2.29340 -0.32611 -0.00807 0.29471 2.20496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.652578 0.057771 11.30 <2e-16 ***
yrs_school 0.092920 0.004333 21.45 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5236 on 2244 degrees of freedom
Multiple R-squared: 0.1701, Adjusted R-squared: 0.1697
F-statistic: 459.9 on 1 and 2244 DF, p-value: < 2.2e-16
We can see that the model is statistically significant at 1% level and the years in school is statistically significant predictor of the somebody’s wages.
The value of beta1 and beta0 can be calculated the following way as well. Please pay attention to the above beta0 and beta1 values.
#Calculating Beta1 or the Slope for yrs-school
beta1 = cor(labor_data$yrs_school, labor_data$lwage) * sd(labor_data$lwage)/sd(labor_data$yrs_school)
# Calculating Beta0, or the Intercept for the model
beta0 = mean(labor_data$lwage) - beta1 * mean(labor_data$yrs_school)
# Displaying the values together
rbind(c(beta0, beta1),coef(lm(lwage ~ yrs_school, data = labor_data)))
(Intercept) yrs_school
[1,] 0.6525777 0.09291988
[2,] 0.6525777 0.09291988
Bam!! We got the same values from both of these methods. So, it is essential for us to understand how the parameters are calculated and what they mean.
Measuring relationship between wages and the black women.
fit2 <- lm(lwage ~ black, data=labor_data)
summary(fit2)
Call:
lm(formula = lwage ~ black, data = labor_data)
Residuals:
Min 1Q Median 3Q Max
-1.90667 -0.40290 -0.03418 0.37105 1.96129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91161 0.01398 136.739 < 2e-16 ***
black -0.16554 0.02744 -6.033 1.88e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5701 on 2244 degrees of freedom
Multiple R-squared: 0.01596, Adjusted R-squared: 0.01552
F-statistic: 36.39 on 1 and 2244 DF, p-value: 1.88e-09
A line requires two parameters to be specified and plotted on a two dimensional slop. Regression analyses strive to give us the best fitting line based on the values in the parameters. What happens if we stick with the line that passes through the origin (*which means the y intercept is 0).
As shown in the above linear model, whether a women is black or not goes to X-axis. We are going to subtract the mean of X from the individual Xs. Likewise, create a new variable ‘wage’ by subtracting mean of Y from the individual Ys.
# Creating New Variables
color <- labor_data$black - mean(labor_data$black)
wage <- labor_data$lwage - mean(labor_data$lwage)
# Creating a New Table that Records the New Variables
freq_new <- as.data.frame(table(color,wage))
# Assigning Names to the Columns of the New Table
colnames(freq_new) <- c("color", "wage", "freq")
freq_new$color <- as.numeric(as.character(freq_new$color))
freq_new$wage <- as.numeric(as.character(freq_new$wage))
# Creating a Function Called "beta" and Plotting the Values
g <- ggplot(filter(freq_new, freq > 1), aes(x = color, y = wage))
g <- g + scale_size(range = c(2,10), guide = "none")
g <- g + geom_point(colour = "grey50", aes(size= freq+10, show_guide = FALSE))
g <- g + geom_point(aes(colour = freq, size = freq))
g <- g + scale_colour_gradient(low="lightblue", high="green")
g
As can be seen, most the black women, i.e., the women who answered ‘yes’ aka 1 to the question whether or not they were the black women have narrower wage range, which is roughly within the +1 and -1 log of wages. Other women have slightly wider wage range and a good number of other women make wages beyond +1 range.
Now, I am going to regress wage on total years in school and total experience.
fit3 <- lm(lwage ~ yrs_school+ttl_exp, data=labor_data)
summary(fit3)
Call:
lm(formula = lwage ~ yrs_school + ttl_exp, data = labor_data)
Residuals:
Min 1Q Median 3Q Max
-2.09807 -0.29945 -0.00571 0.25158 2.49949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.336944 0.057308 5.88 4.73e-09 ***
yrs_school 0.079148 0.004150 19.07 < 2e-16 ***
ttl_exp 0.039559 0.002296 17.23 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4921 on 2243 degrees of freedom
Multiple R-squared: 0.2671, Adjusted R-squared: 0.2664
F-statistic: 408.7 on 2 and 2243 DF, p-value: < 2.2e-16
Estimate the restricted model in R. What is the value that you obtain for β^1 in the restricted model?
labor_data$newvar <- labor_data$yrs_school + 2*labor_data$ttl_exp
restricted <- lm(lwage ~ newvar, data = labor_data)
summary(restricted) # show results
Call:
lm(formula = lwage ~ newvar, data = labor_data)
Residuals:
Min 1Q Median 3Q Max
-1.79637 -0.32172 -0.02268 0.27505 2.39896
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.865430 0.042395 20.41 <2e-16 ***
newvar 0.026292 0.001075 24.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5106 on 2244 degrees of freedom
Multiple R-squared: 0.2106, Adjusted R-squared: 0.2102
F-statistic: 598.6 on 1 and 2244 DF, p-value: < 2.2e-16
anova_school <- anova(fit1)
anova_unrest <- anova(fit3)
anova_rest <- anova(restricted)
anova_school; anova_rest; anova_unrest
Analysis of Variance Table
Response: lwage
Df Sum Sq Mean Sq F value Pr(>F)
yrs_school 1 126.07 126.066 459.91 < 2.2e-16 ***
Residuals 2244 615.10 0.274
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: lwage
Df Sum Sq Mean Sq F value Pr(>F)
newvar 1 156.08 156.080 598.62 < 2.2e-16 ***
Residuals 2244 585.09 0.261
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: lwage
Df Sum Sq Mean Sq F value Pr(>F)
yrs_school 1 126.07 126.066 520.55 < 2.2e-16 ***
ttl_exp 1 71.90 71.901 296.90 < 2.2e-16 ***
Residuals 2243 543.20 0.242
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test Statistics
statistic_test = (((anova_rest\(Sum Sq[2]-anova_unrest\)Sum Sq[3]/1)/((anova_unrest\(Sum Sq[3]/anova_unrest\)Df[3]))
statistic_test
P-value
pvalue = df(statistic_test, 1, anova_unrest$Df[3]) pvalue
Difference in Differences (DiD)is a tool used especially by empirical economists. The sample here is drawn from replicating the results of David Card and Alan Krueger’s work “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” For this I require the data they used fastfood.csv. The abstract from the original study reads:
On April 1, 1992, New Jersey’s minimum wage rose from $4.25 to $5.05 per hour. To evaluate the impact of the law we surveyed 410 fast-food restaurants in New Jersey and eastern Pennsylvania before and after the rise. Comparisons of employment growth at stores in New Jersey and Pennsylvania (where the minimum wage was constant) provide simple estimates of the effect of the higher minimum wage. We also compare employment changes at stores in New Jersey that were initially paying high wages (above $5) to the changes at lower-wage stores. We find no indication that the rise in the minimum wage reduced employment.
Variables:
A. Loading the data and running a linear model to compare whether there are differences between fast-food restaurants located in New Jersey (NJ) and Pennsylvania (PA)prior to the change in the minimum wage in terms of the number of full-time employees and the starting wage:
fastfood <- read.csv("fastfood.csv")
summary(fastfood)# Checking if the data is loaded
chain state southj centralj
Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :2.000 Median :1.0000 Median :0.0000 Median :0.0000
Mean :2.099 Mean :0.8061 Mean :0.2321 Mean :0.1505
3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :4.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
northj shore pa1 pa2
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
Mean :0.4235 Mean :0.08673 Mean :0.08929 Mean :0.1046
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
empft emppt wage_st empft2
Min. : 0.000 Min. : 0.00 Min. :4.250 Min. : 0.000
1st Qu.: 2.000 1st Qu.:11.00 1st Qu.:4.250 1st Qu.: 2.000
Median : 6.000 Median :17.00 Median :4.500 Median : 6.000
Mean : 8.219 Mean :18.86 Mean :4.612 Mean : 8.251
3rd Qu.:11.625 3rd Qu.:25.00 3rd Qu.:4.950 3rd Qu.:12.000
Max. :60.000 Max. :60.00 Max. :5.750 Max. :40.000
NA's :19
emppt2 wage_st2
Min. : 0.00 Min. :4.250
1st Qu.:11.00 1st Qu.:5.050
Median :17.00 Median :5.050
Mean :18.54 Mean :4.994
3rd Qu.:25.00 3rd Qu.:5.050
Max. :60.00 Max. :6.250
NA's :1 NA's :15
Beautiful!! The data is loaded in R and we can check the structure of the data and the variables.
#Running the Linear Model and Storing the Results in Wagestate vault
empft_state <- lm(empft ~ state, data=fastfood)
summary(empft_state)
Call:
lm(formula = empft ~ state, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-10.316 -5.715 -2.715 3.385 52.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3158 0.9894 10.43 <2e-16 ***
state -2.6006 1.1020 -2.36 0.0188 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.626 on 390 degrees of freedom
Multiple R-squared: 0.01408, Adjusted R-squared: 0.01155
F-statistic: 5.569 on 1 and 390 DF, p-value: 0.01877
B. What is the average wage in Pennsylvania prior to the change?
prioravg_wage <- lm(wage_st ~ state, data=fastfood)
summary(prioravg_wage)
Call:
lm(formula = wage_st ~ state, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-0.3786 -0.3574 -0.1074 0.3426 1.1426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.62863 0.04044 114.460 <2e-16 ***
state -0.02123 0.04509 -0.471 0.638
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3455 on 371 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0.0005972, Adjusted R-squared: -0.002097
F-statistic: 0.2217 on 1 and 371 DF, p-value: 0.638
The average wage in Pennsylvania prior to the change was $4.629.
The average wage in New Jersey prior to the change was $4.62863 - $0.02123 = $4.607.
We see that Pennsylvania had slightly higher average wage compared to New Jersey prior to the change, however, the difference was not statistically significant (p = 0.628). Thus, we cannot reject the null hypothesis that the average starting wage is the same in NJ and PA prior to the change in the minimum wage.
C. Now let’s assume that someone pointed out the northeast suburbs of Philadelphia are very different from the rest of Pennsylvania. This person claims that the model that should be used to estimate the differences between fast food restaurants located in NJ and PA prior to the change is as follows:
full time employment = β0 + β1 × statei + β2 × pa1 + β3 × pa2 + ε
Now let’s run the new model based on the above information:
PA1_PA <- lm(empft ~ state + pa1, data=fastfood)
summary(PA1_PA)
Call:
lm(formula = empft ~ state + pa1, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-10.943 -5.715 -2.715 3.410 52.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.780 1.348 7.254 2.2e-12 ***
state -2.065 1.433 -1.441 0.150
pa1 1.162 1.987 0.585 0.559
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.633 on 389 degrees of freedom
Multiple R-squared: 0.01495, Adjusted R-squared: 0.009881
F-statistic: 2.951 on 2 and 389 DF, p-value: 0.05346
Here we go. The model ran successfully.
According to this model above, what would be the average difference in full time employment between the restaurants located in the northeast suburbs of Philadelphia and the rest of Pennsylvania?
This model can’t be estimated since its variables are perfectly collinear. In particular, all the observations that have the dummy of state equal to 0, have that either pa1 or pa2 is equal to one. Therefore, we can’t identify this difference from this model.
D. Now, let’s run the difference in differences model to see whether the change created a difference in the employment. According to what we saw in the lecture, the model to estimate should be the following:
empftit = β0 + β1 × statei + β2 × postt + β3 × statei × postt (equation 1)
Where postt is a dummy variable that takes the value of 1 after the change takes place.
E. Even though we aim to estimate equation (1), the data we have is “wide,” meaning that we have one observation per fast food restaurant, and the different yearly observations are in different variables. Model 1 would require the data to be in the “long” format, with the different years stacked. We could change the structure of the data to solve this problem.
Alternatively, someone has suggested that instead we could run the following model:
empfti2 − empfti1 = α0 + α1 × statei + νi (equation 2)
And that our estimate for α1 in equation (2) would be equivalent to β3 in equation (1).
DiD <- lm((empft2 - empft) ~ state, data=fastfood)
summary(DiD)
Call:
lm(formula = (empft2 - empft) ~ state, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-32.257 -3.699 0.301 4.301 34.301
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.743 1.181 -2.323 0.02071 *
state 3.443 1.316 2.617 0.00921 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.3 on 390 degrees of freedom
Multiple R-squared: 0.01726, Adjusted R-squared: 0.01474
F-statistic: 6.849 on 1 and 390 DF, p-value: 0.009213
What value do you obtain for the DiD estimate? It would be 3.443.
Assuming that we can interpret the estimate for α1 as causal and that the minimum wage for these fast-food restaurants is binding, can you conclude that the NJ increase in the minimum wage had a negative effect on full-time employment?
No. Even though we can reject the null hypothesis that α1=0 at the 5% level, the estimate for this parameter is positive. Thus, we are finding that the increase in the minimum wage led to an increase in full-time employment.
For this demonstration, I am going to replicate the results of David S. Lee’s work. The data I am using the data from “Mostly Harmless Econometrics Data Archive”.
Lee (2008) studied the effect of party incumbency on reelection probabilities. In general, Lee was interested in whether a Democratic candidate for a seat in the U.S. House of Representatives has an advantage if his party won the seat last time. Below is the abstract of the paper “The Electoral Advantage to Incumbency and Voters’ Valuation of Politicians Experience: A Regression Discontinuity Analysis of Elections to the U.S. Houses”
Using data on elections to the United States House of Representatives (1946-1998), this paper exploits a quasi-experiment generated by the electoral system in order to determine if political incumbency provides an electoral advantage - an implicit first-order prediction of principal-agent theories of politicians and voter behavior. Candidates who just barely won an election (barely became the incumbent) are likely to be ex ante comparable in all other ways to candidates who barely lost, and so their differential electoral outcomes in the next election should represent a true incumbency advantage. The regression discontinuity analysis provides striking evidence that incumbency has a significant causal effect of raising the probability of subsequent electoral success - by about 0.4 to 0.45. Simulations - using estimates from a structural model of individual voting behavior - imply that about two-thirds of the apparent electoral success of incumbents can be attributed to voters’ valuation of politicians’ experience. The quasi-experimental analysis also suggest that heuristic “fixed effects” and “instrumental variable” modeling approaches would have led to misleading inferences in this context.
Variables:
We need package rdd to conduct Regression Discontinuity in R.
library(rdd)
#Loading the dataset
Lee <- read.csv("indiv_final.csv")
summary(Lee)
yearel myoutcomenext difshare
Min. :1946 Min. :0.0000 Min. :-1.0000
1st Qu.:1960 1st Qu.:0.0000 1st Qu.:-0.4845
Median :1974 Median :0.0000 Median :-0.1291
Mean :1972 Mean :0.3202 Mean :-0.1252
3rd Qu.:1986 3rd Qu.:1.0000 3rd Qu.: 0.1918
Max. :1996 Max. :1.0000 Max. : 1.0000
Based on the information provided, create a variable for whether the party of the candidate is the same party as the incumbent. What is the proportion of these cases in your data set?
# If the difshare is > 0 the candidate will be the incumbent candidate
Lee$incumb_candidate <- as.numeric(Lee$difshare > 0)
# Measuring the proportion of the incumbent candidates
proportion_incumb <- sum(Lee$incumb_candidate)/nrow(Lee)
# Checking the Statistics
proportion_incumb
[1] 0.3990055
One of the main assumption in RD designs is that there are no jumps in the density of the running variable around the cutoff. The package in R rdd has a command DCdensity. Run the command in R using difshare as the running variable.
What is the difference in the log estimate in heights at the cutpoint?
DCdensity(Lee$difshare, verbose = TRUE, ext.out = FALSE, htest = TRUE)
Assuming cutpoint of zero.
Using calculated bin size: 0.005
Using calculated bandwidth: 0.183
Log difference in heights is -0.002 with SE 0.052
this gives a z-stat of -0.048
and a p value of 0.962
McCrary (2008) sorting test
data:
z = -0.047558, binwidth = 0.0052909, bandwidth = 0.1833466, cutpoint =
0.0000000, p-value = 0.9621
alternative hypothesis: no apparent sorting
Now, keep only the observations within 50 percentage points of the cutoff (the absolute value of ‘difshare’ is less than or equal to 0.5). Also, create in R the required variables to run the following models:
1. yi = β0 + β11difshare ≥ 0, i + εi (model 1)
2. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + εi (model 2)
3. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + δ1difsharei × 1difshare ≥ 0,i + εi (model 3)
4. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + εi (model 4)
5. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + δ1difsharei × 1difshare ≥ 0, i + δ2difshare2i × 1difshare ≥ 0, i + εi (model 5)
6. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + + γ3difshare3i + εi (model 6)
7. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + γ3difshare3i + δ1difsharei × 1difshare ≥ 0, i + δ2difshare2i × 1difshare ≥ 0, i + δ3difshare3i × 1difshare ≥ 0,i + εi (model 7)
Where yi corresponds to the myoutcomenext variable in the data set, and 1difshare≥0 to a dummy variable that indicates whether the party of the candidate won in the previous election.
To achieve this goal, I am going to create an empty matrix where I can dump the desired components of the results from the models above.
combined_matrix <- matrix(NA, nrow=2, ncol=7)
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA
Now, I am going to create the squared and cubic value of the difshare variable:
Lee$difshare_square <- (Lee$difshare)^2
Lee$difshare_cube <- (Lee$difshare)^3
head(Lee$difshare_square)
[1] 0.003780799 0.003780799 0.010997612 0.010997612 0.286996990 0.027048440
head(Lee$difshare_cube)
[1] 0.0002324745 -0.0002324745 0.0011533141 -0.0011533141 -0.1537503144
[6] 0.0044484974
Creating the aforementioned models and inputting the values in the ‘combined_matrix’, I created above. However, I don’t need all the outputs because that will overwhelm me. Thus, I created the matrix (combined_matrix) above to store only the required information. 7 models to run seven columns to input the values of my interest.
For this analysis, I need
model1 <- lm(myoutcomenext ~ incumb_candidate, data = Lee, subset = abs(difshare) <= 0.5) #abs returns the absolute value, the positive value
combined_matrix[1,1] <- model1$coefficients[2] #the β-Estimate[2] in the coefficient table will fill the [1 X 1] of the combined_matrix
fit_model1 <- summary(model1)
combined_matrix[2,1] <- fit_model1$coefficients[2,4] #p-value in the position of [2 X 4] of the Coefficients table will fill the [2 X 1] position in my combined_matrix.
combined_matrix #Checking if everything came together.
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 NA NA NA NA NA NA
[2,] 0.000000 NA NA NA NA NA NA
That did work. We have now the first column filled. Let’s run all the remaining models and fill the combined_matrix.
# Model 2
model2 <- lm(myoutcomenext ~ incumb_candidate + difshare, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,2] <- model2$coefficients[2]
fit_model2 <- summary(model2)
combined_matrix[2,2] <- fit_model2$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 NA NA NA NA NA
[2,] 0.000000 0.0000000 NA NA NA NA NA
# Model 3
model3 <- lm(myoutcomenext ~ incumb_candidate + difshare + incumb_candidate * difshare, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,3] <- model3$coefficients[2]
fit_model3 <- summary(model3)
combined_matrix[2,3] <- fit_model3$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 NA NA NA NA
[2,] 0.000000 0.0000000 0.0000000 NA NA NA NA
# Model 4
model4 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,4] <- model4$coefficients[2]
fit_model4 <- summary(model4)
combined_matrix[2,4] <- fit_model4$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 NA NA NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000 NA NA NA
# Model 5
model5 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + incumb_candidate * difshare + incumb_candidate * difshare_square, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,5] <- model5$coefficients[2]
fit_model5 <- summary(model5)
combined_matrix[2,5] <- fit_model5$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643 NA NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 NA NA
# Model 6
model6 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + incumb_candidate * difshare + incumb_candidate * difshare_square, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,6] <- model6$coefficients[2]
fit_model6 <- summary(model6)
combined_matrix[2,6] <- fit_model6$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643 0.5301643 NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 NA
# Model 7
model7 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + difshare_cube + incumb_candidate * difshare + incumb_candidate * difshare_square + incumb_candidate * difshare_cube, data = Lee, subset = abs(difshare) <= 0.5)
combined_matrix[1,7] <- model7$coefficients[2]
fit_model7 <- summary(model7)
combined_matrix[2,7] <- fit_model7$coefficients[2,4]
combined_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643 0.5301643 4.764126e-01
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 8.514610e-155
A. For which of the models do you find that the effects of party incumbency over re-election is greater than 0.6?
The Results shows that 4 of the models, i.e., models 1, 2, 3 & 4 have the effects of party incumbency over re-election greater than 0.6.
B. Using RDestimate command to estimate the effect non-parametrically. What is the point estimate that we obtained from this method?
NP_model <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5)
summary(NP_model)
Call:
RDestimate(formula = myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <=
0.5)
Type:
sharp
Estimates:
Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
LATE 0.11982 4695 0.4707 0.02695 17.47 2.578e-68
Half-BW 0.05991 2363 0.4511 0.03934 11.47 1.974e-30
Double-BW 0.23965 9182 0.5119 0.01818 28.15 2.082e-174
LATE ***
Half-BW ***
Double-BW ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistics:
F Num. DoF Denom. DoF p
LATE 878.0 3 4691 0
Half-BW 334.1 3 2359 0
Double-BW 2493.8 3 9178 0
The results show that the obtained estimate is 0.4707.
C. Plotting the model with default (optimal) bandwidth, another with three times the optimal bandwidth and the third one with the one-third of the optimal bandwidth.
To make our task easy, I can create a variable named Bandwidth and assign the value of Bandwidth (for LATE) in the Estimates table above. The corresponding value is: 0.11982
Bandwidth <- 0.11982
# A. Optimal (Default) Bandwidth
Optimal_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5)
plot(Optimal_bandwidth)
# B. Three-times the Optimal Bandwidth
Threetimes_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5, kernel = "rectangular", bw = 3*Bandwidth)
plot(Threetimes_bandwidth)
# B. Three-times the Optimal Bandwidth
OneThird_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5, kernel = "rectangular", bw = Bandwidth/3)
plot(OneThird_bandwidth)
Finally this kind of plots are not really useful in case of Regression Discontinuity Models. But just to check how they look. I plotting the model 7, the most complex model.
plot(model7)
——————Thanks————————————–