Introduction to Linear Model

rm(list = ls())

This tutorial is for my own understanding of the Linear Model function in R. This file will include the following steps:

Define the Linear Model
Fitting Linear Model using Fake Data &
Plotting Linear Model
Difference of Difference Examples
Regression Discontinuity Examples

What is the Linear Model?

Estimation of unknown parameters is in the crux of social science and educational research. When we have to analyze the conditional distribution of the outcome/dependent variable relying on may be continuous random variable, or the way we estimate the parameters of the conditional distribution, we use LINEAR MODEL, and the way we estimate the parameters is called the LINEAR REGRESSION (Duflo, 2021). “Linear models are the cornerstone of statistical methodology” (Caffo, 2017, p. 1). We often use the linear models to estimate parameters using the joint distribution because most model have include more than one predictor variables. The Linear Models can be used:

1. Prediction Develop a model which can infer a single aspect of the data (predicted variable)from some combination of other aspect of data, the predictor variable.

2. Causality What happens to something when we give some sort of treatment? For example does the use of I-pad in the first grade classroom increase students’ learning gains? , &

3. Discovery with Models or Distillation of data for human judgment Sometimes we use the regression model to understand our data better so that we understand phenomena under question better, we may then, take appropriate action like refine the existing model etc.

In general, linear regression attempts to model relationship between dependent and independent variables by fitting a linear equation to observed data. The equation looks something like this:

Y = mX + c

Where, Y is the dependent variable, and X is the predictor/independent variable, ‘c’ is a constant, while the expression ‘m’ refers to the slope. This equation quickly gets complicated when we add independent variables.

Now, I am moving towards creating some fake data and fitting a regression model.

Fit Linear Model using Fake Data

In R, we can model a regression model using following syntax:

lm(formula, data, subset, weights, na.action, method = “qr”, model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, …)

With this knowledge in my pocket, I am now ready to test this model.

First, I am going to simulate two variables known as Height and Weight.

Create a vector named ‘a’ to model height in centimeters.
Create a vector named ‘b’ to model weight in kilogram, afterward.
create a third variable called “Height” from random selection of units in ‘a’ (100 observations)
create a fourth variable called “Weight” from random selection of units in ‘b’ (100 observations)

# Loading Required Library
library(dplyr)
library(ggplot2)
library(UsingR)

set.seed(0417)
b <- c(120:180)#creates vector b
a <- c(50:80)+ (b/100) #created vector a. There is a loose relationship between these variables. 

## Now creating an empty table with 100 rows and two columns
mydata_m <- matrix(ncol=2, nrow=100, data=NA)
colnames(mydata_m) <- c("Height", "Weight")
head(mydata_m)

     Height Weight
[1,]     NA     NA
[2,]     NA     NA
[3,]     NA     NA
[4,]     NA     NA
[5,]     NA     NA
[6,]     NA     NA

The empty matrix has been created. It has two columns namely “Height” and “Weight” and it has 100 observations. I am now going to simulate the variable and fill the matrix with the fake data:

set.seed(0417)
library(data.table)
for(i in 1:100){
  mydata_m[i:100,1] <- sample(b, size=1)
  mydata_m[i:100,2] <- sample(a, size=1)
}
# Now, I want to change my matrix to a data table 
mydata <- as.data.table(mydata_m)
summary(mydata)

     Height          Weight     
 Min.   :120.0   Min.   :51.51  
 1st Qu.:133.8   1st Qu.:58.50  
 Median :149.5   Median :65.34  
 Mean   :148.4   Mean   :65.61  
 3rd Qu.:161.0   3rd Qu.:71.48  
 Max.   :180.0   Max.   :80.80

Here we go. I created a matrix with two variables. I then changed the matrix to a table, so that, I can manipulate data as I want.

Based on the output, Minimum height is 120 cm, with mean of 148.4 among 100 subject and the maximum height is 180 cm. On the other hand, minimum weight is 51.51 Kilogram, while the maximum is 80.80 KG. Mean weight is 65.61 and median is 65.34 feet.

I am now ready to do run a linear model.

Fitting a Linear Model

I am going to try to fit a linear model. I am interested to test i there is any relationship between peoples’ height and weight. It is obvious that there is a huge relationship, but for the sake of this practice, I am pretending I am not sure on this certain sample.

My outcome variable (dependent variable) will be the person’s height and the predicting variable/regressor/explanatory variable/independent variable will be the people’s weight. I am going to save the outcome in an instance called ‘fit’.

fit <- lm(Weight~Height, data=mydata)
summary(fit)


Call:
lm(formula = Weight ~ Height, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.6460  -6.9360  -0.4791   6.8876  15.1548 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 70.92645    6.99517  10.139   <2e-16 ***
Height      -0.03584    0.04680  -0.766    0.446    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.306 on 98 degrees of freedom
Multiple R-squared:  0.00595,   Adjusted R-squared:  -0.004194 
F-statistic: 0.5866 on 1 and 98 DF,  p-value: 0.4456

The program run successfully and I obtained the results. As noted by the Intercept, the model is statistically significantly better than a zero. The average Weight is 70.92645 KG with the standard error of 6.99517 kg.

However, we failed to reject the null that there is statistically significant linear relationship between people’s weight and height. Based on the results, the weight is negatively associated with height. Based on the results, each one unit increase in weigh (i.e., 1 kg) is associated to 0.03584 kg lower body weight(what a data!). It may be true, though:

People who are short tend to be fatter than tall people. If the people’s height is proportional to their being fatter, the findings make perfect sense.

But, hey, this relationship is not statistically significant. So, we are okay.

Now, lets check if we can trace the relationship in the following plots. 1. A simple plot showing relationship between Height and Weight. 2. Include the fit line that we obtained from our linear model

plot(mydata$Height, mydata$Weight)

Yes, exactly the there are the data points everywhere. I have hard time tracing any pattern.

Now, let’s include the fit line and see what it has to say.

g = ggplot(mydata, aes(x = Height, y = Weight))
g = g + xlab("Peoples' Height")
g = g + ylab("Corresponding Weight")
g = g + geom_point(size = 7, colour = "green", alpha = 0.5)
g = g + geom_point(size = 5, colour = "blue", alpha = 0.2)
g = g + geom_smooth(method = "lm", colour = "black")
g

Fit line is trending slightly downward from if we follow it from the Y-axis. The shaded region suggests the confidence interval, and vast majority of data points fall beyond this limit. So, we definitely see a minute negative relationships between peoples’ weight and height on the simulated data.

I could have made more elegant linear model studies by simulating related data. However, my intention is to help me understand the linear model in R, so, I don’t really care about the relationships between the variables.

Part 2: Reading a Real Data and Answering a Few Research Questions:

The dataset I am going to use for the rest of this study is on labor market outcomes of a representative sample of women in the US. The dataset has the followign variables:

the logarithm of wage (lwage)
total years of schooling (yrs_school)
total experience in the labor markets (ttl_experience), and
a dummy variable indicating if the women is black or not

Reading the data

I am going to read the data using the read.csv command, and I am storing the data in a vector named labor_data.

labor_data <- read.csv("nlsw88.csv")
str(labor_data)

'data.frame':   2246 obs. of  4 variables:
 $ lwage     : num  2.46 1.86 1.61 2.2 2.09 ...
 $ yrs_school: int  12 12 12 17 12 12 12 18 14 15 ...
 $ ttl_exp   : num  10.3 13.6 17.7 13.2 17.8 ...
 $ black     : int  1 1 1 0 0 0 0 0 0 0 ...

As mentioned above, there are four variables, and 2246 observations. Log of wage and total experience are numerical variables and total schooling and black are recorded as the integer variables.

I am now going to model a linear regression. I want to see if the log of wage is influenced in any way by the years of schooling. We assume that people who spend more years in school are more likely to have higher academic degrees, thus, earn higher wages than with lower academic degrees.

fit1 <- lm(lwage ~ yrs_school, data=labor_data)
summary(fit1)


Call:
lm(formula = lwage ~ yrs_school, data = labor_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.29340 -0.32611 -0.00807  0.29471  2.20496 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.652578   0.057771   11.30   <2e-16 ***
yrs_school  0.092920   0.004333   21.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5236 on 2244 degrees of freedom
Multiple R-squared:  0.1701,    Adjusted R-squared:  0.1697 
F-statistic: 459.9 on 1 and 2244 DF,  p-value: < 2.2e-16

We can see that the model is statistically significant at 1% level and the years in school is statistically significant predictor of the somebody’s wages.

Alternate Calculation

The value of beta1 and beta0 can be calculated the following way as well. Please pay attention to the above beta0 and beta1 values.

#Calculating Beta1 or the Slope for yrs-school 
beta1 = cor(labor_data$yrs_school, labor_data$lwage) * sd(labor_data$lwage)/sd(labor_data$yrs_school)

# Calculating Beta0, or the Intercept for the model
beta0 = mean(labor_data$lwage) - beta1 * mean(labor_data$yrs_school)

# Displaying the values together
rbind(c(beta0, beta1),coef(lm(lwage ~ yrs_school, data = labor_data)))

     (Intercept) yrs_school
[1,]   0.6525777 0.09291988
[2,]   0.6525777 0.09291988

Bam!! We got the same values from both of these methods. So, it is essential for us to understand how the parameters are calculated and what they mean.

Black vs. Wages

Measuring relationship between wages and the black women.

fit2 <- lm(lwage ~ black, data=labor_data)
summary(fit2)


Call:
lm(formula = lwage ~ black, data = labor_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.90667 -0.40290 -0.03418  0.37105  1.96129 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91161    0.01398 136.739  < 2e-16 ***
black       -0.16554    0.02744  -6.033 1.88e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5701 on 2244 degrees of freedom
Multiple R-squared:  0.01596,   Adjusted R-squared:  0.01552 
F-statistic: 36.39 on 1 and 2244 DF,  p-value: 1.88e-09

A line requires two parameters to be specified and plotted on a two dimensional slop. Regression analyses strive to give us the best fitting line based on the values in the parameters. What happens if we stick with the line that passes through the origin (*which means the y intercept is 0).

As shown in the above linear model, whether a women is black or not goes to X-axis. We are going to subtract the mean of X from the individual Xs. Likewise, create a new variable ‘wage’ by subtracting mean of Y from the individual Ys.

# Creating New Variables
color <- labor_data$black - mean(labor_data$black)
wage <- labor_data$lwage - mean(labor_data$lwage)

# Creating a New Table that Records the New Variables
freq_new <- as.data.frame(table(color,wage))

# Assigning Names to the Columns of the New Table
colnames(freq_new) <- c("color", "wage", "freq")
freq_new$color <- as.numeric(as.character(freq_new$color))
freq_new$wage <- as.numeric(as.character(freq_new$wage))

# Creating a Function Called "beta" and Plotting the Values
g <- ggplot(filter(freq_new, freq > 1), aes(x = color, y = wage))
g <- g + scale_size(range = c(2,10), guide = "none")
g <- g + geom_point(colour = "grey50", aes(size= freq+10, show_guide = FALSE))
g <- g + geom_point(aes(colour = freq, size = freq))
g <- g + scale_colour_gradient(low="lightblue", high="green")
g

As can be seen, most the black women, i.e., the women who answered ‘yes’ aka 1 to the question whether or not they were the black women have narrower wage range, which is roughly within the +1 and -1 log of wages. Other women have slightly wider wage range and a good number of other women make wages beyond +1 range.

Now, I am going to regress wage on total years in school and total experience.

fit3 <- lm(lwage ~ yrs_school+ttl_exp, data=labor_data)
summary(fit3)


Call:
lm(formula = lwage ~ yrs_school + ttl_exp, data = labor_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.09807 -0.29945 -0.00571  0.25158  2.49949 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.336944   0.057308    5.88 4.73e-09 ***
yrs_school  0.079148   0.004150   19.07  < 2e-16 ***
ttl_exp     0.039559   0.002296   17.23  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4921 on 2243 degrees of freedom
Multiple R-squared:  0.2671,    Adjusted R-squared:  0.2664 
F-statistic: 408.7 on 2 and 2243 DF,  p-value: < 2.2e-16

Estimate the restricted model in R. What is the value that you obtain for β^1 in the restricted model?

labor_data$newvar <- labor_data$yrs_school + 2*labor_data$ttl_exp
restricted <- lm(lwage ~ newvar, data = labor_data)
summary(restricted) # show results


Call:
lm(formula = lwage ~ newvar, data = labor_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.79637 -0.32172 -0.02268  0.27505  2.39896 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.865430   0.042395   20.41   <2e-16 ***
newvar      0.026292   0.001075   24.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5106 on 2244 degrees of freedom
Multiple R-squared:  0.2106,    Adjusted R-squared:  0.2102 
F-statistic: 598.6 on 1 and 2244 DF,  p-value: < 2.2e-16

Checking and Comparing ANOVA Tables among Different Models

anova_school <- anova(fit1)
anova_unrest <- anova(fit3)
anova_rest <- anova(restricted)
anova_school; anova_rest; anova_unrest

Analysis of Variance Table

Response: lwage
             Df Sum Sq Mean Sq F value    Pr(>F)    
yrs_school    1 126.07 126.066  459.91 < 2.2e-16 ***
Residuals  2244 615.10   0.274                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance Table

Response: lwage
            Df Sum Sq Mean Sq F value    Pr(>F)    
newvar       1 156.08 156.080  598.62 < 2.2e-16 ***
Residuals 2244 585.09   0.261                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance Table

Response: lwage
             Df Sum Sq Mean Sq F value    Pr(>F)    
yrs_school    1 126.07 126.066  520.55 < 2.2e-16 ***
ttl_exp       1  71.90  71.901  296.90 < 2.2e-16 ***
Residuals  2243 543.20   0.242                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculating Test Statistics

Test Statistics

statistic_test = (((anova_rest$Sum Sq[2]-anova_unrest$Sum Sq[3]/1)/((anova_unrest$Sum Sq[3]/anova_unrest$Df[3]))

statistic_test

P-value

pvalue = df(statistic_test, 1, anova_unrest$Df[3]) pvalue

Difference in Differences (Linear Model)

Difference in Differences (DiD)is a tool used especially by empirical economists. The sample here is drawn from replicating the results of David Card and Alan Krueger’s work “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” For this I require the data they used fastfood.csv. The abstract from the original study reads:

On April 1, 1992, New Jersey’s minimum wage rose from $4.25 to $5.05 per hour. To evaluate the impact of the law we surveyed 410 fast-food restaurants in New Jersey and eastern Pennsylvania before and after the rise. Comparisons of employment growth at stores in New Jersey and Pennsylvania (where the minimum wage was constant) provide simple estimates of the effect of the higher minimum wage. We also compare employment changes at stores in New Jersey that were initially paying high wages (above $5) to the changes at lower-wage stores. We find no indication that the rise in the minimum wage reduced employment.

Variables:

chain: 1=Burger King; 2=KFC; 3=Roy Rogers; 4=Wendy’s
state: 1 if NJ; 0 if PA
southj: 1 if in southern NJ
centralj: 1 if in central NJ
northj: 1 if in northern NJ
shore: 1 if on NJ shore
pa1:1 if in PA, northeast suburbs of Philadelphia
pa2: 1 if in PA, all other areas besides the northeast suburbs of Philadelphia
empft: number of full-time employees before the change in the minimum wage
emppt: number of part-time employees before the change in the minimum wage
wage_st: starting wage in the local (per hour) before the change in the minimum wage
empft2: number of full-time employees after the change in the minimum wage
emppt2: number of part-time employees after the change in the minimum wage
wage_st2: starting wage in the local (per hour) after the change in the minimum wage

A. Loading the data and running a linear model to compare whether there are differences between fast-food restaurants located in New Jersey (NJ) and Pennsylvania (PA)prior to the change in the minimum wage in terms of the number of full-time employees and the starting wage:

fastfood <- read.csv("fastfood.csv")
summary(fastfood)# Checking if the data is loaded

     chain           state            southj          centralj     
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.000   Median :1.0000   Median :0.0000   Median :0.0000  
 Mean   :2.099   Mean   :0.8061   Mean   :0.2321   Mean   :0.1505  
 3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :4.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
                                                                   
     northj           shore              pa1               pa2        
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :0.00000   Median :0.0000  
 Mean   :0.4235   Mean   :0.08673   Mean   :0.08929   Mean   :0.1046  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
                                                                      
     empft            emppt          wage_st          empft2      
 Min.   : 0.000   Min.   : 0.00   Min.   :4.250   Min.   : 0.000  
 1st Qu.: 2.000   1st Qu.:11.00   1st Qu.:4.250   1st Qu.: 2.000  
 Median : 6.000   Median :17.00   Median :4.500   Median : 6.000  
 Mean   : 8.219   Mean   :18.86   Mean   :4.612   Mean   : 8.251  
 3rd Qu.:11.625   3rd Qu.:25.00   3rd Qu.:4.950   3rd Qu.:12.000  
 Max.   :60.000   Max.   :60.00   Max.   :5.750   Max.   :40.000  
                                  NA's   :19                      
     emppt2         wage_st2    
 Min.   : 0.00   Min.   :4.250  
 1st Qu.:11.00   1st Qu.:5.050  
 Median :17.00   Median :5.050  
 Mean   :18.54   Mean   :4.994  
 3rd Qu.:25.00   3rd Qu.:5.050  
 Max.   :60.00   Max.   :6.250  
 NA's   :1       NA's   :15

Beautiful!! The data is loaded in R and we can check the structure of the data and the variables.

#Running the Linear Model and Storing the Results in Wagestate vault
empft_state <- lm(empft ~ state, data=fastfood)
summary(empft_state)


Call:
lm(formula = empft ~ state, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.316  -5.715  -2.715   3.385  52.285 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.3158     0.9894   10.43   <2e-16 ***
state        -2.6006     1.1020   -2.36   0.0188 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.626 on 390 degrees of freedom
Multiple R-squared:  0.01408,   Adjusted R-squared:  0.01155 
F-statistic: 5.569 on 1 and 390 DF,  p-value: 0.01877

B. What is the average wage in Pennsylvania prior to the change?

prioravg_wage <- lm(wage_st ~ state, data=fastfood)
summary(prioravg_wage)


Call:
lm(formula = wage_st ~ state, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3786 -0.3574 -0.1074  0.3426  1.1426 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.62863    0.04044 114.460   <2e-16 ***
state       -0.02123    0.04509  -0.471    0.638    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3455 on 371 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.0005972, Adjusted R-squared:  -0.002097 
F-statistic: 0.2217 on 1 and 371 DF,  p-value: 0.638

The average wage in Pennsylvania prior to the change was $4.629.

The average wage in New Jersey prior to the change was $4.62863 - $0.02123 = $4.607.

We see that Pennsylvania had slightly higher average wage compared to New Jersey prior to the change, however, the difference was not statistically significant (p = 0.628). Thus, we cannot reject the null hypothesis that the average starting wage is the same in NJ and PA prior to the change in the minimum wage.

C. Now let’s assume that someone pointed out the northeast suburbs of Philadelphia are very different from the rest of Pennsylvania. This person claims that the model that should be used to estimate the differences between fast food restaurants located in NJ and PA prior to the change is as follows:

full time employment = β0 + β1 × statei + β2 × pa1 + β3 × pa2 + ε

Now let’s run the new model based on the above information:

PA1_PA <- lm(empft ~ state + pa1, data=fastfood)
summary(PA1_PA)


Call:
lm(formula = empft ~ state + pa1, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.943  -5.715  -2.715   3.410  52.285 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.780      1.348   7.254  2.2e-12 ***
state         -2.065      1.433  -1.441    0.150    
pa1            1.162      1.987   0.585    0.559    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.633 on 389 degrees of freedom
Multiple R-squared:  0.01495,   Adjusted R-squared:  0.009881 
F-statistic: 2.951 on 2 and 389 DF,  p-value: 0.05346

Here we go. The model ran successfully.

According to this model above, what would be the average difference in full time employment between the restaurants located in the northeast suburbs of Philadelphia and the rest of Pennsylvania?

This model can’t be estimated since its variables are perfectly collinear. In particular, all the observations that have the dummy of state equal to 0, have that either pa1 or pa2 is equal to one. Therefore, we can’t identify this difference from this model.

D. Now, let’s run the difference in differences model to see whether the change created a difference in the employment. According to what we saw in the lecture, the model to estimate should be the following:

empftit = β0 + β1 × statei + β2 × postt + β3 × statei × postt (equation 1)

Where postt is a dummy variable that takes the value of 1 after the change takes place.

E. Even though we aim to estimate equation (1), the data we have is “wide,” meaning that we have one observation per fast food restaurant, and the different yearly observations are in different variables. Model 1 would require the data to be in the “long” format, with the different years stacked. We could change the structure of the data to solve this problem.

Alternatively, someone has suggested that instead we could run the following model:

empfti2 − empfti1 = α0 + α1 × statei + νi (equation 2)

And that our estimate for α1 in equation (2) would be equivalent to β3 in equation (1).

DiD <- lm((empft2 - empft) ~ state, data=fastfood)
summary(DiD)


Call:
lm(formula = (empft2 - empft) ~ state, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.257  -3.699   0.301   4.301  34.301 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   -2.743      1.181  -2.323  0.02071 * 
state          3.443      1.316   2.617  0.00921 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.3 on 390 degrees of freedom
Multiple R-squared:  0.01726,   Adjusted R-squared:  0.01474 
F-statistic: 6.849 on 1 and 390 DF,  p-value: 0.009213

What value do you obtain for the DiD estimate? It would be 3.443.

Assuming that we can interpret the estimate for α1 as causal and that the minimum wage for these fast-food restaurants is binding, can you conclude that the NJ increase in the minimum wage had a negative effect on full-time employment?

No. Even though we can reject the null hypothesis that α1=0 at the 5% level, the estimate for this parameter is positive. Thus, we are finding that the increase in the minimum wage led to an increase in full-time employment.

Regression Discontinuity Examples

For this demonstration, I am going to replicate the results of David S. Lee’s work. The data I am using the data from “Mostly Harmless Econometrics Data Archive”.

Lee (2008) studied the effect of party incumbency on reelection probabilities. In general, Lee was interested in whether a Democratic candidate for a seat in the U.S. House of Representatives has an advantage if his party won the seat last time. Below is the abstract of the paper “The Electoral Advantage to Incumbency and Voters’ Valuation of Politicians Experience: A Regression Discontinuity Analysis of Elections to the U.S. Houses”

Using data on elections to the United States House of Representatives (1946-1998), this paper exploits a quasi-experiment generated by the electoral system in order to determine if political incumbency provides an electoral advantage - an implicit first-order prediction of principal-agent theories of politicians and voter behavior. Candidates who just barely won an election (barely became the incumbent) are likely to be ex ante comparable in all other ways to candidates who barely lost, and so their differential electoral outcomes in the next election should represent a true incumbency advantage. The regression discontinuity analysis provides striking evidence that incumbency has a significant causal effect of raising the probability of subsequent electoral success - by about 0.4 to 0.45. Simulations - using estimates from a structural model of individual voting behavior - imply that about two-thirds of the apparent electoral success of incumbents can be attributed to voters’ valuation of politicians’ experience. The quasi-experimental analysis also suggest that heuristic “fixed effects” and “instrumental variable” modeling approaches would have led to misleading inferences in this context.

Variables:

yearel: election year
myoutcomenext: a dummy variable indicating whether the candidate of the incumbent party was elected
difshare: a normalized running variable: proportion of votes of the party in the previous election - 0.5. If difshare>0 then the candidate runs for the same party as the incumbent.

We need package rdd to conduct Regression Discontinuity in R.

library(rdd)

#Loading the dataset
Lee <- read.csv("indiv_final.csv")
summary(Lee)

     yearel     myoutcomenext       difshare      
 Min.   :1946   Min.   :0.0000   Min.   :-1.0000  
 1st Qu.:1960   1st Qu.:0.0000   1st Qu.:-0.4845  
 Median :1974   Median :0.0000   Median :-0.1291  
 Mean   :1972   Mean   :0.3202   Mean   :-0.1252  
 3rd Qu.:1986   3rd Qu.:1.0000   3rd Qu.: 0.1918  
 Max.   :1996   Max.   :1.0000   Max.   : 1.0000

Based on the information provided, create a variable for whether the party of the candidate is the same party as the incumbent. What is the proportion of these cases in your data set?

# If the difshare is > 0 the candidate will be the incumbent candidate
Lee$incumb_candidate <- as.numeric(Lee$difshare > 0)

# Measuring the proportion of the incumbent candidates
proportion_incumb <- sum(Lee$incumb_candidate)/nrow(Lee)

# Checking the Statistics
proportion_incumb

[1] 0.3990055

One of the main assumption in RD designs is that there are no jumps in the density of the running variable around the cutoff. The package in R rdd has a command DCdensity. Run the command in R using difshare as the running variable.

What is the difference in the log estimate in heights at the cutpoint?

DCdensity(Lee$difshare, verbose = TRUE, ext.out = FALSE, htest = TRUE)

Assuming cutpoint of zero.
Using calculated bin size:  0.005 
Using calculated bandwidth:  0.183

Log difference in heights is  -0.002  with SE  0.052 
  this gives a z-stat of  -0.048 
  and a p value of  0.962


    McCrary (2008) sorting test

data:  
z = -0.047558, binwidth = 0.0052909, bandwidth = 0.1833466, cutpoint =
0.0000000, p-value = 0.9621
alternative hypothesis: no apparent sorting

Now, keep only the observations within 50 percentage points of the cutoff (the absolute value of ‘difshare’ is less than or equal to 0.5). Also, create in R the required variables to run the following models:

1. yi = β0 + β11difshare ≥ 0, i + εi (model 1)

2. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + εi (model 2)

3. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + δ1difsharei × 1difshare ≥ 0,i + εi (model 3)

4. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + εi (model 4)

5. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + δ1difsharei × 1difshare ≥ 0, i + δ2difshare2i × 1difshare ≥ 0, i + εi (model 5)

6. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + + γ3difshare3i + εi (model 6)

7. yi = β0 + β11difshare ≥ 0, i + γ1difsharei + γ2difshare2i + γ3difshare3i + δ1difsharei × 1difshare ≥ 0, i + δ2difshare2i × 1difshare ≥ 0, i + δ3difshare3i × 1difshare ≥ 0,i + εi (model 7)

Where yi corresponds to the myoutcomenext variable in the data set, and 1difshare≥0 to a dummy variable that indicates whether the party of the candidate won in the previous election.

To achieve this goal, I am going to create an empty matrix where I can dump the desired components of the results from the models above.

combined_matrix <- matrix(NA, nrow=2, ncol=7)
combined_matrix

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   NA   NA   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA   NA   NA

Now, I am going to create the squared and cubic value of the difshare variable:

Lee$difshare_square <- (Lee$difshare)^2
Lee$difshare_cube <- (Lee$difshare)^3
head(Lee$difshare_square)

[1] 0.003780799 0.003780799 0.010997612 0.010997612 0.286996990 0.027048440

head(Lee$difshare_cube)

[1]  0.0002324745 -0.0002324745  0.0011533141 -0.0011533141 -0.1537503144
[6]  0.0044484974

Creating the aforementioned models and inputting the values in the ‘combined_matrix’, I created above. However, I don’t need all the outputs because that will overwhelm me. Thus, I created the matrix (combined_matrix) above to store only the required information. 7 models to run seven columns to input the values of my interest.

For this analysis, I need

the ‘β-Estimate’ which is can be found in the “Coefficient Table” in the R-output for linear model. This value goes in the 1-row X 1-column
the p-value for the corresponding ‘β-Estimate’. This value goes in the 2nd row X 1st column.

model1 <- lm(myoutcomenext ~ incumb_candidate, data = Lee, subset = abs(difshare) <= 0.5) #abs returns the absolute value, the positive value
combined_matrix[1,1] <- model1$coefficients[2] #the β-Estimate[2] in the coefficient table will fill the [1 X 1] of the combined_matrix
fit_model1 <- summary(model1)
combined_matrix[2,1] <- fit_model1$coefficients[2,4] #p-value in the position of [2 X 4] of the Coefficients table will fill the [2 X 1] position in my combined_matrix. 
combined_matrix #Checking if everything came together.

         [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124   NA   NA   NA   NA   NA   NA
[2,] 0.000000   NA   NA   NA   NA   NA   NA

That did work. We have now the first column filled. Let’s run all the remaining models and fill the combined_matrix.

# Model 2
model2 <- lm(myoutcomenext ~ incumb_candidate + difshare, data = Lee, subset = abs(difshare) <= 0.5) 
combined_matrix[1,2] <- model2$coefficients[2]
fit_model2 <- summary(model2)
combined_matrix[2,2] <- fit_model2$coefficients[2,4] 
combined_matrix

         [,1]      [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884   NA   NA   NA   NA   NA
[2,] 0.000000 0.0000000   NA   NA   NA   NA   NA

# Model 3
model3 <- lm(myoutcomenext ~ incumb_candidate + difshare + incumb_candidate * difshare, data = Lee, subset = abs(difshare) <= 0.5) 
combined_matrix[1,3] <- model3$coefficients[2]
fit_model3 <- summary(model3)
combined_matrix[2,3] <- fit_model3$coefficients[2,4] 
combined_matrix

         [,1]      [,2]      [,3] [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998   NA   NA   NA   NA
[2,] 0.000000 0.0000000 0.0000000   NA   NA   NA   NA

# Model 4
model4 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square, data = Lee, subset = abs(difshare) <= 0.5) 
combined_matrix[1,4] <- model4$coefficients[2]
fit_model4 <- summary(model4)
combined_matrix[2,4] <- fit_model4$coefficients[2,4] 
combined_matrix

         [,1]      [,2]      [,3]      [,4] [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309   NA   NA   NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000   NA   NA   NA

# Model 5
model5 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + incumb_candidate * difshare + incumb_candidate * difshare_square, data = Lee, subset = abs(difshare) <= 0.5) 
combined_matrix[1,5] <- model5$coefficients[2]
fit_model5 <- summary(model5)
combined_matrix[2,5] <- fit_model5$coefficients[2,4] 
combined_matrix

         [,1]      [,2]      [,3]      [,4]      [,5] [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643   NA   NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000   NA   NA

# Model 6
model6 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + incumb_candidate * difshare + incumb_candidate * difshare_square, data = Lee, subset = abs(difshare) <= 0.5) 

combined_matrix[1,6] <- model6$coefficients[2]
fit_model6 <- summary(model6)
combined_matrix[2,6] <- fit_model6$coefficients[2,4] 
combined_matrix

         [,1]      [,2]      [,3]      [,4]      [,5]      [,6] [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643 0.5301643   NA
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000   NA

# Model 7
model7 <- lm(myoutcomenext ~ incumb_candidate + difshare + difshare_square + difshare_cube + incumb_candidate * difshare + incumb_candidate * difshare_square + incumb_candidate * difshare_cube, data = Lee, subset = abs(difshare) <= 0.5) 

combined_matrix[1,7] <- model7$coefficients[2]
fit_model7 <- summary(model7)
combined_matrix[2,7] <- fit_model7$coefficients[2,4] 
combined_matrix

         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]          [,7]
[1,] 0.753124 0.6263884 0.6231998 0.6227309 0.5301643 0.5301643  4.764126e-01
[2,] 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 8.514610e-155

A. For which of the models do you find that the effects of party incumbency over re-election is greater than 0.6?

The Results shows that 4 of the models, i.e., models 1, 2, 3 & 4 have the effects of party incumbency over re-election greater than 0.6.

B. Using RDestimate command to estimate the effect non-parametrically. What is the point estimate that we obtained from this method?

NP_model <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5)
summary(NP_model)


Call:
RDestimate(formula = myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 
    0.5)

Type:
sharp 

Estimates:
           Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|)  
LATE       0.11982    4695          0.4707    0.02695     17.47     2.578e-68
Half-BW    0.05991    2363          0.4511    0.03934     11.47     1.974e-30
Double-BW  0.23965    9182          0.5119    0.01818     28.15    2.082e-174
              
LATE       ***
Half-BW    ***
Double-BW  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistics:
           F       Num. DoF  Denom. DoF  p
LATE        878.0  3         4691        0
Half-BW     334.1  3         2359        0
Double-BW  2493.8  3         9178        0

The results show that the obtained estimate is 0.4707.

C. Plotting the model with default (optimal) bandwidth, another with three times the optimal bandwidth and the third one with the one-third of the optimal bandwidth.

To make our task easy, I can create a variable named Bandwidth and assign the value of Bandwidth (for LATE) in the Estimates table above. The corresponding value is: 0.11982

Bandwidth <- 0.11982

# A. Optimal (Default) Bandwidth
Optimal_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5)

plot(Optimal_bandwidth)

# B. Three-times the Optimal Bandwidth
Threetimes_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5, kernel = "rectangular", bw = 3*Bandwidth)

plot(Threetimes_bandwidth)

# B. Three-times the Optimal Bandwidth
OneThird_bandwidth <- RDestimate(myoutcomenext ~ difshare, data = Lee, subset = abs(Lee$difshare) <= 0.5, kernel = "rectangular", bw = Bandwidth/3)

plot(OneThird_bandwidth)

Finally this kind of plots are not really useful in case of Regression Discontinuity Models. But just to check how they look. I plotting the model 7, the most complex model.

plot(model7)

——————Thanks————————————–