# Libraries
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.3
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
library(clubSandwich) # Cluster robust se
## Registered S3 method overwritten by 'clubSandwich':
## method from
## bread.mlm sandwich
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(plm)
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
library(fixest)
## Warning: package 'fixest' was built under R version 3.6.2
##
## Attaching package: 'fixest'
## The following object is masked from 'package:texreg':
##
## coefplot
(Wooldridge, Chapter 14, C1) Use the data in rental.Rdata for this exercise.
load("rental.RData")
head(data)
## city year pop enroll rent rnthsg tothsg avginc lenroll lpop
## 1 1 80 75211 15303 197 13475 26167 11537 9.635804 11.22805
## 2 1 90 77759 18017 342 15660 29467 19568 9.799071 11.26137
## 3 2 80 106743 22462 323 14580 37277 19841 10.019580 11.57818
## 4 2 90 141865 29769 496 26895 55540 31885 10.301223 11.86263
## 5 3 80 36608 11847 216 7026 13482 11455 9.379830 10.50802
## 6 3 90 42099 10265 351 9557 16894 21202 9.236495 10.64778
## lrent ltothsg lrnthsg lavginc clenroll clpop clrent
## 1 5.283204 10.172255 9.508592 9.353314 NA NA NA
## 2 5.834811 10.291026 9.658865 9.881651 -15293.20 0.03331661 0.5516071
## 3 5.777652 10.526132 9.587406 9.895506 NA NA NA
## 4 6.206576 10.924859 10.199696 10.369891 -22451.70 0.28445148 0.4289236
## 5 5.375278 9.509110 8.857373 9.346182 NA NA NA
## 6 5.860786 9.734714 9.165030 9.961851 -11837.76 0.13975716 0.4855080
## cltothsg clrnthsg clavginc pctstu cpctstu y90
## 1 NA NA NA 20.34676 NA 0
## 2 0.1187716 0.1502733 0.5283365 23.17031 2.82355118 1
## 3 NA NA NA 21.04307 NA 0
## 4 0.3987274 0.6122894 0.4743853 20.98403 -0.05903244 1
## 5 NA NA NA 32.36178 NA 0
## 6 0.2256031 0.3076563 0.6156693 24.38300 -7.97877693 1
The data on rental prices and other variables for college towns are for the years 1980 and 1990. The idea is to see whether a stronger presence of students affects rental rates. The unobserved effects model is:
\[ \begin{eqnarray*} \log(rent_{it}) &=& \beta_0 + \delta_{0}\,\textit{y90}_t + \beta_{1}\log(pop_{it}) + \beta_{2}\log(avginc_{it}) \\ &+& \beta_{3}\,\textit{pctstu}_{it} + u_{it} \end{eqnarray*} \]
where pop is city population, avginc is average income, and pctstu is student population as a percentage of city population (during the school year).
1. Estimate the equation by pooled OLS and report the results in standard form. What do you make of the estimate on the 1990 dummy variable? What do you get for \(\widehat{\beta}_{pctstu}\)?
# defining the equation
the.formula = lrent ~ y90 + lpop + lavginc + pctstu
mod.pols = lm(the.formula, data = data)
mod.ols = plm(formula = the.formula, data = data, model = "pooling")
list(mod.pols, mod.ols) %>% screenreg(digits = 5)
##
## =========================================
## Model 1 Model 2
## -----------------------------------------
## (Intercept) -0.56881 -0.56881
## (0.53488) (0.53488)
## y90 0.26223 *** 0.26223 ***
## (0.03476) (0.03476)
## lpop 0.04069 0.04069
## (0.02252) (0.02252)
## lavginc 0.57145 *** 0.57145 ***
## (0.05310) (0.05310)
## pctstu 0.00504 *** 0.00504 ***
## (0.00102) (0.00102)
## -----------------------------------------
## R^2 0.86128 0.86128
## Adj. R^2 0.85677 0.85677
## Num. obs. 128 128
## RMSE 0.12592
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# Pooled OLS is just a regular OLS on panel data
From the coefficient of the dummy variable, we can read that there is an average higher rent by approximately 26\(\%\) in year 1990 than year 1980, ceteris paribus. Since the logarithm of the rent is 0.26 higher in year 1990. Corresponding to:
exp(0.26223)-1
## [1] 0.2998255
an approximate increase of 30\(\%\).
\(\hat{\beta}_{pctstu}\) indicates that a 1\(\%\) increse in the student population as a percentage of city population increases the rent by approximately .5\(\%\).
Vulnerable to Simpson Paradox.
2. Are the standard errors you report in part 1. valid? Explain.
Since we assume i.i.d, the answer is no, as there is serialcorrelation within in clusters (in this case cities), and panel data are per definition serial correlated. The cluster correlation comes from the non-zero variance, \(var(\alpha_i)\).
E.g. it is fair to expect a correlation between the rent in city 1 in 1980 and in 1990. The standard OLS estimate do not take this into account. Thus, the standard errors are biased downwards, due to cluster correlation. That is, we will tend to conclude, that something is significant when in fact, it is not.
For this reason, we need to account for correlation within clusters, why we should instead use cluster at city level.
# Cluster-robust se
cr.vcov.pols = vcovCR(mod.pols, cluster = data$city, type = "CR1") # Estimating cluster-robust vcov
test.pols = coeftest(mod.pols, cr.vcov.pols) # Testing significance with cluster robust standard errros (cr.vcov)
list(mod.pols, test.pols) %>% screenreg(digits = 6) # Showing results
##
## ==========================================
## Model 1 Model 2
## ------------------------------------------
## (Intercept) -0.568807 -0.568807
## (0.534881) (1.023349)
## y90 0.262227 *** 0.262227 ***
## (0.034763) (0.063228)
## lpop 0.040686 0.040686
## (0.022515) (0.028365)
## lavginc 0.571446 *** 0.571446 ***
## (0.053098) (0.120953)
## pctstu 0.005044 *** 0.005044 **
## (0.001019) (0.001503)
## ------------------------------------------
## R^2 0.861281
## Adj. R^2 0.856770
## Num. obs. 128
## RMSE 0.125915
## ==========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Note that \(\hat{\beta}_{pctstu}\) is less significant.
3. Now, difference the equation and estimate by OLS. Compare your estimate of \(\beta_{pctstu}\) with the estimate that you obtained in part 1.. Does the relative size of the student population appear to affect rental prices?
Differencing the equation is to take estimates from one period t+1 (1990) and subtracting corresponding observations in t (1980). Estimating on this with OLS is the First Difference Estimator (FD Estimator).
# FD
data$lrent[which(data$year==90)]-data$lrent[which(data$year==80)]
## [1] 0.5516071 0.4289236 0.4855080 0.7894783 0.6664791 0.8253188 0.5453234
## [8] 0.4959292 0.8087320 0.4590969 0.6664791 0.6182160 0.5519438 0.6380877
## [15] 0.6518292 0.5108256 0.4759617 0.5705447 0.5148497 0.5697684 0.5372767
## [22] 0.3928337 0.7387824 0.5980387 0.5521994 0.6087804 0.5166907 0.4338241
## [29] 0.4815888 0.4951186 0.5319953 0.7624297 0.7659063 0.6666079 0.6471848
## [36] 0.6706743 0.6304369 0.4818382 0.4427724 0.5300832 0.4443808 0.4805326
## [43] 0.3944759 0.4802251 0.4519849 0.5346365 0.6289048 0.6109095 0.6580558
## [50] 0.4919286 0.5404701 0.4172158 0.4415412 0.5785809 0.5337491 0.5047464
## [57] 0.5913644 0.7074475 0.5596156 0.4216728 0.4784904 0.4475307 0.4470139
## [64] 0.6639175
mod.dpols = plm(
the.formula,
data = data,
index = c("city", "year"), # Defining the paneldata structure. Standard: 1. observational, 2. time
model = "fd" # Defining the model type
) # remember: Pooled OLS does not account serial correlation and heterogeneity bias: $\alpha_i != \alpha$
list("Pooled OLS" = mod.pols, "First Difference" = mod.dpols) %>% screenreg(digits = 6)
##
## =============================================
## Pooled OLS First Difference
## ---------------------------------------------
## (Intercept) -0.568807 0.385521 ***
## (0.534881) (0.036824)
## y90 0.262227 ***
## (0.034763)
## lpop 0.040686 0.072246
## (0.022515) (0.088343)
## lavginc 0.571446 *** 0.309961 ***
## (0.053098) (0.066477)
## pctstu 0.005044 *** 0.011203 **
## (0.001019) (0.004132)
## ---------------------------------------------
## R^2 0.861281 0.322262
## Adj. R^2 0.856770 0.288375
## Num. obs. 128 64
## RMSE 0.125915
## =============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# Note that y90 is absent in FD (due to multicollinearity)
# Multicollinearity is correlation/multiple correlations of sufficient magnitude to adversely affect regression estimates. Thus, we lose significance, individually.
# If we want to estimate y90 another model would be required, e.g., RE.
From comparing the estimates, we see that the impact of pctstu has more than doubled from the fd corrected estimates.
Hence, we can conclude that the relative size of the student population appear to affect rental prices. And that the bias was downward, because the standard error are now larger.
FD is the better model, because it estimates the average difference between clusters
4. Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part 3..
mod.fe = plm(
the.formula,
data = data,
index = c("city", "year"), # Defining the paneldata structre
model = "within" # Defining the model type
)
list("Pooled OLS" = mod.pols, "First Difference" = mod.dpols, "Fixed effects" = mod.fe) %>% screenreg(digits = 5)
##
## ===========================================================
## Pooled OLS First Difference Fixed effects
## -----------------------------------------------------------
## (Intercept) -0.56881 0.38552 ***
## (0.53488) (0.03682)
## y90 0.26223 *** 0.38552 ***
## (0.03476) (0.03682)
## lpop 0.04069 0.07225 0.07225
## (0.02252) (0.08834) (0.08834)
## lavginc 0.57145 *** 0.30996 *** 0.30996 ***
## (0.05310) (0.06648) (0.06648)
## pctstu 0.00504 *** 0.01120 ** 0.01120 **
## (0.00102) (0.00413) (0.00413)
## -----------------------------------------------------------
## R^2 0.86128 0.32226 0.97653
## Adj. R^2 0.85677 0.28837 0.95032
## Num. obs. 128 64 128
## RMSE 0.12592
## ===========================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# FE takes each observation and substract the mean. Removing the level effect. By that it doesn't get the same estimates as pooled OLS.
For T=2, i.e. panels of two periods, we know that first difference (FD) is mathematically identical to fixed effects (FE).
We trust FE because we get a correct estimate of the slope, and we avoid the risk of the interpreting the relationship wrongly, as we do with the pooled OLS model.
load("airfare.RData")
(Wooldridge, Chapter 14, C10) Use the data in airfare.Rdata for this exercise.
We are interested in estimating the model:
\[ \begin{eqnarray*} \log(fare_{it}) &=& \eta_t + \beta_{1}\textit{concen}_{it} + \beta_{2}\log(dist_{i}) + \beta_{3}[\log(dist_{i})]^2 \\ &+& a_i + u_{it}, \quad t = 1, \ldots, 4 \end{eqnarray*} \]
where \(\eta_t\) means that we allow for different year intercepts.
1. Estimate the above equation by pooled OLS, being sure to include year dummies. If \(\Delta\textit{concen} = 0.10\), what is the estimated percentage increase in fare?
# Defining the equation. eta is the intercept of each year
the.formula = lfare ~ concen + ldist + ldistsq + y98 + y99 + y00 # we could also have used year instead
mod.pols = lm(the.formula, data = data)
cr.vcov.pols = vcovCR(mod.pols, cluster = data$id, type = "CR1") # cluster robust se, which take care of serial correlation
test.pols = coeftest(mod.pols, cr.vcov.pols)
list(POLS = test.pols) %>% screenreg(digits = 6)
##
## ==========================
## POLS
## --------------------------
## (Intercept) 6.209258 ***
## (0.911160)
## concen 0.360120 ***
## (0.058518)
## ldist -0.901600 ***
## (0.271769)
## ldistsq 0.103020 ***
## (0.020147)
## y98 0.021124 ***
## (0.004145)
## y99 0.037850 ***
## (0.005176)
## y00 0.099870 ***
## (0.005643)
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# The percentage increase in fare
delta = .10
exp(mod.pols$coefficients["concen"]*delta)
## concen
## 1.036668
In conclusion, when concen changes with 0.1, fare changes by $$$1.036668. \(fare\) is the average price of a one-way fare.
2. What is the usual OLS 95% confidence interval for \(\beta_1\)? Why is it probably not relaible? Compute fully robust standard errors and find the fully robust 95% CI for \(\beta_1\). Compare it to the usual CI and comment.
# CI for concen
OLS.ci = confint(mod.pols, "concen")
CR.ci = conf_int(mod.pols, cr.vcov.pols, coefs = "concen")
We observe that the OLS 95% CI is smaller than the CI for the cluster robust (CR) standard errors
The usual confidence interval is computed with OLS se. Because of serial correlation, they are invalid. Think i.i.d. These are over-optimistic since they have downward bias from serial correlation in the clusters of the data.
We should therefore use cluster-robust standard error to calculate the CI. This, however does not make up for the fact that POLS estiamte tends to be biased by the unknown effects, if these are correlated with other variables (which they tend to be).
3. Describe what is happening with the quadratic in \(\log(dist)\). In particular for what value of dist does the relationship between \(\log(fare)\) and dist become positive? (Hint: First figure out the turning point value for \(\log(dist)\), and then exponentiate.) Is the turning point outside the range of the data?
We can minimize a quadratic by differentating it with respect to ldist and setting it equal to zero: \[ \beta_2 + 2\beta_3 \, ldist_{min} = 0 \\ \Leftrightarrow - \frac{\beta_2}{2\beta_3} = ldist_{min} \\ \Rightarrow ldist_{min} \approx -4.376 \]
-mod.pols$coefficients["ldist"]/(2*mod.pols$coefficients["ldistsq"])
## ldist
## 4.375868
# Is the turning point outside the range of the data?
data$ldist %>% summary() # The range of data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.554 6.225 6.758 6.696 7.173 7.910
The minimum lies outside the range, thus the relationship between lfare and ldist i strictly positive, since the minimum price of the data starts at 4.554
# Graphical presentation
fake.data = data.frame(ldist = 1:100/10) # non-real distance
fake.data$lfare = fake.data$ldist*mod.pols$coefficients["ldist"] + fake.data$ldist^2*mod.pols$coefficients["ldistsq"]
plot(fake.data)
4. Now estimate the equation using random effects. How does the estimate of \(\beta_1\) change?
# RE is a method to instruct a mathematical code that there is serial correlation. So that we utilize serial correlation in our calculation, which provides more efficient estimates (but still biased). FE on contrary is unbiased, but less efficient.
mod.re = plm(the.formula,
data = data,
index = c("id", "year"),
model = "random"
)
cr.vcov.re = vcovCR(mod.re, cluster = data$id, type = "CR1")
test.re = coeftest(mod.re, cr.vcov.re)
list(POLS = test.pols, RE = test.re) %>% screenreg(digits = 6)
##
## =========================================
## POLS RE
## -----------------------------------------
## (Intercept) 6.209258 *** 6.222005 ***
## (0.911160) (0.913810)
## concen 0.360120 *** 0.208993 ***
## (0.058518) (0.042218)
## ldist -0.901600 *** -0.852092 **
## (0.271769) (0.271912)
## ldistsq 0.103020 *** 0.097460 ***
## (0.020147) (0.020129)
## y98 0.021124 *** 0.022474 ***
## (0.004145) (0.004143)
## y99 0.037850 *** 0.036690 ***
## (0.005176) (0.005128)
## y00 0.099870 *** 0.098212 ***
## (0.005643) (0.005520)
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
\(concen\) drops substantially.
5. Now estimate the equation using fixed effects. What is the FE estimate of \(\beta_1\)? Why is it fairly similar to the RE estimate? (Hint: What is \(\widehat{\theta}\) for RE estimation?) Name two characteristics of a route (other than distance between stops) that are captured by \(a_i\). Might these be correalted with \(\textit{concen}_{it}\)? Are you convinced that higher concentration on a route increases airfares? What is your best estimate?
the.formula = lfare ~ concen
mod.fe = plm(the.formula,
data = data,
index = c("id", "year"),
effect = "twoways",
model = "within"
)
mod.fe %>% screenreg(digits = 6)
##
## ==========================
## Model 1
## --------------------------
## concen 0.168859 ***
## (0.029410)
## --------------------------
## R^2 0.009484
## Adj. R^2 -0.321935
## Num. obs. 4596
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
cr.vcov.fe = vcovCR(mod.fe,
cluster = c(data$id),
type = "CR1")
test.fe = coeftest(mod.fe, cr.vcov.fe)
list(RE = test.re, FE = test.fe) %>% screenreg(digits = 6)
##
## =========================================
## RE FE
## -----------------------------------------
## (Intercept) 6.222005 ***
## (0.913810)
## concen 0.208993 *** 0.168859 ***
## (0.042218) (0.049437)
## ldist -0.852092 **
## (0.271912)
## ldistsq 0.097460 ***
## (0.020129)
## y98 0.022474 ***
## (0.004143)
## y99 0.036690 ***
## (0.005128)
## y00 0.098212 ***
## (0.005520)
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
#list(POLS = mod.pols, RE = test.re, FE = test.fe) %>% screenreg(digits = 6)
# fixest
mod.fixest = feols(fml = lfare ~ concen | id + year, data = data)
summary(mod.fixest, se = "cluster")
## OLS estimation, Dep. Var.: lfare
## Observations: 4,596
## Fixed-effects: id: 1,149, year: 4
## Standard-errors: Clustered (id)
## Estimate Std. Error t value Pr(>|t|)
## concen 0.168859 0.049459 3.4141 0.000662 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.092188 Adj. R2: 0.94043
## Within R2: 0.009484
#0.049459 clustered standard se are default
# default for feols and plm
se_feols_id_plm = se(mod.fixest, dof = dof(fixef.K = "none", cluster.adj = F))
#.04941565
se_plm_id = sqrt(vcovHC(mod.fe, cluster = "group")["concen", "concen"])
#.04941565
The difference in \(concen\) between RE and FE are not as large as the difference in \(concen\) between POLS and RE. This could suggest that there is some degree of randomization in the concentration of the routes.
Note that ldist i constant through out the period, the distance between two destinations in a route does not change. Thus, we cannot say anything about the distance with FE.
However, only if \(\alpha_i\) is relatively random, we can use RE.
It is an important assumption of RE that the explanatory variables are uncorrelated with the unitwise heterogeneity, \(\alpha_i\). If this is not the case, RE should not be used and FE is strongly preferred, if it is feasible. It is only feasible if we observe characteristics that change over time.
As mentioned, the estimates are close because \(\alpha_i\) is pretty much random. This is corrected by fixed effects estimate. However, the small change is an indication that we CANNOT trust the RE model and instead should rely on the FE estimate.
THINK: More demand for route^^ -> Concen^ Concen^ -> More suppliers^^ -> More competition
(Heij et al. E 7.25 a,c-f) The data file xr725fas.csv contains quarterly fashion sales data of a US retailer with multiple specialty divisions over the period 1986.1–1992.4.
require(foreign)
## Loading required package: foreign
data = read.csv("xr725fas.csv", header = T, sep = ",", dec = ".")
This is a panel data set with \(m = 5\) units and \(n = 28\) observations. The first two divisions specialize in high-priced fashion apparel, division 3 in low-priced clothes, and divisions 4 and 5 in specialities like large sizes, undergarments, and so on. The data consist of the quarterly sales \(S_{it}\) of the divisions, the purchasing ability \(A_{t}\), and the consumer confidence \(C_t\). We formulate the model:
\[ \begin{eqnarray*} \log(S_{it}) &=& \alpha_i + \alpha_{i2}D_{2t} + \alpha_{i3}D_{3t} + \alpha_{i4}D_{4t} \\ &+& \beta_i\log(A_t) + \gamma_i\log(C_t) + \epsilon_{it} \end{eqnarray*} \]
# Note that the panel data is in wide form.
# Removing SALES from df
data = data %>% select(-(SALES_1:SALES_5))
# Reshaping
data = data %>% pivot_longer(
LOGSALES_1:LOGSALES_5,
names_to = "Division",
values_to = "LOGS"
)
# Rename
data = data %>%
mutate(Division = substr(Division, 10, 10)) # Changing strings
# Defining dummy for each Division
data = data %>% mutate(
Div1 = ifelse(Division == 1, 1, 0),
Div2 = ifelse(Division == 2, 1, 0),
Div3 = ifelse(Division == 3, 1, 0),
Div4 = ifelse(Division == 4, 1, 0),
Div5 = ifelse(Division == 5, 1, 0)
)
1. Estiamte the above model (with thirty regression parameters) by OLS. Check that there exists significant contemporaneous correation between the residual terms for the five divisions in the same quarter.
#Defining a formula filled with interaction terms that covers a parameter for each division
the.formula = LOGS ~
Div2 + Div3 + Div4 + Div5 + #Dummy for each division
D2 + D3 + D4 + # D1-D4 For division 1
I(Div2*D2) + I(Div2*D3) + I(Div2*D4) + #Parmeters on D2-D4 for other units
I(Div3*D2) + I(Div3*D3) + I(Div3*D4) +
I(Div4*D2) + I(Div4*D3) + I(Div4*D4) +
I(Div5*D2) + I(Div5*D3) + I(Div5*D4) +
LOGA + LOGC + # LOGA and LOGC for division 1
I(Div2*LOGA) + I(Div2*LOGC) + #Parmeters on LOGA-D4 for other units
I(Div3*LOGA) + I(Div3*LOGC) +
I(Div4*LOGA) + I(Div4*LOGC) +
I(Div5*LOGA) + I(Div5*LOGC)
mod1 = lm(the.formula, data = data)
mod1 %>% screenreg() # Looking at model
##
## ==========================
## Model 1
## --------------------------
## (Intercept) -6.14 *
## (2.41)
## Div2 9.84 **
## (3.41)
## Div3 -2.21
## (3.41)
## Div4 10.84 **
## (3.41)
## Div5 13.91 ***
## (3.41)
## D2 0.19 ***
## (0.04)
## D3 0.31 ***
## (0.04)
## D4 0.62 ***
## (0.04)
## I(Div2 * D2) -0.17 **
## (0.06)
## I(Div2 * D3) -0.15 *
## (0.06)
## I(Div2 * D4) -0.13 *
## (0.06)
## I(Div3 * D2) -0.06
## (0.06)
## I(Div3 * D3) -0.19 **
## (0.06)
## I(Div3 * D4) -0.07
## (0.06)
## I(Div4 * D2) 0.05
## (0.06)
## I(Div4 * D3) -0.21 ***
## (0.06)
## I(Div4 * D4) -0.21 **
## (0.06)
## I(Div5 * D2) -0.14 *
## (0.06)
## I(Div5 * D3) -0.21 ***
## (0.06)
## I(Div5 * D4) -0.06
## (0.06)
## LOGA 1.49 ***
## (0.33)
## LOGC 0.66 **
## (0.20)
## I(Div2 * LOGA) -2.04 ***
## (0.47)
## I(Div2 * LOGC) -0.10
## (0.29)
## I(Div3 * LOGA) 0.47
## (0.47)
## I(Div3 * LOGC) -0.13
## (0.29)
## I(Div4 * LOGA) -1.52 **
## (0.47)
## I(Div4 * LOGC) -0.77 **
## (0.29)
## I(Div5 * LOGA) -1.81 ***
## (0.47)
## I(Div5 * LOGC) -0.97 ***
## (0.29)
## --------------------------
## R^2 0.97
## Adj. R^2 0.97
## Num. obs. 140
## RMSE 0.08
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Contemporaneous correlation is when the error term for one period is correlated with other error terms in the same period (in this case, divisions).
One way to test for this, is to make a regression of the residuals versus the observational time. If this is significant it is an indication of contemporaneous correlation. That is units error are correlated at the same point in time.
# Extracting the residuals
res = mod1$residuals
data = data %>% mutate(res = res) # Adding residuals to data frame
# Testing for contemporaneous correlation
mod.res = lm(res ~ OBS, data = data)
mod.res %>% summary()
##
## Call:
## lm(formula = res ~ OBS, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.144014 -0.031582 0.003625 0.029482 0.138292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.067234 0.024420 -2.753 0.006886 **
## OBS1986:2 0.021774 0.034535 0.631 0.529648
## OBS1986:3 0.108516 0.034535 3.142 0.002146 **
## OBS1986:4 0.059469 0.034535 1.722 0.087831 .
## OBS1987:1 0.095243 0.034535 2.758 0.006796 **
## OBS1987:2 0.122404 0.034535 3.544 0.000575 ***
## OBS1987:3 0.123999 0.034535 3.591 0.000491 ***
## OBS1987:4 -0.005514 0.034535 -0.160 0.873426
## OBS1988:1 -0.013547 0.034535 -0.392 0.695597
## OBS1988:2 0.019212 0.034535 0.556 0.579114
## OBS1988:3 0.083096 0.034535 2.406 0.017758 *
## OBS1988:4 0.092585 0.034535 2.681 0.008451 **
## OBS1989:1 0.055999 0.034535 1.622 0.107717
## OBS1989:2 0.050434 0.034535 1.460 0.146984
## OBS1989:3 0.073063 0.034535 2.116 0.036591 *
## OBS1989:4 0.095224 0.034535 2.757 0.006806 **
## OBS1990:1 0.182299 0.034535 5.279 6.44e-07 ***
## OBS1990:2 0.119112 0.034535 3.449 0.000794 ***
## OBS1990:3 0.100793 0.034535 2.919 0.004251 **
## OBS1990:4 0.127635 0.034535 3.696 0.000341 ***
## OBS1991:1 0.100717 0.034535 2.916 0.004279 **
## OBS1991:2 0.113252 0.034535 3.279 0.001387 **
## OBS1991:3 0.028113 0.034535 0.814 0.417352
## OBS1991:4 0.074698 0.034535 2.163 0.032670 *
## OBS1992:1 0.049927 0.034535 1.446 0.151054
## OBS1992:2 0.024449 0.034535 0.708 0.480440
## OBS1992:3 -0.046942 0.034535 -1.359 0.176790
## OBS1992:4 0.026542 0.034535 0.769 0.443772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0546 on 112 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.4112
## F-statistic: 4.596 on 27 and 112 DF, p-value: 4.937e-09
# Visualisation of c.c.
plot(data$OBS, data$res)
By testing the residuals against the time notation, we are able to decide whether the contemporaenous correlation is significant.
Visually, we see a pattern, indicating that the residuals are moving together through time. From the summary output, we obtain mostly significant estimates, and the F-test is significant too, which tells us that the contemporaenous correlation is indeed significant.
2. Estimate a panel model with fixed effects and with different parameters for the seasonal dummies. This means that the parameters \(\beta_i\) and \(\gamma_i\) are constant across the divisions, so that the model contains in total twenty-two parameters. Compare the results with 1..
# Redifning formula
the_formula = LOGS ~
Div2 + Div3 + Div4 + Div5 + # Dummy for each division (i.e. Fixed Effects)
D2 + D3 + D4 + # D1-D4 For division 1
I(Div2*D2) + I(Div2*D3) + I(Div2*D4) + # Parmeters on D2-D4 for other units
I(Div3*D2) + I(Div3*D3) + I(Div3*D4) +
I(Div4*D2) + I(Div4*D3) + I(Div4*D4) +
I(Div5*D2) + I(Div5*D3) + I(Div5*D4) +
LOGA + LOGC
mod2 = lm(the_formula, data = data)
list(mod1, mod2) %>% screenreg(include.fstatistic = T)
##
## ======================================
## Model 1 Model 2
## --------------------------------------
## (Intercept) -6.14 * 0.49
## (2.41) (1.38)
## Div2 9.84 ** 0.02
## (3.41) (0.05)
## Div3 -2.21 -0.76 ***
## (3.41) (0.05)
## Div4 10.84 ** -0.36 ***
## (3.41) (0.05)
## Div5 13.91 *** 0.31 ***
## (3.41) (0.05)
## D2 0.19 *** 0.19 ***
## (0.04) (0.05)
## D3 0.31 *** 0.33 ***
## (0.04) (0.05)
## D4 0.62 *** 0.60 ***
## (0.04) (0.05)
## I(Div2 * D2) -0.17 ** -0.15
## (0.06) (0.08)
## I(Div2 * D3) -0.15 * -0.17 *
## (0.06) (0.08)
## I(Div2 * D4) -0.13 * -0.12
## (0.06) (0.08)
## I(Div3 * D2) -0.06 -0.07
## (0.06) (0.08)
## I(Div3 * D3) -0.19 ** -0.19 *
## (0.06) (0.08)
## I(Div3 * D4) -0.07 -0.06
## (0.06) (0.08)
## I(Div4 * D2) 0.05 0.05
## (0.06) (0.08)
## I(Div4 * D3) -0.21 *** -0.23 **
## (0.06) (0.08)
## I(Div4 * D4) -0.21 ** -0.16 *
## (0.06) (0.08)
## I(Div5 * D2) -0.14 * -0.14
## (0.06) (0.08)
## I(Div5 * D3) -0.21 *** -0.24 **
## (0.06) (0.08)
## I(Div5 * D4) -0.06 -0.01
## (0.06) (0.08)
## LOGA 1.49 *** 0.51 **
## (0.33) (0.19)
## LOGC 0.66 ** 0.27 *
## (0.20) (0.12)
## I(Div2 * LOGA) -2.04 ***
## (0.47)
## I(Div2 * LOGC) -0.10
## (0.29)
## I(Div3 * LOGA) 0.47
## (0.47)
## I(Div3 * LOGC) -0.13
## (0.29)
## I(Div4 * LOGA) -1.52 **
## (0.47)
## I(Div4 * LOGC) -0.77 **
## (0.29)
## I(Div5 * LOGA) -1.81 ***
## (0.47)
## I(Div5 * LOGC) -0.97 ***
## (0.29)
## --------------------------------------
## R^2 0.97 0.95
## Adj. R^2 0.97 0.94
## Num. obs. 140 140
## F statistic 135.68 112.93
## RMSE 0.08 0.10
## ======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Model 2 is the more parsimous model, and R^2 of the models are approximately equal. Suggesting that model 2 is the superior model.
3. Estimate the model of 2. with the restriction of equal seasonal effects \(\alpha_{ij}\), so that the model contains in total ten regression parameters. Compare the estimated seasonal effects with the estimates in 2..
# Redifining formula
the.formula = LOGS ~
Div2 + Div3 + Div4 + Div5 + # Dummy for each division (i.e. FE)
D2 + D3 + D4 + # Seasonal effects
LOGA + LOGC
mod3 = lm(the.formula, data = data)
list(mod1, mod2, mod3) %>% screenreg(include.fstatistics = T)
##
## ==================================================
## Model 1 Model 2 Model 3
## --------------------------------------------------
## (Intercept) -6.14 * 0.49 0.57
## (2.41) (1.38) (1.51)
## Div2 9.84 ** 0.02 -0.09 **
## (3.41) (0.05) (0.03)
## Div3 -2.21 -0.76 *** -0.84 ***
## (3.41) (0.05) (0.03)
## Div4 10.84 ** -0.36 *** -0.44 ***
## (3.41) (0.05) (0.03)
## Div5 13.91 *** 0.31 *** 0.21 ***
## (3.41) (0.05) (0.03)
## D2 0.19 *** 0.19 *** 0.13 ***
## (0.04) (0.05) (0.03)
## D3 0.31 *** 0.33 *** 0.16 ***
## (0.04) (0.05) (0.03)
## D4 0.62 *** 0.60 *** 0.53 ***
## (0.04) (0.05) (0.03)
## I(Div2 * D2) -0.17 ** -0.15
## (0.06) (0.08)
## I(Div2 * D3) -0.15 * -0.17 *
## (0.06) (0.08)
## I(Div2 * D4) -0.13 * -0.12
## (0.06) (0.08)
## I(Div3 * D2) -0.06 -0.07
## (0.06) (0.08)
## I(Div3 * D3) -0.19 ** -0.19 *
## (0.06) (0.08)
## I(Div3 * D4) -0.07 -0.06
## (0.06) (0.08)
## I(Div4 * D2) 0.05 0.05
## (0.06) (0.08)
## I(Div4 * D3) -0.21 *** -0.23 **
## (0.06) (0.08)
## I(Div4 * D4) -0.21 ** -0.16 *
## (0.06) (0.08)
## I(Div5 * D2) -0.14 * -0.14
## (0.06) (0.08)
## I(Div5 * D3) -0.21 *** -0.24 **
## (0.06) (0.08)
## I(Div5 * D4) -0.06 -0.01
## (0.06) (0.08)
## LOGA 1.49 *** 0.51 ** 0.51 *
## (0.33) (0.19) (0.21)
## LOGC 0.66 ** 0.27 * 0.27 *
## (0.20) (0.12) (0.13)
## I(Div2 * LOGA) -2.04 ***
## (0.47)
## I(Div2 * LOGC) -0.10
## (0.29)
## I(Div3 * LOGA) 0.47
## (0.47)
## I(Div3 * LOGC) -0.13
## (0.29)
## I(Div4 * LOGA) -1.52 **
## (0.47)
## I(Div4 * LOGC) -0.77 **
## (0.29)
## I(Div5 * LOGA) -1.81 ***
## (0.47)
## I(Div5 * LOGC) -0.97 ***
## (0.29)
## --------------------------------------------------
## R^2 0.97 0.95 0.94
## Adj. R^2 0.97 0.94 0.93
## Num. obs. 140 140 140
## RMSE 0.08 0.10 0.11
## ==================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
4. Perform an LR-test of the twelve parameter restrictions that reduce the model in 2. to the model in 3..
library(lmtest)
# Likelihood Ratio (LR): An estimation of the restricted and unrestricted model
lrtest(mod3, mod2)
## Likelihood ratio test
##
## Model 1: LOGS ~ Div2 + Div3 + Div4 + Div5 + D2 + D3 + D4 + LOGA + LOGC
## Model 2: LOGS ~ Div2 + Div3 + Div4 + Div5 + D2 + D3 + D4 + I(Div2 * D2) +
## I(Div2 * D3) + I(Div2 * D4) + I(Div3 * D2) + I(Div3 * D3) +
## I(Div3 * D4) + I(Div4 * D2) + I(Div4 * D3) + I(Div4 * D4) +
## I(Div5 * D2) + I(Div5 * D3) + I(Div5 * D4) + LOGA + LOGC
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 11 112.92
## 2 23 132.95 12 40.047 7.063e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 2 is significantly better than model 3. That is the likelihood of model 2 being much larger than the likelihood of model. In conclusion, the more complex model explains the data better in this case.
Rationality: They will be approximately equal if the restrictions are reasonable – otherwise the restricted (mod2) is much smaller than unrestricted (mod3)
Opposed to the F test, the LR tests is asymptotic, i.e. it does not have a finite-sample distribution. Thus, it is only justified for large samples. However, this is the price for loosening the restrictive LS assumptions
5. Explain why it would make little sense to estimate a random effects panel model for these data.
Our dependent variable vary through the seasons. Hence, it does not make sense to use RE, since it is only prefered when we want to investigate some characteristic that does not change over time.
Further, it is not realistic to assume that the individual effects are randomly drawn. The different levels of the divisions are not random, it can be explained by the difference between e.g., high-priced fashion clothes and low-budget clothes.