ps3

# Libraries
library(tidyverse)

## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.3
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(texreg)

## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").

## 
## Attaching package: 'texreg'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(clubSandwich) # Cluster robust se

## Registered S3 method overwritten by 'clubSandwich':
##   method    from    
##   bread.mlm sandwich

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(plm)

## 
## Attaching package: 'plm'

## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead

library(fixest)

## Warning: package 'fixest' was built under R version 3.6.2

## 
## Attaching package: 'fixest'

## The following object is masked from 'package:texreg':
## 
##     coefplot

Problem 1: Pooled OLS, FD and FE with $t = 2$

(Wooldridge, Chapter 14, C1) Use the data in rental.Rdata for this exercise.

load("rental.RData")
head(data)

##   city year    pop enroll rent rnthsg tothsg avginc   lenroll     lpop
## 1    1   80  75211  15303  197  13475  26167  11537  9.635804 11.22805
## 2    1   90  77759  18017  342  15660  29467  19568  9.799071 11.26137
## 3    2   80 106743  22462  323  14580  37277  19841 10.019580 11.57818
## 4    2   90 141865  29769  496  26895  55540  31885 10.301223 11.86263
## 5    3   80  36608  11847  216   7026  13482  11455  9.379830 10.50802
## 6    3   90  42099  10265  351   9557  16894  21202  9.236495 10.64778
##      lrent   ltothsg   lrnthsg   lavginc  clenroll      clpop    clrent
## 1 5.283204 10.172255  9.508592  9.353314        NA         NA        NA
## 2 5.834811 10.291026  9.658865  9.881651 -15293.20 0.03331661 0.5516071
## 3 5.777652 10.526132  9.587406  9.895506        NA         NA        NA
## 4 6.206576 10.924859 10.199696 10.369891 -22451.70 0.28445148 0.4289236
## 5 5.375278  9.509110  8.857373  9.346182        NA         NA        NA
## 6 5.860786  9.734714  9.165030  9.961851 -11837.76 0.13975716 0.4855080
##    cltothsg  clrnthsg  clavginc   pctstu     cpctstu y90
## 1        NA        NA        NA 20.34676          NA   0
## 2 0.1187716 0.1502733 0.5283365 23.17031  2.82355118   1
## 3        NA        NA        NA 21.04307          NA   0
## 4 0.3987274 0.6122894 0.4743853 20.98403 -0.05903244   1
## 5        NA        NA        NA 32.36178          NA   0
## 6 0.2256031 0.3076563 0.6156693 24.38300 -7.97877693   1

The data on rental prices and other variables for college towns are for the years 1980 and 1990. The idea is to see whether a stronger presence of students affects rental rates. The unobserved effects model is:

\[ \begin{eqnarray*} \log(rent_{it}) &=& \beta_0 + \delta_{0}\,\textit{y90}_t + \beta_{1}\log(pop_{it}) + \beta_{2}\log(avginc_{it}) \\ &+& \beta_{3}\,\textit{pctstu}_{it} + u_{it} \end{eqnarray*} \]

where pop is city population, avginc is average income, and pctstu is student population as a percentage of city population (during the school year).

1. Estimate the equation by pooled OLS and report the results in standard form. What do you make of the estimate on the 1990 dummy variable? What do you get for $\widehat{\beta}_{pctstu}$?

# defining the equation
the.formula = lrent ~ y90 + lpop + lavginc + pctstu

mod.pols = lm(the.formula, data = data)

mod.ols = plm(formula = the.formula, data = data, model = "pooling") 

list(mod.pols, mod.ols) %>% screenreg(digits = 5)

## 
## =========================================
##              Model 1        Model 2      
## -----------------------------------------
## (Intercept)   -0.56881       -0.56881    
##               (0.53488)      (0.53488)   
## y90            0.26223 ***    0.26223 ***
##               (0.03476)      (0.03476)   
## lpop           0.04069        0.04069    
##               (0.02252)      (0.02252)   
## lavginc        0.57145 ***    0.57145 ***
##               (0.05310)      (0.05310)   
## pctstu         0.00504 ***    0.00504 ***
##               (0.00102)      (0.00102)   
## -----------------------------------------
## R^2            0.86128        0.86128    
## Adj. R^2       0.85677        0.85677    
## Num. obs.    128            128          
## RMSE           0.12592                   
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

# Pooled OLS is just a regular OLS on panel data

From the coefficient of the dummy variable, we can read that there is an average higher rent by approximately 26$\%$ in year 1990 than year 1980, ceteris paribus. Since the logarithm of the rent is 0.26 higher in year 1990. Corresponding to:

exp(0.26223)-1

## [1] 0.2998255

an approximate increase of 30$\%$.

$\hat{\beta}_{pctstu}$ indicates that a 1$\%$ increse in the student population as a percentage of city population increases the rent by approximately .5$\%$.

Vulnerable to Simpson Paradox.

2. Are the standard errors you report in part 1. valid? Explain.

Since we assume i.i.d, the answer is no, as there is serialcorrelation within in clusters (in this case cities), and panel data are per definition serial correlated. The cluster correlation comes from the non-zero variance, $var(\alpha_i)$.

E.g. it is fair to expect a correlation between the rent in city 1 in 1980 and in 1990. The standard OLS estimate do not take this into account. Thus, the standard errors are biased downwards, due to cluster correlation. That is, we will tend to conclude, that something is significant when in fact, it is not.

For this reason, we need to account for correlation within clusters, why we should instead use cluster at city level.

# Cluster-robust se

cr.vcov.pols = vcovCR(mod.pols, cluster = data$city, type = "CR1") # Estimating cluster-robust vcov

test.pols = coeftest(mod.pols, cr.vcov.pols) # Testing significance with cluster robust standard errros (cr.vcov)

list(mod.pols, test.pols) %>% screenreg(digits = 6) # Showing results

## 
## ==========================================
##              Model 1         Model 2      
## ------------------------------------------
## (Intercept)   -0.568807      -0.568807    
##               (0.534881)     (1.023349)   
## y90            0.262227 ***   0.262227 ***
##               (0.034763)     (0.063228)   
## lpop           0.040686       0.040686    
##               (0.022515)     (0.028365)   
## lavginc        0.571446 ***   0.571446 ***
##               (0.053098)     (0.120953)   
## pctstu         0.005044 ***   0.005044 ** 
##               (0.001019)     (0.001503)   
## ------------------------------------------
## R^2            0.861281                   
## Adj. R^2       0.856770                   
## Num. obs.    128                          
## RMSE           0.125915                   
## ==========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Note that $\hat{\beta}_{pctstu}$ is less significant.

3. Now, difference the equation and estimate by OLS. Compare your estimate of $\beta_{pctstu}$ with the estimate that you obtained in part 1.. Does the relative size of the student population appear to affect rental prices?

Differencing the equation is to take estimates from one period t+1 (1990) and subtracting corresponding observations in t (1980). Estimating on this with OLS is the First Difference Estimator (FD Estimator).

# FD
data$lrent[which(data$year==90)]-data$lrent[which(data$year==80)]

##  [1] 0.5516071 0.4289236 0.4855080 0.7894783 0.6664791 0.8253188 0.5453234
##  [8] 0.4959292 0.8087320 0.4590969 0.6664791 0.6182160 0.5519438 0.6380877
## [15] 0.6518292 0.5108256 0.4759617 0.5705447 0.5148497 0.5697684 0.5372767
## [22] 0.3928337 0.7387824 0.5980387 0.5521994 0.6087804 0.5166907 0.4338241
## [29] 0.4815888 0.4951186 0.5319953 0.7624297 0.7659063 0.6666079 0.6471848
## [36] 0.6706743 0.6304369 0.4818382 0.4427724 0.5300832 0.4443808 0.4805326
## [43] 0.3944759 0.4802251 0.4519849 0.5346365 0.6289048 0.6109095 0.6580558
## [50] 0.4919286 0.5404701 0.4172158 0.4415412 0.5785809 0.5337491 0.5047464
## [57] 0.5913644 0.7074475 0.5596156 0.4216728 0.4784904 0.4475307 0.4470139
## [64] 0.6639175

mod.dpols = plm(
  the.formula,
  data = data,
  index = c("city", "year"), # Defining the paneldata structure. Standard: 1. observational, 2. time
  model = "fd" # Defining the model type
) # remember: Pooled OLS does not account serial correlation and heterogeneity bias: $\alpha_i != \alpha$

list("Pooled OLS" = mod.pols, "First Difference" = mod.dpols) %>% screenreg(digits = 6)

## 
## =============================================
##              Pooled OLS      First Difference
## ---------------------------------------------
## (Intercept)   -0.568807       0.385521 ***   
##               (0.534881)     (0.036824)      
## y90            0.262227 ***                  
##               (0.034763)                     
## lpop           0.040686       0.072246       
##               (0.022515)     (0.088343)      
## lavginc        0.571446 ***   0.309961 ***   
##               (0.053098)     (0.066477)      
## pctstu         0.005044 ***   0.011203 **    
##               (0.001019)     (0.004132)      
## ---------------------------------------------
## R^2            0.861281       0.322262       
## Adj. R^2       0.856770       0.288375       
## Num. obs.    128             64              
## RMSE           0.125915                      
## =============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

# Note that y90 is absent in FD (due to multicollinearity)

# Multicollinearity is correlation/multiple correlations of sufficient magnitude to adversely affect regression estimates. Thus, we lose significance, individually.

# If we want to estimate y90 another model would be required, e.g., RE.

From comparing the estimates, we see that the impact of pctstu has more than doubled from the fd corrected estimates.

Hence, we can conclude that the relative size of the student population appear to affect rental prices. And that the bias was downward, because the standard error are now larger.

FD is the better model, because it estimates the average difference between clusters

4. Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part 3..

mod.fe = plm(
  the.formula,
  data = data,
  index = c("city", "year"), # Defining the paneldata structre
  model = "within" # Defining the model type
)

list("Pooled OLS" = mod.pols, "First Difference" = mod.dpols, "Fixed effects" = mod.fe) %>% screenreg(digits = 5)

## 
## ===========================================================
##              Pooled OLS     First Difference  Fixed effects
## -----------------------------------------------------------
## (Intercept)   -0.56881       0.38552 ***                   
##               (0.53488)     (0.03682)                      
## y90            0.26223 ***                      0.38552 ***
##               (0.03476)                        (0.03682)   
## lpop           0.04069       0.07225            0.07225    
##               (0.02252)     (0.08834)          (0.08834)   
## lavginc        0.57145 ***   0.30996 ***        0.30996 ***
##               (0.05310)     (0.06648)          (0.06648)   
## pctstu         0.00504 ***   0.01120 **         0.01120 ** 
##               (0.00102)     (0.00413)          (0.00413)   
## -----------------------------------------------------------
## R^2            0.86128       0.32226            0.97653    
## Adj. R^2       0.85677       0.28837            0.95032    
## Num. obs.    128            64                128          
## RMSE           0.12592                                     
## ===========================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

# FE takes each observation and substract the mean. Removing the level effect. By that it doesn't get the same estimates as pooled OLS.

For T=2, i.e. panels of two periods, we know that first difference (FD) is mathematically identical to fixed effects (FE).

We trust FE because we get a correct estimate of the slope, and we avoid the risk of the interpreting the relationship wrongly, as we do with the pooled OLS model.

load("airfare.RData")

Problem 2: Pooled OLS, FE, RE

(Wooldridge, Chapter 14, C10) Use the data in airfare.Rdata for this exercise.

We are interested in estimating the model:

\[ \begin{eqnarray*} \log(fare_{it}) &=& \eta_t + \beta_{1}\textit{concen}_{it} + \beta_{2}\log(dist_{i}) + \beta_{3}[\log(dist_{i})]^2 \\ &+& a_i + u_{it}, \quad t = 1, \ldots, 4 \end{eqnarray*} \]

where $\eta_t$ means that we allow for different year intercepts.

1. Estimate the above equation by pooled OLS, being sure to include year dummies. If $\Delta\textit{concen} = 0.10$, what is the estimated percentage increase in fare?

# Defining the equation. eta is the intercept of each year
the.formula = lfare ~ concen + ldist + ldistsq + y98 + y99 + y00 # we could also have used year instead 

mod.pols = lm(the.formula, data = data)

cr.vcov.pols = vcovCR(mod.pols, cluster = data$id, type = "CR1") # cluster robust se, which take care of serial correlation

test.pols = coeftest(mod.pols, cr.vcov.pols)

list(POLS = test.pols) %>% screenreg(digits = 6)

## 
## ==========================
##              POLS         
## --------------------------
## (Intercept)   6.209258 ***
##              (0.911160)   
## concen        0.360120 ***
##              (0.058518)   
## ldist        -0.901600 ***
##              (0.271769)   
## ldistsq       0.103020 ***
##              (0.020147)   
## y98           0.021124 ***
##              (0.004145)   
## y99           0.037850 ***
##              (0.005176)   
## y00           0.099870 ***
##              (0.005643)   
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05

# The percentage increase in fare
delta = .10
exp(mod.pols$coefficients["concen"]*delta)

##   concen 
## 1.036668

In conclusion, when concen changes with 0.1, fare changes by $$$1.036668. $fare$ is the average price of a one-way fare.

2. What is the usual OLS 95% confidence interval for $\beta_1$? Why is it probably not relaible? Compute fully robust standard errors and find the fully robust 95% CI for $\beta_1$. Compare it to the usual CI and comment.

# CI for concen
OLS.ci = confint(mod.pols, "concen")

CR.ci = conf_int(mod.pols, cr.vcov.pols, coefs = "concen")

We observe that the OLS 95% CI is smaller than the CI for the cluster robust (CR) standard errors

The usual confidence interval is computed with OLS se. Because of serial correlation, they are invalid. Think i.i.d. These are over-optimistic since they have downward bias from serial correlation in the clusters of the data.

We should therefore use cluster-robust standard error to calculate the CI. This, however does not make up for the fact that POLS estiamte tends to be biased by the unknown effects, if these are correlated with other variables (which they tend to be).

3. Describe what is happening with the quadratic in $\log(dist)$. In particular for what value of dist does the relationship between $\log(fare)$ and dist become positive? (Hint: First figure out the turning point value for $\log(dist)$, and then exponentiate.) Is the turning point outside the range of the data?

We can minimize a quadratic by differentating it with respect to ldist and setting it equal to zero: \[ \beta_2 + 2\beta_3 \, ldist_{min} = 0 \\ \Leftrightarrow - \frac{\beta_2}{2\beta_3} = ldist_{min} \\ \Rightarrow ldist_{min} \approx -4.376 \]

-mod.pols$coefficients["ldist"]/(2*mod.pols$coefficients["ldistsq"])

##    ldist 
## 4.375868

# Is the turning point outside the range of the data?
data$ldist %>% summary() # The range of data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.554   6.225   6.758   6.696   7.173   7.910

The minimum lies outside the range, thus the relationship between lfare and ldist i strictly positive, since the minimum price of the data starts at 4.554

# Graphical presentation
fake.data = data.frame(ldist = 1:100/10) # non-real distance

fake.data$lfare = fake.data$ldist*mod.pols$coefficients["ldist"] + fake.data$ldist^2*mod.pols$coefficients["ldistsq"]

plot(fake.data)

4. Now estimate the equation using random effects. How does the estimate of $\beta_1$ change?

# RE is a method to instruct a mathematical code that there is serial correlation. So that we utilize serial correlation in our calculation, which provides more efficient estimates (but still biased). FE on contrary is unbiased, but less efficient.
mod.re = plm(the.formula,
             data = data,
             index = c("id", "year"),
             model = "random"
)

cr.vcov.re = vcovCR(mod.re, cluster = data$id, type = "CR1")
test.re = coeftest(mod.re, cr.vcov.re)

list(POLS = test.pols, RE = test.re) %>% screenreg(digits = 6)

## 
## =========================================
##              POLS           RE           
## -----------------------------------------
## (Intercept)   6.209258 ***   6.222005 ***
##              (0.911160)     (0.913810)   
## concen        0.360120 ***   0.208993 ***
##              (0.058518)     (0.042218)   
## ldist        -0.901600 ***  -0.852092 ** 
##              (0.271769)     (0.271912)   
## ldistsq       0.103020 ***   0.097460 ***
##              (0.020147)     (0.020129)   
## y98           0.021124 ***   0.022474 ***
##              (0.004145)     (0.004143)   
## y99           0.037850 ***   0.036690 ***
##              (0.005176)     (0.005128)   
## y00           0.099870 ***   0.098212 ***
##              (0.005643)     (0.005520)   
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

$concen$ drops substantially.

5. Now estimate the equation using fixed effects. What is the FE estimate of $\beta_1$? Why is it fairly similar to the RE estimate? (Hint: What is $\widehat{\theta}$ for RE estimation?) Name two characteristics of a route (other than distance between stops) that are captured by $a_i$. Might these be correalted with $\textit{concen}_{it}$? Are you convinced that higher concentration on a route increases airfares? What is your best estimate?

the.formula = lfare ~ concen 

mod.fe = plm(the.formula,
             data = data,
             index = c("id", "year"),
             effect = "twoways",
             model = "within"
)

mod.fe %>% screenreg(digits = 6)

## 
## ==========================
##            Model 1        
## --------------------------
## concen        0.168859 ***
##              (0.029410)   
## --------------------------
## R^2           0.009484    
## Adj. R^2     -0.321935    
## Num. obs.  4596           
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05

cr.vcov.fe = vcovCR(mod.fe, 
                    cluster = c(data$id),
                    type = "CR1")

test.fe = coeftest(mod.fe, cr.vcov.fe)

list(RE = test.re, FE = test.fe) %>% screenreg(digits = 6)

## 
## =========================================
##              RE             FE           
## -----------------------------------------
## (Intercept)   6.222005 ***               
##              (0.913810)                  
## concen        0.208993 ***   0.168859 ***
##              (0.042218)     (0.049437)   
## ldist        -0.852092 **                
##              (0.271912)                  
## ldistsq       0.097460 ***               
##              (0.020129)                  
## y98           0.022474 ***               
##              (0.004143)                  
## y99           0.036690 ***               
##              (0.005128)                  
## y00           0.098212 ***               
##              (0.005520)                  
## =========================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

#list(POLS = mod.pols, RE = test.re, FE = test.fe) %>% screenreg(digits = 6)

# fixest
mod.fixest = feols(fml = lfare ~ concen | id + year, data = data)

summary(mod.fixest, se = "cluster")

## OLS estimation, Dep. Var.: lfare
## Observations: 4,596 
## Fixed-effects: id: 1,149,  year: 4
## Standard-errors: Clustered (id) 
##        Estimate Std. Error t value Pr(>|t|)    
## concen 0.168859   0.049459  3.4141 0.000662 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.092188     Adj. R2: 0.94043 
##                  Within R2: 0.009484

#0.049459   clustered standard se are default

# default for feols and plm
se_feols_id_plm = se(mod.fixest, dof = dof(fixef.K = "none", cluster.adj = F))
#.04941565
se_plm_id = sqrt(vcovHC(mod.fe, cluster = "group")["concen", "concen"])
#.04941565

The difference in $concen$ between RE and FE are not as large as the difference in $concen$ between POLS and RE. This could suggest that there is some degree of randomization in the concentration of the routes.

Note that ldist i constant through out the period, the distance between two destinations in a route does not change. Thus, we cannot say anything about the distance with FE.

However, only if $\alpha_i$ is relatively random, we can use RE.

It is an important assumption of RE that the explanatory variables are uncorrelated with the unitwise heterogeneity, $\alpha_i$. If this is not the case, RE should not be used and FE is strongly preferred, if it is feasible. It is only feasible if we observe characteristics that change over time.

As mentioned, the estimates are close because $\alpha_i$ is pretty much random. This is corrected by fixed effects estimate. However, the small change is an indication that we CANNOT trust the RE model and instead should rely on the FE estimate.

Name two characteristics of a route (other than distance between stops) that are captured by $a_i$

THINK: More demand for route^^ -> Concen^ Concen^ -> More suppliers^^ -> More competition

more demand and 2) more supliers influence the faire directly and thus is also captured by $\alpha_i$.

Problem 3: Panel data

(Heij et al. E 7.25 a,c-f) The data file xr725fas.csv contains quarterly fashion sales data of a US retailer with multiple specialty divisions over the period 1986.1–1992.4.

require(foreign)

## Loading required package: foreign

data = read.csv("xr725fas.csv", header = T, sep = ",", dec = ".")

This is a panel data set with $m = 5$ units and $n = 28$ observations. The first two divisions specialize in high-priced fashion apparel, division 3 in low-priced clothes, and divisions 4 and 5 in specialities like large sizes, undergarments, and so on. The data consist of the quarterly sales $S_{it}$ of the divisions, the purchasing ability $A_{t}$, and the consumer confidence $C_t$. We formulate the model:

\[ \begin{eqnarray*} \log(S_{it}) &=& \alpha_i + \alpha_{i2}D_{2t} + \alpha_{i3}D_{3t} + \alpha_{i4}D_{4t} \\ &+& \beta_i\log(A_t) + \gamma_i\log(C_t) + \epsilon_{it} \end{eqnarray*} \]

# Note that the panel data is in wide form.

# Removing SALES from df
data = data %>% select(-(SALES_1:SALES_5))

# Reshaping
data = data %>% pivot_longer(
  LOGSALES_1:LOGSALES_5,
  names_to = "Division",
  values_to = "LOGS"
)

# Rename
data = data %>% 
  mutate(Division = substr(Division, 10, 10)) # Changing strings

# Defining dummy for each Division
data = data %>% mutate(
  Div1 = ifelse(Division == 1, 1, 0),
  Div2 = ifelse(Division == 2, 1, 0),
  Div3 = ifelse(Division == 3, 1, 0),
  Div4 = ifelse(Division == 4, 1, 0),
  Div5 = ifelse(Division == 5, 1, 0)
)

1. Estiamte the above model (with thirty regression parameters) by OLS. Check that there exists significant contemporaneous correation between the residual terms for the five divisions in the same quarter.

#Defining  a formula filled with interaction terms that covers a parameter for each division
the.formula = LOGS ~ 
  Div2 + Div3 + Div4 + Div5 + #Dummy for each division 
  D2 + D3 + D4 + # D1-D4 For division 1
  I(Div2*D2) + I(Div2*D3) + I(Div2*D4) + #Parmeters on D2-D4 for other units 
  I(Div3*D2) + I(Div3*D3) + I(Div3*D4) + 
  I(Div4*D2) + I(Div4*D3) + I(Div4*D4) + 
  I(Div5*D2) + I(Div5*D3) + I(Div5*D4) +
  LOGA + LOGC + # LOGA and LOGC for division 1
  I(Div2*LOGA) + I(Div2*LOGC) + #Parmeters on LOGA-D4 for other units 
  I(Div3*LOGA) + I(Div3*LOGC) + 
  I(Div4*LOGA) + I(Div4*LOGC) + 
  I(Div5*LOGA) + I(Div5*LOGC)

mod1 = lm(the.formula, data = data)

mod1 %>% screenreg() # Looking at model

## 
## ==========================
##                 Model 1   
## --------------------------
## (Intercept)      -6.14 *  
##                  (2.41)   
## Div2              9.84 ** 
##                  (3.41)   
## Div3             -2.21    
##                  (3.41)   
## Div4             10.84 ** 
##                  (3.41)   
## Div5             13.91 ***
##                  (3.41)   
## D2                0.19 ***
##                  (0.04)   
## D3                0.31 ***
##                  (0.04)   
## D4                0.62 ***
##                  (0.04)   
## I(Div2 * D2)     -0.17 ** 
##                  (0.06)   
## I(Div2 * D3)     -0.15 *  
##                  (0.06)   
## I(Div2 * D4)     -0.13 *  
##                  (0.06)   
## I(Div3 * D2)     -0.06    
##                  (0.06)   
## I(Div3 * D3)     -0.19 ** 
##                  (0.06)   
## I(Div3 * D4)     -0.07    
##                  (0.06)   
## I(Div4 * D2)      0.05    
##                  (0.06)   
## I(Div4 * D3)     -0.21 ***
##                  (0.06)   
## I(Div4 * D4)     -0.21 ** 
##                  (0.06)   
## I(Div5 * D2)     -0.14 *  
##                  (0.06)   
## I(Div5 * D3)     -0.21 ***
##                  (0.06)   
## I(Div5 * D4)     -0.06    
##                  (0.06)   
## LOGA              1.49 ***
##                  (0.33)   
## LOGC              0.66 ** 
##                  (0.20)   
## I(Div2 * LOGA)   -2.04 ***
##                  (0.47)   
## I(Div2 * LOGC)   -0.10    
##                  (0.29)   
## I(Div3 * LOGA)    0.47    
##                  (0.47)   
## I(Div3 * LOGC)   -0.13    
##                  (0.29)   
## I(Div4 * LOGA)   -1.52 ** 
##                  (0.47)   
## I(Div4 * LOGC)   -0.77 ** 
##                  (0.29)   
## I(Div5 * LOGA)   -1.81 ***
##                  (0.47)   
## I(Div5 * LOGC)   -0.97 ***
##                  (0.29)   
## --------------------------
## R^2               0.97    
## Adj. R^2          0.97    
## Num. obs.       140       
## RMSE              0.08    
## ==========================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Contemporaneous correlation is when the error term for one period is correlated with other error terms in the same period (in this case, divisions).

One way to test for this, is to make a regression of the residuals versus the observational time. If this is significant it is an indication of contemporaneous correlation. That is units error are correlated at the same point in time.

# Extracting the residuals
res = mod1$residuals
data = data %>% mutate(res = res) # Adding residuals to data frame

# Testing for contemporaneous correlation
mod.res = lm(res ~ OBS, data = data)

mod.res %>% summary()

## 
## Call:
## lm(formula = res ~ OBS, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.144014 -0.031582  0.003625  0.029482  0.138292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.067234   0.024420  -2.753 0.006886 ** 
## OBS1986:2    0.021774   0.034535   0.631 0.529648    
## OBS1986:3    0.108516   0.034535   3.142 0.002146 ** 
## OBS1986:4    0.059469   0.034535   1.722 0.087831 .  
## OBS1987:1    0.095243   0.034535   2.758 0.006796 ** 
## OBS1987:2    0.122404   0.034535   3.544 0.000575 ***
## OBS1987:3    0.123999   0.034535   3.591 0.000491 ***
## OBS1987:4   -0.005514   0.034535  -0.160 0.873426    
## OBS1988:1   -0.013547   0.034535  -0.392 0.695597    
## OBS1988:2    0.019212   0.034535   0.556 0.579114    
## OBS1988:3    0.083096   0.034535   2.406 0.017758 *  
## OBS1988:4    0.092585   0.034535   2.681 0.008451 ** 
## OBS1989:1    0.055999   0.034535   1.622 0.107717    
## OBS1989:2    0.050434   0.034535   1.460 0.146984    
## OBS1989:3    0.073063   0.034535   2.116 0.036591 *  
## OBS1989:4    0.095224   0.034535   2.757 0.006806 ** 
## OBS1990:1    0.182299   0.034535   5.279 6.44e-07 ***
## OBS1990:2    0.119112   0.034535   3.449 0.000794 ***
## OBS1990:3    0.100793   0.034535   2.919 0.004251 ** 
## OBS1990:4    0.127635   0.034535   3.696 0.000341 ***
## OBS1991:1    0.100717   0.034535   2.916 0.004279 ** 
## OBS1991:2    0.113252   0.034535   3.279 0.001387 ** 
## OBS1991:3    0.028113   0.034535   0.814 0.417352    
## OBS1991:4    0.074698   0.034535   2.163 0.032670 *  
## OBS1992:1    0.049927   0.034535   1.446 0.151054    
## OBS1992:2    0.024449   0.034535   0.708 0.480440    
## OBS1992:3   -0.046942   0.034535  -1.359 0.176790    
## OBS1992:4    0.026542   0.034535   0.769 0.443772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0546 on 112 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.4112 
## F-statistic: 4.596 on 27 and 112 DF,  p-value: 4.937e-09

# Visualisation of c.c.
plot(data$OBS, data$res)

By testing the residuals against the time notation, we are able to decide whether the contemporaenous correlation is significant.

Visually, we see a pattern, indicating that the residuals are moving together through time. From the summary output, we obtain mostly significant estimates, and the F-test is significant too, which tells us that the contemporaenous correlation is indeed significant.

2. Estimate a panel model with fixed effects and with different parameters for the seasonal dummies. This means that the parameters $\beta_i$ and $\gamma_i$ are constant across the divisions, so that the model contains in total twenty-two parameters. Compare the results with 1..

# Redifning formula
the_formula = LOGS ~ 
  Div2 + Div3 + Div4 + Div5 + # Dummy for each division (i.e. Fixed Effects)
  D2 + D3 + D4 + # D1-D4 For division 1
  I(Div2*D2) + I(Div2*D3) + I(Div2*D4) + # Parmeters on D2-D4 for other units 
  I(Div3*D2) + I(Div3*D3) + I(Div3*D4) + 
  I(Div4*D2) + I(Div4*D3) + I(Div4*D4) + 
  I(Div5*D2) + I(Div5*D3) + I(Div5*D4) +
  LOGA + LOGC

mod2 = lm(the_formula, data = data)

list(mod1, mod2) %>% screenreg(include.fstatistic = T)

## 
## ======================================
##                 Model 1     Model 2   
## --------------------------------------
## (Intercept)      -6.14 *      0.49    
##                  (2.41)      (1.38)   
## Div2              9.84 **     0.02    
##                  (3.41)      (0.05)   
## Div3             -2.21       -0.76 ***
##                  (3.41)      (0.05)   
## Div4             10.84 **    -0.36 ***
##                  (3.41)      (0.05)   
## Div5             13.91 ***    0.31 ***
##                  (3.41)      (0.05)   
## D2                0.19 ***    0.19 ***
##                  (0.04)      (0.05)   
## D3                0.31 ***    0.33 ***
##                  (0.04)      (0.05)   
## D4                0.62 ***    0.60 ***
##                  (0.04)      (0.05)   
## I(Div2 * D2)     -0.17 **    -0.15    
##                  (0.06)      (0.08)   
## I(Div2 * D3)     -0.15 *     -0.17 *  
##                  (0.06)      (0.08)   
## I(Div2 * D4)     -0.13 *     -0.12    
##                  (0.06)      (0.08)   
## I(Div3 * D2)     -0.06       -0.07    
##                  (0.06)      (0.08)   
## I(Div3 * D3)     -0.19 **    -0.19 *  
##                  (0.06)      (0.08)   
## I(Div3 * D4)     -0.07       -0.06    
##                  (0.06)      (0.08)   
## I(Div4 * D2)      0.05        0.05    
##                  (0.06)      (0.08)   
## I(Div4 * D3)     -0.21 ***   -0.23 ** 
##                  (0.06)      (0.08)   
## I(Div4 * D4)     -0.21 **    -0.16 *  
##                  (0.06)      (0.08)   
## I(Div5 * D2)     -0.14 *     -0.14    
##                  (0.06)      (0.08)   
## I(Div5 * D3)     -0.21 ***   -0.24 ** 
##                  (0.06)      (0.08)   
## I(Div5 * D4)     -0.06       -0.01    
##                  (0.06)      (0.08)   
## LOGA              1.49 ***    0.51 ** 
##                  (0.33)      (0.19)   
## LOGC              0.66 **     0.27 *  
##                  (0.20)      (0.12)   
## I(Div2 * LOGA)   -2.04 ***            
##                  (0.47)               
## I(Div2 * LOGC)   -0.10                
##                  (0.29)               
## I(Div3 * LOGA)    0.47                
##                  (0.47)               
## I(Div3 * LOGC)   -0.13                
##                  (0.29)               
## I(Div4 * LOGA)   -1.52 **             
##                  (0.47)               
## I(Div4 * LOGC)   -0.77 **             
##                  (0.29)               
## I(Div5 * LOGA)   -1.81 ***            
##                  (0.47)               
## I(Div5 * LOGC)   -0.97 ***            
##                  (0.29)               
## --------------------------------------
## R^2               0.97        0.95    
## Adj. R^2          0.97        0.94    
## Num. obs.       140         140       
## F statistic     135.68      112.93    
## RMSE              0.08        0.10    
## ======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Model 2 is the more parsimous model, and R^2 of the models are approximately equal. Suggesting that model 2 is the superior model.

3. Estimate the model of 2. with the restriction of equal seasonal effects $\alpha_{ij}$, so that the model contains in total ten regression parameters. Compare the estimated seasonal effects with the estimates in 2..

# Redifining formula
the.formula = LOGS ~ 
  Div2 + Div3 + Div4 + Div5 + # Dummy for each division (i.e. FE)
  D2 + D3 + D4 + # Seasonal effects
  LOGA + LOGC

mod3 = lm(the.formula, data = data)

list(mod1, mod2, mod3) %>% screenreg(include.fstatistics = T)

## 
## ==================================================
##                 Model 1     Model 2     Model 3   
## --------------------------------------------------
## (Intercept)      -6.14 *      0.49        0.57    
##                  (2.41)      (1.38)      (1.51)   
## Div2              9.84 **     0.02       -0.09 ** 
##                  (3.41)      (0.05)      (0.03)   
## Div3             -2.21       -0.76 ***   -0.84 ***
##                  (3.41)      (0.05)      (0.03)   
## Div4             10.84 **    -0.36 ***   -0.44 ***
##                  (3.41)      (0.05)      (0.03)   
## Div5             13.91 ***    0.31 ***    0.21 ***
##                  (3.41)      (0.05)      (0.03)   
## D2                0.19 ***    0.19 ***    0.13 ***
##                  (0.04)      (0.05)      (0.03)   
## D3                0.31 ***    0.33 ***    0.16 ***
##                  (0.04)      (0.05)      (0.03)   
## D4                0.62 ***    0.60 ***    0.53 ***
##                  (0.04)      (0.05)      (0.03)   
## I(Div2 * D2)     -0.17 **    -0.15                
##                  (0.06)      (0.08)               
## I(Div2 * D3)     -0.15 *     -0.17 *              
##                  (0.06)      (0.08)               
## I(Div2 * D4)     -0.13 *     -0.12                
##                  (0.06)      (0.08)               
## I(Div3 * D2)     -0.06       -0.07                
##                  (0.06)      (0.08)               
## I(Div3 * D3)     -0.19 **    -0.19 *              
##                  (0.06)      (0.08)               
## I(Div3 * D4)     -0.07       -0.06                
##                  (0.06)      (0.08)               
## I(Div4 * D2)      0.05        0.05                
##                  (0.06)      (0.08)               
## I(Div4 * D3)     -0.21 ***   -0.23 **             
##                  (0.06)      (0.08)               
## I(Div4 * D4)     -0.21 **    -0.16 *              
##                  (0.06)      (0.08)               
## I(Div5 * D2)     -0.14 *     -0.14                
##                  (0.06)      (0.08)               
## I(Div5 * D3)     -0.21 ***   -0.24 **             
##                  (0.06)      (0.08)               
## I(Div5 * D4)     -0.06       -0.01                
##                  (0.06)      (0.08)               
## LOGA              1.49 ***    0.51 **     0.51 *  
##                  (0.33)      (0.19)      (0.21)   
## LOGC              0.66 **     0.27 *      0.27 *  
##                  (0.20)      (0.12)      (0.13)   
## I(Div2 * LOGA)   -2.04 ***                        
##                  (0.47)                           
## I(Div2 * LOGC)   -0.10                            
##                  (0.29)                           
## I(Div3 * LOGA)    0.47                            
##                  (0.47)                           
## I(Div3 * LOGC)   -0.13                            
##                  (0.29)                           
## I(Div4 * LOGA)   -1.52 **                         
##                  (0.47)                           
## I(Div4 * LOGC)   -0.77 **                         
##                  (0.29)                           
## I(Div5 * LOGA)   -1.81 ***                        
##                  (0.47)                           
## I(Div5 * LOGC)   -0.97 ***                        
##                  (0.29)                           
## --------------------------------------------------
## R^2               0.97        0.95        0.94    
## Adj. R^2          0.97        0.94        0.93    
## Num. obs.       140         140         140       
## RMSE              0.08        0.10        0.11    
## ==================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

4. Perform an LR-test of the twelve parameter restrictions that reduce the model in 2. to the model in 3..

library(lmtest)

# Likelihood Ratio (LR): An estimation of the restricted and unrestricted model
lrtest(mod3, mod2)

## Likelihood ratio test
## 
## Model 1: LOGS ~ Div2 + Div3 + Div4 + Div5 + D2 + D3 + D4 + LOGA + LOGC
## Model 2: LOGS ~ Div2 + Div3 + Div4 + Div5 + D2 + D3 + D4 + I(Div2 * D2) + 
##     I(Div2 * D3) + I(Div2 * D4) + I(Div3 * D2) + I(Div3 * D3) + 
##     I(Div3 * D4) + I(Div4 * D2) + I(Div4 * D3) + I(Div4 * D4) + 
##     I(Div5 * D2) + I(Div5 * D3) + I(Div5 * D4) + LOGA + LOGC
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1  11 112.92                         
## 2  23 132.95 12 40.047  7.063e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 2 is significantly better than model 3. That is the likelihood of model 2 being much larger than the likelihood of model. In conclusion, the more complex model explains the data better in this case.

Rationality: They will be approximately equal if the restrictions are reasonable – otherwise the restricted (mod2) is much smaller than unrestricted (mod3)

Opposed to the F test, the LR tests is asymptotic, i.e. it does not have a finite-sample distribution. Thus, it is only justified for large samples. However, this is the price for loosening the restrictive LS assumptions

5. Explain why it would make little sense to estimate a random effects panel model for these data.

Our dependent variable vary through the seasons. Hence, it does not make sense to use RE, since it is only prefered when we want to investigate some characteristic that does not change over time.

Further, it is not realistic to assume that the individual effects are randomly drawn. The different levels of the divisions are not random, it can be explained by the difference between e.g., high-priced fashion clothes and low-budget clothes.

ps3

Jonathan Riis

3/8/2021

Problem 1: Pooled OLS, FD and FE with \(t = 2\)

Problem 2: Pooled OLS, FE, RE

Name two characteristics of a route (other than distance between stops) that are captured by \(a_i\)

Problem 3: Panel data