Introduction

Panel data or longitudinal data are multidimensional data with measurments over time. They contain observations of numerous phenomena that were collected over several time periods for the same group of entities. Time series and cross sectional data can be interpreted as cases of longitudinal data in one dimension only. Two distinct kinds of information can be derived from panel data. The cross sectional component reflects differences observed between the individual subjects whilst the time series component stands for changes in a single entity over time. Analysis of panel data can be somewhat complex. However, its flexibility can be perceived as an advantage as it gives a large number of unique data points. Panel data is characterised by increased variety, smaller collinearity between variables, more degress of freedom and greater effectiveness. Nonetheless, as any kind of analysis, panel data analysis has limitations and it is crucial to fit a proper method to specification od one’s data.

Data and purpose of this study

At the beginning of 2019 I collected data for my Bachelor thesis, the purpose of which was to build a panel model of suburbanisation spanning years 2007-2018. I used the linearised equation of the extended gravity model of migration derived from discipline called “social physics”. This equation bases on an assumption that human migration flows reflect Newton’s law of universal gravitation. In the basic version of this model, the dependent variable is represented by the amount of people migrating from Warsaw to one of its suburban boroughs, whilst explanatory variables are population in boroughs and its distance from city centre (we expect the coefficients for this variables to be about 1 and -2 respectively, values implied by the law of universal gravitation). Furthermore, following literature, I included some other explanatory variables and estimated my equation using Pooled OLS method. Frankly speaking, this wasn’t the best approach to this data (individual effects were present in my dataset). Secondly, I failed to inspect for differences between counties. A cross-county regression might not neccesarily reflect reality. Perhaps for some reasons, some counties are in general more attractive to migrants than others. Checking this property is exactly where clustering is applicable.

Cluster-robust variance estimators

When building models, we often encounter situations where the working data is not i.i.d., but units are correlated within groups and at the same time independent across groups. Violation of i.i.d. leads to problems with the development of estimates relying on this assumption, such as standard errors and variances. Let’s write the standard linear model including the within-cluster dependence (Blackwell 2016). \(G\) denodes number of clusters or groups, each with units which can be possibly related. The observations are labeled \(y_{ig}\), where \(i\) and \(g\) represent units and clusters respectively and \(n_g\) represents number of units in cluster \(g\): \(n=\sum_{g = 1}^{m} n_g\).
\(y_{ig}=x'_{ig}\beta+v_{ig}=x'_{ig}\beta+a_g+u_{ig}\)
\(a_g\) and \(u_{ig}\) stand for cluster error component and unit error component respectively. We assume them to be independent of each otherand that zero conditional mean error holds: \(E[v_{ig}|x_{ig}]=0\) (condition of OLS estimator being unbiased and consistent). If \(X_g\) is a \(n_g x k\) matrix of covariates for cluster then a group linear model is as follows:
\(y_g=X_{g}\beta+v_g=X_{g}\beta+a_g+u_g\)
with the assumption of \(E[v_g|X_g]=0\). The variance of the components:
\(Var[a_g|X_g]=\rho\sigma^2\), \(Var[u_{ig}|X_g]=(1-\rho)\sigma^2\).
Then:
\(V[v_{ig}|X_g]=V[a_g+u_{ig}|X_g]=V[a_g|X_g]+V[u_{ig}|X_g]=\rho\sigma^2+(1-\rho)\sigma^2=\sigma^2\)
\(\rho\in(0,1)\) is called within-cluster correlation. Covariance between two units \(i\) and \(s\) from the same cluster is \(\rho\sigma^2\):
\(Cov[v_{ig},v_{sg}|X_g]=Cov[a_g+u_{ig},a_g+u_{ig}|X_g]=Cov[a_g,a_g|X_g]+Cov[a_g,u_{sg}|X_g]+Cov[u_{ig},a_g|X_g]+Cov[u_{ig},u_{sg}|X_g]=Var[a_g|X_g]=\rho\sigma^2\) \(Corr[v_{ig},v_{sg}|X_g]=\dfrac{(Cov[v_{ig},v_{sg}|X_g])}{\sqrt{Var[v_{ig}|X_g]Var[v_{sg}|X_g]}}=\dfrac{(\rho\sigma^2)}{\sqrt{\sigma^2\sigma^2}}=\rho\)
Then, the covariance of two units i and s in different clusters j and k is:
\(Cov[v_{ig},v_{sk}|X]=Cov[a_g+u_{ig},v_k+u_{sk}|X]=Cov[a_g,v_k|X]+Cov[a_g,u_{sk}|X]+Cov[u_{ig},v_k|X]+Cov[u_{ig},u_{sk}|X]=0\)
In linear model, the covariance matrix of error is diagonal with variances along the diagonal. That’s not the case with clustered dependence, where this matrix is block-diagonal. An expression of cluster-robust covariance matrix estimate is as follows:
\(\widehat{V}[\hat{\beta}|X]=(X'X)^{-1}(\sum_{j = 1}^{m}X'_j\widehat{v_j}\widehat{v_j}'X_j)(X'X)^{-1}\)

where \(\widehat{v_j}\) represents an estimate based on the within-cluster residuals.

The goal of this study is to compare panel gravity models of migration (in basic shape) with and without clustered standard errors and inspect for within-cluster dependence, the clusters being counties.

Empirical verification

Firstly, I’m loading and browsing the dataset.

gravity_ln <- read_excel("gravity_ln.xlsx")
View(gravity_ln)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
summary(gravity_ln)
##    borough               year         county          ln_checkin_waw   
##  Length:770         Min.   :2008   Length:770         Min.   :-0.6931  
##  Class :character   1st Qu.:2010   Class :character   1st Qu.: 2.6391  
##  Mode  :character   Median :2013   Mode  :character   Median : 4.3694  
##                     Mean   :2013                      Mean   : 3.8025  
##                     3rd Qu.:2016                      3rd Qu.: 5.1417  
##                     Max.   :2018                      Max.   : 7.0300  
##  ln_population     ln_distance   
##  Min.   : 7.916   Min.   :2.205  
##  1st Qu.: 8.950   1st Qu.:2.961  
##  Median : 9.517   Median :3.377  
##  Mean   : 9.524   Mean   :3.274  
##  3rd Qu.:10.034   3rd Qu.:3.624  
##  Max.   :11.336   Max.   :4.078

The data is arranged in an order suitable for panel setting. There are 70 boroughs and 11 years (2008-2018) which multiplies to 770 observations. Each borough is classified to one of 10 counties. We’ve got three initial variables (natural logarithms): number of people who checked out from Warsaw in the specific borough (ln_checkin_waw), population of the borough (ln_population) and its distance from the center of Warsaw (ln_distance). Before logarithming of the initial variables, I put 0.5 for the number of check-ins which was equal to 0 (a trick well described in literature of gravity models (f.e. Poot et. al 2016)). Let’s plot the dependent variable with respect to counties and years.

coplot(ln_checkin_waw~year|county, type="l", data=gravity_ln)

plotmeans(ln_checkin_waw ~ year, main="Heterogeneity across years", data=gravity_ln)

plotmeans(ln_checkin_waw ~ county, main="Heterogeneity across counties - check-ins", data=gravity_ln)

plotmeans(ln_population ~ county, main="Heterogeneity across counties - population", data=gravity_ln)

plotmeans(ln_distance ~ county, main="Heterogeneity across counties - distance", data=gravity_ln)

We can see that the level of the dependend variable is somewhat similar between years. The mean fluctuates between 3.6 and 4 for each year. A the same time, large discrepancies are present county-wise for both the dependent and independent variables. Let’s declare panel type of data and run different types of panel regression.

gravity_ln.p <- pdata.frame(gravity_ln, index = c("borough", "year"))
pooled <- plm(ln_checkin_waw~(ln_population+ln_distance),data=gravity_ln.p, model="pooling")
stargazer(pooled,type='text')
## 
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                     ln_checkin_waw       
## -----------------------------------------
## ln_population          1.118***          
##                         (0.058)          
##                                          
## ln_distance            -1.992***         
##                         (0.094)          
##                                          
## Constant                -0.324           
##                         (0.770)          
##                                          
## -----------------------------------------
## Observations              770            
## R2                       0.719           
## Adjusted R2              0.718           
## F Statistic    979.219*** (df = 2; 767)  
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01
fixedeff <- plm(ln_checkin_waw~(ln_population+ln_distance),data=gravity_ln.p,model="within")
stargazer(fixedeff,type='text')
## 
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                     ln_checkin_waw       
## -----------------------------------------
## ln_population            0.288           
##                         (0.395)          
##                                          
## -----------------------------------------
## Observations              770            
## R2                       0.001           
## Adjusted R2             -0.099           
## F Statistic       0.532 (df = 1; 699)    
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01
randomeff <- plm(ln_checkin_waw~(ln_population+ln_distance),data=gravity_ln.p, model="random")
stargazer(randomeff,type='text')
## 
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                     ln_checkin_waw       
## -----------------------------------------
## ln_population          1.015***          
##                         (0.144)          
##                                          
## ln_distance            -2.090***         
##                         (0.246)          
##                                          
## Constant                 0.981           
##                         (1.938)          
##                                          
## -----------------------------------------
## Observations              770            
## R2                       0.261           
## Adjusted R2              0.259           
## F Statistic           271.427***         
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

Above I encoded three most popular types of panel model. Pooled oridinary least squares “pools” all observations into one regression - it is practically just OLS run on panel dataset. However, its key assumption are no individual effects in the measurement set and lack of universal effects across time. It also assumes no correlation between errors corresponding to the same individuals. Two other models include presence of individual effects. Fixed effects model captures the individual heterogenity. In this model, fixed effects are intercepts included to control for individual-specific and time-invariant characteristics. Random effects model assumes unique, but time constant attributes of individuals, nor correlated with individual regressors (which is allowed in fixed effects). This model uses generalized least squares estimator, which means robustness to heteroskedasticity. Pooled OLS can be used when time constant attributes are present but Random Effects model is considered to be more effective. To test which approach is suitable for my dataset, I’m conducting Breusch-Pagan Lagrange Multiplier Test, which inspects for individual effects and Hausman test which compares Random and Fixed effects models.

plmtest(pooled, type=c("bp"))
## 
##  Lagrange Multiplier Test - (Breusch-Pagan) for balanced panels
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance)
## chisq = 1307.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: significant effects
phtest(fixedeff,randomeff)
## 
##  Hausman Test
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance)
## chisq = 3.9014, df = 1, p-value = 0.04825
## alternative hypothesis: one model is inconsistent

P-value of the Lagrange Multiplier Test is almost equal to 0. On a significance level of 5 % I should reject the null hypotesis (insignificant individuall effects), which means that Pooled OLS can’t be used. Then, p-value of the Hausman test is 4.825 %., according to which score I should reject the null hypotesis and choose Fixed Effects model. However, as can be see in the estimation output, this model excludes one independent variable as collinear (due to its assumptions). On the other hand, Random Effects estimation gives nearly perfect output (coefficients equal to 1 for population and -2 for distance, the values corresponding to theory of gravity models of migration). Due to this two reasons I decided to choose and further consider Random Effect model (independent variables are insignificant but it doesn’t violate showing clustering standard errors on this dataset). Before clustering, I’m running several diagnostics tests.

# cross-sectional dependence
pcdtest(randomeff, test=c("lm"))
## 
##  Breusch-Pagan LM test for cross-sectional dependence in panels
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance)
## chisq = 4071.1, df = 2415, p-value < 2.2e-16
## alternative hypothesis: cross-sectional dependence
pcdtest(randomeff, test=c("cd"))
## 
##  Pesaran CD test for cross-sectional dependence in panels
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance)
## z = 23.287, p-value < 2.2e-16
## alternative hypothesis: cross-sectional dependence
# stationarity
gravity_ln.set <- plm.data(gravity_ln, index=c("borough", "year"))
adf.test(gravity_ln.set$y,k=2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  gravity_ln.set$y
## Dickey-Fuller = -17.957, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
# no unit roots - stationary
# heteroskedasticity present - robust covariance matrix needed
bptest(ln_checkin_waw~(ln_population+ln_distance) + factor(borough), data=gravity_ln.p, studentize=F )
## 
##  Breusch-Pagan test
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance) + factor(borough)
## BP = 807.21, df = 70, p-value < 2.2e-16
# serial correlation present
pbgtest(randomeff)
## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  ln_checkin_waw ~ (ln_population + ln_distance)
## chisq = 35.916, df = 11, p-value = 0.0001748
## alternative hypothesis: serial correlation in idiosyncratic errors

As it happens, p-values of all diagnostics test are close to 0, which mean accepting all alternative hypotheses. According to those test: cross-sectional dependence, heteroskedasticity and serial correlation are present. Time series is stationary. All things considered, it is advisable to use robust and cluster-robust variance estimators. Below I present and compare models with different approaches to clustered standard errors.

# Original coefficients
SE1   = sqrt(diag(vcov(randomeff)))
SE1
##   (Intercept) ln_population   ln_distance 
##     1.9384223     0.1438605     0.2460764
cov2cor(vcov(randomeff))
##               (Intercept) ln_population ln_distance
## (Intercept)     1.0000000    -0.9369987  -0.8070596
## ln_population  -0.9369987     1.0000000   0.5537843
## ln_distance    -0.8070596     0.5537843   1.0000000

First, let’s take a look at the original output. Correlation between the two regression coefficient is about 55%, which means that variables are not orthogonal. This correlation is not too high. However, it suggests that standard errors might be lower with a use of robust covariance estimator. I’m using vcovHC and vcovCR functions which return heteroskedasticity and cluster-robust sandwich estimates of variance-covariance matrices, respectively.

# Robust variance estimator
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcov(randomeff, type = x)))))
##     (Intercept) ln_population ln_distance
## HC0    1.938422     0.1438605   0.2460764
## HC1    1.938422     0.1438605   0.2460764
## HC2    1.938422     0.1438605   0.2460764
## HC3    1.938422     0.1438605   0.2460764
## HC4    1.938422     0.1438605   0.2460764
SE1   = sqrt(diag(vcov(randomeff, type="HC0")))
SE1
##   (Intercept) ln_population   ln_distance 
##     1.9384223     0.1438605     0.2460764
cov2cor(vcov(randomeff, type = "HC0"))
##               (Intercept) ln_population ln_distance
## (Intercept)     1.0000000    -0.9369987  -0.8070596
## ln_population  -0.9369987     1.0000000   0.5537843
## ln_distance    -0.8070596     0.5537843   1.0000000
coeftest(randomeff,vcov=vcov(randomeff, type="HC0"))
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    0.98058    1.93842  0.5059    0.6131    
## ln_population  1.01463    0.14386  7.0529 3.917e-12 ***
## ln_distance   -2.08962    0.24608 -8.4917 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that using White-like robust covariance matrix didn’t change anything. The answer for this lies in the use of GLS in Random Effects model, which is already robust. However, it doesn’t include clustered standard errors by default. Moreoverly, HC0-HC4 denodes different types of heteroskedasticity adjustment, HC0 being the panel version of White matrix.

# Cluster-robust variance estimator with clustered groups
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(randomeff, type = x, cluster="group")))))
##     (Intercept) ln_population ln_distance
## HC0    1.551397     0.1155214   0.2093679
## HC1    1.554428     0.1157471   0.2097769
## HC2    1.555110     0.1157934   0.2099181
## HC3    1.558829     0.1160660   0.2104697
## HC4    1.556405     0.1158853   0.2101892
SE2   = sqrt(diag(vcovHC(randomeff, type = "HC0", cluster="group")))
SE2
##   (Intercept) ln_population   ln_distance 
##     1.5513970     0.1155214     0.2093679
cov2cor(vcovHC(randomeff, type = "HC0", cluster="group"))
##               (Intercept) ln_population ln_distance
## (Intercept)     1.0000000    -0.9153665  -0.7423854
## ln_population  -0.9153665     1.0000000   0.4142225
## ln_distance    -0.7423854     0.4142225   1.0000000
## attr(,"cluster")
## [1] "group"

Clustering by group is aimed at serial correlation. It improved the situation significantly. Correlation between regression coefficient is now equal to 41.4 % and standard errors are obviously lower. Standard errors are lowest for the H0 type of matrix, but the differences are relatively small.

# Cluster-robust variance estimator with SEs clustered across time
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(randomeff, type = x, cluster="time")))))
##     (Intercept) ln_population ln_distance
## HC0    1.264955    0.09607279   0.1686244
## HC1    1.267426    0.09626049   0.1689538
## HC2    1.268189    0.09633255   0.1689973
## HC3    1.271441    0.09659363   0.1693721
## HC4    1.269922    0.09649217   0.1691173
SE3   = sqrt(diag(vcovHC(randomeff, type = "HC0", cluster="time")))
SE3
##   (Intercept) ln_population   ln_distance 
##    1.26495460    0.09607279    0.16862436
cov2cor(vcovHC(randomeff, type = "HC0", cluster="time"))
##               (Intercept) ln_population ln_distance
## (Intercept)     1.0000000    -0.9369745  -0.7190598
## ln_population  -0.9369745     1.0000000   0.4494998
## ln_distance    -0.7190598     0.4494998   1.0000000
## attr(,"cluster")
## [1] "time"

Clustering by time is used to correct cross-sectional correlation. Here, it returns much lower standard errors but greater correlation between coefficients (almost 45 %).

# Cluster-robust variance estimator with clustering by county
t(sapply(c("CR0", "CR1", "CR2", "CR3", "CR1S", "CR1p"), function(x) sqrt(diag(vcovCR(randomeff, gravity_ln$county,type = x)))))
##      (Intercept) ln_population ln_distance
## CR0     25.19396      1.909161    3.404849
## CR1     26.55677      2.012432    3.589026
## CR2     26.91694      2.037917    3.621588
## CR3     28.77427      2.176626    3.857481
## CR1S    26.59137      2.015054    3.593703
## CR1p    30.11255      2.281884    4.069573
SE4   = sqrt(diag(vcovCR(randomeff, gravity_ln$county,type = "CR2")))
SE4
##   (Intercept) ln_population   ln_distance 
##     26.916937      2.037917      3.621588
cov2cor(vcovCR(randomeff, gravity_ln$county,type = "CR2"))
##               (Intercept) ln_population ln_distance
## (Intercept)     1.0000000    -0.9193747  -0.7643494
## ln_population  -0.9193747     1.0000000   0.4490700
## ln_distance    -0.7643494     0.4490700   1.0000000

In this case, I estimated variance matrix with standard errors clustered with respect to county and it gave interesting results. CR0-CR3 is an option including different types of small sample correction. This correction is not necessary in this dataset, however “type” is a default argument in vcovCR function. According to my knowledge, there is no other function in R which allowes clustering standard errors by variable in a panel model, hence the choice. Standard errors are much higher now than for any other option, even the initial output. However, the correlation between coefficients is lower than in versions without correcting for robustness and clustering by time.

Unsupervised Learning approach

Ordinary Least Squares regression techniques obviously take the dependent variable into consideration, hence they are part of the Supervised Learning family. Nonetheless, some Unsupervised Learning methods are applicable in checking certain features of econometric models. For example, usually, when running an OLS regression without robust variance-covariance matrix, we have to inspect for the presence of homoscedasticity (variance of all residuals precisely equal). Statistical tests, such as the Breusch-Pagan one can be used for this purpose. However, one could also check the equality of variance with clustering. If I run k-means, PAM or other algorithm on my residuals, the points on visualisation should not be spread too much, assuming homoscedasticity. If there are large discrepancies in location, it means that the variances differ across residuals. However, this way of measure is rather intuitive than formal. In this study I clustered observations in regard to the random effects model. However, this approach uses GLS which means robust variance-covariance matrix. Hence, for clustering, I will use the results of pooled OLS estimation. The previously run BP test resulted in rejecting the null hypothesis about the residuals being homoscedastic. I’m extracting residuals from that model and running k-means clustering algorithm with visualisations.

res <- residuals(pooled)
# Euclidean
km1<-eclust(res, "kmeans", hc_metric="euclidean",k=2)
fviz_cluster(km1, main="kmeans / Euclidean")

sil1<-silhouette(km1$cluster, dist(res))
fviz_silhouette(sil1)
##   cluster size ave.sil.width
## 1       1   42          0.51
## 2       2   28          0.24

# Manhattan
km2<-eclust(res, "kmeans", hc_metric="manhattan", k=2)
fviz_cluster(km2, main="kmeans / Manhattan")

sil2<-silhouette(km2$cluster, dist(res))
fviz_silhouette(sil2)
##   cluster size ave.sil.width
## 1       1   42          0.51
## 2       2   28          0.24

Above the visualisations of k-means clustering using Euclidean and Manhattan distances are visible. After trial and error, I decided to choose the number of clusters equal to 2 (the average Silhouette metric is the highest). It is interesting that the plots don’t differ between the two distances, at least not in a way visible to the naked eye. Although many observations are really concentrated in the middle of each clusters, some of them are located far away from the center, especially in Cluster 2 (f.e. Prażmów, Cegłów, Mrozy). However, generaly it can be said that there are large discrepancies between locations of observations which follows both intuition and the BP test result of the heteroscedasticity being present. A few negative values on the Silhouette plots indicate that some observations have been assigned to the wrong clusters.

Conclusion

Clustering standard errors is a useful tool which allowes us to deal with correlation in a dataset. It is possible to implement in in both dimensions of panel data and it improves the model, when the entities are correlated cross-sectionally or across time. It is worth to mention, that one fact poses limitations to this study: the complexity of my data set. Here, observations are correlated in both dimensions, within-county correlation also being present. That’s why clustering improved the initial output in all those cases. Moreover, some traditional Unsupervised Learning clustering algorithms can be used to check intuitivelly certain features of an econometric model, such as homoscedasticity of residuals.