Panel data are a type of longitudinal data, or data collected at different points in time. Panel data models provide information on individual behavior, both across individuals and over time. The data and models have both cross-sectional and time-series dimensions. In this we will attempt panel models on a stocks dataset as this dataset provides data across stocks and over time, thus having both cross-sectional and time-series dimensions. The general assumption for this dataset is that there is correlation over time for a given stock, but each stock is independent of other stocks. This dataset has not have any time-invariant regressors.

The variables in the data are as follows:

Variable Description
date Trading Date
open Price of the stock at market open
high Highest price reached in the trade day
low Lowest price reached in the trade day
close Price of the stock at market close
volume Number of shares traded
unadjustedVolume Volume for stocks, unadjusted by stock splits
change Change in closing price from prior trade day close
changePercent Percentage change in closing price from prior trade day close
vwap Volume weighted average price (VWAP) is the ratio of the value traded to total volume traded
changeOverTime Percent change of each interval relative to first value. Useful for comparing multiple stocks.
ticker Abbreviation used to uniquely identify publicly traded shares

Load Dataset

We load the dataset and indicate what the cross-sectional dimension and time dimension is, ticker and date, respectively

library(plm)
library(readr)
library(dplyr)
library(broom)
dataset <- readr::read_csv('https://raw.githubusercontent.com/salma71/Data_621/master/Project_Proposal/stocks_combined.csv')
dataset$date <- as.Date(dataset$date, format="%m/%d/%Y")
head(dataset)
## # A tibble: 6 x 13
##   date       ticker  open  high   low close volume unadjustedVolume change
##   <date>     <chr>  <dbl> <dbl> <dbl> <dbl>  <dbl>            <dbl>  <dbl>
## 1 2014-02-21 AAPL    70.0  70.2  68.9  69.0 6.98e7          9965321 -0.775
## 2 2014-02-24 AAPL    68.7  69.6  68.6  69.3 7.24e7         10337850  0.302
## 3 2014-02-25 AAPL    69.5  69.5  68.4  68.6 5.82e7          8321050 -0.721
## 4 2014-02-26 AAPL    68.8  68.9  67.7  67.9 6.91e7          9875898 -0.619
## 5 2014-02-27 AAPL    67.9  69.4  67.8  69.3 7.56e7         10793903  1.36 
## 6 2014-02-28 AAPL    69.5  70.0  68.6  69.1 9.31e7         13296379 -0.188
## # ... with 4 more variables: changePercent <dbl>, vwap <dbl>, label <chr>,
## #   changeOverTime <dbl>
panel_stocks <- pdata.frame(dataset, index = c("ticker", "date"))

Balance dataset

Determine if all stocks are observed for all time periods and if unbalanced, balance so that all stocks are observed on the same trade dates. This indicates that some observations will be dropped so that all stocks have an observations on the same trading days

dataset2 <- dataset %>% select(-label)
is.pbalanced(dataset2)
## [1] FALSE
dataset2 <- make.pbalanced(dataset2,balance.type = "shared.times")
is.pbalanced(dataset2)
## [1] TRUE

Models

After a balanced dataset we build Panel Data models with various estimators

Pooled OLS

A Pooled model on panel data is applies the Ordinary Least Squares technique on the data. It is the most restrictive panel model as cross-sectional dimensions are ignored: \(y_{it}=\alpha+\beta x_{it}+u_{it}\)

stocks_m3_pooled <- plm(close ~ volume + change,  data = panel_stocks, model = "pooling")
summary(stocks_m3_pooled)
## Pooling Model
## 
## Call:
## plm(formula = close ~ volume + change, data = panel_stocks, model = "pooling")
## 
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -105.283  -33.769  -13.010   20.485  828.904 
## 
## Coefficients:
##                Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)  1.1081e+02  3.4624e-01 320.0473 < 2.2e-16 ***
## volume      -1.4591e-06  2.0657e-08 -70.6331 < 2.2e-16 ***
## change       1.0794e+00  1.7113e-01   6.3072 2.873e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    102250000
## Residual Sum of Squares: 89909000
## R-Squared:      0.12067
## Adj. R-Squared: 0.12062
## F-statistic: 2528.15 on 2 and 36847 DF, p-value: < 2.22e-16
Pooled_Model <- glance(stocks_m3_pooled)

Fixed Effect Models

In Fixed affects models we will assume there is an unobserved heterogeneity across the stocks such as each company’s core competency, or anything else that is unique to each company and thus something unobserved factored into each company’s stock price. This heterogeneity (\(\alpha_i\)) is not known but we would want to investigate it’s correlation with the created predictor variables see it’s impact on the closing price: \(y_{it}=\alpha_i+\beta x_{it}+u_{it}\)

Fixed Model - Within Estimator
stocks_m3_within <- plm(close ~ change + volume,  data = panel_stocks, model = "within")
summary(stocks_m3_within)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = close ~ change + volume, data = panel_stocks, model = "within")
## 
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -87.0894 -11.5124  -1.9800   8.9088 227.6882 
## 
## Coefficients:
##           Estimate  Std. Error  t-value  Pr(>|t|)    
## change  7.6403e-01  9.7709e-02   7.8194 5.446e-15 ***
## volume -3.2987e-07  1.8457e-08 -17.8718 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    29546000
## Residual Sum of Squares: 29236000
## R-Squared:      0.010514
## Adj. R-Squared: 0.0096808
## F-statistic: 195.607 on 2 and 36818 DF, p-value: < 2.22e-16
Fixed_Model <- glance(stocks_m3_within)
Fixed Model - First Difference Estimator

The First difference estimator uses period changes for each stock, where the individual specific effects (unobserved heterogeneity) is canceled out: \(y_{it}-y_{i,t-1} = \beta(x_{it}-x_{i,t-1})+(e_{it}-e_{i,t-1})\)

stocks_m3_fd <- plm(close ~ change+volume,  data = panel_stocks, model = "fd")
summary(stocks_m3_fd)
## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = close ~ change + volume, data = panel_stocks, model = "fd")
## 
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
## Observations used in estimation: 36820
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -14.2388182  -0.3440688  -0.0013794   0.3629816  30.6575086 
## 
## Coefficients:
##                Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)  4.7363e-02  5.4457e-03   8.6972 < 2.2e-16 ***
## change       4.8687e-01  2.5826e-03 188.5182 < 2.2e-16 ***
## volume      -3.8725e-09  7.0967e-10  -5.4567 4.882e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    79114
## Residual Sum of Squares: 40202
## R-Squared:      0.49185
## Adj. R-Squared: 0.49182
## F-statistic: 17817.8 on 2 and 36817 DF, p-value: < 2.22e-16
Fixed_Model_FD <- glance(stocks_m3_fd)

Random Effect Model

In Random affects models assumes no fixed effects.The individual specific effects (unobserved heterogeneity) are not correlated with the predictor variables and are independent of the predictor variables. This would result in the assumption that any residual variation on the dependent variable ( closing price) is random and should randomly distributed with the error term:
\(y_{it}=\beta x_{it}+(\alpha_i+ e_{it})\)

stocks_m3_random <- plm(close ~ volume,  data = panel_stocks, model = "random")
summary(stocks_m3_random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = close ~ volume, data = panel_stocks, model = "random")
## 
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
## 
## Effects:
##                   var std.dev share
## idiosyncratic  795.35   28.20 0.329
## individual    1621.56   40.27 0.671
## theta:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9635  0.9803  0.9803  0.9801  0.9803  0.9803 
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -84.835 -11.598  -2.485   0.009   8.679 233.394 
## 
## Coefficients:
##                Estimate  Std. Error z-value  Pr(>|z|)    
## (Intercept)  9.7573e+01  7.3569e+00  13.263 < 2.2e-16 ***
## volume      -3.3618e-07  1.8456e-08 -18.215 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    29575000
## Residual Sum of Squares: 29311000
## R-Squared:      0.0089123
## Adj. R-Squared: 0.0088854
## Chisq: 331.781 on 1 DF, p-value: < 2.22e-16
Random_Model <- glance(stocks_m3_random)

Panel Model Testing & Selection

Below is a summary glance of the 4 Panel models run on the stocks dataset. Choosing between either a pooled, fixed effect, or random effect model requires running Breusch–Pagan Lagrange Multiplier and Hausman tests. Of the 4 models below, we will discard the First Difference model due to its nonsensical fitted values

Panel_Model_Summary <- data.frame(Pooled_Model)

Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model_FD)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Random_Model)


rownames(Panel_Model_Summary) <- c("Pooled Model", "Fixed Model", "First Difference Model", "Random Model")
Panel_Model_Summary
##                          r.squared adj.r.squared  statistic      p.value
## Pooled Model           0.120666113   0.120618384  2528.1547 0.000000e+00
## Fixed Model            0.010513880   0.009680753   195.6066 3.142607e-85
## First Difference Model 0.491847059   0.491819455 17817.7983 0.000000e+00
## Random Model           0.008912289   0.008885393   331.7813 3.933962e-74
##                          deviance df.residual  nobs
## Pooled Model           89909043.5       36847 36850
## Fixed Model            29235595.5       36818 36850
## First Difference Model    40201.9       36817 36820
## Random Model           29310960.3       36848 36850

The Breusch–Pagan Lagrange Multiplier Test (LM Test) is used to test for heteroskedasticity in a linear regression model. The null hypothesis assumes homoskedasticity and in the alternate hypothesis heteroskedasticity is assumed.

First, we test for Random Effects against OLS. From the test we see that since the p-value is near 0, we reject the null hypothesis. The individual specific effects (unobserved heterogeneity) of each stock are significant and therefore we should not use the Pooled OLS model.

plmtest(stocks_m3_pooled, effect = "individual")
## 
##  Lagrange Multiplier Test - (Honda) for unbalanced panels
## 
## data:  close ~ volume + change
## normal = 3081.9, p-value < 2.2e-16
## alternative hypothesis: significant effects

We test the Fixed Effects model against OLS Model, we again see that we reject the null for the alternative hypothesis as the test suggest more support for the Fixed Effect model

pFtest(stocks_m3_within, stocks_m3_pooled)
## 
##  F test for individual effects
## 
## data:  close ~ change + volume
## F = 2634.8, df1 = 29, df2 = 36818, p-value < 2.2e-16
## alternative hypothesis: significant effects

From the two tests above, we can set aside the Pooled OLS model which ignores both the cross-sectional and time-series dimensions. Next, we compare the Fixed Effects and Random Effects model. We do this using the Hausman Test, which evaluates the consistency of estimators, in our case the fixed effect and random effect estimators. If there is no correlation between the independent variables and the individual specific effects, then both Fixed Effect and Random Effect models are consistent, but the Fixed Effect model is not efficient. Should there be a correlation, then the Fixed Effects model is consistent and the Random Effects model is inconsistent.

From the result we see that the null hypothesis is rejected. This indicates that the hypothesis is that individual random effects of each stock are uncorrelated with the error term does not have support. We therefore choose the alternative hypothesis and select the Fixed Effect model

phtest(stocks_m3_random, stocks_m3_within)
## 
##  Hausman Test
## 
## data:  close ~ volume
## chisq = 926.06, df = 1, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent