Panel data are a type of longitudinal data, or data collected at different points in time. Panel data models provide information on individual behavior, both across individuals and over time. The data and models have both cross-sectional and time-series dimensions. In this we will attempt panel models on a stocks dataset as this dataset provides data across stocks and over time, thus having both cross-sectional and time-series dimensions. The general assumption for this dataset is that there is correlation over time for a given stock, but each stock is independent of other stocks. This dataset has not have any time-invariant regressors.
The variables in the data are as follows:
| Variable | Description |
|---|---|
| date | Trading Date |
| open | Price of the stock at market open |
| high | Highest price reached in the trade day |
| low | Lowest price reached in the trade day |
| close | Price of the stock at market close |
| volume | Number of shares traded |
| unadjustedVolume | Volume for stocks, unadjusted by stock splits |
| change | Change in closing price from prior trade day close |
| changePercent | Percentage change in closing price from prior trade day close |
| vwap | Volume weighted average price (VWAP) is the ratio of the value traded to total volume traded |
| changeOverTime | Percent change of each interval relative to first value. Useful for comparing multiple stocks. |
| ticker | Abbreviation used to uniquely identify publicly traded shares |
We load the dataset and indicate what the cross-sectional dimension and time dimension is, ticker and date, respectively
library(plm)
library(readr)
library(dplyr)
library(broom)
dataset <- readr::read_csv('https://raw.githubusercontent.com/salma71/Data_621/master/Project_Proposal/stocks_combined.csv')
dataset$date <- as.Date(dataset$date, format="%m/%d/%Y")
head(dataset)
## # A tibble: 6 x 13
## date ticker open high low close volume unadjustedVolume change
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-02-21 AAPL 70.0 70.2 68.9 69.0 6.98e7 9965321 -0.775
## 2 2014-02-24 AAPL 68.7 69.6 68.6 69.3 7.24e7 10337850 0.302
## 3 2014-02-25 AAPL 69.5 69.5 68.4 68.6 5.82e7 8321050 -0.721
## 4 2014-02-26 AAPL 68.8 68.9 67.7 67.9 6.91e7 9875898 -0.619
## 5 2014-02-27 AAPL 67.9 69.4 67.8 69.3 7.56e7 10793903 1.36
## 6 2014-02-28 AAPL 69.5 70.0 68.6 69.1 9.31e7 13296379 -0.188
## # ... with 4 more variables: changePercent <dbl>, vwap <dbl>, label <chr>,
## # changeOverTime <dbl>
panel_stocks <- pdata.frame(dataset, index = c("ticker", "date"))
Determine if all stocks are observed for all time periods and if unbalanced, balance so that all stocks are observed on the same trade dates. This indicates that some observations will be dropped so that all stocks have an observations on the same trading days
dataset2 <- dataset %>% select(-label)
is.pbalanced(dataset2)
## [1] FALSE
dataset2 <- make.pbalanced(dataset2,balance.type = "shared.times")
is.pbalanced(dataset2)
## [1] TRUE
After a balanced dataset we build Panel Data models with various estimators
A Pooled model on panel data is applies the Ordinary Least Squares technique on the data. It is the most restrictive panel model as cross-sectional dimensions are ignored: \(y_{it}=\alpha+\beta x_{it}+u_{it}\)
stocks_m3_pooled <- plm(close ~ volume + change, data = panel_stocks, model = "pooling")
summary(stocks_m3_pooled)
## Pooling Model
##
## Call:
## plm(formula = close ~ volume + change, data = panel_stocks, model = "pooling")
##
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -105.283 -33.769 -13.010 20.485 828.904
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 1.1081e+02 3.4624e-01 320.0473 < 2.2e-16 ***
## volume -1.4591e-06 2.0657e-08 -70.6331 < 2.2e-16 ***
## change 1.0794e+00 1.7113e-01 6.3072 2.873e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 102250000
## Residual Sum of Squares: 89909000
## R-Squared: 0.12067
## Adj. R-Squared: 0.12062
## F-statistic: 2528.15 on 2 and 36847 DF, p-value: < 2.22e-16
Pooled_Model <- glance(stocks_m3_pooled)
In Fixed affects models we will assume there is an unobserved heterogeneity across the stocks such as each company’s core competency, or anything else that is unique to each company and thus something unobserved factored into each company’s stock price. This heterogeneity (\(\alpha_i\)) is not known but we would want to investigate it’s correlation with the created predictor variables see it’s impact on the closing price: \(y_{it}=\alpha_i+\beta x_{it}+u_{it}\)
stocks_m3_within <- plm(close ~ change + volume, data = panel_stocks, model = "within")
summary(stocks_m3_within)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = close ~ change + volume, data = panel_stocks, model = "within")
##
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -87.0894 -11.5124 -1.9800 8.9088 227.6882
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## change 7.6403e-01 9.7709e-02 7.8194 5.446e-15 ***
## volume -3.2987e-07 1.8457e-08 -17.8718 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 29546000
## Residual Sum of Squares: 29236000
## R-Squared: 0.010514
## Adj. R-Squared: 0.0096808
## F-statistic: 195.607 on 2 and 36818 DF, p-value: < 2.22e-16
Fixed_Model <- glance(stocks_m3_within)
The First difference estimator uses period changes for each stock, where the individual specific effects (unobserved heterogeneity) is canceled out: \(y_{it}-y_{i,t-1} = \beta(x_{it}-x_{i,t-1})+(e_{it}-e_{i,t-1})\)
stocks_m3_fd <- plm(close ~ change+volume, data = panel_stocks, model = "fd")
summary(stocks_m3_fd)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = close ~ change + volume, data = panel_stocks, model = "fd")
##
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
## Observations used in estimation: 36820
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -14.2388182 -0.3440688 -0.0013794 0.3629816 30.6575086
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 4.7363e-02 5.4457e-03 8.6972 < 2.2e-16 ***
## change 4.8687e-01 2.5826e-03 188.5182 < 2.2e-16 ***
## volume -3.8725e-09 7.0967e-10 -5.4567 4.882e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 79114
## Residual Sum of Squares: 40202
## R-Squared: 0.49185
## Adj. R-Squared: 0.49182
## F-statistic: 17817.8 on 2 and 36817 DF, p-value: < 2.22e-16
Fixed_Model_FD <- glance(stocks_m3_fd)
In Random affects models assumes no fixed effects.The individual specific effects (unobserved heterogeneity) are not correlated with the predictor variables and are independent of the predictor variables. This would result in the assumption that any residual variation on the dependent variable ( closing price) is random and should randomly distributed with the error term:
\(y_{it}=\beta x_{it}+(\alpha_i+ e_{it})\)
stocks_m3_random <- plm(close ~ volume, data = panel_stocks, model = "random")
summary(stocks_m3_random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = close ~ volume, data = panel_stocks, model = "random")
##
## Unbalanced Panel: n = 30, T = 368-1258, N = 36850
##
## Effects:
## var std.dev share
## idiosyncratic 795.35 28.20 0.329
## individual 1621.56 40.27 0.671
## theta:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9635 0.9803 0.9803 0.9801 0.9803 0.9803
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -84.835 -11.598 -2.485 0.009 8.679 233.394
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 9.7573e+01 7.3569e+00 13.263 < 2.2e-16 ***
## volume -3.3618e-07 1.8456e-08 -18.215 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 29575000
## Residual Sum of Squares: 29311000
## R-Squared: 0.0089123
## Adj. R-Squared: 0.0088854
## Chisq: 331.781 on 1 DF, p-value: < 2.22e-16
Random_Model <- glance(stocks_m3_random)
Below is a summary glance of the 4 Panel models run on the stocks dataset. Choosing between either a pooled, fixed effect, or random effect model requires running Breusch–Pagan Lagrange Multiplier and Hausman tests. Of the 4 models below, we will discard the First Difference model due to its nonsensical fitted values
Panel_Model_Summary <- data.frame(Pooled_Model)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model_FD)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Random_Model)
rownames(Panel_Model_Summary) <- c("Pooled Model", "Fixed Model", "First Difference Model", "Random Model")
Panel_Model_Summary
## r.squared adj.r.squared statistic p.value
## Pooled Model 0.120666113 0.120618384 2528.1547 0.000000e+00
## Fixed Model 0.010513880 0.009680753 195.6066 3.142607e-85
## First Difference Model 0.491847059 0.491819455 17817.7983 0.000000e+00
## Random Model 0.008912289 0.008885393 331.7813 3.933962e-74
## deviance df.residual nobs
## Pooled Model 89909043.5 36847 36850
## Fixed Model 29235595.5 36818 36850
## First Difference Model 40201.9 36817 36820
## Random Model 29310960.3 36848 36850
The Breusch–Pagan Lagrange Multiplier Test (LM Test) is used to test for heteroskedasticity in a linear regression model. The null hypothesis assumes homoskedasticity and in the alternate hypothesis heteroskedasticity is assumed.
First, we test for Random Effects against OLS. From the test we see that since the p-value is near 0, we reject the null hypothesis. The individual specific effects (unobserved heterogeneity) of each stock are significant and therefore we should not use the Pooled OLS model.
plmtest(stocks_m3_pooled, effect = "individual")
##
## Lagrange Multiplier Test - (Honda) for unbalanced panels
##
## data: close ~ volume + change
## normal = 3081.9, p-value < 2.2e-16
## alternative hypothesis: significant effects
We test the Fixed Effects model against OLS Model, we again see that we reject the null for the alternative hypothesis as the test suggest more support for the Fixed Effect model
pFtest(stocks_m3_within, stocks_m3_pooled)
##
## F test for individual effects
##
## data: close ~ change + volume
## F = 2634.8, df1 = 29, df2 = 36818, p-value < 2.2e-16
## alternative hypothesis: significant effects
From the two tests above, we can set aside the Pooled OLS model which ignores both the cross-sectional and time-series dimensions. Next, we compare the Fixed Effects and Random Effects model. We do this using the Hausman Test, which evaluates the consistency of estimators, in our case the fixed effect and random effect estimators. If there is no correlation between the independent variables and the individual specific effects, then both Fixed Effect and Random Effect models are consistent, but the Fixed Effect model is not efficient. Should there be a correlation, then the Fixed Effects model is consistent and the Random Effects model is inconsistent.
From the result we see that the null hypothesis is rejected. This indicates that the hypothesis is that individual random effects of each stock are uncorrelated with the error term does not have support. We therefore choose the alternative hypothesis and select the Fixed Effect model
phtest(stocks_m3_random, stocks_m3_within)
##
## Hausman Test
##
## data: close ~ volume
## chisq = 926.06, df = 1, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent