Predicting stock market behavior is a difficult task due to the high volatility and complexity of financial markets. Although large amounts of historical data are available, accurately forecasting stock prices remains challenging because of the many economic and behavioral factors that influence market movements. As a result, researchers often focus on predicting price direction or overall trends rather than exact future values.
This project evaluates three statistical approaches for stock prediction: logistic regression, ARIMA time series models, and panel regression models. Logistic regression is used to classify next-day price direction, ARIMA models are used to forecast future prices based on past observations, and panel regression models analyze multiple stocks while accounting for both time and cross-sectional effects.
The results show that logistic regression performs poorly, with most predictors being insignificant and accuracy near 52%, close to random guessing. Panel regression also produces inconsistent results across stocks. In contrast, ARIMA models provide more stable and reliable forecasts by capturing time-dependent patterns and trends in the data, producing reasonable prediction intervals.
Overall, the results suggest that time series–based ARIMA models are more effective for stock price forecasting in this study compared to logistic and panel regression approaches.
Key Words
Time series Logistic regression Auto-regressive models Panel regression
Introduction
Predicting stock prices is a complex task, and most traditional approaches often produce limited or inconsistent results. Although large volumes of stock market data are readily available, identifying meaningful predictors remains a key challenge. When assuming that recent market behavior can help forecast future movements, analysts often rely on technical indicators such as the Relative Strength Index (RSI), moving averages, and intraday price measures such as open, high, low, and close (OHLC) values.
Alternatively, if stock prices are assumed to follow longer-term patterns and trends, time series analysis becomes more appropriate. This approach models stock prices by comparing current values with their lagged or past observations to capture temporal dependencies and autocorrelation. However, these methods often require data transformations, such as differencing, to satisfy assumptions like stationarity. By applying different statistical models and tuning their parameters, it is possible to generate forecasts that estimate future stock prices with a measurable level of uncertainty.
Methodology
The dataset consists of multiple stocks, but for clarity and consistency, the analysis focuses primarily on Apple (AAPL). In addition to standard pricing information, the dataset is enriched with engineered technical indicators designed to improve predictive performance. These include the Relative Strength Index (RSI), Moving Average Convergence Divergence (MACD), and simple moving averages, which are commonly used to capture momentum and trend behavior in financial markets.
For the logistic regression models, a binary target variable is created based on next-day price movement. If the closing price increases on the following day, the target is assigned a value of 1; otherwise, it is assigned 0. These models are implemented using generalized linear models (GLMs) with a binomial family, allowing for classification of upward or downward price movement.
The autoregressive models rely solely on historical price data and model the relationship between a time series and its lagged values. The ARIMA (AutoRegressive Integrated Moving Average) framework is used to capture these dependencies, where the AR component represents autoregression, the MA component captures moving average effects, and the integration term accounts for differencing to achieve stationarity. These models are then used to forecast future stock prices.
Panel regression models incorporate multiple stocks over time and use the same engineered features, including RSI, MACD, and moving averages of price and volume. Since the dataset includes several companies across identical time periods, fixed effects models are used to account for firm-specific characteristics. This allows the model to estimate stock prices while controlling for individual differences between companies.
The tickers available in this dataset are listed below. We will be focusing on the AAPL stock which is described below.
Code
tickers
# A tibble: 30 × 2
Ticker Company
<chr> <chr>
1 AAPL Apple Inc
2 AXP American Express Company
3 BA Boeing Co
4 CAT Caterpillar Inc.
5 CSCO Cisco Systems, Inc.
6 CVX Chevron Corporation
7 DIS Walt Disney Co
8 DWDP DowDuPont
9 GS GlaxoSmithKline plc
10 HD The Home Depot, Inc.
# ℹ 20 more rows
The dataset is enhanced using technical indicators generated through the TA-Lib package, along with a binary target variable for the logistic regression model. Rows containing missing values are removed, as these are introduced by rolling window calculations used in feature construction. The analysis focuses on rates of change rather than raw price levels to ensure comparability across different stocks with varying price scales.
The engineered variables include the following:
TARGET: Binary outcome variable indicating whether the next day’s closing price increases (1) or not (0). MACD: Moving Average Convergence Divergence signal based on standard 12- and 26-day windows. m5: 5-day momentum indicator representing short-term price trend (approximately one trading week). m20: 20-day momentum indicator capturing medium-term trend (approximately one trading month). vol1: One-day rate of change in trading volume. vol5: Five-day rate of change in trading volume, capturing short-term volume trend behavior.
The stock price for AAPL over the years are risen steadily. There are two important periods of decline but the overall trends remains positive. These periods of decline should provide balance to the dataset.
Code
AAPL %>%ggplot(aes(x=date,y=close)) +geom_line() +ggtitle('AAPL Close Price')
The correlation plot reveals that some variable are nearly perfectly correlated. This makes sense given that some of these variables generally move together or are derived from another.
# Initialize a df that will store the metrics of modelsmodels.df <-tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),precision.deviance=numeric(), stringsAsFactors=FALSE)
M1A Full Model
The initial logistic regression model produced weak and inconsistent results. Only a small number of predictors were statistically significant, indicating limited explanatory power. This outcome is largely attributed to multicollinearity among several variables, particularly those derived from stock prices such as open, high, low, and close, which tend to move together.
Because many predictors were highly correlated or redundant, the full model suffered from instability and reduced interpretability. As a result, model refinement was necessary by progressively removing insignificant and highly correlated variables to improve model performance and reliability.
Call:
glm(formula = TARGET ~ ., family = binomial, data = AAPL_model1_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.671e-01 6.027e-01 0.443 0.6577
open 8.583e-03 8.433e-02 0.102 0.9189
high -1.930e-03 1.517e-01 -0.013 0.9899
low -4.638e-02 1.281e-01 -0.362 0.7172
close -2.640e-01 1.042e-01 -2.533 0.0113 *
volume 2.347e-09 5.330e-09 0.440 0.6597
unadjustedVolume -3.069e-09 5.218e-09 -0.588 0.5563
change 7.907e-02 1.042e-01 0.759 0.4479
changePercent -5.918e-02 1.341e-01 -0.441 0.6590
vwap 3.025e-01 2.193e-01 1.379 0.1678
macd -2.321e-02 6.687e-02 -0.347 0.7286
m5 1.059e-02 2.147e-02 0.493 0.6218
m20 1.028e-02 1.728e-02 0.595 0.5520
rsi -8.145e-04 9.055e-03 -0.090 0.9283
vol1 -3.231e-02 1.985e-01 -0.163 0.8707
vol5 -4.458e-03 1.609e-01 -0.028 0.9779
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1705.0 on 1231 degrees of freedom
Residual deviance: 1694.3 on 1216 degrees of freedom
AIC: 1726.3
Number of Fisher Scoring iterations: 4
M1B Small Model Reducing the model down to only the significant predictors leads to a nonsensical results where the two variables cancel each other out.
Call:
glm(formula = TARGET ~ . - vol5 - high - rsi - open - vol1 -
changePercent - low - macd - m20 - volume - unadjustedVolume -
m5 - change, family = binomial, data = AAPL_model1_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.19119 0.20539 0.931 0.3519
close -0.16739 0.07510 -2.229 0.0258 *
vwap 0.16665 0.07511 2.219 0.0265 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1705.0 on 1231 degrees of freedom
Residual deviance: 1699.6 on 1229 degrees of freedom
AIC: 1705.6
Number of Fisher Scoring iterations: 4
M1C (Multicollinearity Reduced Model)
To address multicollinearity issues observed in earlier specifications, this model removes several highly correlated predictors, including open, high, low, vwap, change, and MACD. These variables were excluded due to strong interdependence and high Variance Inflation Factor (VIF) values, which negatively impacted model stability and interpretability.
To retain useful information from the removed variables, two engineered features were introduced: Close2Open and Open2Open, which capture short-term price dynamics and changes between consecutive observations. The final set of predictors was selected based on acceptable VIF thresholds to reduce redundancy and improve model robustness.
Despite these adjustments, the model still shows limited statistical significance across predictors, suggesting that even after mitigating multicollinearity, logistic regression has limited effectiveness for accurately modeling short-term stock direction in this dataset.
Code
m1c_data <- AAPL_model1_data# Use dplyr::lag to avoid conflict with plm::lagm1c_data$Close2Open <- m1c_data$open - dplyr::lag(m1c_data$close, 1)m1c_data$Open2Open <- m1c_data$open - dplyr::lag(m1c_data$open, 1)# Remove rows with NA valuesm1c_data_trimmed <- m1c_data %>%drop_na()# Remove highly collinear variablesm1c_data_trimmed <- m1c_data_trimmed %>%select( TARGET, close, volume, rsi, macd, m5, m20, vol1, vol5, Close2Open, Open2Open )# Fit logistic regressionmodel.m1c <-glm( TARGET ~ .,data = m1c_data_trimmed,family = binomial)summary(model.m1c)
Call:
glm(formula = TARGET ~ ., family = binomial, data = m1c_data_trimmed)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.652e-01 5.662e-01 0.468 0.640
close -7.071e-04 1.715e-03 -0.412 0.680
volume 1.467e-09 3.733e-09 0.393 0.694
rsi -2.701e-03 8.574e-03 -0.315 0.753
macd 2.639e-05 6.520e-02 0.000 1.000
m5 7.912e-03 2.127e-02 0.372 0.710
m20 6.765e-03 1.689e-02 0.401 0.689
vol1 -1.507e-02 1.960e-01 -0.077 0.939
vol5 -2.968e-02 1.585e-01 -0.187 0.851
Close2Open 3.919e-02 5.651e-02 0.694 0.488
Open2Open -1.913e-03 3.722e-02 -0.051 0.959
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1703.5 on 1230 degrees of freedom
Residual deviance: 1700.6 on 1220 degrees of freedom
AIC: 1722.6
Number of Fisher Scoring iterations: 3
The results from the logistic regression models are summarized below. Overall, the in-sample classification accuracy is approximately 52%, which is only marginally better than random guessing. None of the models demonstrate a clear advantage in predictive performance, as most variables are not statistically significant.
These results suggest that the current feature set provides limited explanatory power for predicting short-term price direction. Future improvements could include incorporating additional stocks while removing stock-specific variables to improve generalizability. In addition, alternative target definitions, such as predicting price movement over 5-day or 20-day horizons, may yield more stable patterns and improved model performance.
Model 2 (Autoregressive Models)
In this section, all predictor variables are excluded except for the closing price, allowing the analysis to focus solely on the time series structure of the data. This approach assumes that past price behavior contains sufficient information to model and forecast future values.
The AAPL stock series exhibits a clear overall trend with noticeable short-term fluctuations. To assess its suitability for time series modeling, autocorrelation and partial autocorrelation functions (ACF and PACF) are examined. The ACF indicates strong dependence on past values, while the PACF helps identify the number of meaningful lag terms after accounting for intermediate correlations.
Model identification typically relies on interpreting these plots, where significant PACF lags suggest an appropriate autoregressive structure. In this case, the PACF suggests that an AR(1) model may be appropriate, supported by a strong linear relationship between the current and previous time step, with an estimated coefficient close to 1. However, the slowly decaying ACF indicates that the series may be non-stationary, suggesting that differencing may be required to stabilize the mean before fitting a more reliable model.
Call:
lm(formula = t ~ t_1)
Residuals:
Min 1Q Median 3Q Max
-15.6872 -0.8211 -0.0341 0.9425 11.1222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.340328 0.210876 1.614 0.107
t_1 0.997991 0.001572 634.859 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.129 on 1255 degrees of freedom
Multiple R-squared: 0.9969, Adjusted R-squared: 0.9969
F-statistic: 4.03e+05 on 1 and 1255 DF, p-value: < 2.2e-16
We use the generalized least squares method to obtain more information on the mode. We specify an error correlation structure of order 1. We note that the coefficient estiamte is again 1 and the AIC is 5478. This will serve as a comparison of more complex models.
Generalized least squares fit by REML
Model: AAPL_close ~ time(AAPL_close)
Data: NULL
AIC BIC logLik
5478.647 5499.19 -2735.324
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.9999998
Coefficients:
Value Std.Error t-value p-value
(Intercept) 68.90127 3174.416 0.0217052 0.9827
time(AAPL_close) 0.08198 0.060 1.3647854 0.1726
Correlation:
(Intr)
time(AAPL_close) -0.012
Standardized residuals:
Min Q1 Med Q3 Max
-0.009293359 -0.001385895 0.002775650 0.006557553 0.020775575
Residual standard error: 3174.415
Degrees of freedom: 1258 total; 1256 residual
We test the justification of the added autoregressive structure using a likelyhood ratio test. With a very small p value, we can reject the null hypothesis that the added term is not necessary.
Code
m2.gls.0<-update(m2.gls, correlation=NULL)anova(m2.gls, m2.gls.0) # AR(1) vs Uncorrelated errors
Model df AIC BIC logLik Test L.Ratio p-value
m2.gls 1 4 5478.647 5499.19 -2735.324
m2.gls.0 2 3 10941.635 10957.04 -5467.817 1 vs 2 5464.988 <.0001
Fitting an AR(1) model using the ar function confirms the coefficient of approxmiately 1 (0.997).
Call:
ar(x = AAPL_close)
Coefficients:
1
0.997
Order selected 1 sigma^2 estimated as 8.862
We can view the fitted of the GLS and AR(1) models side by side.
Code
par(mfrow=c(1,2))m2.gls_fitted <- AAPL_close -residuals(m2.gls)plot(AAPL$close, type ="l", col =4, lty =1, main="GLS")points(m2.gls_fitted, type ="l", col =2, lty =2)plot(AAPL$close, type ="l", col =4, lty =1, main="AR(1)")points(m2.ar1_fitted, type ="l", col =2, lty =2)
We noted earlier that the slowly decaying structure of the ACF suggested that the series is not stationary. We verify this claim using the Dickey Fuller tests below. The null hypothesis that the series is not stationary is not rejected for the original series, but rejected for the once differentiated series. Therefore, the original series is not stationary and should be differentiated to stabilize the mean.
Dickey Fuller Test on time series:
Code
adf.test(diff(AAPL$close, differences=1))
Augmented Dickey-Fuller Test
data: diff(AAPL$close, differences = 1)
Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
Dickey Fuller Test on differentiated time series:
Code
adf.test(diff(AAPL$close, differences=1))
Augmented Dickey-Fuller Test
data: diff(AAPL$close, differences = 1)
Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
ARIMA ARIMA models allows to combine the AR structure seen above with differentiation and moving average terms.
Here we look at the first two differentials. The higher order differentials have mean 0 but the variance seems to increase in the last part of the series.
Order 1 differential The order 1 differential shows the highest correlation at position 0 and a significant lag at position 6, while the partial ACF seenms to show periodic behavior with a few significant lags at 7 and 8.
Order 2 differential On the other hand, the order two differential has several meaningful lags at position 1 and 7, and the partial ACF decays and stays significant until lag 6.
Form these observations, we can study the following ARMA (autoregressive moving average) models are possible for the differentiated time series:
First Order ARIMA(7,1,0): AR(7) high order on the autoregressive structure First Order ARIMA(0,1,2): MA(1) moving average structure First Order ARIMA(1,1,1): ARMA(1,1) combination
Second Order ARIMA(1,2,0): AR(1) autoregressive structure Second Order ARIMA(0,2,1): MA(1) moving average structure Second Order ARIMA(6,2,0): AR(6) autoregressive structure Second Order ARIMA(6,2,1): ARMA(6,1) combination
Code
arimadf <-tibble(model=character(), coef=character(), loglik=numeric(), aic=numeric()) arimamodels <-function(p,d,q) {# A specification of the non-seasonal part of the ARIMA model: the three components (p, d, q) are the AR order, the degree of differencing, and the MA order. ARMA(p,q) arima.model <-arima(AAPL$close, order =c(p,d,q)) arima.model_fitted <- AAPL$close -residuals(arima.model) df <-tibble(model=paste0(p,d,q), loglik=arima.model$loglik, aic=arima.model$aic) return.list <-list(model=arima.model,fitted=arima.model_fitted, df=df)return(return.list)}
Model Results The results of the ARIMA models listed above are printed here. We find that the model with the lowest AIC is the first order differential AR(7) model with a value of 5466. However is is also a complex model with 8 parameters. A smaller model with marginally decreased AIC is preffered. We can choose for the any of the remaining 3 parameter models. We select the second order MA(1) model at the best model.
plot(AAPL$close, type ="l", col =4, lty =1)points(m021$fitted, type ="l", col =2, lty =2)
We can now use the model to forecast the price of the stock. In the example below the price is predicted for the next 150 trading days. Confidence intervals (shown in blue) are provided at 80% and 95% levels.
arimaforecast <-function(stock, ticker) {# Subset the training data, fit model and get fitted valyes training_subset <- stock$close[1:round(0.7*length(AAPL$close))] model <-arima(training_subset, order =c(0, 2, 1)) fitted_values <- training_subset -residuals(model) boundary <-round(length(stock$close)*.7) end <-length(stock$close) data <-tibble(date=stock$date, price=stock$close, prediction =NA, fitted=NA, Low95 =NA,Low80 =NA,High95 =NA,High80 =NA) stocksubsetforecast <-forecast(model, h=377) forecast_df <-fortify(as.data.frame(stocksubsetforecast)) %>%as_tibble() forecast_df <- forecast_df %>%rename("Low95"="Lo 95","Low80"="Lo 80","High95"="Hi 95","High80"="Hi 80","Forecast"="Point Forecast") data$fitted <-c(as.vector(fitted_values),rep(NA,(end-boundary))) data$prediction <-c(rep(NA,boundary),forecast_df$Forecast) data$Low80 <-c(rep(NA,boundary),forecast_df$Low80) data$High80 <-c(rep(NA,boundary),forecast_df$High80) data$Low95 <-c(rep(NA,boundary),forecast_df$Low95) data$High95 <-c(rep(NA,boundary),forecast_df$High95)ggplot(data, aes(x = date)) +geom_ribbon(aes(ymin = Low95, ymax = High95, fill ="95%")) +geom_ribbon(aes(ymin = Low80, ymax = High80, fill ="80%")) +geom_point(aes(y = price, colour ="price"), size =1) +geom_line(aes(y = price, group =1, colour ="price"), linetype ="dotted", size =0.75) +geom_line(aes(y = fitted, group =2, colour ="fitted"), size =0.75) +geom_line(aes(y = prediction, group =3, colour ="prediction"), size =0.75) +scale_x_date(breaks = scales::pretty_breaks(), date_labels ="%b %y") +scale_colour_brewer(name ="Legend", type ="qual", palette ="Dark2") +scale_fill_brewer(name ="Intervals") +guides(colour =guide_legend(order =1), fill =guide_legend(order =2)) +theme_bw(base_size =14) +ggtitle(ticker)}
With a well-fitting ARIMA model in place, we evaluate its predictive performance by splitting the data into a training set (first 70% of observations) and a test set (remaining 30%). The model is trained on the historical portion and then used to forecast future stock prices, which are compared against the actual observed values.
For AAPL, the observed prices generally remain within the 80% and 95% confidence intervals generated by the model. While the forecast captures the overall trend reasonably well, the actual prices tend to be slightly lower than the predicted trajectory, indicating a modest upward bias in the fitted trend.
To assess generalizability, the same ARIMA structure is applied to other stocks, including Microsoft (MSFT) and IBM. For Microsoft, actual prices tend to lie near the upper bound of the 95% confidence interval and occasionally exceed it, suggesting that the model underestimates stronger upward movements. In contrast, IBM shows better alignment with the forecasted values, with observed prices staying more consistently within the predicted range. Overall, the model performs more reliably for IBM, while showing greater deviation for Microsoft, highlighting differences in how well the ARIMA structure generalizes across different stocks.
Code
arimaforecast(AAPL,'AAPL')
Code
MSFT <- dataset %>%filter(ticker=='MSFT') IBM <- dataset %>%filter(ticker=='IBM')
Code
arimaforecast(MSFT,'MSFT')
Code
arimaforecast(IBM,'IBM')
Model 3 (Panel Regression)
The third modeling approach uses panel regression to analyze stock behavior across both time and multiple companies. This dataset is well suited for panel methods because it contains observations for several stocks over a common time period, allowing the model to capture both cross-sectional differences between companies and temporal dynamics within each stock. The underlying assumption is that stock prices are correlated over time within each company, while different companies remain independent of one another. Additionally, the dataset does not include time-invariant regressors.
Before model estimation, the dataset is balanced to ensure that all stocks are observed over the same trading dates. This step is necessary because unbalanced panel data can lead to biased or inconsistent estimates. As a result, some observations are removed so that each stock has complete coverage across identical time periods, ensuring comparability across the panel structure.
train_stocks <- train_stocks %>%drop_na()panel_stocks <-pdata.frame(train_stocks, index =c("ticker", "date"))
Model 3A: Pooled OLS A Pooled model on panel data is applies the Ordinary Least Squares technique on the data. It is the most restrictive panel model as cross-sectional dimensions are ignored: yit=α+βxit+uit
Code
stocks_m3_pooled <-plm(close ~ macd + m5 + m20, data = panel_stocks, model ="pooling")summary(stocks_m3_pooled)
Pooling Model
Call:
plm(formula = close ~ macd + m5 + m20, data = panel_stocks, model = "pooling")
Balanced Panel: n = 29, T = 983, N = 28507
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-98.2878 -31.1645 -5.8626 21.9314 280.1262
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 86.36634 0.25985 332.376 < 2.2e-16 ***
macd -8.03873 0.25224 -31.869 < 2.2e-16 ***
m5 -1.87150 0.12322 -15.188 < 2.2e-16 ***
m20 5.02074 0.10047 49.975 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 58494000
Residual Sum of Squares: 52403000
R-Squared: 0.10414
Adj. R-Squared: 0.10404
F-statistic: 1104.43 on 3 and 28503 DF, p-value: < 2.22e-16
Code
Pooled_Model <-glance(stocks_m3_pooled)
Fixed Effect Models In Fixed affects models we will assume there is an unobserved heterogeneity across the stocks such as each company’s core competency, or anything else that is unique to each company and thus something unobserved factored into each company’s stock price. This heterogeneity (αi ) is not known but we would want to investigate it’s correlation with the created predictor variables see it’s impact on the closing price: yit=αi+βxit+uit
Model 3B: Fixed Model - Within Estimator
Code
stocks_m3_within <-plm(close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, model ="within")summary(stocks_m3_within)
Oneway (individual) effect Within Model
Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks,
model = "within")
Balanced Panel: n = 29, T = 983, N = 28507
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-64.76738 -8.82872 -0.75545 6.23842 204.90320
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
macd -3.473877 0.130298 -26.6609 < 2.2e-16 ***
m5 -1.048252 0.065263 -16.0619 < 2.2e-16 ***
m20 2.270449 0.045208 50.2219 < 2.2e-16 ***
rsi 0.185650 0.017343 10.7047 < 2.2e-16 ***
vol5 0.739358 0.237983 3.1068 0.001893 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 11677000
Residual Sum of Squares: 10039000
R-Squared: 0.14026
Adj. R-Squared: 0.13926
F-statistic: 929.038 on 5 and 28473 DF, p-value: < 2.22e-16
Code
Fixed_Model <-glance(stocks_m3_within)
Model 3C: Fixed Model - First Difference Estimator The First difference estimator uses period changes for each stock, where the individual specific effects (unobserved heterogeneity) is canceled out: yit−yi,t−1=β(xit−xi,t−1)+(eit−ei,t−1)
Looking at the result, we see that while the adjusted R2 is fairly high compared to the fitted values are nonsensical
Code
stocks_m3_fd <-plm(close ~ macd + m5 + m20 + rsi + vol1 + vol5, data = panel_stocks, model ="fd")summary(stocks_m3_fd)
Oneway (individual) effect First-Difference Model
Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol1 + vol5, data = panel_stocks,
model = "fd")
Balanced Panel: n = 29, T = 983, N = 28507
Observations used in estimation: 28478
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-7.214443 -0.244970 -0.011041 0.233413 10.271243
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.05347588 0.00323057 16.5531 < 2.2e-16 ***
macd 0.23063393 0.01368203 16.8567 < 2.2e-16 ***
m5 0.21175171 0.00241756 87.5890 < 2.2e-16 ***
m20 0.21845909 0.00256222 85.2617 < 2.2e-16 ***
rsi 0.11329595 0.00093788 120.8004 < 2.2e-16 ***
vol1 -0.02273702 0.00690079 -3.2948 0.000986 ***
vol5 0.01974797 0.00788665 2.5040 0.012286 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 42088
Residual Sum of Squares: 8460.4
R-Squared: 0.79899
Adj. R-Squared: 0.79894
F-statistic: 18860.9 on 6 and 28471 DF, p-value: < 2.22e-16
Model 3D: Random Effect Model In Random affects models assumes no fixed effects.The individual specific effects (unobserved heterogeneity) are not correlated with the predictor variables and are independent of the predictor variables. This would result in the assumption that any residual variation on the dependent variable ( closing price) is random and should randomly distributed with the error term: yit=βxit+(αi+eit)
Code
stocks_m3_random <-plm(close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, model ="random")summary(stocks_m3_random)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks,
model = "random")
Balanced Panel: n = 29, T = 983, N = 28507
Effects:
var std.dev share
idiosyncratic 352.59 18.78 0.386
individual 560.33 23.67 0.614
theta: 0.9747
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-62.2882 -8.8892 -1.3109 5.9865 206.3854
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 77.366926 4.492012 17.2232 < 2.2e-16 ***
macd -3.477005 0.130416 -26.6609 < 2.2e-16 ***
m5 -1.048818 0.065322 -16.0561 < 2.2e-16 ***
m20 2.272307 0.045249 50.2183 < 2.2e-16 ***
rsi 0.185538 0.017358 10.6886 < 2.2e-16 ***
vol5 0.739583 0.238199 3.1049 0.001903 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 11707000
Residual Sum of Squares: 10067000
R-Squared: 0.14006
Adj. R-Squared: 0.13991
Chisq: 4642.05 on 5 DF, p-value: < 2.22e-16
Code
Random_Model <-glance(stocks_m3_random)
Panel Model Testing & Selection Below is a summary glance of the 4 Panel models run on the stocks dataset. Choosing between either a pooled, fixed effect, or random effect model requires running Breusch–Pagan Lagrange Multiplier and Hausman tests. Of the 4 models below, we will discard the First Difference model due to its nonsensical fitted values
r.squared adj.r.squared statistic p.value deviance
Pooled Model 0.1041384 0.1040441 1104.4329 0 52402960.63
Fixed Model 0.1402610 0.1392645 929.0379 0 10039213.31
First Difference Model 0.7989854 0.7989431 18860.9173 0 8460.39
Random Model 0.1400610 0.1399101 4642.0489 0 10067303.75
df.residual nobs
Pooled Model 28503 28507
Fixed Model 28473 28507
First Difference Model 28471 28478
Random Model 28501 28507
The Breusch–Pagan Lagrange Multiplier Test (LM Test) is used to test for heteroskedasticity in a linear regression model. The null hypothesis assumes homoskedasticity and in the alternate hypothesis heteroskedasticity is assumed.
First, we test for Random Effects against OLS. From the test we see that since the p-value is near 0, we reject the null hypothesis. The individual specific effects (unobserved heterogeneity) of each stock are significant and therefore we should not use the Pooled OLS model.
Code
plmtest(stocks_m3_pooled, effect ="individual")
Lagrange Multiplier Test - (Honda)
data: close ~ macd + m5 + m20
normal = 2918.2, p-value < 2.2e-16
alternative hypothesis: significant effects
We test the Fixed Effects model against OLS Model, we again see that we reject the null for the alternative hypothesis as the test suggest more support for the Fixed Effect model
Code
pFtest(stocks_m3_within, stocks_m3_pooled)
F test for individual effects
data: close ~ macd + m5 + m20 + rsi + vol5
F = 4005, df1 = 30, df2 = 28473, p-value < 2.2e-16
alternative hypothesis: significant effects
Based on the results from the previous tests, the pooled OLS model can be excluded because it ignores both cross-sectional differences and time-series variation in the data. The next step involves comparing the fixed effects and random effects models using the Hausman test, which evaluates whether the estimators are consistent.
The Hausman test is based on the assumption that if there is no correlation between the independent variables and the individual-specific effects, both fixed effects and random effects models are consistent, although the random effects model is more efficient. However, if such a correlation exists, the random effects estimator becomes inconsistent, while the fixed effects estimator remains valid.
In this case, the null hypothesis is rejected, indicating that the assumption of no correlation does not hold. Therefore, the random effects model is not appropriate, and the fixed effects model is selected as the preferred specification for the panel data analysis.
Code
phtest(stocks_m3_random, stocks_m3_within)
Hausman Test
data: close ~ macd + m5 + m20 + rsi + vol5
chisq = 1.078, df = 5, p-value = 0.956
alternative hypothesis: one model is inconsistent
Discussion and Conclusion
In this paper, we implemented models using logistic regression, panel regression, and auto-regressive models using ARIMA.
The logistic regression models aimed at predicting the direction of next-day stock movements produced weak results overall. Very few predictors were statistically significant, and classification accuracy remained around 52%, which is only slightly better than random guessing. Attempts to address multicollinearity did not meaningfully improve performance, leading to the conclusion that logistic regression was not well suited for this task within the current feature set. Potential improvements could include expanding the dataset to include multiple stocks or redefining the target variable to capture longer-term price movements.
In contrast, the autoregressive time series models proved to be more effective for forecasting stock prices. These models rely solely on historical price data and identify patterns based on past values. Initial linear analysis suggested that an AR(1) structure, where the current price depends on the previous value, could be appropriate. However, the series exhibited non-stationarity, requiring differencing to stabilize the mean and satisfy modeling assumptions. After extending the model to include differencing and moving average components, multiple ARIMA configurations were evaluated. Simpler models performed more consistently, with a second-order differencing model combined with an MA(1) structure selected as the best-performing approach. Validation using training and test splits showed that the ARIMA model generally captured price movements within its confidence intervals, although the intervals remained relatively wide and primarily reflected trend behavior rather than exact price prediction.
For panel regression, three main approaches were compared: pooled OLS, fixed effects, and random effects models. A first-differences estimator was also tested within the fixed effects framework but was ultimately excluded due to unrealistic fitted values. Based on diagnostic testing, including the Breusch–Pagan Lagrange Multiplier test and the Hausman test, the fixed effects model was determined to be the most appropriate, as it accounts for unobserved heterogeneity across companies that may influence stock prices. However, the fixed effects model struggled when applied to the test dataset due to multicollinearity issues in the engineered features, suggesting that alternative feature construction may be required for better out-of-sample performance.
Overall, among the three approaches considered, the ARIMA-based autoregressive models provided the most stable and interpretable results. Each modeling framework serves a different purpose: logistic regression focuses on direction classification, panel regression estimates cross-sectional price behavior, and ARIMA captures temporal structure within individual stock series. Given the inherent difficulty of predicting daily stock prices, ARIMA models are the most suitable in this study for capturing overall trends and providing meaningful forecast ranges.