Data-621 - Final Project ( Stock Predictions Methods )

Author

Bikash Bhowmik, Rupendra Shrestha, Anthony Josue Roman, Jerald Melukkaran, Haoming Chen

Published

May 17, 2026

Abstract

Predicting stock market behavior is a difficult task due to the high volatility and complexity of financial markets. Although large amounts of historical data are available, accurately forecasting stock prices remains challenging because of the many economic and behavioral factors that influence market movements. As a result, researchers often focus on predicting price direction or overall trends rather than exact future values.

This project evaluates three statistical approaches for stock prediction: logistic regression, ARIMA time series models, and panel regression models. Logistic regression is used to classify next-day price direction, ARIMA models are used to forecast future prices based on past observations, and panel regression models analyze multiple stocks while accounting for both time and cross-sectional effects.

The results show that logistic regression performs poorly, with most predictors being insignificant and accuracy near 52%, close to random guessing. Panel regression also produces inconsistent results across stocks. In contrast, ARIMA models provide more stable and reliable forecasts by capturing time-dependent patterns and trends in the data, producing reasonable prediction intervals.

Overall, the results suggest that time series–based ARIMA models are more effective for stock price forecasting in this study compared to logistic and panel regression approaches.

Key Words

Time series Logistic regression Auto-regressive models Panel regression

Introduction

Predicting stock prices is a complex task, and most traditional approaches often produce limited or inconsistent results. Although large volumes of stock market data are readily available, identifying meaningful predictors remains a key challenge. When assuming that recent market behavior can help forecast future movements, analysts often rely on technical indicators such as the Relative Strength Index (RSI), moving averages, and intraday price measures such as open, high, low, and close (OHLC) values.

Alternatively, if stock prices are assumed to follow longer-term patterns and trends, time series analysis becomes more appropriate. This approach models stock prices by comparing current values with their lagged or past observations to capture temporal dependencies and autocorrelation. However, these methods often require data transformations, such as differencing, to satisfy assumptions like stationarity. By applying different statistical models and tuning their parameters, it is possible to generate forecasts that estimate future stock prices with a measurable level of uncertainty.

Methodology

The dataset consists of multiple stocks, but for clarity and consistency, the analysis focuses primarily on Apple (AAPL). In addition to standard pricing information, the dataset is enriched with engineered technical indicators designed to improve predictive performance. These include the Relative Strength Index (RSI), Moving Average Convergence Divergence (MACD), and simple moving averages, which are commonly used to capture momentum and trend behavior in financial markets.

For the logistic regression models, a binary target variable is created based on next-day price movement. If the closing price increases on the following day, the target is assigned a value of 1; otherwise, it is assigned 0. These models are implemented using generalized linear models (GLMs) with a binomial family, allowing for classification of upward or downward price movement.

The autoregressive models rely solely on historical price data and model the relationship between a time series and its lagged values. The ARIMA (AutoRegressive Integrated Moving Average) framework is used to capture these dependencies, where the AR component represents autoregression, the MA component captures moving average effects, and the integration term accounts for differencing to achieve stationarity. These models are then used to forecast future stock prices.

Panel regression models incorporate multiple stocks over time and use the same engineered features, including RSI, MACD, and moving averages of price and volume. Since the dataset includes several companies across identical time periods, fixed effects models are used to account for firm-specific characteristics. This allows the model to estimate stock prices while controlling for individual differences between companies.

Experimentation and Results

Code
library(summarytools)
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
library(tidymodels)
library(ggpmisc)
library(regclass)
library(skimr)
library(corrgram)
library(mice)

library(TTR)
library(quantmod)
library(plm)
library(nlme)
library(tseries)
library(broom)
Code
dataset <- read_csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/stocks_combined.csv')
tickers <- read_csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/tickers.csv')

The tickers available in this dataset are listed below. We will be focusing on the AAPL stock which is described below.

Code
tickers
# A tibble: 30 × 2
   Ticker Company                 
   <chr>  <chr>                   
 1 AAPL   Apple Inc               
 2 AXP    American Express Company
 3 BA     Boeing Co               
 4 CAT    Caterpillar Inc.        
 5 CSCO   Cisco Systems, Inc.     
 6 CVX    Chevron Corporation     
 7 DIS    Walt Disney Co          
 8 DWDP   DowDuPont               
 9 GS     GlaxoSmithKline plc     
10 HD     The Home Depot, Inc.    
# ℹ 20 more rows
Code
skim(dataset)
Data summary
Name dataset
Number of rows 36850
Number of columns 13
_______________________
Column type frequency:
character 3
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 8 10 0 1258 0
ticker 0 1 1 4 0 30 0
label 0 1 5 9 0 1258 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
open 0 1 94.52 52.68 18.17 56.83 84.09 118.30 417.14 ▇▅▁▁▁
high 0 1 95.25 53.14 18.41 57.40 84.67 119.14 421.84 ▇▅▁▁▁
low 0 1 93.76 52.19 18.11 56.19 83.54 117.39 417.11 ▇▅▁▁▁
close 0 1 94.53 52.68 18.18 56.77 84.09 118.32 421.55 ▇▅▁▁▁
volume 0 1 11196571.67 12461717.47 305358.00 3977137.50 7045060.00 13777850.50 618237630.00 ▇▁▁▁▁
unadjustedVolume 0 1 10967241.59 12152154.22 305358.00 3817578.50 6895243.00 13581565.00 618237630.00 ▇▁▁▁▁
change 0 1 0.05 1.50 -23.45 -0.43 0.04 0.56 65.29 ▁▇▁▁▁
changePercent 0 1 0.05 1.30 -14.34 -0.56 0.05 0.70 16.67 ▁▁▇▁▁
vwap 0 1 94.52 52.68 13.79 56.79 84.11 118.24 420.32 ▇▅▁▁▁
changeOverTime 0 1 0.38 0.50 -0.41 0.06 0.24 0.54 3.19 ▇▅▁▁▁
Code
head(dataset)
# A tibble: 6 × 13
  date      ticker  open  high   low close   volume unadjustedVolume change
  <chr>     <chr>  <dbl> <dbl> <dbl> <dbl>    <dbl>            <dbl>  <dbl>
1 2/21/2014 AAPL    70.0  70.2  68.9  69.0 69757247          9965321 -0.775
2 2/24/2014 AAPL    68.7  69.6  68.6  69.3 72364950         10337850  0.302
3 2/25/2014 AAPL    69.5  69.5  68.4  68.6 58247350          8321050 -0.721
4 2/26/2014 AAPL    68.8  68.9  67.7  67.9 69131286          9875898 -0.619
5 2/27/2014 AAPL    67.9  69.4  67.8  69.3 75557321         10793903  1.36 
6 2/28/2014 AAPL    69.5  70.0  68.6  69.1 93074653         13296379 -0.188
# ℹ 4 more variables: changePercent <dbl>, vwap <dbl>, label <chr>,
#   changeOverTime <dbl>

Data Processing

The only necessary processing of the data is formatting the date column appropriately.

Code
dataset$date <- as.Date(dataset$date, format="%m/%d/%Y")

Feature Engineering

The dataset is enhanced using technical indicators generated through the TA-Lib package, along with a binary target variable for the logistic regression model. Rows containing missing values are removed, as these are introduced by rolling window calculations used in feature construction. The analysis focuses on rates of change rather than raw price levels to ensure comparability across different stocks with varying price scales.

The engineered variables include the following:

TARGET: Binary outcome variable indicating whether the next day’s closing price increases (1) or not (0). MACD: Moving Average Convergence Divergence signal based on standard 12- and 26-day windows. m5: 5-day momentum indicator representing short-term price trend (approximately one trading week). m20: 20-day momentum indicator capturing medium-term trend (approximately one trading month). vol1: One-day rate of change in trading volume. vol5: Five-day rate of change in trading volume, capturing short-term volume trend behavior.

Code
AAPL <- dataset %>% filter(ticker=='AAPL') 

# Augment data frame
AAPL_full <- AAPL %>% mutate(macd=MACD(Cl(AAPL), nFast=12, nSlow=26, nSig=9, maType=SMA)[,1], 
                        m5=momentum(AAPL$close, n=5), 
                        m20=momentum(AAPL$close, n=20), 
                        rsi=RSI(AAPL$close, n=14),
                        vol1=ROC(AAPL$volume, n=1),
                        vol5=ROC(AAPL$volume, n=5)) %>%
                        drop_na() %>%
                      select(-c(date,ticker,label, changeOverTime))

# Create target variable
AAPL_model1_data <- AAPL_full %>% 
  mutate(
    TARGET = if_else(dplyr::lead(close, 1) > close, 1, 0)
  ) %>%
  drop_na()
                  
AAPL_model1_data$TARGET <- factor(AAPL_model1_data$TARGET)

head(AAPL_model1_data,10)
# A tibble: 10 × 16
    open  high   low close  volume unadjustedVolume   change changePercent  vwap
   <dbl> <dbl> <dbl> <dbl>   <dbl>            <dbl>    <dbl>         <dbl> <dbl>
 1  70.7  70.8  70.2  70.5  5.01e7          7163009 -0.0788         -0.112  70.5
 2  70.8  71.0  70.4  70.5  4.22e7          6023884 -0.0158         -0.022  70.6
 3  70.6  71.2  70.5  71.1  5.02e7          7169955  0.645           0.915  70.8
 4  71.2  71.4  71.0  71.3  4.48e7          6398885  0.118           0.166  71.2
 5  71.1  71.2  70.6  70.8  4.06e7          5806873 -0.494          -0.693  70.9
 6  70.9  70.9  69.7  69.8  6.88e7          9830355 -0.915          -1.29   70.1
 7  69.3  69.7  68.5  68.7  7.25e7         10351790 -1.10           -1.57   69.0
 8  69.0  69.1  68.1  68.7  6.10e7          8710269 -0.00394        -0.006  68.7
 9  68.6  69.7  68.6  69.6  5.15e7          7363246  0.904           1.31   69.2
10  69.7  69.9  68.7  68.7  5.99e7          8558974 -0.898          -1.29   69.2
# ℹ 7 more variables: macd <dbl>, m5 <dbl>, m20 <dbl>, rsi <dbl>, vol1 <dbl>,
#   vol5 <dbl>, TARGET <fct>

Data Exploration

The stock price for AAPL over the years are risen steadily. There are two important periods of decline but the overall trends remains positive. These periods of decline should provide balance to the dataset.

Code
AAPL %>% ggplot(aes(x=date,y=close)) + geom_line() + ggtitle('AAPL Close Price')

Code
AAPL_full %>% 
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

The correlation plot reveals that some variable are nearly perfectly correlated. This makes sense given that some of these variables generally move together or are derived from another.

Code
library(corrplot)
corr_dataframe <- AAPL_model1_data %>% mutate_if(is.factor, as.numeric)
corr.d <- cor(corr_dataframe)
corr.d[ lower.tri( corr.d, diag = TRUE ) ] <- NA
corrplot( corr.d, type = "upper", diag = FALSE )

Code
AAPL_model1_data %>% ggplot(aes(x=TARGET)) + geom_histogram(stat="count")

Modeling

Model 1 (Logistic)

Code
# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE) 

M1A Full Model

The initial logistic regression model produced weak and inconsistent results. Only a small number of predictors were statistically significant, indicating limited explanatory power. This outcome is largely attributed to multicollinearity among several variables, particularly those derived from stock prices such as open, high, low, and close, which tend to move together.

Because many predictors were highly correlated or redundant, the full model suffered from instability and reduced interpretability. As a result, model refinement was necessary by progressively removing insignificant and highly correlated variables to improve model performance and reliability.

Code
model.full <- glm(
  TARGET ~ .,
  data = AAPL_model1_data,
  family = binomial
)

summary(model.full)

Call:
glm(formula = TARGET ~ ., family = binomial, data = AAPL_model1_data)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)  
(Intercept)       2.671e-01  6.027e-01   0.443   0.6577  
open              8.583e-03  8.433e-02   0.102   0.9189  
high             -1.930e-03  1.517e-01  -0.013   0.9899  
low              -4.638e-02  1.281e-01  -0.362   0.7172  
close            -2.640e-01  1.042e-01  -2.533   0.0113 *
volume            2.347e-09  5.330e-09   0.440   0.6597  
unadjustedVolume -3.069e-09  5.218e-09  -0.588   0.5563  
change            7.907e-02  1.042e-01   0.759   0.4479  
changePercent    -5.918e-02  1.341e-01  -0.441   0.6590  
vwap              3.025e-01  2.193e-01   1.379   0.1678  
macd             -2.321e-02  6.687e-02  -0.347   0.7286  
m5                1.059e-02  2.147e-02   0.493   0.6218  
m20               1.028e-02  1.728e-02   0.595   0.5520  
rsi              -8.145e-04  9.055e-03  -0.090   0.9283  
vol1             -3.231e-02  1.985e-01  -0.163   0.8707  
vol5             -4.458e-03  1.609e-01  -0.028   0.9779  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1705.0  on 1231  degrees of freedom
Residual deviance: 1694.3  on 1216  degrees of freedom
AIC: 1726.3

Number of Fisher Scoring iterations: 4

M1B Small Model Reducing the model down to only the significant predictors leads to a nonsensical results where the two variables cancel each other out.

Code
model.small <- glm(
  TARGET ~ . - vol5 - high - rsi - open - vol1 -
    changePercent - low - macd - m20 -
    volume - unadjustedVolume - m5 - change,
  data = AAPL_model1_data,
  family = binomial
)

summary(model.small)

Call:
glm(formula = TARGET ~ . - vol5 - high - rsi - open - vol1 - 
    changePercent - low - macd - m20 - volume - unadjustedVolume - 
    m5 - change, family = binomial, data = AAPL_model1_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.19119    0.20539   0.931   0.3519  
close       -0.16739    0.07510  -2.229   0.0258 *
vwap         0.16665    0.07511   2.219   0.0265 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1705.0  on 1231  degrees of freedom
Residual deviance: 1699.6  on 1229  degrees of freedom
AIC: 1705.6

Number of Fisher Scoring iterations: 4

M1C (Multicollinearity Reduced Model)

To address multicollinearity issues observed in earlier specifications, this model removes several highly correlated predictors, including open, high, low, vwap, change, and MACD. These variables were excluded due to strong interdependence and high Variance Inflation Factor (VIF) values, which negatively impacted model stability and interpretability.

To retain useful information from the removed variables, two engineered features were introduced: Close2Open and Open2Open, which capture short-term price dynamics and changes between consecutive observations. The final set of predictors was selected based on acceptable VIF thresholds to reduce redundancy and improve model robustness.

Despite these adjustments, the model still shows limited statistical significance across predictors, suggesting that even after mitigating multicollinearity, logistic regression has limited effectiveness for accurately modeling short-term stock direction in this dataset.

Code
m1c_data <- AAPL_model1_data

# Use dplyr::lag to avoid conflict with plm::lag
m1c_data$Close2Open <- m1c_data$open - dplyr::lag(m1c_data$close, 1)
m1c_data$Open2Open  <- m1c_data$open - dplyr::lag(m1c_data$open, 1)

# Remove rows with NA values
m1c_data_trimmed <- m1c_data %>% drop_na()

# Remove highly collinear variables
m1c_data_trimmed <- m1c_data_trimmed %>%
  select(
    TARGET,
    close,
    volume,
    rsi,
    macd,
    m5,
    m20,
    vol1,
    vol5,
    Close2Open,
    Open2Open
  )

# Fit logistic regression
model.m1c <- glm(
  TARGET ~ .,
  data = m1c_data_trimmed,
  family = binomial
)

summary(model.m1c)

Call:
glm(formula = TARGET ~ ., family = binomial, data = m1c_data_trimmed)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.652e-01  5.662e-01   0.468    0.640
close       -7.071e-04  1.715e-03  -0.412    0.680
volume       1.467e-09  3.733e-09   0.393    0.694
rsi         -2.701e-03  8.574e-03  -0.315    0.753
macd         2.639e-05  6.520e-02   0.000    1.000
m5           7.912e-03  2.127e-02   0.372    0.710
m20          6.765e-03  1.689e-02   0.401    0.689
vol1        -1.507e-02  1.960e-01  -0.077    0.939
vol5        -2.968e-02  1.585e-01  -0.187    0.851
Close2Open   3.919e-02  5.651e-02   0.694    0.488
Open2Open   -1.913e-03  3.722e-02  -0.051    0.959

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1703.5  on 1230  degrees of freedom
Residual deviance: 1700.6  on 1220  degrees of freedom
AIC: 1722.6

Number of Fisher Scoring iterations: 3
Code
# Variance Inflation Factors
car::vif(model.m1c)
     close     volume        rsi       macd         m5        m20       vol1 
  1.275271   1.883759   4.181486   8.107270   2.993892   8.486247   1.292616 
      vol5 Close2Open  Open2Open 
  1.597036   1.827310   2.034822 

Logistic Model Results

The results from the logistic regression models are summarized below. Overall, the in-sample classification accuracy is approximately 52%, which is only marginally better than random guessing. None of the models demonstrate a clear advantage in predictive performance, as most variables are not statistically significant.

These results suggest that the current feature set provides limited explanatory power for predicting short-term price direction. Future improvements could include incorporating additional stocks while removing stock-specific variables to improve generalizability. In addition, alternative target definitions, such as predicting price movement over 5-day or 20-day horizons, may yield more stable patterns and improved model performance.

Model 2 (Autoregressive Models)

In this section, all predictor variables are excluded except for the closing price, allowing the analysis to focus solely on the time series structure of the data. This approach assumes that past price behavior contains sufficient information to model and forecast future values.

The AAPL stock series exhibits a clear overall trend with noticeable short-term fluctuations. To assess its suitability for time series modeling, autocorrelation and partial autocorrelation functions (ACF and PACF) are examined. The ACF indicates strong dependence on past values, while the PACF helps identify the number of meaningful lag terms after accounting for intermediate correlations.

Model identification typically relies on interpreting these plots, where significant PACF lags suggest an appropriate autoregressive structure. In this case, the PACF suggests that an AR(1) model may be appropriate, supported by a strong linear relationship between the current and previous time step, with an estimated coefficient close to 1. However, the slowly decaying ACF indicates that the series may be non-stationary, suggesting that differencing may be required to stabilize the mean before fitting a more reliable model.

Code
par(mfrow=c(2,2))
acf(AAPL$close)
pacf(AAPL$close)

t <- AAPL$close[-1]
t_1 <- AAPL$close[-length(AAPL$close)]
m2.lm <- lm(t ~ t_1)

plot(t_1, t)
abline(m2.lm, col=3, lwd=2)

Code
summary(m2.lm)

Call:
lm(formula = t ~ t_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6872  -0.8211  -0.0341   0.9425  11.1222 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.340328   0.210876   1.614    0.107    
t_1         0.997991   0.001572 634.859   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.129 on 1255 degrees of freedom
Multiple R-squared:  0.9969,    Adjusted R-squared:  0.9969 
F-statistic: 4.03e+05 on 1 and 1255 DF,  p-value: < 2.2e-16

We use the generalized least squares method to obtain more information on the mode. We specify an error correlation structure of order 1. We note that the coefficient estiamte is again 1 and the AIC is 5478. This will serve as a comparison of more complex models.

Code
AAPL_close <- ts(AAPL$close)
m2.gls <- gls(AAPL_close ~ time(AAPL_close), correlation = corAR1(form=~1))
summary(m2.gls)
Generalized least squares fit by REML
  Model: AAPL_close ~ time(AAPL_close) 
  Data: NULL 
       AIC     BIC    logLik
  5478.647 5499.19 -2735.324

Correlation Structure: AR(1)
 Formula: ~1 
 Parameter estimate(s):
      Phi 
0.9999998 

Coefficients:
                    Value Std.Error   t-value p-value
(Intercept)      68.90127  3174.416 0.0217052  0.9827
time(AAPL_close)  0.08198     0.060 1.3647854  0.1726

 Correlation: 
                 (Intr)
time(AAPL_close) -0.012

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-0.009293359 -0.001385895  0.002775650  0.006557553  0.020775575 

Residual standard error: 3174.415 
Degrees of freedom: 1258 total; 1256 residual

We test the justification of the added autoregressive structure using a likelyhood ratio test. With a very small p value, we can reject the null hypothesis that the added term is not necessary.

Code
m2.gls.0 <- update(m2.gls, correlation=NULL)
anova(m2.gls, m2.gls.0) # AR(1) vs Uncorrelated errors
         Model df       AIC      BIC    logLik   Test  L.Ratio p-value
m2.gls       1  4  5478.647  5499.19 -2735.324                        
m2.gls.0     2  3 10941.635 10957.04 -5467.817 1 vs 2 5464.988  <.0001

Fitting an AR(1) model using the ar function confirms the coefficient of approxmiately 1 (0.997).

Code
m2.ar1 <- ar(AAPL_close)
m2.ar1_fitted <- AAPL$close - residuals(m2.ar1)
m2.ar1

Call:
ar(x = AAPL_close)

Coefficients:
    1  
0.997  

Order selected 1  sigma^2 estimated as  8.862

We can view the fitted of the GLS and AR(1) models side by side.

Code
par(mfrow=c(1,2))
m2.gls_fitted <- AAPL_close - residuals(m2.gls)
plot(AAPL$close, type = "l", col = 4, lty = 1, main="GLS")
points(m2.gls_fitted, type = "l", col = 2, lty = 2)

plot(AAPL$close, type = "l", col = 4, lty = 1, main="AR(1)")
points(m2.ar1_fitted, type = "l", col = 2, lty = 2)

We noted earlier that the slowly decaying structure of the ACF suggested that the series is not stationary. We verify this claim using the Dickey Fuller tests below. The null hypothesis that the series is not stationary is not rejected for the original series, but rejected for the once differentiated series. Therefore, the original series is not stationary and should be differentiated to stabilize the mean.

Dickey Fuller Test on time series:

Code
adf.test(diff(AAPL$close, differences=1))

    Augmented Dickey-Fuller Test

data:  diff(AAPL$close, differences = 1)
Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

Dickey Fuller Test on differentiated time series:

Code
adf.test(diff(AAPL$close, differences=1))

    Augmented Dickey-Fuller Test

data:  diff(AAPL$close, differences = 1)
Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

ARIMA ARIMA models allows to combine the AR structure seen above with differentiation and moving average terms.

Here we look at the first two differentials. The higher order differentials have mean 0 but the variance seems to increase in the last part of the series.

Order 1 differential The order 1 differential shows the highest correlation at position 0 and a significant lag at position 6, while the partial ACF seenms to show periodic behavior with a few significant lags at 7 and 8.

Code
APPLdiff1 <- diff(AAPL$close, differences=1)
par(mfrow=c(2,2))
acf(APPLdiff1, lag.max=20)             
pacf(APPLdiff1, lag.max=20)
plot.ts(APPLdiff1)

Order 2 differential On the other hand, the order two differential has several meaningful lags at position 1 and 7, and the partial ACF decays and stays significant until lag 6.

Code
APPLdiff2 <- diff(AAPL$close, differences=2)
par(mfrow=c(2,2))
acf(APPLdiff2, lag.max=20)             
pacf(APPLdiff2, lag.max=20)
plot.ts(APPLdiff2)

Form these observations, we can study the following ARMA (autoregressive moving average) models are possible for the differentiated time series:

First Order ARIMA(7,1,0): AR(7) high order on the autoregressive structure First Order ARIMA(0,1,2): MA(1) moving average structure First Order ARIMA(1,1,1): ARMA(1,1) combination

Second Order ARIMA(1,2,0): AR(1) autoregressive structure Second Order ARIMA(0,2,1): MA(1) moving average structure Second Order ARIMA(6,2,0): AR(6) autoregressive structure Second Order ARIMA(6,2,1): ARMA(6,1) combination

Code
arimadf <- tibble(model=character(), coef=character(), loglik=numeric(), aic=numeric()) 

arimamodels <- function(p,d,q) {
  # A specification of the non-seasonal part of the ARIMA model: the three components (p, d, q) are the AR order, the degree of differencing, and the MA order. ARMA(p,q)
  arima.model <- arima(AAPL$close, order = c(p,d,q))
  arima.model_fitted <- AAPL$close - residuals(arima.model)
  df <- tibble(model=paste0(p,d,q), loglik=arima.model$loglik, aic=arima.model$aic)
  return.list <- list(model=arima.model,fitted=arima.model_fitted, df=df)
  return(return.list)
}

Model Results The results of the ARIMA models listed above are printed here. We find that the model with the lowest AIC is the first order differential AR(7) model with a value of 5466. However is is also a complex model with 8 parameters. A smaller model with marginally decreased AIC is preffered. We can choose for the any of the remaining 3 parameter models. We select the second order MA(1) model at the best model.

Code
# First order models
m710 <- arimamodels(p=7, d=1, q=0)
arimadf <- rbind(arimadf, m710$df)
m012 <- arimamodels(p=0, d=1, q=2)
arimadf <- rbind(arimadf, m012$df)
m111 <- arimamodels(p=1, d=1, q=1)
arimadf <- rbind(arimadf, m111$df)

# Second order models
m120 <- arimamodels(p=1, d=2, q=0)
arimadf <- rbind(arimadf, m120$df)
m021 <- arimamodels(0, 2, 1)
arimadf <- rbind(arimadf, m021$df)
m620 <- arimamodels(6, 2, 0)
arimadf <- rbind(arimadf, m620$df)
m621 <- arimamodels(6, 2, 1)
arimadf <- rbind(arimadf, m621$df)

arimadf
# A tibble: 7 × 3
  model loglik   aic
  <chr>  <dbl> <dbl>
1 710   -2725. 5467.
2 012   -2733. 5471.
3 111   -2733. 5473.
4 120   -3005. 6014.
5 021   -2735. 5475.
6 620   -2791. 5596.
7 621   -2732. 5481.

The coefficients for the best model are listed below and we can also see from the graph that the model is a good fit.

Code
m021$model

Call:
arima(x = AAPL$close, order = c(p, d, q))

Coefficients:
         ma1
      -1.000
s.e.   0.003

sigma^2 estimated as 4.536:  log likelihood = -2735.32,  aic = 5474.65
Code
plot(AAPL$close, type = "l", col = 4, lty = 1)
points(m021$fitted, type = "l", col = 2, lty = 2)

We can now use the model to forecast the price of the stock. In the example below the price is predicted for the next 150 trading days. Confidence intervals (shown in blue) are provided at 80% and 95% levels.

Code
AAPLforecast <- forecast(m021$model, h=150)
autoplot(AAPLforecast)

Code
arimaforecast <- function(stock, ticker) {
  # Subset the training data, fit model and get fitted valyes
  training_subset <- stock$close[1:round(0.7*length(AAPL$close))] 
  model <- arima(training_subset, order = c(0, 2, 1))
  fitted_values <- training_subset - residuals(model)
  
  boundary <- round(length(stock$close)*.7)
  end <- length(stock$close)
  data <- tibble(date=stock$date, price=stock$close, prediction = NA, fitted=NA, Low95 = NA,
         Low80 = NA,
         High95 = NA,
         High80 = NA)
  stocksubsetforecast <- forecast(model, h=377)
  forecast_df <- fortify(as.data.frame(stocksubsetforecast)) %>% as_tibble()
  forecast_df <- forecast_df %>%
    rename("Low95" = "Lo 95",
           "Low80" = "Lo 80",
           "High95" = "Hi 95",
           "High80" = "Hi 80",
           "Forecast" = "Point Forecast")
  data$fitted <- c(as.vector(fitted_values),rep(NA,(end-boundary)))
  data$prediction <- c(rep(NA,boundary),forecast_df$Forecast)
  data$Low80 <- c(rep(NA,boundary),forecast_df$Low80)
  data$High80 <- c(rep(NA,boundary),forecast_df$High80)
  data$Low95 <- c(rep(NA,boundary),forecast_df$Low95)
  data$High95 <- c(rep(NA,boundary),forecast_df$High95)
  
  ggplot(data, aes(x = date)) + 
    geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
    geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
    geom_point(aes(y = price, colour = "price"), size = 1) +
    geom_line(aes(y = price, group = 1, colour = "price"), 
              linetype = "dotted", size = 0.75) +
    geom_line(aes(y = fitted, group = 2, colour = "fitted"), size = 0.75) +
    geom_line(aes(y = prediction, group = 3, colour = "prediction"), size = 0.75) +
    scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
    scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
    scale_fill_brewer(name = "Intervals") +
    guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
    theme_bw(base_size = 14) +
    ggtitle(ticker)
}

With a well-fitting ARIMA model in place, we evaluate its predictive performance by splitting the data into a training set (first 70% of observations) and a test set (remaining 30%). The model is trained on the historical portion and then used to forecast future stock prices, which are compared against the actual observed values.

For AAPL, the observed prices generally remain within the 80% and 95% confidence intervals generated by the model. While the forecast captures the overall trend reasonably well, the actual prices tend to be slightly lower than the predicted trajectory, indicating a modest upward bias in the fitted trend.

To assess generalizability, the same ARIMA structure is applied to other stocks, including Microsoft (MSFT) and IBM. For Microsoft, actual prices tend to lie near the upper bound of the 95% confidence interval and occasionally exceed it, suggesting that the model underestimates stronger upward movements. In contrast, IBM shows better alignment with the forecasted values, with observed prices staying more consistently within the predicted range. Overall, the model performs more reliably for IBM, while showing greater deviation for Microsoft, highlighting differences in how well the ARIMA structure generalizes across different stocks.

Code
arimaforecast(AAPL,'AAPL')

Code
MSFT <- dataset %>% filter(ticker=='MSFT') 
IBM <- dataset %>% filter(ticker=='IBM') 
Code
arimaforecast(MSFT,'MSFT')

Code
arimaforecast(IBM,'IBM')

Model 3 (Panel Regression)

The third modeling approach uses panel regression to analyze stock behavior across both time and multiple companies. This dataset is well suited for panel methods because it contains observations for several stocks over a common time period, allowing the model to capture both cross-sectional differences between companies and temporal dynamics within each stock. The underlying assumption is that stock prices are correlated over time within each company, while different companies remain independent of one another. Additionally, the dataset does not include time-invariant regressors.

Before model estimation, the dataset is balanced to ensure that all stocks are observed over the same trading dates. This step is necessary because unbalanced panel data can lead to biased or inconsistent estimates. As a result, some observations are removed so that each stock has complete coverage across identical time periods, ensuring comparability across the panel structure.

Code
dataset2 <- dataset %>% select(-label)
is.pbalanced(dataset2)
[1] FALSE
Code
dataset2 <- make.pbalanced(dataset2,balance.type = "shared.times")
is.pbalanced(dataset2)
[1] TRUE

Create Predictors For all stock observations in our dataset, we will create the engineered features: MACD, RSI, m5, m20, v1, and v5

Code
create_predictors <- function(df){
  df_full <- df %>% mutate(macd=MACD(Cl(df), nFast=12, nSlow=26, nSig=9, maType=SMA)[,1], 
                        m5=momentum(df$close, n=5), 
                        m20=momentum(df$close, n=20), 
                        rsi=RSI(df$close, n=14),
                        vol1=ROC(df$volume, n=1),
                        vol5=ROC(df$volume, n=5))
  return(df_full)
}

balanced_tickers <- unique(dataset2$ticker)
dataset2_full <- dataset2 %>% filter(ticker=="AAPL") %>% create_predictors() 

for(i in balanced_tickers[2:length(balanced_tickers)])
{
 dataset2_full <- rbind(dataset2_full, dataset2 %>% filter(ticker==i) %>% create_predictors())
}

dataset2_full %>% filter(date=="2014-05-21")
# A tibble: 29 × 18
   date       ticker  open  high   low close   volume unadjustedVolume change
   <date>     <chr>  <dbl> <dbl> <dbl> <dbl>    <dbl>            <dbl>  <dbl>
 1 2014-05-21 AAPL    79.4  79.7  79.1  79.7 49249914          7035702  0.210
 2 2014-05-21 AXP     81.4  81.9  81.2  81.7  2525553          2525553  0.716
 3 2014-05-21 BA     115.  116.  115.  116.   2476186          2476186  1.22 
 4 2014-05-21 CAT     87.4  88.7  87.4  88.6  5460511          5460511  1.14 
 5 2014-05-21 CSCO    20.8  21.2  20.8  21.0 65262617         65262617  0.309
 6 2014-05-21 CVX    101.  103.  101.  102.   4758794          4758794  1.37 
 7 2014-05-21 DIS     75.8  76.7  75.8  76.6  4736535          4736535  1.02 
 8 2014-05-21 GS     147.  150.  147.  149.   4226325          4226325  2.81 
 9 2014-05-21 HD      70.7  70.9  70.2  70.5  6693930          6693930  0.108
10 2014-05-21 IBM    156.  157.  155.  156.   2988348          2988348  1.26 
# ℹ 19 more rows
# ℹ 9 more variables: changePercent <dbl>, vwap <dbl>, changeOverTime <dbl>,
#   macd <dbl>, m5 <dbl>, m20 <dbl>, rsi <dbl>, vol1 <dbl>, vol5 <dbl>

Test and Train datasets Create test and training datasets for prediction purposes, and ensure both datasets are balanced

Code
stock_dates <- unique(dataset2_full$date)

test_dates <- tail(stock_dates, 250)
train_dates <- head(stock_dates, length(stock_dates)-250)

test_stocks <- dataset2_full %>% dplyr::filter(date %in% test_dates)
train_stocks <- dataset2_full %>% dplyr::filter(date %in% train_dates)
is.pbalanced(test_stocks)
[1] TRUE
Code
is.pbalanced(train_stocks)
[1] TRUE
Code
train_stocks <- train_stocks %>% drop_na()
panel_stocks <- pdata.frame(train_stocks, index = c("ticker", "date"))

Model 3A: Pooled OLS A Pooled model on panel data is applies the Ordinary Least Squares technique on the data. It is the most restrictive panel model as cross-sectional dimensions are ignored: yit=α+βxit+uit

Code
stocks_m3_pooled <- plm(close ~ macd + m5 + m20,  data = panel_stocks, model = "pooling")
summary(stocks_m3_pooled)
Pooling Model

Call:
plm(formula = close ~ macd + m5 + m20, data = panel_stocks, model = "pooling")

Balanced Panel: n = 29, T = 983, N = 28507

Residuals:
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-98.2878 -31.1645  -5.8626  21.9314 280.1262 

Coefficients:
            Estimate Std. Error t-value  Pr(>|t|)    
(Intercept) 86.36634    0.25985 332.376 < 2.2e-16 ***
macd        -8.03873    0.25224 -31.869 < 2.2e-16 ***
m5          -1.87150    0.12322 -15.188 < 2.2e-16 ***
m20          5.02074    0.10047  49.975 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    58494000
Residual Sum of Squares: 52403000
R-Squared:      0.10414
Adj. R-Squared: 0.10404
F-statistic: 1104.43 on 3 and 28503 DF, p-value: < 2.22e-16
Code
Pooled_Model <- glance(stocks_m3_pooled)

Fixed Effect Models In Fixed affects models we will assume there is an unobserved heterogeneity across the stocks such as each company’s core competency, or anything else that is unique to each company and thus something unobserved factored into each company’s stock price. This heterogeneity (αi ) is not known but we would want to investigate it’s correlation with the created predictor variables see it’s impact on the closing price: yit=αi+βxit+uit

Model 3B: Fixed Model - Within Estimator

Code
stocks_m3_within <- plm(close ~ macd + m5 + m20 + rsi + vol5,  data = panel_stocks, model = "within")
summary(stocks_m3_within)
Oneway (individual) effect Within Model

Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, 
    model = "within")

Balanced Panel: n = 29, T = 983, N = 28507

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-64.76738  -8.82872  -0.75545   6.23842 204.90320 

Coefficients:
      Estimate Std. Error  t-value  Pr(>|t|)    
macd -3.473877   0.130298 -26.6609 < 2.2e-16 ***
m5   -1.048252   0.065263 -16.0619 < 2.2e-16 ***
m20   2.270449   0.045208  50.2219 < 2.2e-16 ***
rsi   0.185650   0.017343  10.7047 < 2.2e-16 ***
vol5  0.739358   0.237983   3.1068  0.001893 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    11677000
Residual Sum of Squares: 10039000
R-Squared:      0.14026
Adj. R-Squared: 0.13926
F-statistic: 929.038 on 5 and 28473 DF, p-value: < 2.22e-16
Code
Fixed_Model <- glance(stocks_m3_within)

Model 3C: Fixed Model - First Difference Estimator The First difference estimator uses period changes for each stock, where the individual specific effects (unobserved heterogeneity) is canceled out: yit−yi,t−1=β(xit−xi,t−1)+(eit−ei,t−1)

Looking at the result, we see that while the adjusted R2 is fairly high compared to the fitted values are nonsensical

Code
stocks_m3_fd <- plm(close ~ macd + m5 + m20 + rsi + vol1 + vol5,  data = panel_stocks, model = "fd")
summary(stocks_m3_fd)
Oneway (individual) effect First-Difference Model

Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol1 + vol5, data = panel_stocks, 
    model = "fd")

Balanced Panel: n = 29, T = 983, N = 28507
Observations used in estimation: 28478

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-7.214443 -0.244970 -0.011041  0.233413 10.271243 

Coefficients:
               Estimate  Std. Error  t-value  Pr(>|t|)    
(Intercept)  0.05347588  0.00323057  16.5531 < 2.2e-16 ***
macd         0.23063393  0.01368203  16.8567 < 2.2e-16 ***
m5           0.21175171  0.00241756  87.5890 < 2.2e-16 ***
m20          0.21845909  0.00256222  85.2617 < 2.2e-16 ***
rsi          0.11329595  0.00093788 120.8004 < 2.2e-16 ***
vol1        -0.02273702  0.00690079  -3.2948  0.000986 ***
vol5         0.01974797  0.00788665   2.5040  0.012286 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    42088
Residual Sum of Squares: 8460.4
R-Squared:      0.79899
Adj. R-Squared: 0.79894
F-statistic: 18860.9 on 6 and 28471 DF, p-value: < 2.22e-16
Code
Fixed_Model_FD <- glance(stocks_m3_fd)
fitted.values(stocks_m3_fd)[1:10]
 [1] -0.2416019  0.5803779  0.2802988 -0.7070574 -1.3107559 -1.3498763
 [7] -0.3178371  1.1958132 -0.8536587 -0.2521375

Model 3D: Random Effect Model In Random affects models assumes no fixed effects.The individual specific effects (unobserved heterogeneity) are not correlated with the predictor variables and are independent of the predictor variables. This would result in the assumption that any residual variation on the dependent variable ( closing price) is random and should randomly distributed with the error term: yit=βxit+(αi+eit)

Code
stocks_m3_random <- plm(close ~ macd + m5 + m20 + rsi + vol5,  data = panel_stocks, model = "random")
summary(stocks_m3_random)
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, 
    model = "random")

Balanced Panel: n = 29, T = 983, N = 28507

Effects:
                 var std.dev share
idiosyncratic 352.59   18.78 0.386
individual    560.33   23.67 0.614
theta: 0.9747

Residuals:
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-62.2882  -8.8892  -1.3109   5.9865 206.3854 

Coefficients:
             Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept) 77.366926   4.492012  17.2232 < 2.2e-16 ***
macd        -3.477005   0.130416 -26.6609 < 2.2e-16 ***
m5          -1.048818   0.065322 -16.0561 < 2.2e-16 ***
m20          2.272307   0.045249  50.2183 < 2.2e-16 ***
rsi          0.185538   0.017358  10.6886 < 2.2e-16 ***
vol5         0.739583   0.238199   3.1049  0.001903 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    11707000
Residual Sum of Squares: 10067000
R-Squared:      0.14006
Adj. R-Squared: 0.13991
Chisq: 4642.05 on 5 DF, p-value: < 2.22e-16
Code
Random_Model <- glance(stocks_m3_random)

Panel Model Testing & Selection Below is a summary glance of the 4 Panel models run on the stocks dataset. Choosing between either a pooled, fixed effect, or random effect model requires running Breusch–Pagan Lagrange Multiplier and Hausman tests. Of the 4 models below, we will discard the First Difference model due to its nonsensical fitted values

Code
Panel_Model_Summary <- data.frame(Pooled_Model)

Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model_FD)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Random_Model)


rownames(Panel_Model_Summary) <- c("Pooled Model", "Fixed Model", "First Difference Model", "Random Model")
Panel_Model_Summary
                       r.squared adj.r.squared  statistic p.value    deviance
Pooled Model           0.1041384     0.1040441  1104.4329       0 52402960.63
Fixed Model            0.1402610     0.1392645   929.0379       0 10039213.31
First Difference Model 0.7989854     0.7989431 18860.9173       0     8460.39
Random Model           0.1400610     0.1399101  4642.0489       0 10067303.75
                       df.residual  nobs
Pooled Model                 28503 28507
Fixed Model                  28473 28507
First Difference Model       28471 28478
Random Model                 28501 28507

The Breusch–Pagan Lagrange Multiplier Test (LM Test) is used to test for heteroskedasticity in a linear regression model. The null hypothesis assumes homoskedasticity and in the alternate hypothesis heteroskedasticity is assumed.

First, we test for Random Effects against OLS. From the test we see that since the p-value is near 0, we reject the null hypothesis. The individual specific effects (unobserved heterogeneity) of each stock are significant and therefore we should not use the Pooled OLS model.

Code
plmtest(stocks_m3_pooled, effect = "individual")

    Lagrange Multiplier Test - (Honda)

data:  close ~ macd + m5 + m20
normal = 2918.2, p-value < 2.2e-16
alternative hypothesis: significant effects

We test the Fixed Effects model against OLS Model, we again see that we reject the null for the alternative hypothesis as the test suggest more support for the Fixed Effect model

Code
pFtest(stocks_m3_within, stocks_m3_pooled)

    F test for individual effects

data:  close ~ macd + m5 + m20 + rsi + vol5
F = 4005, df1 = 30, df2 = 28473, p-value < 2.2e-16
alternative hypothesis: significant effects

Based on the results from the previous tests, the pooled OLS model can be excluded because it ignores both cross-sectional differences and time-series variation in the data. The next step involves comparing the fixed effects and random effects models using the Hausman test, which evaluates whether the estimators are consistent.

The Hausman test is based on the assumption that if there is no correlation between the independent variables and the individual-specific effects, both fixed effects and random effects models are consistent, although the random effects model is more efficient. However, if such a correlation exists, the random effects estimator becomes inconsistent, while the fixed effects estimator remains valid.

In this case, the null hypothesis is rejected, indicating that the assumption of no correlation does not hold. Therefore, the random effects model is not appropriate, and the fixed effects model is selected as the preferred specification for the panel data analysis.

Code
phtest(stocks_m3_random, stocks_m3_within)

    Hausman Test

data:  close ~ macd + m5 + m20 + rsi + vol5
chisq = 1.078, df = 5, p-value = 0.956
alternative hypothesis: one model is inconsistent

Discussion and Conclusion

In this paper, we implemented models using logistic regression, panel regression, and auto-regressive models using ARIMA.

The logistic regression models aimed at predicting the direction of next-day stock movements produced weak results overall. Very few predictors were statistically significant, and classification accuracy remained around 52%, which is only slightly better than random guessing. Attempts to address multicollinearity did not meaningfully improve performance, leading to the conclusion that logistic regression was not well suited for this task within the current feature set. Potential improvements could include expanding the dataset to include multiple stocks or redefining the target variable to capture longer-term price movements.

In contrast, the autoregressive time series models proved to be more effective for forecasting stock prices. These models rely solely on historical price data and identify patterns based on past values. Initial linear analysis suggested that an AR(1) structure, where the current price depends on the previous value, could be appropriate. However, the series exhibited non-stationarity, requiring differencing to stabilize the mean and satisfy modeling assumptions. After extending the model to include differencing and moving average components, multiple ARIMA configurations were evaluated. Simpler models performed more consistently, with a second-order differencing model combined with an MA(1) structure selected as the best-performing approach. Validation using training and test splits showed that the ARIMA model generally captured price movements within its confidence intervals, although the intervals remained relatively wide and primarily reflected trend behavior rather than exact price prediction.

For panel regression, three main approaches were compared: pooled OLS, fixed effects, and random effects models. A first-differences estimator was also tested within the fixed effects framework but was ultimately excluded due to unrealistic fitted values. Based on diagnostic testing, including the Breusch–Pagan Lagrange Multiplier test and the Hausman test, the fixed effects model was determined to be the most appropriate, as it accounts for unobserved heterogeneity across companies that may influence stock prices. However, the fixed effects model struggled when applied to the test dataset due to multicollinearity issues in the engineered features, suggesting that alternative feature construction may be required for better out-of-sample performance.

Overall, among the three approaches considered, the ARIMA-based autoregressive models provided the most stable and interpretable results. Each modeling framework serves a different purpose: logistic regression focuses on direction classification, panel regression estimates cross-sectional price behavior, and ARIMA captures temporal structure within individual stock series. Given the inherent difficulty of predicting daily stock prices, ARIMA models are the most suitable in this study for capturing overall trends and providing meaningful forecast ranges.

Code
install_load <- function(pkg){
  # Load packages & Install them if needed.
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)
}