1 Introduction

1.1 Motivation

Context

The stock market refers to the several exchanges where the buying and selling of stocks of publicly held companies. The participants(investors) in the stock market form part of the stock markets and the main reason for participation gravitates around growing their money or preserving the value of the money that they have. The decision to buy or sell a particular stock at a given time is influenced by various factors, among which is the prevailing price of the stock at the time of participation. This report aims to understand the influence of the identified independent variables on the closing price of the stock.

Problem

From the various economic theories, it’s general knowledge that the price of any commodity is influenced by the demand and supply forces within the marketplace. A number of studies have been done to explain the influence of various factors however there is no standard equation that tells of how the stock price will behave in response to the various changes in the marketplace. This study seeks to understand the factors that influence the stock price for National Bank of Canada.
National Bank of Canada provides various financial products and services to retail, commercial, corporate, and institutional clients in Canada and internationally. It operates through four segments: Personal and Commercial, Wealth Management, Financial Markets, and U.S. Specialty Finance and International.

1.2 Objectives

Overview

The main objective of this project was to understand the fundamental and technical factors that influence the stock price for National Bank of Canada based on historical performance data.

Goals & Research Questions

The goal of the data modelling is to unearth the influence of the various technical and fundamental variables with the price of the stock. The research set out to answer the major question: What is the best model that explains the stock price given a set of fundamental and technical factors?

The fundamental factors are economic and financial factors that influence the intrinsic value of an investment. Examples of such factors include the Consumer price index (CPI), Earnings per share (EPS) for the company, and the Profit earnings ratio (P/E). The technical factors explain the trading activity of the stock. Price, volume movements are technical indicators that were investigated in this research.

2 Methodology

2.1 Data

The data were collected in a CSV format from Open Data sources.

The stock price information for National Bank of Canada, the exchange rates and market index was sourced from Yahoo finance (National Bank of Canada (NA.TO) Stock Historical Prices & Data, n.d.). The historical earnings per share data was sourced from Macrotrends (National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF, n.d.). The consumer price index information for Canada was sourced from Statistics Canada (Consumer Price Index, Monthly, Not Seasonally Adjusted, 2023). The monthly data spans a duration of 2012 to 2022 and was consolidated into one dataframe of 93 rows for analysis.

The dataset has the variables Close, Volume, CPI, EPS, P.E, USDCAD price and SP500 price. The response variable for the research is the Close while the predictors are the Volume, CPI, EPS, PE and USD/CAD rate.

Close : this is the closing price of the National Bank of Canada stock for each day on the Toronto stock market. The price is quoted in Canadian Dollars.

Volume: this is the number of National Bank of Canada stocks that are traded on the stock exchange on the given day. There is no unit of measurement for the volume

CPI : the consumer price index is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. It is an accurate representation of the inflation changes over the years.

EPS : the earnings per share (EPS) is an indicator of the financial health of the National Bank of Canada. It is calculated as the company’s profit divided by the outstanding number of its common stock. This is an indicator of the company’s profitability and the results are released on a quarterly basis.

PE : this is the price earnings ratio is calculated by taking the latest closing price and dividing it by the most recent earnings per share (EPS) number. The PE is used as a valuation measure to assess whether a stock is over or undervalued.

USDCAD price : this is the exchange rate quoted for the United States Dollar (USD) Canadian Dollar (CAD) currency pair. The rate tells of how many Canadian dollars are needed to purchase one U.S dollar. The price is quoted on a daily basis and the closing price was considered for this study.

SP500 price: this is the closing price quoted for the S&P 500 index. The S&P 500 is a market capitalisation weighted index of large-cap stocks. It has 500 constituents that represent a diverse set of companies from multiple companies. The price is quoted in USD.

# read the data
bank = read.csv("bankofcanadaCP.csv")
head(bank,10)

##          Date SP500Price USDCADPrice  Close  Volume   CPI  EPS       PE
## 1  12/30/2022    3839.50      1.3549 127.30 3545800 153.1 1.83 69.56284
## 2  12/29/2022    3849.28      1.3546 128.66 3256100 153.1 1.83 70.30601
## 3  12/28/2022    3783.22      1.3608 127.71 3163800 153.1 1.83 69.78688
## 4  12/23/2022    3844.82      1.3598 128.28 2134600 153.1 1.83 70.09836
## 5  12/22/2022    3822.39      1.3646 127.73 2441100 153.1 1.83 69.79782
## 6  12/21/2022    3878.44      1.3609 128.67 2183800 153.1 1.83 70.31147
## 7  12/20/2022    3821.62      1.3609 127.50 3750500 153.1 1.83 69.67213
## 8  12/19/2022    3817.66      1.3642 126.76 2569100 153.1 1.83 69.26776
## 9  12/16/2022    3852.36      1.3700 127.96 6758500 153.1 1.83 69.92350
## 10 12/15/2022    3895.75      1.3658 128.22 1972700 153.1 1.83 70.06557

Converting date from string to data type:

# Convert the Date column to a Date object using mdy() function from lubridate
bank$Date <- mdy(bank$Date)

The SP500Price, USDCADPrice, Close and Volume parameters are collected on a daily basis. However, CPI and EPS values are released on a monthly basis. In order to align the information, the information for the last day of the month is taken:

df_last_day <- bank %>%
  
  filter(day(Date) == days_in_month(Date))
head(df_last_day)

##         Date SP500Price USDCADPrice  Close  Volume   CPI  EPS       PE
## 1 2022-11-30    4080.11      1.3409 133.78 5769800 154.0 1.83 73.10382
## 2 2022-10-31    3871.98      1.3622 126.05 3120000 153.8 1.83 68.87978
## 3 2022-09-30    3585.62      1.3826 124.37 1982500 152.7 1.83 67.96175
## 4 2022-08-31    3955.00      1.3127 122.13 2395800 152.6 1.83 66.73770
## 5 2022-06-30    3785.38      1.2872 124.64 2576600 152.9 2.01 62.00995
## 6 2022-05-31    4132.15      1.2644 132.17 4224100 151.9 2.01 65.75622

2.2 Approach

The approach is to start with a full multiple linear regression model, with all the variables included. A full model test is performed to confirm whether the multiple linear regression model is useful. Based on the outcome of the individual t-tests, assessment of the predictors’ significance is done.

A first order model is then designed by assessing the contribution of a subset of predictors to the model. The choice of model at this stage is based on the adjusted R-squared, which measures how much of the variation in the response variable is explained by the changes in the predictor variables. To assess the best subsets of predictors in the first order model such measures as RMSE, Cp and AIC are computed and compared. The subset with the combination of the highest Adjusted R-squared, the lowest RMSE, Cp and AIC is preferred.

Based on the outcome of selecting the best first order model, the interaction terms are assessed and added to the main effects to improve the model. The judgement of the significance of the interaction terms within the model is based on the p-values of each of the coefficients as well as the values of Adjusted R-squared and RMSE. The interaction terms whose p-value is insignificant are dropped from the model and model is reassessed.

Once the best first order model with interaction terms is selected, an analysis of the residuals is done. This entails testing the assumptions below:

Linearity Assumption: assumes that there is a straight line relationship between the predictors and the response variable.
Equal Variance assumption: the error terms have a constant variance.
Normality assumption: the residuals of the regression should be normally distributed.
Multicollinearity: check for any linearly correlated independent variables.
Outliers: assess the presence of influential data points which can affect the regression results

The results of each of the assumptions is based on to fine tune the model and conclude on the best multiple linear model to predict the National Bank of Canada stock price.

2.3 Workflow

The next steps are going to be done to find the best regression model:

Evaluating overall model utility by using a global F test
Performing Individual Coefficients Test (t-test) to check how significant their influence on response variable
Finding the best first-order model comparing such measures as Adjusted R-squared, RMSE, Cp and - - AIC
Finding the best first-order model with interaction terms, using individual t-test, Adjusted R-squared, and RMSE
Testing the Linearity assumption by plotting the residuals versus predicted values.
Checking the homoscedasticity by creating the Scale-Location plot and doing the Breusch-Pagan test.
Examining the Q-Q plot of residuals and the results of Shapiro Wilk test to evaluate if the errors between observed and fitted values are normally distributed.
Checking if the problem of multicollinearity exists by using VIFs and scatterplots showing correlation between predictors.
Using Cook’s distance and Leverage plots to find influential cases of outliers.
Analysing how to solve the problems arising from results of testing the assumptions if applicable.
Fitting the final model which pretends to the highest prediction accuracy among other models analysed

2.4 Workload Distribution

The project work was equally distributed between three team members:

Predictors, first order model, interaction and higher-order terms selection tasks. Making intermediate conclusions.
Testing for the linearity, the homoscedasticity, and the Normality assumptions. Drawing intermediate conclusions.
Testing for the multicollinearity assumptions and outliers. Making intermediate conclusions.

The common tasks which were done together:

Data gathering
Discussions about the topic chosen, methodology, steps of the project, future work suggestions
Analysing and choosing the final model
Making conclusions

3 Analysis

3.1 Build the first order model

Fit the model containing all six variables

We built a first order linear model with all the independent variables. This formed the reference point for the subsequent model selection procedures:

\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}Volume + \beta_{4}CPI + \beta_{5}EPS + \beta_{6}PE\)

firstmodel<-lm(Close~SP500Price + USDCADPrice + Volume + CPI  + EPS + PE, data=df_last_day) #Full model
summary(firstmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI + 
##     EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3089  -3.6343  -0.2154   4.5781  10.3098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.974e+01  2.128e+01  -1.867 0.065315 .  
## SP500Price   1.385e-02  1.902e-03   7.282 1.46e-10 ***
## USDCADPrice  3.219e+01  9.002e+00   3.576 0.000575 ***
## Volume      -7.634e-07  5.312e-07  -1.437 0.154310    
## CPI          1.435e-01  2.450e-01   0.586 0.559530    
## EPS          2.160e+01  5.640e+00   3.829 0.000244 ***
## PE           1.529e-01  4.816e-02   3.174 0.002084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared:  0.9441, Adjusted R-squared:  0.9402 
## F-statistic: 241.9 on 6 and 86 DF,  p-value: < 2.2e-16

Test the hypothesis for the full model i.e the test of overall significance (at the significance level = 0.05)

In order to asses the overall model utility, a global F test was done with hypothesis:

\(H_{0} : \beta_{1} = \beta_{2} = \beta_{3} = \beta_{4} = \beta_{5}= \beta_{6} = 0\)

\(H_{a}\) : at least one \(\beta_{i}\) is not zero (i = 1,2,3,4,5,6)

nullmodel<-lm(Close~1, data=df_last_day) #with only intersept
anova(nullmodel,firstmodel) #Comparison of null model and full model

## Analysis of Variance Table
## 
## Model 1: Close ~ 1
## Model 2: Close ~ SP500Price + USDCADPrice + Volume + CPI + EPS + PE
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     92 49353                                  
## 2     86  2760  6     46593 241.93 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that F-statistic = 241.93 with df = 2760 (p-value < 2.2e-16 < \(\alpha\) = 0.05). It indicates that we should clearly reject the null hypothesis \(H_{0}\). The large F-test suggests that at least one of the variables (SP500Price, USDCADPrice, Volume, CPI, EPS, PE) must be related to the stock close price. Based on the p-value of, we also have extremely strong evidence that at least one of the variables is associated with the close price.

Use Individual Coefficients Test (t-test) to find the best model

\(H_{0} : \beta_{i} = 0\)

\(H_{a} : \beta_{i}\) is not equal 0 (i=1,2,3,4,5,6)

summary(firstmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI + 
##     EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3089  -3.6343  -0.2154   4.5781  10.3098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.974e+01  2.128e+01  -1.867 0.065315 .  
## SP500Price   1.385e-02  1.902e-03   7.282 1.46e-10 ***
## USDCADPrice  3.219e+01  9.002e+00   3.576 0.000575 ***
## Volume      -7.634e-07  5.312e-07  -1.437 0.154310    
## CPI          1.435e-01  2.450e-01   0.586 0.559530    
## EPS          2.160e+01  5.640e+00   3.829 0.000244 ***
## PE           1.529e-01  4.816e-02   3.174 0.002084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared:  0.9441, Adjusted R-squared:  0.9402 
## F-statistic: 241.9 on 6 and 86 DF,  p-value: < 2.2e-16

The individual T-tests of the first-order model returned significant p-values for the independent variables SP500Price, USDCADPrice, PE and EPS (< 2e-16), which provided sufficient evidence against the null hypothesis only for four variables. The volume and CPI on the other hand weren’t significant and were subsequently dropped.

Select the significant predictors for the first-order model

reducedmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #reduced model
summary(reducedmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9297  -3.7728   0.2145   4.5945  10.5450 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.206378   9.318896  -3.027  0.00324 ** 
## SP500Price    0.014529   0.001727   8.412 6.66e-13 ***
## USDCADPrice  33.991307   7.552184   4.501 2.06e-05 ***
## EPS          22.621144   4.606398   4.911 4.15e-06 ***
## PE            0.152587   0.044962   3.394  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.677 on 88 degrees of freedom
## Multiple R-squared:  0.9425, Adjusted R-squared:  0.9399 
## F-statistic: 360.9 on 4 and 88 DF,  p-value: < 2.2e-16

The individual T-tests of the reduced first order model returned significant p-values for the variables. Based on this, we concluded that the SP500price, USDCADprice, PE, and EPS are significant variables in predicting the stock price on their own:

\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE + \beta_{5}SP500Price:PE + \beta_{6}EPS:PE\)

Best subset of predictors for the first-order model based on the Adjusted R-squared, cp, AIC and RMSE

To further firm up on the best subset of the predictors to be included in the first order model, the Adjusted R-squared, RMSE, Cp and AIC values were computed. A higher adjusted R-squared and the smaller Cp, AIC, RMSE were preferred.

#Select the subset of predictors that do the best at meeting some well-defined criterion
stock=ols_step_best_subset(reducedmodel, details=TRUE)

# for the output (the Adjusted R-squared, Cp and AIC) interpretation
AdjustedR<-c(stock$adjr)
cp<-c(stock$cp)
AIC<-c(stock$aic)
cbind(AdjustedR,cp,AIC)

##      AdjustedR       cp      AIC
## [1,] 0.9158100 38.54004 622.2573
## [2,] 0.9247213 25.78707 612.8250
## [3,] 0.9328318 14.51725 603.1841
## [4,] 0.9399303  5.00000 593.7456

sigma<-c(firstmodel)
model1<-lm(Close~SP500Price, data=df_last_day)
model2<-lm(Close~SP500Price+ USDCADPrice, data=df_last_day)
model3<-lm(Close~SP500Price+ USDCADPrice+EPS, data=df_last_day)
model4<-lm(Close~SP500Price+ USDCADPrice+EPS+PE, data=df_last_day)

#compare RMSE values
variables<-c(1,2,3,4)
sigma<-c(sigma(model1),sigma(model2),sigma(model3),sigma(model4))
sigma_table <- data.frame(variables,sigma)
sigma_table

##   variables    sigma
## 1         1 6.720368
## 2         2 6.354756
## 3         3 6.002673
## 4         4 5.676629

The analysis showed that model with all four independent variables (SP500Price, USDCADPrice, EPS, PE) has the highest Adjusted R-squared of 0.9399, the least cp of 5, the least AIC of 593.7456, and the least RMSE of 5.6766.

Introduce Interaction terms into the model

\(H_{0} : \beta_{i} = 0\)

\(H_{a} : \beta_{i}\) is not equal 0 (i=1,2,3,4,5,6)

interactmodel <- lm(Close~(SP500Price + USDCADPrice + EPS + PE)^2, data=df_last_day)
summary(interactmodel)

## 
## Call:
## lm(formula = Close ~ (SP500Price + USDCADPrice + EPS + PE)^2, 
##     data = df_last_day)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.018e-08 -2.681e-09  1.770e-10  1.908e-09  3.177e-08 
## 
## Coefficients:
##                          Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)            -6.542e-08  1.224e-07 -5.340e-01   0.5946    
## SP500Price             -5.979e-12  5.236e-11 -1.140e-01   0.9094    
## USDCADPrice             1.150e-08  1.056e-07  1.090e-01   0.9136    
## EPS                     5.626e-08  1.010e-07  5.570e-01   0.5790    
## PE                      6.927e-10  1.452e-09  4.770e-01   0.6345    
## SP500Price:USDCADPrice  2.215e-11  4.081e-11  5.430e-01   0.5888    
## SP500Price:EPS         -1.057e-12  4.853e-12 -2.180e-01   0.8281    
## SP500Price:PE          -3.735e-13  1.449e-13 -2.577e+00   0.0118 *  
## USDCADPrice:EPS        -5.054e-08  8.792e-08 -5.750e-01   0.5670    
## USDCADPrice:PE         -5.162e-11  1.141e-09 -4.500e-02   0.9640    
## EPS:PE                  1.000e+00  1.877e-10  5.328e+09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886e-09 on 82 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.041e+20 on 10 and 82 DF,  p-value: < 2.2e-16

The significant interaction terms are SP500Price:PE and EPS:PE

Excluding the interaction terms that are insignificant

interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day)
summary(interactmodel2)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + 
##     EPS:PE, data = df_last_day)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.884e-08 -2.687e-09 -1.120e-10  1.914e-09  3.256e-08 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -3.764e-08  1.294e-08 -2.908e+00 0.004631 ** 
## SP500Price     1.178e-11  6.317e-12  1.865e+00 0.065574 .  
## USDCADPrice    4.824e-09  1.001e-08  4.820e-01 0.630923    
## EPS           -9.674e-10  6.682e-09 -1.450e-01 0.885219    
## PE             4.511e-10  1.324e-10  3.407e+00 0.001001 ** 
## SP500Price:PE -2.633e-13  6.528e-14 -4.033e+00 0.000119 ***
## EPS:PE         1.000e+00  1.279e-10  7.820e+09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.767e-09 on 86 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.796e+20 on 6 and 86 DF,  p-value: < 2.2e-16

Based on the individual P-values of the coefficients estimates, the alternative hypothesis was accepted for interaction terms SP500Price:PE, and EPS:PE. The refined interaction model had a higher Adjusted R-squared of 1 compared to the 0.9399 observed in the first-order model without interaction terms. The RMSE value of 6.767046e-09 was lower as well in the model with interaction terms added. Based on this, we chose to test the model with interaction terms below for the regression model diagnostics:

\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE + \beta_{5}SP500Price:PE + \beta_{6}EPS:PE\)

3.2 Regression Model Diagnostics

The following sections provide detailed information about the multiple linear regression assumptions that were tested for the model of best fit. Since the Adjusted R-squared values are quite high for both the first order model without interactions (0.9399) and the first order model with interactions (1), both models were tested for assumptions.

Linearity Assumption in a first-order model without interactions

\(H_{0}:\) Linearity between the closing price and its predictors is present

\(H_{a}:\) Non-linearity is present

The model assumes that there is a linear relationship between the closing price of stock and the independent variables listed. For the linearity assumption to be satisfied, the plot of the residuals vs the fitted values should not show any discernible pattern.

library(ggplot2)

## Warning: пакет 'ggplot2' был собран под R версии 4.2.3

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model

#The residuals versus predicted (or fitted) values plots
ggplot(bestfirstmodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for first order model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There appears to be a pattern in the residuals, this suggested that the quadratic term or logarithmic might improve the fit to the data.

Higher order models

We used pairwise combinations of predictors and the response variable to identify potential terms that might be transformed into a higher-order model.

price_predictors <- data.frame(df_last_day$Close, df_last_day$SP500Price, df_last_day$USDCADPrice, df_last_day$EPS, df_last_day$PE)
ggpairs(price_predictors,lower = list(continuous = "smooth_loess", combo =
"facethist", discrete = "facetbar", na = "na"))

From the pairs plot, we concluded that there might be a non-linear relationship between USDCADPrice and Close price, between EPS and Close price, and PE and Close price. We attempted to create higher-order models, based on these observations.

The model with cubic terms for EPS and PE variables showed the highest Adjusted R-squared of 0.9955 among other tested higher-order models:

higherordermodel <- lm(Close ~ SP500Price + USDCADPrice +  EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model
summary(higherordermodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + I(EPS^2) + 
##     I(EPS^3) + PE + I(PE^2) + I(PE^3), data = df_last_day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5635 -1.0012 -0.0812  0.7419  4.2895 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.799e+02  7.185e+00 -25.039  < 2e-16 ***
## SP500Price   2.430e-03  6.202e-04   3.919 0.000181 ***
## USDCADPrice -7.117e+00  2.524e+00  -2.819 0.006003 ** 
## EPS          2.060e+02  1.774e+01  11.613  < 2e-16 ***
## I(EPS^2)    -9.143e+01  1.354e+01  -6.751 1.78e-09 ***
## I(EPS^3)     1.858e+01  3.321e+00   5.596 2.69e-07 ***
## PE           2.433e+00  1.296e-01  18.773  < 2e-16 ***
## I(PE^2)     -1.159e-02  1.110e-03 -10.440  < 2e-16 ***
## I(PE^3)      2.149e-05  2.924e-06   7.349 1.20e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.545 on 84 degrees of freedom
## Multiple R-squared:  0.9959, Adjusted R-squared:  0.9955 
## F-statistic:  2573 on 8 and 84 DF,  p-value: < 2.2e-16

Therefore, in addition to the first-order models without and with interactions, the cubic model was also included to check for linear regression assumptions.

The residuals vs the fitted values were plotted to check linearity assumption in the cubic model:

#The residuals versus predicted (or fitted) values plots
ggplot(higherordermodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for cubic model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The analysis of patterns on the plot showed that there is no linearity between the response variable(close) and the predictors in the cubic model.

Linearity Assumption in a first-order model with interactions

The model with interaction terms, on the other hand, has a more random scatter. The improved spread in the second model is due to the interaction terms.

interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model

#The residuals versus predicted (or fitted) values plots
ggplot(interactmodel2, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for interaction model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Assumption of Equal Variance - Homoscedasticity

\(H_{0}:\) Heteroscedasticity is not present (homoscedasticity)

\(H_{a}:\) Heteroscedasticity is present

This assumption assumes that the error terms have a constant variance. We used a combination of the spread location plot and the Breusch-Pagan test to visualise the equal variance assumption and test the hypothesis. The scale location plot is a plot between the fitted values and the standardized residuals shows whether the residuals are spread equally along the ranges of predictors.

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model
higherordermodel <- lm(Close ~ SP500Price + USDCADPrice +  EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model

#scale-location plots for three models

ggplot(bestfirstmodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: First order model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(interactmodel2, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Interaction model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(higherordermodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Cubic model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Testing for Homoscedasticity - Breusch-Pagan test for the first-order model
bptest(bestfirstmodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  bestfirstmodel
## BP = 4.2948, df = 4, p-value = 0.3676

For the first-order model, the Scale-location plot is not conclusive at first sight. From the Breusch-Pagan test, the p-value 0.3676 is greater than 0.05 hence we fail to reject the null hypothesis that heteroscedasticity is not present. We therefore accept the null hypothesis and conclude that that the equal variance assumption is met by the first order model.

#Testing for Homoscedasticity Homoscedasticity - Breusch-Pagan test for the interaction model
bptest(interactmodel2)

## 
##  studentized Breusch-Pagan test
## 
## data:  interactmodel2
## BP = 12.953, df = 6, p-value = 0.04379

For the interaction model, the Scale-location does not show the pattern of heteroscedasticity, at first sight. However, the Breusch-Pagan test, the p-value 0.044 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore accept the alternative hypothesis that heteroscedasticity is present.

#Testing for Homoscedasticity - Breusch-Pagan test for the cubic model
bptest(higherordermodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  higherordermodel
## BP = 25.959, df = 8, p-value = 0.001067

For the Cubic model, the Scale-location plot for higher-order model shows a narrower spread of residuals along the x-axis, indicating homoscedasticity. From the Breusch-Pagan test, the p-value = 0.001 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore conclude that the equal variance assumption is not met.

Normality Assumption with Q-Q plot of Residual and Shapiro Wilk Test

\(H_{0}:\) The sample data are significantly normally distributed

\(H_{a}:\) The sample data are not normally significantly distribute

Multiple linear regression requires that the errors between the observed and predicted values should be normally distributed. This assumption was checked using the Q-Q plot and mathematically using the Shapiro-Wilk test.

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model

# Check the normality assumption with Q-Q plot of residuals for three models

qqnorm(resid(bestfirstmodel))
qqline(resid(bestfirstmodel))

qqnorm(resid(interactmodel2))
qqline(resid(interactmodel2))

qqnorm(resid(higherordermodel))
qqline(resid(higherordermodel))

#Shapiro-Wilk test for the first-order model
shapiro.test(residuals(bestfirstmodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(bestfirstmodel)
## W = 0.97306, p-value = 0.05123

Based on the QQ Plot analysis for the first-order model, it is observed that a few data points on the upper end deviate slightly from the reference line. Although the Shapiro-Wilk normality test yielded a p-value of 0.051, which is just above the significance level of 0.05, it can still be considered as confirming the normality assumption.

#Shapiro-Wilk test for the interaction model
shapiro.test(residuals(interactmodel2))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(interactmodel2)
## W = 0.83535, p-value = 8.876e-09

Based on the analysis of the Interaction model, it can be observed from the QQ Plot that the data points on both ends deviate from the reference line. This indicates a departure from normality in the residuals. Furthermore, the Shapiro-Wilk normality test provides statistical evidence supporting this observation, as the p-value (8.876-09) is less than the significance level of 0.05. Consequently, the normality assumption cannot be confirmed for the Interaction model.

#Shapiro-Wilk test for the cubic model
shapiro.test(residuals(higherordermodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(higherordermodel)
## W = 0.96673, p-value = 0.01789

Based on the analysis of the cubic model, it can be observed from the QQ Plot that the data points on both ends deviate from the reference line. This indicates that the residuals do not follow a normal distribution. Additionally, the Shapiro-Wilk normality test provides further evidence to support this observation, as the obtained p-value of 0.01789 is less than the significance level of 0.05. Consequently, the normality assumption for the residuals cannot be confirmed.

Multicollinearity

\(H_{0}:\) The problem of multicollinearity does not exist

\(H_{a}:\) The problem of multicollinearity exists

The test for multicollinearity was based on the variance inflation factors (VIF) which identifies correlation between independent variables and the strength of that correlation.

#model with main effects
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
imcdiag(bestfirstmodel, method="VIF")

## 
## Call:
## imcdiag(mod = bestfirstmodel, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##                VIF detection
## SP500Price  7.3406         0
## USDCADPrice 2.5518         0
## EPS         9.0715         0
## PE          4.4370         0
## 
## NOTE:  VIF Method Failed to detect multicollinearity
## 
## 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

We can see that VIF for each variable is < 10, therefore, the collinearity is not detected.

Influential Points and Outliers

The presence of influential points in the data could alter the outcome of the model. We used the residuals vs leverage plot to identify any points beyond Cook’s distance.

cooksd <- cooks.distance(bestfirstmodel) #compute Cook's distance for each observation.
head(cooksd,20)

##            1            2            3            4            5            6 
## 7.589676e-03 1.922220e-05 2.148814e-03 3.552624e-03 5.173857e-04 1.978324e-03 
##            7            8            9           10           11           12 
## 3.112077e-05 1.306805e-02 3.042623e-02 9.319249e-04 2.103213e-02 9.973397e-03 
##           13           14           15           16           17           18 
## 1.047262e-02 1.756789e-03 2.307197e-02 6.320102e-03 1.574768e-02 1.565049e-03 
##           19           20 
## 5.133983e-02 2.300023e-02

From figure below, none of the data points were outside Cook’s distance:

#create a leverage plot
plot(bestfirstmodel, which = 5)

The plot of the Cook’s distance identified the points that were outliers, however the corresponding distance was less than 0.5 hence the outliers aren’t influential.

#Cook's distance for the first-order model
bank[cooks.distance(bestfirstmodel)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 строк> (или 'row.names' нулевой длины)

plot(bestfirstmodel,pch=18,col="red",which=c(4))

#Cook's distance for the interaction model
bank[cooks.distance(interactmodel2)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 строк> (или 'row.names' нулевой длины)

plot(interactmodel2,pch=18,col="red",which=c(4))

#Cook's distance for the cubic model
bank[cooks.distance(higherordermodel)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 строк> (или 'row.names' нулевой длины)

plot(higherordermodel,pch=18,col="red",which=c(4))

The outliers aren’t influential in all three models. Therefore, the outliers were maintained in the data set.

Summary of the regression diagnostics

##                  Model First_order_model Interaction_model Cubic_model
## 1            Linearity                No               Yes          No
## 2       Equal variance               Yes                No          No
## 3            Normality               Yes                No          No
## 4 No Multicollinearity               Yes           Ignored     Ignored
## 5          No Outliers               Yes               Yes         Yes

Box-Cox transformation of the response variable

In order to attempt to solve the problems with unequal variances and non-normality, we use Box-Cox transformations of the interaction model. Box-Cox transformation is a statistical method that assumes transforming the response variable so the data follows a normal distribution. The expression below presents the Box-Cox functions transformations for various values of lambda (the transformation parameter for response variable):

\(Y^\lambda_{i} = (Y^\lambda - 1)/\lambda\), if \(\lambda\) is not 0

\(Y^\lambda_{i}\) = log_e(Y), if \(\lambda\) is 0

bc=boxcox(interactmodel2,lambda=seq(-1.5,1.5))

Based on the plot above and the result of the bestlambda function, the value of the best lambda was gained:

#extract best lambda
bestlambda=bc$x[which(bc$y==max(bc$y))]
bestlambda

## [1] 1.015152

The individual t-tests for the interaction model were done, using lambda = 1.015152 and lambda = 0.

# the output, when we choose λ=0
bcmodel_null=lm(log(Close)~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel_null)

## 
## Call:
## lm(formula = log(Close) ~ SP500Price + USDCADPrice + EPS + PE + 
##     SP500Price:PE + EPS:PE, data = df_last_day)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.060900 -0.006734  0.003237  0.010794  0.041341 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.398e+00  3.945e-02  86.142  < 2e-16 ***
## SP500Price     1.842e-05  1.925e-05   0.957  0.34136    
## USDCADPrice    1.216e-01  3.049e-02   3.986  0.00014 ***
## EPS           -1.495e-01  2.036e-02  -7.340 1.12e-10 ***
## PE            -2.141e-04  4.035e-04  -0.531  0.59712    
## SP500Price:PE -3.167e-07  1.989e-07  -1.592  0.11513    
## EPS:PE         1.257e-02  3.897e-04  32.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02062 on 86 degrees of freedom
## Multiple R-squared:  0.9941, Adjusted R-squared:  0.9937 
## F-statistic:  2408 on 6 and 86 DF,  p-value: < 2.2e-16

# the output, when we choose λ=1.015152
bcmodel=lm((((Close^(1.015152))-1)/(1.015152))~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel)

## 
## Call:
## lm(formula = (((Close^(1.015152)) - 1)/(1.015152)) ~ SP500Price + 
##     USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data = df_last_day)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.056245 -0.018025 -0.006727  0.011261  0.094615 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.334e+00  5.715e-02  -40.844  < 2e-16 ***
## SP500Price    -8.739e-06  2.789e-05   -0.313    0.755    
## USDCADPrice   -1.835e-01  4.418e-02   -4.155 7.66e-05 ***
## EPS            2.093e-01  2.950e-02    7.094 3.45e-10 ***
## PE             7.905e-04  5.846e-04    1.352    0.180    
## SP500Price:PE  2.330e-07  2.882e-07    0.808    0.421    
## EPS:PE         1.069e+00  5.646e-04 1893.171  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02988 on 86 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.057e+07 on 6 and 86 DF,  p-value: < 2.2e-16

#Shapiro-Wilk test for the interaction model after box-cox transformations
shapiro.test(residuals(bcmodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(bcmodel)
## W = 0.93722, p-value = 0.0002308

#Testing for Homoscedasticity - Breusch-Pagan test for the interaction model after box-cox transformations
bptest(bcmodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  bcmodel
## BP = 14.738, df = 6, p-value = 0.0224

As a result, we can see that after Box-Cox transformations of the response variable, some predictors and interactions in the model lost their significance. Furthermore, the problems with non-normality and heteroscedasticity still remained after doing Breusch-Pagan and Shapiro-Wilk tests which showed P-value < 0.05. Therefore, we cannot use this model for predictive purposes.

4 Conclusion and discussion

4.1 Final model

Although the interaction model and the cubic model exhibited a higher Adjusted R-squared and lower RMSE, we prioritize the fulfillment of assumptions. Therefore, our final model is the first-order model without interactions:

\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE\)

Adjusted R-squared = 0.9399. This can be interpreted that 93.99% of the variation of the stock closing price is explained by the model. RMSE = 5.67 which is the standard deviation of the unexplained variance, in other words, it is the average distance between the real values and the predicted (by the model) values of the stock closing price.

There are a total of 5 coefficients in our model which are estimated:

#first-order model without interactions
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
bestfirstmodel

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE, data = df_last_day)
## 
## Coefficients:
## (Intercept)   SP500Price  USDCADPrice          EPS           PE  
##   -28.20638      0.01453     33.99131     22.62114      0.15259

\(Y_{close} = -28.21 + 0.01SP500Price + 33.99USDCADPrice + 22.62EPS + 0.15PE\)

This model satisfied four of the assumptions of multiple linear regression as highlighted earlier (Normality, Equal variance, Multicollinearity and Outliers assumptions). The linearity is the only assumption that wasn’t met by the first order model.

4.2 Future work

The current project has provided valuable insights into analyzing the stock price of National Bank of Canada using a multiple linear regression model. However, there are several areas that can be explored further to enhance the predictive capabilities of the model and deepen our understanding of stock price dynamics. The following areas represent potential avenues for future work:

Include additional predictors

While the current analysis considered a set of fundamental and technical variables, there are other relevant factors that can influence stock prices. Future work could involve exploring and incorporating additional predictors such as market sentiment, news sentiment, economic indicators, and industry-specific variables. By incorporating these factors, the predictive power of the model can be improved, leading to more accurate stock price forecasts.

Evaluate model robustness

To assess the robustness of the model, it is important to test it on different time periods or apply cross-validation techniques. This would help validate the model’s performance and determine if it generalizes well to unseen data. Additionally, conducting sensitivity analysis could provide insights into the stability of the model’s coefficients and predictions under different scenarios, contributing to a more comprehensive evaluation of the model’s performance.

Incorporate time series analysis

Stock prices are inherently time-dependent, and their movements often exhibit autocorrelation and other time series properties. Future work could involve incorporating time series analysis techniques, such as autoregressive integrated moving average (ARIMA) models or generalized autoregressive conditional heteroskedasticity (GARCH) models, to capture the temporal dependencies and volatility clustering in stock price data. This addition would allow for a more accurate representation of the time series characteristics of stock prices.

Extend analysis to other stocks

This project primarily focused on predicting the stock price of National Bank of Canada. However, future work could expand the analysis to include other stocks from different sectors or countries. By examining stocks from various industries and markets, a broader understanding of the factors influencing stock prices can be gained, leading to more comprehensive and applicable predictive

References

Chen, J., & Murry, C. (n.d.). What Is the Stock Market, What Does It Do, and How Does It Work? Investopedia. Retrieved April 4, 2023, from https://www.investopedia.com/terms/s/stockmarket.asp

Consumer Price Index, monthly, not seasonally adjusted. (2023). Statistique Canada. Retrieved April 4, 2023, from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000401

National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF. (n.d.). Macrotrends. Retrieved April 4, 2023, from https://www.macrotrends.net/stocks/charts/NTIOF/national-bank-of-canada/eps-earnings-per-share-diluted

National Bank of Canada (NA.TO) Stock Historical Prices & Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/NA.TO/history?p=NA.TO S&P 500 (^GSPC) Historical Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC

Multiple Linear Regression for Stock Price: National Bank of Canada

Data Analysis Contributors: Ritah Nabaweesi, Laura Assylgazhina, Olayinka Mogaji

2023-03-26