INTRODUCTION

1.1. Motivation

1.1.1. Context

The stock market refers to the several exchanges where the buying and selling of stocks of publicly held companies. The participants(investors) in the stock market form part of the stock markets and the main reason for participation gravitates around growing their money or preserving the value of the money that they have. The decision to buy or sell a particular stock at a given time is influenced by various factors, among which is the prevailing price of the stock at the time of participation. This report aims to understand the influence of the identified independent variables on the closing price of the stock.

1.1.2. Problem

From the various economic theories, it’s general knowledge that the price of any commodity is influenced by the demand and supply forces within the marketplace. A number of studies have been done to explain the influence of various factors however there is no standard equation that tells of how the stock price will behave in response to the various changes in the marketplace. This study seeks to understand the factors that influence the stock price for National Bank of Canada.
National Bank of Canada provides various financial products and services to retail, commercial, corporate, and institutional clients in Canada and internationally. It operates through four segments: Personal and Commercial, Wealth Management, Financial Markets, and U.S. Specialty Finance and International.

1.2. Objectives

1.2.1. Overview

The main objective of this project was to understand the fundamental and technical factors that influence the stock price for National Bank of Canada based on historical performance data.

1.2.2 Goals & Research Questions

The goal of the data modelling is to unearth the influence of the various technical and fundamental variables with the price of the stock. The research set out to answer the major question; What is the best model that explains the stock price given a set of fundamental and technical factors? The fundamental factors are economic and financial factors that influence the intrinsic value of an investment. Examples of such factors include the Consumer price index (CPI), Earnings per share (EPS) for the company, and the Profit earnings ratio (P/E). The technical factors explain the trading activity of the stock. Price, volume movements are technical indicators that were investigated in this research.

2. METHODOLOGY

2.1 Data

The data were collected in a CSV format from Open Data sources. The stock price information for National Bank of Canada, the exchange rates and market index was sourced from Yahoo finance (National Bank of Canada (NA.TO) Stock Historical Prices & Data, n.d.). The historical earnings per share data was sourced from Macrotrends (National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF, n.d.). The consumer price index information for Canada was sourced from Statistics Canada (Consumer Price Index, Monthly, Not Seasonally Adjusted, 2023). The monthly data spans a duration of 2012 to 2022 and was consolidated into one data frame of 93 rows for analysis. The dataset has the variables Close, Volume, CPI, EPS, P.E, USDCAD price and SP500 price. The response variable for the research is the Close while the predictors are the Volume, CPI, EPS, PE and USD/CAD rate.

Close: this is the closing price of the National Bank of Canada stock for each day on the Toronto stock market. The price is quoted in Canadian Dollars.

Volume: this is the number of National Bank of Canada stocks that are traded on the stock exchange on the given day. There is no unit of measurement for the volume

CPI: the consumer price index is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. It is an accurate representation of the inflation changes over the years.

EPS: the earnings per share (EPS) is an indicator of the financial health of the National Bank of Canada. It is calculated as the company’s profit divided by the outstanding number of its common stock. This is an indicator of the company’s profitability and the results are released on a quarterly basis.

PE: this is the price earnings ratio is calculated by taking the latest closing price and dividing it by the most recent earnings per share (EPS) number. The PE is used as a valuation measure to assess whether a stock is over or undervalued.

USDCAD price: this is the exchange rate quoted for the United States Dollar (USD) Canadian Dollar (CAD) currency pair. The rate tells of how many Canadian dollars are needed to purchase one U.S dollar. The price is quoted on a daily basis and the closing price was considered for this study.

SP500 price: this is the closing price quoted for the S&P 500 index. The S&P 500 is a market capitalization weighted index of large-cap stocks. It has 500 constituents that represent a diverse set of companies from multiple companies. The price is quoted in USD.

#read the data
bank = read.csv("bankofcanadaCP.csv")
tail(bank,10)

##           Date SP500Price USDCADPrice Close  Volume   CPI  EPS       PE
## 2683 2/13/2012    1351.77      0.9994 53.66 3412200 121.2 1.01 53.12871
## 2684 2/10/2012    1342.64      1.0023 53.60 2030500 121.2 1.01 53.06930
## 2685  2/9/2012    1351.95      0.9951 53.87 1614400 121.2 1.01 53.33663
## 2686  2/8/2012    1349.96      0.9961 53.93 2326400 121.2 1.01 53.39604
## 2687  2/7/2012    1347.05      0.9945 53.58 2319600 121.2 1.01 53.04951
## 2688  2/6/2012    1344.33      0.9956 53.47 1751400 121.2 1.01 52.94060
## 2689  2/3/2012    1344.90      0.9931 53.33 2872300 121.2 1.01 52.80198
## 2690  2/2/2012    1325.54      0.9992 52.93 3673400 121.2 1.01 52.40594
## 2691  2/1/2012    1324.09      0.9989 53.28 3337500 121.2 1.01 52.75247
## 2692 1/31/2012    1312.41      1.0029 52.37 2870600 120.7 1.01 51.85148

The SP500Price, USDCADPrice, Close and Volume parameters are collected on a daily basis. However CPI and EPS values are released on a monthly basis. In order to align the information, the information for the last day of the month is taken.

Converting date from string to data type:

# Convert the Date column to a Date object using mdy() function from lubridate
library(dplyr)
library(lubridate)

bank$Date <- mdy(bank$Date)
head(bank)

##         Date SP500Price USDCADPrice  Close  Volume   CPI  EPS       PE
## 1 2022-12-30    3839.50      1.3549 127.30 3545800 153.1 1.83 69.56284
## 2 2022-12-29    3849.28      1.3546 128.66 3256100 153.1 1.83 70.30601
## 3 2022-12-28    3783.22      1.3608 127.71 3163800 153.1 1.83 69.78688
## 4 2022-12-23    3844.82      1.3598 128.28 2134600 153.1 1.83 70.09836
## 5 2022-12-22    3822.39      1.3646 127.73 2441100 153.1 1.83 69.79782
## 6 2022-12-21    3878.44      1.3609 128.67 2183800 153.1 1.83 70.31147

Extract data for the last day of the month:

df_last_day <- bank %>%
  
  filter(day(Date) == days_in_month(Date))
head(df_last_day)

##         Date SP500Price USDCADPrice  Close  Volume   CPI  EPS       PE
## 1 2022-11-30    4080.11      1.3409 133.78 5769800 154.0 1.83 73.10382
## 2 2022-10-31    3871.98      1.3622 126.05 3120000 153.8 1.83 68.87978
## 3 2022-09-30    3585.62      1.3826 124.37 1982500 152.7 1.83 67.96175
## 4 2022-08-31    3955.00      1.3127 122.13 2395800 152.6 1.83 66.73770
## 5 2022-06-30    3785.38      1.2872 124.64 2576600 152.9 2.01 62.00995
## 6 2022-05-31    4132.15      1.2644 132.17 4224100 151.9 2.01 65.75622

2.2 Approach

The approach is to start with a full multiple linear regression model, with all the variables included. A full model test is performed to confirm whether the multiple linear regression model is useful. Based on the outcome of the individual t-tests, assessment of the predictors’ significance is done.

A first order model is then designed by assessing the contribution of a subset of predictors to the model. The choice of model at this stage is based on the adjusted R-squared, which measures how much of the variation in the response variable is explained by the changes in the predictor variables. To assess the best subsets of predictors in the first order model such measures as RMSE, Cp and AIC are computed and compared. The subset with the combination of the highest Adjusted R-squared, the lowest RMSE, Cp and AIC is preferred.

Based on the outcome of selecting the best first order model, the interaction terms are assessed and added to the main effects to improve the model. The judgement of the significance of the interaction terms within the model is based on the p-values of each of the coefficients as well as the values of Adjusted R-squared and RMSE. The interaction terms whose p-value is insignificant are dropped from the model and model is reassessed.

Once the best first order model with interaction terms is selected, an analysis of the residuals is done. This entails testing the assumptions below:

● Linearity Assumption: assumes that there is a straight-line relationship between the predictors and the response variable. ● Equal Variance assumption: the error terms have a constant variance. ● Normality assumption: the residuals of the regression should be normally distributed. ● Multicollinearity: check for any linearly correlated independent variables. ● Outliers: assess the presence of influential data points which can affect the regression results

The results of each of the assumptions is based on to fine tune the model and conclude on the best multiple linear model to predict the National Bank of Canada stock price.

##2.3 Workflow

The next steps are going to be done to find the best regression model: ● Evaluating overall model utility by using a global F test ● Performing Individual Coefficients Test (t-test) to check how significant their influence on response variable ● Finding the best first-order model comparing such measures as Adjusted R-squared, RMSE, Cp and AIC ● Finding the best first-order model with interaction terms, using individual t-test, Adjusted R-squared, and RMSE ● Testing the Linearity assumption by plotting the residuals versus predicted values. ● Checking the homoscedasticity by creating the Scale-Location plot and doing the Breusch-Pagan test. ● Examining the Q-Q plot of residuals and the results of Shapiro Wilk test to evaluate if the errors between observed and fitted values are normally distributed. ● Checking if the problem of multicollinearity exists by using VIFs and scatterplots showing correlation between predictors. ● Using Cook’s distance and Leverage plots to find influential cases of outliers. ● Analysing how to solve the problems arising from results of testing the assumptions if applicable. ● Fitting the final model which pretends to the highest prediction accuracy among other models analysed

##2.4 Workload Distribution

The project work was equally distributed between three team members:

● Predictors, first order model, interaction and higher-order terms selection tasks. Making intermediate conclusions. ● Testing for the linearity, the homoscedasticity, and the Normality assumptions. Drawing intermediate conclusions. ● Testing for the multicollinearity assumptions and outliers. Making intermediate conclusions.

The common tasks which were done together: ● Data gathering ● Discussions about the topic chosen, methodology, steps of the project, future work suggestions ● Analysing and choosing the final model ● Making conclusions

#3. The Analysis

3.1 Build the first order model

3.1.1 Fit the model containing all six variables

firstmodel<-lm(Close~SP500Price + USDCADPrice + Volume + CPI  + EPS + PE, data=df_last_day) #Full model
summary(firstmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI + 
##     EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3089  -3.6343  -0.2154   4.5781  10.3098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.974e+01  2.128e+01  -1.867 0.065315 .  
## SP500Price   1.385e-02  1.902e-03   7.282 1.46e-10 ***
## USDCADPrice  3.219e+01  9.002e+00   3.576 0.000575 ***
## Volume      -7.634e-07  5.312e-07  -1.437 0.154310    
## CPI          1.435e-01  2.450e-01   0.586 0.559530    
## EPS          2.160e+01  5.640e+00   3.829 0.000244 ***
## PE           1.529e-01  4.816e-02   3.174 0.002084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared:  0.9441, Adjusted R-squared:  0.9402 
## F-statistic: 241.9 on 6 and 86 DF,  p-value: < 2.2e-16

3.1.2 Test the hypothesis for the full model i.e the test of overall significance (at the significance level = 0.05).

\(H_{0} : \beta_{1} = \beta_{2} = \beta_{3} = \beta_{4} = \beta_{5}= \beta_{6} = 0\)

\(H_{a}\) : at least one \(\beta_{i}\) is not zero (i = 1,2,3,4,5,6)

nullmodel<-lm(Close~1, data=df_last_day) #with only intersept
anova(nullmodel,firstmodel) #Comparison of null model and full model

## Analysis of Variance Table
## 
## Model 1: Close ~ 1
## Model 2: Close ~ SP500Price + USDCADPrice + Volume + CPI + EPS + PE
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     92 49353                                  
## 2     86  2760  6     46593 241.93 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that F-statistic = 241.93 with df = 2760 (p-value < 2.2e-16 < \(\alpha\) = 0.05). It indicates that we should clearly reject the null hypothesis \(H_{0}\). The large F-test suggests that at least one of the variables (SP500Price, USDCADPrice, Volume, CPI, EPS, PE) must be related to the stock close price. Based on the p-value, we also have extremely strong evidence that at least one of the variables is associated with the close price.

3.1.3 Use Individual Coefficients Test (t-test) to find the best model

\(H_{0} : \beta_{i} = 0\)

\(H_{a} : \beta_{i}\) is not equal 0 (i=1,2,3,4,5,6)

summary(firstmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI + 
##     EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3089  -3.6343  -0.2154   4.5781  10.3098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.974e+01  2.128e+01  -1.867 0.065315 .  
## SP500Price   1.385e-02  1.902e-03   7.282 1.46e-10 ***
## USDCADPrice  3.219e+01  9.002e+00   3.576 0.000575 ***
## Volume      -7.634e-07  5.312e-07  -1.437 0.154310    
## CPI          1.435e-01  2.450e-01   0.586 0.559530    
## EPS          2.160e+01  5.640e+00   3.829 0.000244 ***
## PE           1.529e-01  4.816e-02   3.174 0.002084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared:  0.9441, Adjusted R-squared:  0.9402 
## F-statistic: 241.9 on 6 and 86 DF,  p-value: < 2.2e-16

Based on the results above, the individual P-values indicate that Volume and CPI do not have a significant influence on the closing price. It means we should clearly reject the null hypothesis for the SP500Price, USDCADPrice, EPS and PE. Therefore, we drop Volume and CPI in the model.

3.1.4 Select the significant predictors for the first-order model

reducedmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #reduced model
summary(reducedmodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE, data = df_last_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9297  -3.7728   0.2145   4.5945  10.5450 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.206378   9.318896  -3.027  0.00324 ** 
## SP500Price    0.014529   0.001727   8.412 6.66e-13 ***
## USDCADPrice  33.991307   7.552184   4.501 2.06e-05 ***
## EPS          22.621144   4.606398   4.911 4.15e-06 ***
## PE            0.152587   0.044962   3.394  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.677 on 88 degrees of freedom
## Multiple R-squared:  0.9425, Adjusted R-squared:  0.9399 
## F-statistic: 360.9 on 4 and 88 DF,  p-value: < 2.2e-16

All the variables in the reduced model are significant based on the individual p-values < 0.05

3.1.5 Select significant predictors for the first-order model based on the Adjusted R-squared, cp, AIC and RMSE

A higher adjusted R-squared is preferred. The model with the smaller Cp, AIC and RMSE will be selected.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

#Select the subset of predictors that do the best at meeting some well-defined objective
#criterion, such 
stock=ols_step_best_subset(reducedmodel, details=TRUE)

# for the output interpretation
AdjustedR<-c(stock$adjr)
cp<-c(stock$cp)
AIC<-c(stock$aic)
cbind(AdjustedR,cp,AIC)

##      AdjustedR       cp      AIC
## [1,] 0.9158100 38.54004 622.2573
## [2,] 0.9247213 25.78707 612.8250
## [3,] 0.9328318 14.51725 603.1841
## [4,] 0.9399303  5.00000 593.7456

sigma<-c(firstmodel)
model1<-lm(Close~SP500Price, data=df_last_day)
model2<-lm(Close~SP500Price+ USDCADPrice, data=df_last_day)
model3<-lm(Close~SP500Price+ USDCADPrice+EPS, data=df_last_day)
model4<-lm(Close~SP500Price+ USDCADPrice+EPS+PE, data=df_last_day)

variables<-c(1,2,3,4)
sigma<-c(sigma(model1),sigma(model2),sigma(model3),sigma(model4))
sigma_table <- data.frame(variables,sigma)
sigma_table

##   variables    sigma
## 1         1 6.720368
## 2         2 6.354756
## 3         3 6.002673
## 4         4 5.676629

The model with the 4 independent variables (SP500Price, USDCADPrice, EPS, PE) has the highest Adjusted R-squared of 0.9399, the least cp of 5 , the least AIC of 593.7456, and the least RMSE of 5.677.

3.1.6 Improving the model

3.1.6.1 Introduce Interaction terms into the model

interactmodel <- lm(Close~(SP500Price + USDCADPrice + EPS + PE)^2, data=df_last_day)
summary(interactmodel)

## 
## Call:
## lm(formula = Close ~ (SP500Price + USDCADPrice + EPS + PE)^2, 
##     data = df_last_day)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.018e-08 -2.681e-09  1.770e-10  1.908e-09  3.177e-08 
## 
## Coefficients:
##                          Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)            -6.542e-08  1.224e-07 -5.340e-01   0.5946    
## SP500Price             -5.979e-12  5.236e-11 -1.140e-01   0.9094    
## USDCADPrice             1.150e-08  1.056e-07  1.090e-01   0.9136    
## EPS                     5.626e-08  1.010e-07  5.570e-01   0.5790    
## PE                      6.927e-10  1.452e-09  4.770e-01   0.6345    
## SP500Price:USDCADPrice  2.215e-11  4.081e-11  5.430e-01   0.5888    
## SP500Price:EPS         -1.057e-12  4.853e-12 -2.180e-01   0.8281    
## SP500Price:PE          -3.735e-13  1.449e-13 -2.577e+00   0.0118 *  
## USDCADPrice:EPS        -5.054e-08  8.792e-08 -5.750e-01   0.5670    
## USDCADPrice:PE         -5.162e-11  1.141e-09 -4.500e-02   0.9640    
## EPS:PE                  1.000e+00  1.877e-10  5.328e+09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886e-09 on 82 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.041e+20 on 10 and 82 DF,  p-value: < 2.2e-16

The significant interaction terms are SP500Price:PE and EPS:PE

3.1.6.2 Excluding the interaction terms that are insignificant

interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day)
summary(interactmodel2)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + 
##     EPS:PE, data = df_last_day)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.884e-08 -2.687e-09 -1.120e-10  1.914e-09  3.256e-08 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -3.764e-08  1.294e-08 -2.908e+00 0.004631 ** 
## SP500Price     1.178e-11  6.317e-12  1.865e+00 0.065574 .  
## USDCADPrice    4.824e-09  1.001e-08  4.820e-01 0.630923    
## EPS           -9.674e-10  6.682e-09 -1.450e-01 0.885219    
## PE             4.511e-10  1.324e-10  3.407e+00 0.001001 ** 
## SP500Price:PE -2.633e-13  6.528e-14 -4.033e+00 0.000119 ***
## EPS:PE         1.000e+00  1.279e-10  7.820e+09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.767e-09 on 86 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.796e+20 on 6 and 86 DF,  p-value: < 2.2e-16

3.1.6.3 The problem with too high Adjusted R-squared in interaction model

Adjusted R-squared is equal 1, which might indicate a problem. There might be too high correlation between predictors in the model, which we can check by analyzing all pairwise combinations of predictors in a scatterplot and using the VIF function.

3.1.6.4 Checking correlation and regression between predictors

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

price_predictors <- data.frame(df_last_day$Close, df_last_day$SP500Price, df_last_day$USDCADPrice, df_last_day$EPS, df_last_day$PE)
ggpairs(price_predictors,lower = list(continuous = "smooth_loess", combo =
"facethist", discrete = "facetbar", na = "na"))

From ggpairs plot, it can bee seen a high correlation = 0.957 (>0.8) between SP500Price and Close, of which the Close is the response variable.

# The variance inflation factor
library(mctest)
imcdiag(reducedmodel, method="VIF")

## 
## Call:
## imcdiag(mod = reducedmodel, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##                VIF detection
## SP500Price  7.3406         0
## USDCADPrice 2.5518         0
## EPS         9.0715         0
## PE          4.4370         0
## 
## NOTE:  VIF Method Failed to detect multicollinearity
## 
## 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

The VIF method didn’t detect any multicollinearity. The high Adjusted R-squared in the interaction model is due to the interaction terms.

3.1.6.5 Testing for higher-order models

Testing for higher-order models:

higherordermodel <- lm(Close ~ SP500Price + USDCADPrice +  EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day)
summary(higherordermodel)

## 
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + I(EPS^2) + 
##     I(EPS^3) + PE + I(PE^2) + I(PE^3), data = df_last_day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5635 -1.0012 -0.0812  0.7419  4.2895 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.799e+02  7.185e+00 -25.039  < 2e-16 ***
## SP500Price   2.430e-03  6.202e-04   3.919 0.000181 ***
## USDCADPrice -7.117e+00  2.524e+00  -2.819 0.006003 ** 
## EPS          2.060e+02  1.774e+01  11.613  < 2e-16 ***
## I(EPS^2)    -9.143e+01  1.354e+01  -6.751 1.78e-09 ***
## I(EPS^3)     1.858e+01  3.321e+00   5.596 2.69e-07 ***
## PE           2.433e+00  1.296e-01  18.773  < 2e-16 ***
## I(PE^2)     -1.159e-02  1.110e-03 -10.440  < 2e-16 ***
## I(PE^3)      2.149e-05  2.924e-06   7.349 1.20e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.545 on 84 degrees of freedom
## Multiple R-squared:  0.9959, Adjusted R-squared:  0.9955 
## F-statistic:  2573 on 8 and 84 DF,  p-value: < 2.2e-16

The model with cubic terms for EPS and PE variables has the the highest Adjusted R-squared among other tested higher-order models.

3.2. Model diagnostics for the best fitted model

3.2.1 Linearity assumption

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

library(ggplot2)

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model
higherordermodel <- lm(Close ~ SP500Price + USDCADPrice +  EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model


#The residuals versus predicted (or fitted) values plots

ggplot(bestfirstmodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for first order model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(interactmodel2, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for interaction model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(higherordermodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for cubic model')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The first order model and cubic models have some patterns in the residuals, indicating that there is a likelihood of non-linearity.

The model with interaction terms, on the other hand, has a more random scatter. The improved spread in the second model is due to the interaction terms.

3.2.2 Assumption of Equal Variance - Homoscedasticity

This plot is a diagnostic plot for checking the homoscedasticity assumptions of the linear regression model. It shows the residuals (y-axis) against the fitted values (x-axis), where the fitted values are the predicted values from the model. If the model satisfies the homoscedasticity assumptions, the plot should show a random scatter of points with no obvious pattern or trend.

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice +  EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model
higherordermodel <- lm(Close ~ SP500Price + USDCADPrice +  EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model

#scale-location plots for three models

ggplot(bestfirstmodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: First order model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(interactmodel2, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Interaction model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(higherordermodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Cubic model")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Testing for Homoscedasticity - Breusch-Pagan test for the first-order model
library(lmtest)

## Warning: package 'lmtest' was built under R version 4.2.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.2.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(bestfirstmodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  bestfirstmodel
## BP = 4.2948, df = 4, p-value = 0.3676

For the first-order model, the Scale-location plot is not conclusive at first sight. From the Breusch-Pagan test, the p-value 0.3676 is greater than 0.05 hence we fail to reject the null hypothesis that heteroscedasticity is not present. We therefore accept the null hypothesis and conclude that that the equal variance assumption is met by the first order model.

#Testing for Homoscedasticity Homoscedasticity - Breusch-Pagan test for the interaction model
library(lmtest)
bptest(interactmodel2)

## 
##  studentized Breusch-Pagan test
## 
## data:  interactmodel2
## BP = 12.953, df = 6, p-value = 0.04379

The Scale-location does not show the pattern of heteroscedasticity, at first sight. However, the Breusch-Pagan test, the p-value 0.044 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore accept the alternative hypothesis that heteroscedasticity is present.

#Testing for Homoscedasticity - Breusch-Pagan test for the cubic model
library(lmtest)
bptest(higherordermodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  higherordermodel
## BP = 25.959, df = 8, p-value = 0.001067

The Scale-location plot for higher-order model shows a narrower spread of residuals along the x-axis, indicating homoscedasticity. From the Breusch-Pagan test, the p-value = 0.001 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore conclude that the equal variance assumption is not met.

3.2.3 Normality Assumption with Q-Q plot of Residual and Shapiro Wilk Test

Normal Q-Q: This plot shows if the residuals follow a normal distribution. Ideally, we want to see the points fall close to the diagonal line, indicating that the residuals are normally distributed.

bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model

# Check the normality assumption with Q-Q plot of residuals

qqnorm(resid(bestfirstmodel))
qqline(resid(bestfirstmodel))

qqnorm(resid(interactmodel2))
qqline(resid(interactmodel2))

qqnorm(resid(higherordermodel))
qqline(resid(higherordermodel))

#Shapiro-Wilk test for the first-order model
shapiro.test(residuals(bestfirstmodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(bestfirstmodel)
## W = 0.97306, p-value = 0.05123

From the QQ Plot, some data points on upper end slightly deviate from the reference line. The Shapiro-Wilk normality test with a p-value = 0.051 is borderline above the significance level of 0.05. Therefore, the normality assumption can be confirmed.

#Shapiro-Wilk test for the interaction model
shapiro.test(residuals(interactmodel2))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(interactmodel2)
## W = 0.83535, p-value = 8.876e-09

From the QQ Plot, data points on both ends still deviate from the reference line. Also, Shapiro-Wilk normality test confirms that the residuals are not normally distributed as the p-value = 8.876-09 (<0.05). Therefore, the normality assumption cannot be confirmed.

#Shapiro-Wilk test for the cubic model
shapiro.test(residuals(higherordermodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(higherordermodel)
## W = 0.96673, p-value = 0.01789

From the QQ Plot, data points on both ends still deviate from the reference line. Furthermore, Shapiro-Wilk normality test confirms that the residuals are not normally distributed as the p-value = 0.01789 (<0.05). Therefore, the normality assumption cannot be confirmed.

3.2.4 Multicollinearity

library(mctest)
#model with main effects
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model

imcdiag(bestfirstmodel, method="VIF")

## 
## Call:
## imcdiag(mod = bestfirstmodel, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##                VIF detection
## SP500Price  7.3406         0
## USDCADPrice 2.5518         0
## EPS         9.0715         0
## PE          4.4370         0
## 
## NOTE:  VIF Method Failed to detect multicollinearity
## 
## 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

We already checked the multicollinearity in section 1.6.4, and we can see that VIF for each variable is < 10, the collinearity is not detected.

3.2.5 Outliers

Cook’s distance is a measure of the influence of each observation on the fitted values of the regression model. High values of Cook’s distance indicate that the observation may have an undue influence on the fitted values and may be an outlier

To compute Cook’s distance for each observation:

cooksd <- cooks.distance(bestfirstmodel) #to compute Cook's distance for each observation.
head(cooksd,20)

##            1            2            3            4            5            6 
## 7.589676e-03 1.922220e-05 2.148814e-03 3.552624e-03 5.173857e-04 1.978324e-03 
##            7            8            9           10           11           12 
## 3.112077e-05 1.306805e-02 3.042623e-02 9.319249e-04 2.103213e-02 9.973397e-03 
##           13           14           15           16           17           18 
## 1.047262e-02 1.756789e-03 2.307197e-02 6.320102e-03 1.574768e-02 1.565049e-03 
##           19           20 
## 5.133983e-02 2.300023e-02

To create a leverage plot

plot(bestfirstmodel, which = 5)

Identify outliers

cutoff <- 4/(nrow(data)-length(bestfirstmodel$coefficients)-1)
outliers <- which(cooksd > cutoff)
head(outliers,20)

## integer(0)

leverage <- hatvalues(bestfirstmodel)
head(leverage,20)

##          1          2          3          4          5          6          7 
## 0.05242082 0.06238936 0.09586285 0.04700286 0.09741990 0.07825557 0.09206058 
##          8          9         10         11         12         13         14 
## 0.09459699 0.09538160 0.11202656 0.08265700 0.05944918 0.07146424 0.06403024 
##         15         16         17         18         19         20 
## 0.05864073 0.04235189 0.13414322 0.09439907 0.05492223 0.05760854

Leverage values are a measure of how extreme the predictor variable values are for a given observation. It is a measure of the potential influence of a given observation on the regression line. A high leverage value indicates that the predictor values for that observation are far from the average predictor values, and therefore the observation has the potential to have a large effect on the regression line. However, high leverage values do not necessarily mean that the observation is an outlier or influential point. It just means that the observation has extreme predictor variable values.

Plot of Cook’s distance

#Cook's distance for the first-order model
bank[cooks.distance(bestfirstmodel)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 rows> (or 0-length row.names)

plot(bestfirstmodel,pch=18,col="red",which=c(4))

#Cook's distance for the interaction model
bank[cooks.distance(interactmodel2)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 rows> (or 0-length row.names)

plot(interactmodel2,pch=18,col="red",which=c(4))

#Cook's distance for the cubic model
bank[cooks.distance(higherordermodel)>0.5,]

## [1] Date        SP500Price  USDCADPrice Close       Volume      CPI        
## [7] EPS         PE         
## <0 rows> (or 0-length row.names)

plot(higherordermodel,pch=18,col="red",which=c(4))

The outliers aren’t influential in all three models, since their Cook’s distance is less than 0.5. The outliers were maintained in the data set.

3.2.6 Box-Cox transformations

In order to attempt to solve the problems with unequal variances and non-normality, we use Box-Cox transformations of the interaction model. Box-Cox transformation is a statistical method that assumes transforming the response variable so the data follows a normal distribution. The expression below presents the Box-Cox functions transformations for various values of lambda (the transformation parameter for response variable):

\(Y^\lambda_{i} = (Y^\lambda - 1)/\lambda\), if \(\lambda\) is not 0

\(Y^\lambda_{i}\) = log_e(Y), if \(\lambda\) is 0

library(MASS) #for the boxcox()function

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:olsrr':
## 
##     cement

## The following object is masked from 'package:dplyr':
## 
##     select

bc=boxcox(interactmodel2,lambda=seq(-1.5,1.5))

From the output we found that the best lambda is close to one.

#extract best lambda

bestlambda=bc$x[which(bc$y==max(bc$y))]
bestlambda

## [1] 1.015152

# the output, when we choose λ=0
bcmodel_null=lm(log(Close)~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel_null)

## 
## Call:
## lm(formula = log(Close) ~ SP500Price + USDCADPrice + EPS + PE + 
##     SP500Price:PE + EPS:PE, data = df_last_day)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.060900 -0.006734  0.003237  0.010794  0.041341 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.398e+00  3.945e-02  86.142  < 2e-16 ***
## SP500Price     1.842e-05  1.925e-05   0.957  0.34136    
## USDCADPrice    1.216e-01  3.049e-02   3.986  0.00014 ***
## EPS           -1.495e-01  2.036e-02  -7.340 1.12e-10 ***
## PE            -2.141e-04  4.035e-04  -0.531  0.59712    
## SP500Price:PE -3.167e-07  1.989e-07  -1.592  0.11513    
## EPS:PE         1.257e-02  3.897e-04  32.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02062 on 86 degrees of freedom
## Multiple R-squared:  0.9941, Adjusted R-squared:  0.9937 
## F-statistic:  2408 on 6 and 86 DF,  p-value: < 2.2e-16

# the output, when we choose λ=1.015152
bcmodel=lm((((Close^(1.015152))-1)/(1.015152))~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel)

## 
## Call:
## lm(formula = (((Close^(1.015152)) - 1)/(1.015152)) ~ SP500Price + 
##     USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data = df_last_day)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.056245 -0.018025 -0.006727  0.011261  0.094615 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.334e+00  5.715e-02  -40.844  < 2e-16 ***
## SP500Price    -8.739e-06  2.789e-05   -0.313    0.755    
## USDCADPrice   -1.835e-01  4.418e-02   -4.155 7.66e-05 ***
## EPS            2.093e-01  2.950e-02    7.094 3.45e-10 ***
## PE             7.905e-04  5.846e-04    1.352    0.180    
## SP500Price:PE  2.330e-07  2.882e-07    0.808    0.421    
## EPS:PE         1.069e+00  5.646e-04 1893.171  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02988 on 86 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.057e+07 on 6 and 86 DF,  p-value: < 2.2e-16

#Shapiro-Wilk test for the interaction model after box-cox transformations
shapiro.test(residuals(bcmodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(bcmodel)
## W = 0.93722, p-value = 0.0002308

#Testing for Homoscedasticity - Breusch-Pagan test for the interaction model after box-cox transformations
library(lmtest)
bptest(bcmodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  bcmodel
## BP = 14.738, df = 6, p-value = 0.0224

As a result, we can see that after box-cox transformation of the response variable, some of predictors and interactions in the model lost their significance. Furthermore, the problems with non-normality and heteroscedasticity still exist. Therefore, we cannot use this model for predictive purposes.

4 Final model

A comparison of the three models revealed that:

The lowest RMSE is presented in the interactive model. Therefore, if we consider the Adjusted R-squared, RMSE measures, then the best model is Interactive model.

However, the Non-Normality and Heteroscedasticity problems exist for both - interactive and cubic models. It means the prediction accuracy of the models can be significantly reduced.

The first order model, on the other hand, satisfied the Normality, equal variance, multicollinearity and outliers assumptions. The linearity assumption is the only assumption that wasn’t met by the first order model.

Given the precedence that the assumptions take over the Adjusted R-squared, CIP and other mathematical values, the best model is the first order model

Best model for the project:

Close ~ SP500Price + USDCADPrice + EPS + PE

#best first order model
coefficients(bestfirstmodel)

##  (Intercept)   SP500Price  USDCADPrice          EPS           PE 
## -28.20637764   0.01452887  33.99130721  22.62114433   0.15258670

6. CONCLUSION AND DISCUSSION

6.1 Approach

The approach which was used in our project enabled us to perform a series of successive steps that led to the creation of the final best-fitted regression model. During the process of variables and model selection such methods as F-test to assess the overall significance of the models, individual t-test for evaluating the predictors significance were applied. These methods in combination with assessing such measures as Adjusted R-squared, RMSE, cp and AIC were useful to build and choose the best first-order model as well as significant interaction terms.

During the Regression Model Diagnostics stage, it was effective to use visualization such as the residuals versus predicted plot, the Scale-Location plot, the Q-Q plot to understand the patterns in data distribution overall while the Breusch-Pagan test and Shapiro Wilk test gave as the evidence to make the final conclusion about the assumptions. The approach was to go deeper step by step into the different aspects of regression model analysis which we found effective. Throughout the process we sometimes went back to the previous steps to retest the models which were created later, for example, at the diagnostic stage after adding to the model higher-order terms. The change to our approach that could have been taken is to conduct the test for multicollinearity at the initial step to exclude some redundant calculations and tests. In general, the approach that we took was efficient and helped us gradually come to the final conclusion about the best-fitted model.

6.2 Future Work

The possible problem which we might not take into account is that the assumption of independent errors is violated. It could happen as our variables are time-series data, in other words, it was observed sequentially over a period of time. Therefore, the time-series analysis of our data should be done for future work.

Another point which might be considered is searching and selecting categorical independent variables for our model such as an overall character of messages posted on the Twitter social media regarding the National Bank of Canada. The possible values for this variable are “positive”, “negative” or “neutral” characters of messages. Exploration of other independent variables that haven’t been captured in this study should also be considered.

7. REFERENCES

Chen, J., & Murry, C. (n.d.). What Is the Stock Market, What Does It Do, and How Does It Work? Investopedia. Retrieved April 4, 2023, from https://www.investopedia.com/terms/s/stockmarket.asp

Consumer Price Index, monthly, not seasonally adjusted. (2023). Statistique Canada. Retrieved April 4, 2023, from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000401

National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF. (n.d.). Macrotrends. Retrieved April 4, 2023, from https://www.macrotrends.net/stocks/charts/NTIOF/national-bank-of-canada/eps-earnings-per-share-diluted

National Bank of Canada (NA.TO) Stock Historical Prices & Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/NA.TO/history?p=NA.TO

S&P 500 (^GSPC) Historical Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC

MULTIPLE LINEAR REGRESSION FOR STOCK PRICE: NATIONAL BANK OF CANADA

Project collaborators: Ritah Nabaweesi,Laura Assylgazhina, Olayinka Mogaji

2023-03-26