The stock market refers to the several exchanges where the buying and selling of stocks of publicly held companies. The participants(investors) in the stock market form part of the stock markets and the main reason for participation gravitates around growing their money or preserving the value of the money that they have. The decision to buy or sell a particular stock at a given time is influenced by various factors, among which is the prevailing price of the stock at the time of participation. This report aims to understand the influence of the identified independent variables on the closing price of the stock.
From the various economic theories, it’s general knowledge that the
price of any commodity is influenced by the demand and supply forces
within the marketplace. A number of studies have been done to explain
the influence of various factors however there is no standard equation
that tells of how the stock price will behave in response to the various
changes in the marketplace. This study seeks to understand the factors
that influence the stock price for National Bank of Canada.
National Bank of Canada provides various financial products and services
to retail, commercial, corporate, and institutional clients in Canada
and internationally. It operates through four segments: Personal and
Commercial, Wealth Management, Financial Markets, and U.S. Specialty
Finance and International.
The main objective of this project was to understand the fundamental and technical factors that influence the stock price for National Bank of Canada based on historical performance data.
The goal of the data modelling is to unearth the influence of the various technical and fundamental variables with the price of the stock. The research set out to answer the major question: What is the best model that explains the stock price given a set of fundamental and technical factors?
The fundamental factors are economic and financial factors that influence the intrinsic value of an investment. Examples of such factors include the Consumer price index (CPI), Earnings per share (EPS) for the company, and the Profit earnings ratio (P/E). The technical factors explain the trading activity of the stock. Price, volume movements are technical indicators that were investigated in this research.
The data were collected in a CSV format from Open Data sources.
The stock price information for National Bank of Canada, the exchange rates and market index was sourced from Yahoo finance (National Bank of Canada (NA.TO) Stock Historical Prices & Data, n.d.). The historical earnings per share data was sourced from Macrotrends (National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF, n.d.). The consumer price index information for Canada was sourced from Statistics Canada (Consumer Price Index, Monthly, Not Seasonally Adjusted, 2023). The monthly data spans a duration of 2012 to 2022 and was consolidated into one dataframe of 93 rows for analysis.
The dataset has the variables Close, Volume, CPI, EPS, P.E, USDCAD price and SP500 price. The response variable for the research is the Close while the predictors are the Volume, CPI, EPS, PE and USD/CAD rate.
Close : this is the closing price of the National Bank of Canada stock for each day on the Toronto stock market. The price is quoted in Canadian Dollars.
Volume: this is the number of National Bank of Canada stocks that are traded on the stock exchange on the given day. There is no unit of measurement for the volume
CPI : the consumer price index is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. It is an accurate representation of the inflation changes over the years.
EPS : the earnings per share (EPS) is an indicator of the financial health of the National Bank of Canada. It is calculated as the company’s profit divided by the outstanding number of its common stock. This is an indicator of the company’s profitability and the results are released on a quarterly basis.
PE : this is the price earnings ratio is calculated by taking the latest closing price and dividing it by the most recent earnings per share (EPS) number. The PE is used as a valuation measure to assess whether a stock is over or undervalued.
USDCAD price : this is the exchange rate quoted for the United States Dollar (USD) Canadian Dollar (CAD) currency pair. The rate tells of how many Canadian dollars are needed to purchase one U.S dollar. The price is quoted on a daily basis and the closing price was considered for this study.
SP500 price: this is the closing price quoted for the S&P 500 index. The S&P 500 is a market capitalisation weighted index of large-cap stocks. It has 500 constituents that represent a diverse set of companies from multiple companies. The price is quoted in USD.
# read the data
bank = read.csv("bankofcanadaCP.csv")
head(bank,10)
## Date SP500Price USDCADPrice Close Volume CPI EPS PE
## 1 12/30/2022 3839.50 1.3549 127.30 3545800 153.1 1.83 69.56284
## 2 12/29/2022 3849.28 1.3546 128.66 3256100 153.1 1.83 70.30601
## 3 12/28/2022 3783.22 1.3608 127.71 3163800 153.1 1.83 69.78688
## 4 12/23/2022 3844.82 1.3598 128.28 2134600 153.1 1.83 70.09836
## 5 12/22/2022 3822.39 1.3646 127.73 2441100 153.1 1.83 69.79782
## 6 12/21/2022 3878.44 1.3609 128.67 2183800 153.1 1.83 70.31147
## 7 12/20/2022 3821.62 1.3609 127.50 3750500 153.1 1.83 69.67213
## 8 12/19/2022 3817.66 1.3642 126.76 2569100 153.1 1.83 69.26776
## 9 12/16/2022 3852.36 1.3700 127.96 6758500 153.1 1.83 69.92350
## 10 12/15/2022 3895.75 1.3658 128.22 1972700 153.1 1.83 70.06557
Converting date from string to data type:
# Convert the Date column to a Date object using mdy() function from lubridate
bank$Date <- mdy(bank$Date)
The SP500Price, USDCADPrice, Close and Volume parameters are collected on a daily basis. However, CPI and EPS values are released on a monthly basis. In order to align the information, the information for the last day of the month is taken:
df_last_day <- bank %>%
filter(day(Date) == days_in_month(Date))
head(df_last_day)
## Date SP500Price USDCADPrice Close Volume CPI EPS PE
## 1 2022-11-30 4080.11 1.3409 133.78 5769800 154.0 1.83 73.10382
## 2 2022-10-31 3871.98 1.3622 126.05 3120000 153.8 1.83 68.87978
## 3 2022-09-30 3585.62 1.3826 124.37 1982500 152.7 1.83 67.96175
## 4 2022-08-31 3955.00 1.3127 122.13 2395800 152.6 1.83 66.73770
## 5 2022-06-30 3785.38 1.2872 124.64 2576600 152.9 2.01 62.00995
## 6 2022-05-31 4132.15 1.2644 132.17 4224100 151.9 2.01 65.75622
The approach is to start with a full multiple linear regression model, with all the variables included. A full model test is performed to confirm whether the multiple linear regression model is useful. Based on the outcome of the individual t-tests, assessment of the predictors’ significance is done.
A first order model is then designed by assessing the contribution of a subset of predictors to the model. The choice of model at this stage is based on the adjusted R-squared, which measures how much of the variation in the response variable is explained by the changes in the predictor variables. To assess the best subsets of predictors in the first order model such measures as RMSE, Cp and AIC are computed and compared. The subset with the combination of the highest Adjusted R-squared, the lowest RMSE, Cp and AIC is preferred.
Based on the outcome of selecting the best first order model, the interaction terms are assessed and added to the main effects to improve the model. The judgement of the significance of the interaction terms within the model is based on the p-values of each of the coefficients as well as the values of Adjusted R-squared and RMSE. The interaction terms whose p-value is insignificant are dropped from the model and model is reassessed.
Once the best first order model with interaction terms is selected, an analysis of the residuals is done. This entails testing the assumptions below:
The results of each of the assumptions is based on to fine tune the model and conclude on the best multiple linear model to predict the National Bank of Canada stock price.
The next steps are going to be done to find the best regression model:
The project work was equally distributed between three team members:
The common tasks which were done together:
We built a first order linear model with all the independent variables. This formed the reference point for the subsequent model selection procedures:
\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}Volume + \beta_{4}CPI + \beta_{5}EPS + \beta_{6}PE\)
firstmodel<-lm(Close~SP500Price + USDCADPrice + Volume + CPI + EPS + PE, data=df_last_day) #Full model
summary(firstmodel)
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI +
## EPS + PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3089 -3.6343 -0.2154 4.5781 10.3098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.974e+01 2.128e+01 -1.867 0.065315 .
## SP500Price 1.385e-02 1.902e-03 7.282 1.46e-10 ***
## USDCADPrice 3.219e+01 9.002e+00 3.576 0.000575 ***
## Volume -7.634e-07 5.312e-07 -1.437 0.154310
## CPI 1.435e-01 2.450e-01 0.586 0.559530
## EPS 2.160e+01 5.640e+00 3.829 0.000244 ***
## PE 1.529e-01 4.816e-02 3.174 0.002084 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared: 0.9441, Adjusted R-squared: 0.9402
## F-statistic: 241.9 on 6 and 86 DF, p-value: < 2.2e-16
In order to asses the overall model utility, a global F test was done with hypothesis:
\(H_{0} : \beta_{1} = \beta_{2} = \beta_{3} = \beta_{4} = \beta_{5}= \beta_{6} = 0\)\(H_{a}\) : at least one \(\beta_{i}\) is not zero (i = 1,2,3,4,5,6)
nullmodel<-lm(Close~1, data=df_last_day) #with only intersept
anova(nullmodel,firstmodel) #Comparison of null model and full model
## Analysis of Variance Table
##
## Model 1: Close ~ 1
## Model 2: Close ~ SP500Price + USDCADPrice + Volume + CPI + EPS + PE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 49353
## 2 86 2760 6 46593 241.93 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that F-statistic = 241.93 with df = 2760 (p-value < 2.2e-16 < \(\alpha\) = 0.05). It indicates that we should clearly reject the null hypothesis \(H_{0}\). The large F-test suggests that at least one of the variables (SP500Price, USDCADPrice, Volume, CPI, EPS, PE) must be related to the stock close price. Based on the p-value of, we also have extremely strong evidence that at least one of the variables is associated with the close price.
\(H_{a} : \beta_{i}\) is not equal 0 (i=1,2,3,4,5,6)
summary(firstmodel)
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + Volume + CPI +
## EPS + PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3089 -3.6343 -0.2154 4.5781 10.3098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.974e+01 2.128e+01 -1.867 0.065315 .
## SP500Price 1.385e-02 1.902e-03 7.282 1.46e-10 ***
## USDCADPrice 3.219e+01 9.002e+00 3.576 0.000575 ***
## Volume -7.634e-07 5.312e-07 -1.437 0.154310
## CPI 1.435e-01 2.450e-01 0.586 0.559530
## EPS 2.160e+01 5.640e+00 3.829 0.000244 ***
## PE 1.529e-01 4.816e-02 3.174 0.002084 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.665 on 86 degrees of freedom
## Multiple R-squared: 0.9441, Adjusted R-squared: 0.9402
## F-statistic: 241.9 on 6 and 86 DF, p-value: < 2.2e-16
The individual T-tests of the first-order model returned significant p-values for the independent variables SP500Price, USDCADPrice, PE and EPS (< 2e-16), which provided sufficient evidence against the null hypothesis only for four variables. The volume and CPI on the other hand weren’t significant and were subsequently dropped.
reducedmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #reduced model
summary(reducedmodel)
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9297 -3.7728 0.2145 4.5945 10.5450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.206378 9.318896 -3.027 0.00324 **
## SP500Price 0.014529 0.001727 8.412 6.66e-13 ***
## USDCADPrice 33.991307 7.552184 4.501 2.06e-05 ***
## EPS 22.621144 4.606398 4.911 4.15e-06 ***
## PE 0.152587 0.044962 3.394 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.677 on 88 degrees of freedom
## Multiple R-squared: 0.9425, Adjusted R-squared: 0.9399
## F-statistic: 360.9 on 4 and 88 DF, p-value: < 2.2e-16
The individual T-tests of the reduced first order model returned significant p-values for the variables. Based on this, we concluded that the SP500price, USDCADprice, PE, and EPS are significant variables in predicting the stock price on their own:
\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE + \beta_{5}SP500Price:PE + \beta_{6}EPS:PE\)
To further firm up on the best subset of the predictors to be included in the first order model, the Adjusted R-squared, RMSE, Cp and AIC values were computed. A higher adjusted R-squared and the smaller Cp, AIC, RMSE were preferred.
#Select the subset of predictors that do the best at meeting some well-defined criterion
stock=ols_step_best_subset(reducedmodel, details=TRUE)
# for the output (the Adjusted R-squared, Cp and AIC) interpretation
AdjustedR<-c(stock$adjr)
cp<-c(stock$cp)
AIC<-c(stock$aic)
cbind(AdjustedR,cp,AIC)
## AdjustedR cp AIC
## [1,] 0.9158100 38.54004 622.2573
## [2,] 0.9247213 25.78707 612.8250
## [3,] 0.9328318 14.51725 603.1841
## [4,] 0.9399303 5.00000 593.7456
sigma<-c(firstmodel)
model1<-lm(Close~SP500Price, data=df_last_day)
model2<-lm(Close~SP500Price+ USDCADPrice, data=df_last_day)
model3<-lm(Close~SP500Price+ USDCADPrice+EPS, data=df_last_day)
model4<-lm(Close~SP500Price+ USDCADPrice+EPS+PE, data=df_last_day)
#compare RMSE values
variables<-c(1,2,3,4)
sigma<-c(sigma(model1),sigma(model2),sigma(model3),sigma(model4))
sigma_table <- data.frame(variables,sigma)
sigma_table
## variables sigma
## 1 1 6.720368
## 2 2 6.354756
## 3 3 6.002673
## 4 4 5.676629
The analysis showed that model with all four independent variables (SP500Price, USDCADPrice, EPS, PE) has the highest Adjusted R-squared of 0.9399, the least cp of 5, the least AIC of 593.7456, and the least RMSE of 5.6766.
\(H_{a} : \beta_{i}\) is not equal 0 (i=1,2,3,4,5,6)
interactmodel <- lm(Close~(SP500Price + USDCADPrice + EPS + PE)^2, data=df_last_day)
summary(interactmodel)
##
## Call:
## lm(formula = Close ~ (SP500Price + USDCADPrice + EPS + PE)^2,
## data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.018e-08 -2.681e-09 1.770e-10 1.908e-09 3.177e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.542e-08 1.224e-07 -5.340e-01 0.5946
## SP500Price -5.979e-12 5.236e-11 -1.140e-01 0.9094
## USDCADPrice 1.150e-08 1.056e-07 1.090e-01 0.9136
## EPS 5.626e-08 1.010e-07 5.570e-01 0.5790
## PE 6.927e-10 1.452e-09 4.770e-01 0.6345
## SP500Price:USDCADPrice 2.215e-11 4.081e-11 5.430e-01 0.5888
## SP500Price:EPS -1.057e-12 4.853e-12 -2.180e-01 0.8281
## SP500Price:PE -3.735e-13 1.449e-13 -2.577e+00 0.0118 *
## USDCADPrice:EPS -5.054e-08 8.792e-08 -5.750e-01 0.5670
## USDCADPrice:PE -5.162e-11 1.141e-09 -4.500e-02 0.9640
## EPS:PE 1.000e+00 1.877e-10 5.328e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.886e-09 on 82 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.041e+20 on 10 and 82 DF, p-value: < 2.2e-16
The significant interaction terms are SP500Price:PE and EPS:PE
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day)
summary(interactmodel2)
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE +
## EPS:PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.884e-08 -2.687e-09 -1.120e-10 1.914e-09 3.256e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.764e-08 1.294e-08 -2.908e+00 0.004631 **
## SP500Price 1.178e-11 6.317e-12 1.865e+00 0.065574 .
## USDCADPrice 4.824e-09 1.001e-08 4.820e-01 0.630923
## EPS -9.674e-10 6.682e-09 -1.450e-01 0.885219
## PE 4.511e-10 1.324e-10 3.407e+00 0.001001 **
## SP500Price:PE -2.633e-13 6.528e-14 -4.033e+00 0.000119 ***
## EPS:PE 1.000e+00 1.279e-10 7.820e+09 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.767e-09 on 86 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.796e+20 on 6 and 86 DF, p-value: < 2.2e-16
Based on the individual P-values of the coefficients estimates, the alternative hypothesis was accepted for interaction terms SP500Price:PE, and EPS:PE. The refined interaction model had a higher Adjusted R-squared of 1 compared to the 0.9399 observed in the first-order model without interaction terms. The RMSE value of 6.767046e-09 was lower as well in the model with interaction terms added. Based on this, we chose to test the model with interaction terms below for the regression model diagnostics:
\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE + \beta_{5}SP500Price:PE + \beta_{6}EPS:PE\)
The following sections provide detailed information about the multiple linear regression assumptions that were tested for the model of best fit. Since the Adjusted R-squared values are quite high for both the first order model without interactions (0.9399) and the first order model with interactions (1), both models were tested for assumptions.
\(H_{a}:\) Non-linearity is present
The model assumes that there is a linear relationship between the closing price of stock and the independent variables listed. For the linearity assumption to be satisfied, the plot of the residuals vs the fitted values should not show any discernible pattern.
library(ggplot2)
## Warning: пакет 'ggplot2' был собран под R версии 4.2.3
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
#The residuals versus predicted (or fitted) values plots
ggplot(bestfirstmodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for first order model')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There appears to be a pattern in the residuals, this suggested that the
quadratic term or logarithmic might improve the fit to the data.
We used pairwise combinations of predictors and the response variable to identify potential terms that might be transformed into a higher-order model.
price_predictors <- data.frame(df_last_day$Close, df_last_day$SP500Price, df_last_day$USDCADPrice, df_last_day$EPS, df_last_day$PE)
ggpairs(price_predictors,lower = list(continuous = "smooth_loess", combo =
"facethist", discrete = "facetbar", na = "na"))
From the pairs plot, we concluded that there might be a non-linear
relationship between USDCADPrice and Close price, between EPS and Close
price, and PE and Close price. We attempted to create higher-order
models, based on these observations.
The model with cubic terms for EPS and PE variables showed the highest Adjusted R-squared of 0.9955 among other tested higher-order models:
higherordermodel <- lm(Close ~ SP500Price + USDCADPrice + EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model
summary(higherordermodel)
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + I(EPS^2) +
## I(EPS^3) + PE + I(PE^2) + I(PE^3), data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5635 -1.0012 -0.0812 0.7419 4.2895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.799e+02 7.185e+00 -25.039 < 2e-16 ***
## SP500Price 2.430e-03 6.202e-04 3.919 0.000181 ***
## USDCADPrice -7.117e+00 2.524e+00 -2.819 0.006003 **
## EPS 2.060e+02 1.774e+01 11.613 < 2e-16 ***
## I(EPS^2) -9.143e+01 1.354e+01 -6.751 1.78e-09 ***
## I(EPS^3) 1.858e+01 3.321e+00 5.596 2.69e-07 ***
## PE 2.433e+00 1.296e-01 18.773 < 2e-16 ***
## I(PE^2) -1.159e-02 1.110e-03 -10.440 < 2e-16 ***
## I(PE^3) 2.149e-05 2.924e-06 7.349 1.20e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.545 on 84 degrees of freedom
## Multiple R-squared: 0.9959, Adjusted R-squared: 0.9955
## F-statistic: 2573 on 8 and 84 DF, p-value: < 2.2e-16
Therefore, in addition to the first-order models without and with interactions, the cubic model was also included to check for linear regression assumptions.
The residuals vs the fitted values were plotted to check linearity assumption in the cubic model:
#The residuals versus predicted (or fitted) values plots
ggplot(higherordermodel, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for cubic model')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The analysis of patterns on the plot showed that there is no linearity
between the response variable(close) and the predictors in the cubic
model.
The model with interaction terms, on the other hand, has a more random scatter. The improved spread in the second model is due to the interaction terms.
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model
#The residuals versus predicted (or fitted) values plots
ggplot(interactmodel2, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)+
ggtitle('Residuals Vs fitted values for interaction model')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
\(H_{a}:\) Heteroscedasticity is present
This assumption assumes that the error terms have a constant variance. We used a combination of the spread location plot and the Breusch-Pagan test to visualise the equal variance assumption and test the hypothesis. The scale location plot is a plot between the fitted values and the standardized residuals shows whether the residuals are spread equally along the ranges of predictors.
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
interactmodel2 <- lm(Close ~ SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data=df_last_day) #best interaction model
higherordermodel <- lm(Close ~ SP500Price + USDCADPrice + EPS + I(EPS^2) + I(EPS^3) + PE + I(PE^2) + I(PE^3), data=df_last_day) #cubic model
#scale-location plots for three models
ggplot(bestfirstmodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: First order model")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(interactmodel2, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Interaction model")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(higherordermodel, aes(x=.fitted, y=sqrt(abs(.stdresid)))) + geom_point() + geom_smooth()+geom_hline(yintercept = 0) + ggtitle("Scale-Location plot : Standardized Residual vs Fitted values: Cubic model")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Testing for Homoscedasticity - Breusch-Pagan test for the first-order model
bptest(bestfirstmodel)
##
## studentized Breusch-Pagan test
##
## data: bestfirstmodel
## BP = 4.2948, df = 4, p-value = 0.3676
For the first-order model, the Scale-location plot is not conclusive at first sight. From the Breusch-Pagan test, the p-value 0.3676 is greater than 0.05 hence we fail to reject the null hypothesis that heteroscedasticity is not present. We therefore accept the null hypothesis and conclude that that the equal variance assumption is met by the first order model.
#Testing for Homoscedasticity Homoscedasticity - Breusch-Pagan test for the interaction model
bptest(interactmodel2)
##
## studentized Breusch-Pagan test
##
## data: interactmodel2
## BP = 12.953, df = 6, p-value = 0.04379
For the interaction model, the Scale-location does not show the pattern of heteroscedasticity, at first sight. However, the Breusch-Pagan test, the p-value 0.044 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore accept the alternative hypothesis that heteroscedasticity is present.
#Testing for Homoscedasticity - Breusch-Pagan test for the cubic model
bptest(higherordermodel)
##
## studentized Breusch-Pagan test
##
## data: higherordermodel
## BP = 25.959, df = 8, p-value = 0.001067
For the Cubic model, the Scale-location plot for higher-order model shows a narrower spread of residuals along the x-axis, indicating homoscedasticity. From the Breusch-Pagan test, the p-value = 0.001 is less than 0.05 hence we reject the null hypothesis that heteroscedasticity is not present. We therefore conclude that the equal variance assumption is not met.
\(H_{a}:\) The sample data are not normally significantly distribute
Multiple linear regression requires that the errors between the observed and predicted values should be normally distributed. This assumption was checked using the Q-Q plot and mathematically using the Shapiro-Wilk test.
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
# Check the normality assumption with Q-Q plot of residuals for three models
qqnorm(resid(bestfirstmodel))
qqline(resid(bestfirstmodel))
qqnorm(resid(interactmodel2))
qqline(resid(interactmodel2))
qqnorm(resid(higherordermodel))
qqline(resid(higherordermodel))
#Shapiro-Wilk test for the first-order model
shapiro.test(residuals(bestfirstmodel))
##
## Shapiro-Wilk normality test
##
## data: residuals(bestfirstmodel)
## W = 0.97306, p-value = 0.05123
Based on the QQ Plot analysis for the first-order model, it is observed that a few data points on the upper end deviate slightly from the reference line. Although the Shapiro-Wilk normality test yielded a p-value of 0.051, which is just above the significance level of 0.05, it can still be considered as confirming the normality assumption.
#Shapiro-Wilk test for the interaction model
shapiro.test(residuals(interactmodel2))
##
## Shapiro-Wilk normality test
##
## data: residuals(interactmodel2)
## W = 0.83535, p-value = 8.876e-09
Based on the analysis of the Interaction model, it can be observed from the QQ Plot that the data points on both ends deviate from the reference line. This indicates a departure from normality in the residuals. Furthermore, the Shapiro-Wilk normality test provides statistical evidence supporting this observation, as the p-value (8.876-09) is less than the significance level of 0.05. Consequently, the normality assumption cannot be confirmed for the Interaction model.
#Shapiro-Wilk test for the cubic model
shapiro.test(residuals(higherordermodel))
##
## Shapiro-Wilk normality test
##
## data: residuals(higherordermodel)
## W = 0.96673, p-value = 0.01789
Based on the analysis of the cubic model, it can be observed from the QQ Plot that the data points on both ends deviate from the reference line. This indicates that the residuals do not follow a normal distribution. Additionally, the Shapiro-Wilk normality test provides further evidence to support this observation, as the obtained p-value of 0.01789 is less than the significance level of 0.05. Consequently, the normality assumption for the residuals cannot be confirmed.
\(H_{a}:\) The problem of multicollinearity exists
The test for multicollinearity was based on the variance inflation factors (VIF) which identifies correlation between independent variables and the strength of that correlation.
#model with main effects
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
imcdiag(bestfirstmodel, method="VIF")
##
## Call:
## imcdiag(mod = bestfirstmodel, method = "VIF")
##
##
## VIF Multicollinearity Diagnostics
##
## VIF detection
## SP500Price 7.3406 0
## USDCADPrice 2.5518 0
## EPS 9.0715 0
## PE 4.4370 0
##
## NOTE: VIF Method Failed to detect multicollinearity
##
##
## 0 --> COLLINEARITY is not detected by the test
##
## ===================================
We can see that VIF for each variable is < 10, therefore, the collinearity is not detected.
The presence of influential points in the data could alter the outcome of the model. We used the residuals vs leverage plot to identify any points beyond Cook’s distance.
cooksd <- cooks.distance(bestfirstmodel) #compute Cook's distance for each observation.
head(cooksd,20)
## 1 2 3 4 5 6
## 7.589676e-03 1.922220e-05 2.148814e-03 3.552624e-03 5.173857e-04 1.978324e-03
## 7 8 9 10 11 12
## 3.112077e-05 1.306805e-02 3.042623e-02 9.319249e-04 2.103213e-02 9.973397e-03
## 13 14 15 16 17 18
## 1.047262e-02 1.756789e-03 2.307197e-02 6.320102e-03 1.574768e-02 1.565049e-03
## 19 20
## 5.133983e-02 2.300023e-02
From figure below, none of the data points were outside Cook’s distance:
#create a leverage plot
plot(bestfirstmodel, which = 5)
The plot of the Cook’s distance identified the points that were outliers, however the corresponding distance was less than 0.5 hence the outliers aren’t influential.
#Cook's distance for the first-order model
bank[cooks.distance(bestfirstmodel)>0.5,]
## [1] Date SP500Price USDCADPrice Close Volume CPI
## [7] EPS PE
## <0 строк> (или 'row.names' нулевой длины)
plot(bestfirstmodel,pch=18,col="red",which=c(4))
#Cook's distance for the interaction model
bank[cooks.distance(interactmodel2)>0.5,]
## [1] Date SP500Price USDCADPrice Close Volume CPI
## [7] EPS PE
## <0 строк> (или 'row.names' нулевой длины)
plot(interactmodel2,pch=18,col="red",which=c(4))
#Cook's distance for the cubic model
bank[cooks.distance(higherordermodel)>0.5,]
## [1] Date SP500Price USDCADPrice Close Volume CPI
## [7] EPS PE
## <0 строк> (или 'row.names' нулевой длины)
plot(higherordermodel,pch=18,col="red",which=c(4))
The outliers aren’t influential in all three models. Therefore, the outliers were maintained in the data set.
## Model First_order_model Interaction_model Cubic_model
## 1 Linearity No Yes No
## 2 Equal variance Yes No No
## 3 Normality Yes No No
## 4 No Multicollinearity Yes Ignored Ignored
## 5 No Outliers Yes Yes Yes
In order to attempt to solve the problems with unequal variances and non-normality, we use Box-Cox transformations of the interaction model. Box-Cox transformation is a statistical method that assumes transforming the response variable so the data follows a normal distribution. The expression below presents the Box-Cox functions transformations for various values of lambda (the transformation parameter for response variable):
\(Y^\lambda_{i} = (Y^\lambda - 1)/\lambda\), if \(\lambda\) is not 0
\(Y^\lambda_{i}\) = loge(Y), if \(\lambda\) is 0
bc=boxcox(interactmodel2,lambda=seq(-1.5,1.5))
Based on the plot above and the result of the bestlambda function, the value of the best lambda was gained:
#extract best lambda
bestlambda=bc$x[which(bc$y==max(bc$y))]
bestlambda
## [1] 1.015152
The individual t-tests for the interaction model were done, using lambda = 1.015152 and lambda = 0.
# the output, when we choose λ=0
bcmodel_null=lm(log(Close)~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel_null)
##
## Call:
## lm(formula = log(Close) ~ SP500Price + USDCADPrice + EPS + PE +
## SP500Price:PE + EPS:PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.060900 -0.006734 0.003237 0.010794 0.041341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.398e+00 3.945e-02 86.142 < 2e-16 ***
## SP500Price 1.842e-05 1.925e-05 0.957 0.34136
## USDCADPrice 1.216e-01 3.049e-02 3.986 0.00014 ***
## EPS -1.495e-01 2.036e-02 -7.340 1.12e-10 ***
## PE -2.141e-04 4.035e-04 -0.531 0.59712
## SP500Price:PE -3.167e-07 1.989e-07 -1.592 0.11513
## EPS:PE 1.257e-02 3.897e-04 32.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02062 on 86 degrees of freedom
## Multiple R-squared: 0.9941, Adjusted R-squared: 0.9937
## F-statistic: 2408 on 6 and 86 DF, p-value: < 2.2e-16
# the output, when we choose λ=1.015152
bcmodel=lm((((Close^(1.015152))-1)/(1.015152))~SP500Price + USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE,data=df_last_day)
summary(bcmodel)
##
## Call:
## lm(formula = (((Close^(1.015152)) - 1)/(1.015152)) ~ SP500Price +
## USDCADPrice + EPS + PE + SP500Price:PE + EPS:PE, data = df_last_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.056245 -0.018025 -0.006727 0.011261 0.094615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.334e+00 5.715e-02 -40.844 < 2e-16 ***
## SP500Price -8.739e-06 2.789e-05 -0.313 0.755
## USDCADPrice -1.835e-01 4.418e-02 -4.155 7.66e-05 ***
## EPS 2.093e-01 2.950e-02 7.094 3.45e-10 ***
## PE 7.905e-04 5.846e-04 1.352 0.180
## SP500Price:PE 2.330e-07 2.882e-07 0.808 0.421
## EPS:PE 1.069e+00 5.646e-04 1893.171 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02988 on 86 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.057e+07 on 6 and 86 DF, p-value: < 2.2e-16
#Shapiro-Wilk test for the interaction model after box-cox transformations
shapiro.test(residuals(bcmodel))
##
## Shapiro-Wilk normality test
##
## data: residuals(bcmodel)
## W = 0.93722, p-value = 0.0002308
#Testing for Homoscedasticity - Breusch-Pagan test for the interaction model after box-cox transformations
bptest(bcmodel)
##
## studentized Breusch-Pagan test
##
## data: bcmodel
## BP = 14.738, df = 6, p-value = 0.0224
As a result, we can see that after Box-Cox transformations of the response variable, some predictors and interactions in the model lost their significance. Furthermore, the problems with non-normality and heteroscedasticity still remained after doing Breusch-Pagan and Shapiro-Wilk tests which showed P-value < 0.05. Therefore, we cannot use this model for predictive purposes.
Although the interaction model and the cubic model exhibited a higher Adjusted R-squared and lower RMSE, we prioritize the fulfillment of assumptions. Therefore, our final model is the first-order model without interactions:
\(Y_{close} = \beta_{0} + \beta_{1}SP500Price + \beta_{2}USDCADPrice + \beta_{3}EPS + \beta_{4}PE\)
Adjusted R-squared = 0.9399. This can be interpreted that 93.99% of the variation of the stock closing price is explained by the model. RMSE = 5.67 which is the standard deviation of the unexplained variance, in other words, it is the average distance between the real values and the predicted (by the model) values of the stock closing price.
There are a total of 5 coefficients in our model which are estimated:
#first-order model without interactions
bestfirstmodel<-lm(Close~SP500Price + USDCADPrice + EPS + PE, data=df_last_day) #best first order model
bestfirstmodel
##
## Call:
## lm(formula = Close ~ SP500Price + USDCADPrice + EPS + PE, data = df_last_day)
##
## Coefficients:
## (Intercept) SP500Price USDCADPrice EPS PE
## -28.20638 0.01453 33.99131 22.62114 0.15259
\(Y_{close} = -28.21 + 0.01SP500Price + 33.99USDCADPrice + 22.62EPS + 0.15PE\)
This model satisfied four of the assumptions of multiple linear regression as highlighted earlier (Normality, Equal variance, Multicollinearity and Outliers assumptions). The linearity is the only assumption that wasn’t met by the first order model.
The current project has provided valuable insights into analyzing the stock price of National Bank of Canada using a multiple linear regression model. However, there are several areas that can be explored further to enhance the predictive capabilities of the model and deepen our understanding of stock price dynamics. The following areas represent potential avenues for future work:
While the current analysis considered a set of fundamental and technical variables, there are other relevant factors that can influence stock prices. Future work could involve exploring and incorporating additional predictors such as market sentiment, news sentiment, economic indicators, and industry-specific variables. By incorporating these factors, the predictive power of the model can be improved, leading to more accurate stock price forecasts.
To assess the robustness of the model, it is important to test it on different time periods or apply cross-validation techniques. This would help validate the model’s performance and determine if it generalizes well to unseen data. Additionally, conducting sensitivity analysis could provide insights into the stability of the model’s coefficients and predictions under different scenarios, contributing to a more comprehensive evaluation of the model’s performance.
Stock prices are inherently time-dependent, and their movements often exhibit autocorrelation and other time series properties. Future work could involve incorporating time series analysis techniques, such as autoregressive integrated moving average (ARIMA) models or generalized autoregressive conditional heteroskedasticity (GARCH) models, to capture the temporal dependencies and volatility clustering in stock price data. This addition would allow for a more accurate representation of the time series characteristics of stock prices.
This project primarily focused on predicting the stock price of National Bank of Canada. However, future work could expand the analysis to include other stocks from different sectors or countries. By examining stocks from various industries and markets, a broader understanding of the factors influencing stock prices can be gained, leading to more comprehensive and applicable predictive
Chen, J., & Murry, C. (n.d.). What Is the Stock Market, What Does It Do, and How Does It Work? Investopedia. Retrieved April 4, 2023, from https://www.investopedia.com/terms/s/stockmarket.asp
Consumer Price Index, monthly, not seasonally adjusted. (2023). Statistique Canada. Retrieved April 4, 2023, from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000401
National Bank Of Canada EPS - Earnings per Share 2010-2022 | NTIOF. (n.d.). Macrotrends. Retrieved April 4, 2023, from https://www.macrotrends.net/stocks/charts/NTIOF/national-bank-of-canada/eps-earnings-per-share-diluted
National Bank of Canada (NA.TO) Stock Historical Prices & Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/NA.TO/history?p=NA.TO S&P 500 (^GSPC) Historical Data. (n.d.). Yahoo Finance. Retrieved April 4, 2023, from https://ca.finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC