Regression Model: NVIDIA Corporation Stock Market Report
Introduction
About NVIDIA Corporation
Nvidia Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware.[5] It is a software and fabless company which designs and supplies graphics processing units (GPUs), application programming interfaces (APIs) for data science and high-performance computing, as well as system on a chip units (SoCs) for the mobile computing and automotive market. Nvidia is also a dominant supplier of artificial intelligence (AI) hardware and software.
NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. NVIDIA work in AI and digital twins is transforming the world’s largest industries and profoundly impacting society.
Nvidia went public on January 22, 1999. In January 2024, Forbes reported that Nvidia has increased its lobbying presence in Washington, D.C. as American lawmakers consider proposals to regulate artificial intelligence. From 2023 to 2024, the company reportedly hired at least four government affairs with professional backgrounds at agencies including the United States Department of State and the Department of the Treasury. It was noted that the $350,000 spent by the company on lobbying in 2023 was small compared to a number of major tech companies in the artificial intelligence space.
Business Goal
This section is copied from this Kaggle kernel , who has done the
same analysis using python instead of R. After we build the linear
regression model, we will evaluate the model and we will filter out the
most important indicator of stock price prediction.
This information will be give special advantage for investors. From this
the result of multiple linear regression model, the investors will see
how NVIDIA stock movement, when the best time they should buy or sell
their stock.
Data Preparation
Prerequisites
Importing Dataset
This dataset organized stock market data from 06-01-2023 to 13-05-2024 from Yahoo Finance
stock <- read.csv("NVIDIA Historical Market Data.csv")
library(DT)
datatable(stock, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Data Column
Date: Date
Open: Prior Day Open
Range: Log Range ( Stock Daily High - Low, risk
indicator)
Volume: Value of Daily Traded Volume
Log_Volume: Log_Volume (Log Value of Daily Traded Volume)
as Risk Indicator
Return_Percentage: Return_Percentage ( [Today AdjClose -
Prior Day AdjClose)/Prior Day AdjClose]*100% ) as Momentum
Indicator
X3_Day_Avg_AdjClose.Delay: 3_Day_Avg_AdjClose(Market Price
Level Assassin indicatory, Delay=3) as profit Indicator
PriorDay_AdjClose: PriorDay_AdjClose (prior day adjusted
closed price) as Historical Market Price Level Assessment
Adj.Close: adjusted closed price
X3_Day_Avg_AdjClose.Delay will be rename as AdjClose_delay
Data Processing
Missing Values
## Date Open Range Volume
## 0 14 14 14
## Log_Volume Return_Percentage AdjClose_delay PriorDay_AdjClose
## 14 0 14 14
## Adj.Close
## 14
Remove the row which include NA value then check the missing value.
## Date Open Range Volume
## 0 0 0 0
## Log_Volume Return_Percentage AdjClose_delay PriorDay_AdjClose
## 0 0 0 0
## Adj.Close
## 0
Recheck the duplicates after removing missing value
## [1] 0
Data Types
Check the structure of stock dataset using
str function.
## 'data.frame': 339 obs. of 9 variables:
## $ Date : chr "06-01-23" "09-01-23" "10-01-23" "11-01-23" ...
## $ Open : num 145 153 155 158 161 ...
## $ Range : num 9.76 9.15 4.9 4.65 11.45 ...
## $ Volume : int 40504400 50423100 38410100 35328500 55140900 44728700 51110200 43962400 45293200 56496700 ...
## $ Log_Volume : num 7.61 7.7 7.58 7.55 7.74 7.65 7.71 7.64 7.66 7.75 ...
## $ Return_Percentage: chr "4.16%" "5.18%" "1.80%" "0.58%" ...
## $ AdjClose_delay : num 144 146 149 155 158 ...
## $ PriorDay_AdjClose: num 143 149 156 159 160 ...
## $ Adj.Close : num 149 156 159 160 165 ...
## - attr(*, "na.action")= 'omit' Named int [1:14] 340 341 342 343 344 345 346 347 348 349 ...
## ..- attr(*, "names")= chr [1:14] "340" "341" "342" "343" ...
As we see on the result above, we need to convert the
Date column data type into date time and remove the
% character on Return_Percentage column then
convert it into numeric data type.
## 'data.frame': 339 obs. of 9 variables:
## $ Date : Date, format: "2023-01-06" "2023-01-09" ...
## $ Open : num 145 153 155 158 161 ...
## $ Range : num 9.76 9.15 4.9 4.65 11.45 ...
## $ Volume : int 40504400 50423100 38410100 35328500 55140900 44728700 51110200 43962400 45293200 56496700 ...
## $ Log_Volume : num 7.61 7.7 7.58 7.55 7.74 7.65 7.71 7.64 7.66 7.75 ...
## $ Return_Percentage: num 4.16 5.18 1.8 0.58 3.19 2.35 4.75 -1.84 -3.52 6.41 ...
## $ AdjClose_delay : num 144 146 149 155 158 ...
## $ PriorDay_AdjClose: num 143 149 156 159 160 ...
## $ Adj.Close : num 149 156 159 160 165 ...
## - attr(*, "na.action")= 'omit' Named int [1:14] 340 341 342 343 344 345 346 347 348 349 ...
## ..- attr(*, "names")= chr [1:14] "340" "341" "342" "343" ...
Our data is clean and ready for next analysis process.
Exploratory Data Analysis
We will check the correlation of each variable using
ggcorr function from GGally library. Because
of Data column is not a numeric, then it will be
ignore.
Correlation of Each Variable
Target variable: Adj.Close
Insight from the graphic above: PriorDay_AdjClose, AdjClose_delay, Range, Open have strong positive correlation with our target variable (Adj.Close)
Then we will use those four variable as predictor to make
model_stock model.
Modeling
In this section, we will build two models.
1. All variable model
2. 4 variable (based on correlation graph above) model
We will use lm() function then we will check the result
of the model using summary() function.
1. All variable model Target variable: Adj.Close Predictor variable: all variable
##
## Call:
## lm(formula = Adj.Close ~ ., data = stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9746 -2.3645 -0.3048 2.4398 28.2264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.221e+02 1.675e+02 -3.713 0.00024 ***
## Date 6.611e-03 7.094e-03 0.932 0.35205
## Open 2.579e-01 5.109e-02 5.049 7.37e-07 ***
## Range -1.545e-01 4.727e-02 -3.267 0.00120 **
## Volume -6.185e-07 9.590e-08 -6.449 3.99e-10 ***
## Log_Volume 6.799e+01 1.179e+01 5.767 1.86e-08 ***
## Return_Percentage 4.295e+00 1.431e-01 30.006 < 2e-16 ***
## AdjClose_delay 2.801e-02 3.069e-02 0.912 0.36218
## PriorDay_AdjClose 7.206e-01 6.039e-02 11.931 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.355 on 330 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
## F-statistic: 4.755e+04 on 8 and 330 DF, p-value: < 2.2e-16
\[Adj.Close = -6.221e+02 +
2.579e-01*{Open} - 1.545e-01*{Range} - 6.185e-07*{Volume}\] \[+ 6.799e+01*{Log_Volume} +
4.295e+00*{Return_Percentage} + 7.206e-01*{PriorDay_AdjClose}\]
As we see in the summary above, AdjClose_delay is not
significant because the P value more than 0.05 and Adjusted R-squared:
0.9991.
2. 4 variable model Target variable: Adj.Close Predictor variable: PriorDay_AdjClose, AdjClose_delay, Range, Open
model_stock <- lm(Adj.Close ~ PriorDay_AdjClose + AdjClose_delay + Range + Open, data = stock)
summary(model_stock)##
## Call:
## lm(formula = Adj.Close ~ PriorDay_AdjClose + AdjClose_delay +
## Range + Open, data = stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.252 -6.079 -0.762 5.270 46.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85650 1.68275 1.103 0.2707
## PriorDay_AdjClose -0.14323 0.09469 -1.513 0.1313
## AdjClose_delay 0.12597 0.05899 2.135 0.0335 *
## Range -0.32596 0.07477 -4.360 1.74e-05 ***
## Open 1.02651 0.07372 13.924 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.45 on 334 degrees of freedom
## Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966
## F-statistic: 2.47e+04 on 4 and 334 DF, p-value: < 2.2e-16
\[Adj.Close = 1.85650 +
0.12597*{AdjClose_delay} - 0.32596*{Range} + 1.02651*{Open}\] As
we see in the summary above, PriorDay_AdjClose is not
significant because the P value more than 0.05 and Adjusted R-squared:
0.9966. Then we need to evaluate and make improvement of the model.
Evaluation
In this section, we will make the comparison of
model_stock and model_all using R-squared,
RMSE and MAPE.
Compare R-squared value
## [1] 0.9991123
## [1] 0.9965908
Compare RMSE and MAPE value:
# hitung RMSE (gunakan object pred_ineq dan pred_ineq_all sebagai y_pred)
RMSE(y_pred = pred_all , y_true = stock$Adj.Close)## [1] 6.269972
## [1] 12.36153
## [1] 0.01058545
## [1] 0.01793644
RMSE and MAPE are error value and we can interpret that the smaller
error value show the better model. From the R-squared, RMSE and MAPE
value, model_all is better than model_stock
but we still can improve the model until we find out the best model.
Model Improvement
Step-wise Regression (model_backward)
In this section we will improve the model using Step-wise Regression. Step-wise regression helps us choose good predictors by finding the combination of predictors that produces the best model based on the AIC value. The Akaike Information Criterion (AIC) represents the amount of information lost in the model, or information loss. Therefore, a good regression model is a small AIC. Then we will use backward stepwise regression.
## Start: AIC=1262.65
## Adj.Close ~ Date + Open + Range + Volume + Log_Volume + Return_Percentage +
## AdjClose_delay + PriorDay_AdjClose
##
## Df Sum of Sq RSS AIC
## - AdjClose_delay 1 34 13361 1261.5
## - Date 1 35 13362 1261.5
## <none> 13327 1262.7
## - Range 1 431 13758 1271.5
## - Open 1 1029 14356 1285.9
## - Log_Volume 1 1343 14670 1293.2
## - Volume 1 1680 15007 1300.9
## - PriorDay_AdjClose 1 5749 19076 1382.2
## - Return_Percentage 1 36360 49687 1706.8
##
## Step: AIC=1261.51
## Adj.Close ~ Date + Open + Range + Volume + Log_Volume + Return_Percentage +
## PriorDay_AdjClose
##
## Df Sum of Sq RSS AIC
## - Date 1 41 13402 1260.6
## <none> 13361 1261.5
## - Range 1 423 13783 1270.1
## - Open 1 1020 14381 1284.5
## - Log_Volume 1 1344 14705 1292.0
## - Volume 1 1705 15066 1300.2
## - PriorDay_AdjClose 1 8533 21894 1426.9
## - Return_Percentage 1 36968 50329 1709.1
##
## Step: AIC=1260.56
## Adj.Close ~ Open + Range + Volume + Log_Volume + Return_Percentage +
## PriorDay_AdjClose
##
## Df Sum of Sq RSS AIC
## <none> 13402 1260.6
## - Range 1 460 13862 1270.0
## - Open 1 1032 14434 1283.7
## - Log_Volume 1 1315 14717 1290.3
## - Volume 1 1688 15090 1298.8
## - PriorDay_AdjClose 1 8646 22048 1427.3
## - Return_Percentage 1 37176 50578 1708.8
##
## Call:
## lm(formula = Adj.Close ~ Open + Range + Volume + Log_Volume +
## Return_Percentage + PriorDay_AdjClose, data = stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.0481 -2.2679 -0.2363 2.5179 28.8744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.873e+02 8.563e+01 -5.691 2.78e-08 ***
## Open 2.581e-01 5.104e-02 5.056 7.08e-07 ***
## Range -1.584e-01 4.691e-02 -3.377 0.00082 ***
## Volume -6.190e-07 9.572e-08 -6.467 3.58e-10 ***
## Log_Volume 6.706e+01 1.175e+01 5.708 2.54e-08 ***
## Return_Percentage 4.315e+00 1.422e-01 30.347 < 2e-16 ***
## PriorDay_AdjClose 7.527e-01 5.143e-02 14.635 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.354 on 332 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
## F-statistic: 6.343e+04 on 6 and 332 DF, p-value: < 2.2e-16
Check assumptions of model_backward
In this section, we will check the assumptions for the
model_backward.
1. Linearity
plot(model_backward, which = 1)
abline(h = 10, col = "darkturquoise")
abline(h = -10, col = "darkturquoise")
The
model_backward is a linear model (passes the linearity
assumption test because the residual value is around the value 0)
2. Normality of Residuals 1. Residual histogram
vusialization using hist() function
- Statistics test using
shapiro.test()function
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.9101, p-value = 2.461e-13
p-value = 2.461e-13, p-value < alpha (0.05) -> reject h0 (Error is not normally distributed -> does not pass the normality assumption)
3. Homoscedasticity of Residuals 1. Scatter plot
visualization: fitted.values vs residuals
fitted.valuesis the predicted value of the training dataresidualsis error value
2. Statistics test using
bptest() from lmtest
package
Breusch-Pagan hypothesis test:
- H0: error spread constant or homoscedasticity
- H1: error spread NOT constant or heteroscedasticity
Expected condition: H0 p_value > alpha -> fail to reject h0 (accept h0) p_value < alpha -> reject h0 (accept h1)
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 115.54, df = 6, p-value < 2.2e-16
Kesimpulan: p value= 2.2e-16 -> p-value < alpha (0.05) -> terima h0 (residuals are not distributed randomly, do not pass the assumption test)
4. No Multicollinearity
## Open Range Volume Log_Volume
## 999.781674 3.159746 19.099636 17.776953
## Return_Percentage PriorDay_AdjClose
## 1.636250 1002.871156
variables indicated as multicollinear (VIF value >10) : Open, Volume, Log_Volume, PriorDay_AdjClose. Then we will fix this condition by take out the variable with VIF value >10 and similar with other variable. Volume is similar with Log_Volume, then we will keep Log_Volume because the VIF value is smaller. Open is similar with PriorDay_AdjClose, then we will keep Open because the VIF value is smaller.
Build the model
model_eval1 <- lm(Adj.Close ~ Range + Log_Volume + Return_Percentage + Open, data = stock)
summary(model_eval1)##
## Call:
## lm(formula = Adj.Close ~ Range + Log_Volume + Return_Percentage +
## Open, data = stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.262 -3.994 -0.138 3.580 32.981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.519133 38.787593 2.849 0.00465 **
## Range -0.211831 0.067208 -3.152 0.00177 **
## Log_Volume -14.535944 5.058832 -2.873 0.00432 **
## Return_Percentage 2.814919 0.165811 16.977 < 2e-16 ***
## Open 1.006929 0.003638 276.759 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.175 on 334 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.9981
## F-statistic: 4.558e+04 on 4 and 334 DF, p-value: < 2.2e-16
Check the linearity
plot(model_eval1, which = 1)
abline(h = 10, col = "darkturquoise")
abline(h = -10, col = "darkturquoise")
Check the Normality
##
## Shapiro-Wilk normality test
##
## data: model_eval1$residuals
## W = 0.86911, p-value = 2.443e-16
Check Homoscedasticity of Residuals
##
## studentized Breusch-Pagan test
##
## data: model_eval1
## BP = 121.67, df = 4, p-value < 2.2e-16
Check the multicollinearity
## Range Log_Volume Return_Percentage Open
## 3.110908 1.580359 1.067139 2.436284
Summary of model_eval1:
- Adjusted R-squared value of model_eval1 is 0.9981
- All of the predictors are significant.
- The residuals are in an area not far from zero, so they fulfill the
linearity assumption. - P value on normality test is 2.443e-16 and it’s
smaller than alpha (0.05), then the normality assumption can’t be
met.
- P value on homoscedasticity test is 2.2e-16 and it’s smaller than
alpha (0.05), then the normality assumption can’t be met.
- All of variable have VIF value smaller than 10, then there is no
multinolinearity .
- Eventhough model_eval1 can’t meet the normality and
homoscedasticity, the minimum assumption tests that must be met so that
the model can be used are linearity and no multinolinearity.
Conclusion
After we evaluate and improve the model, we find that
model_eval1 is the best model with adjusted R-squared value
is 0.9981. It also can meet the linearity and no multinolinearity
assumption as the minimum assumption tests that must be met. Based on
this model, Next Day Adjusted Close price value depend on
range of daily high-low stock price,
log_volume (log value of daily traded volume),
Return_Percentage and stock Open price.