Regression Model: NVIDIA Corporation Stock Market Report

Introduction

About NVIDIA Corporation

Nvidia Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware.[5] It is a software and fabless company which designs and supplies graphics processing units (GPUs), application programming interfaces (APIs) for data science and high-performance computing, as well as system on a chip units (SoCs) for the mobile computing and automotive market. Nvidia is also a dominant supplier of artificial intelligence (AI) hardware and software.

NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. NVIDIA work in AI and digital twins is transforming the world’s largest industries and profoundly impacting society.

NVIDIA Corporation

Nvidia went public on January 22, 1999. In January 2024, Forbes reported that Nvidia has increased its lobbying presence in Washington, D.C. as American lawmakers consider proposals to regulate artificial intelligence. From 2023 to 2024, the company reportedly hired at least four government affairs with professional backgrounds at agencies including the United States Department of State and the Department of the Treasury. It was noted that the $350,000 spent by the company on lobbying in 2023 was small compared to a number of major tech companies in the artificial intelligence space.

Business Goal

This section is copied from this Kaggle kernel , who has done the same analysis using python instead of R. After we build the linear regression model, we will evaluate the model and we will filter out the most important indicator of stock price prediction.
This information will be give special advantage for investors. From this the result of multiple linear regression model, the investors will see how NVIDIA stock movement, when the best time they should buy or sell their stock.

Data Preparation

Prerequisites

Importing Libraries

library(dplyr)
library(MLmetrics)
library(lmtest)
library(car)
library(GGally)

Importing Dataset

This dataset organized stock market data from 06-01-2023 to 13-05-2024 from Yahoo Finance

stock <- read.csv("NVIDIA Historical Market Data.csv")
library(DT)
datatable(stock, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Data Column
Date: Date
Open: Prior Day Open
Range: Log Range ( Stock Daily High - Low, risk indicator)
Volume: Value of Daily Traded Volume
Log_Volume: Log_Volume (Log Value of Daily Traded Volume) as Risk Indicator
Return_Percentage: Return_Percentage ( [Today AdjClose - Prior Day AdjClose)/Prior Day AdjClose]*100% ) as Momentum Indicator
X3_Day_Avg_AdjClose.Delay: 3_Day_Avg_AdjClose(Market Price Level Assassin indicatory, Delay=3) as profit Indicator
PriorDay_AdjClose: PriorDay_AdjClose (prior day adjusted closed price) as Historical Market Price Level Assessment
Adj.Close: adjusted closed price

X3_Day_Avg_AdjClose.Delay will be rename as AdjClose_delay

names(stock) <- c("Date","Open", "Range", "Volume", "Log_Volume", "Return_Percentage", "AdjClose_delay", "PriorDay_AdjClose", "Adj.Close")

Data Processing

Duplicates

sum(duplicated(stock))

## [1] 13

Missing Values

colSums(is.na(x = stock))

##              Date              Open             Range            Volume 
##                 0                14                14                14 
##        Log_Volume Return_Percentage    AdjClose_delay PriorDay_AdjClose 
##                14                 0                14                14 
##         Adj.Close 
##                14

Remove the row which include NA value then check the missing value.

stock <- na.omit(stock)
colSums(is.na(x = stock))

##              Date              Open             Range            Volume 
##                 0                 0                 0                 0 
##        Log_Volume Return_Percentage    AdjClose_delay PriorDay_AdjClose 
##                 0                 0                 0                 0 
##         Adj.Close 
##                 0

Recheck the duplicates after removing missing value

sum(duplicated(stock))

## [1] 0

Data Types

Check the structure of stock dataset using str function.

str(stock)

## 'data.frame':    339 obs. of  9 variables:
##  $ Date             : chr  "06-01-23" "09-01-23" "10-01-23" "11-01-23" ...
##  $ Open             : num  145 153 155 158 161 ...
##  $ Range            : num  9.76 9.15 4.9 4.65 11.45 ...
##  $ Volume           : int  40504400 50423100 38410100 35328500 55140900 44728700 51110200 43962400 45293200 56496700 ...
##  $ Log_Volume       : num  7.61 7.7 7.58 7.55 7.74 7.65 7.71 7.64 7.66 7.75 ...
##  $ Return_Percentage: chr  "4.16%" "5.18%" "1.80%" "0.58%" ...
##  $ AdjClose_delay   : num  144 146 149 155 158 ...
##  $ PriorDay_AdjClose: num  143 149 156 159 160 ...
##  $ Adj.Close        : num  149 156 159 160 165 ...
##  - attr(*, "na.action")= 'omit' Named int [1:14] 340 341 342 343 344 345 346 347 348 349 ...
##   ..- attr(*, "names")= chr [1:14] "340" "341" "342" "343" ...

As we see on the result above, we need to convert the Date column data type into date time and remove the % character on Return_Percentage column then convert it into numeric data type.

stock$Date <-  as.Date(stock$Date, format = "%d-%m-%y")

stock$Return_Percentage <-  gsub('%','',stock$Return_Percentage)

stock$Return_Percentage <-  as.numeric(stock$Return_Percentage)

str(stock)

## 'data.frame':    339 obs. of  9 variables:
##  $ Date             : Date, format: "2023-01-06" "2023-01-09" ...
##  $ Open             : num  145 153 155 158 161 ...
##  $ Range            : num  9.76 9.15 4.9 4.65 11.45 ...
##  $ Volume           : int  40504400 50423100 38410100 35328500 55140900 44728700 51110200 43962400 45293200 56496700 ...
##  $ Log_Volume       : num  7.61 7.7 7.58 7.55 7.74 7.65 7.71 7.64 7.66 7.75 ...
##  $ Return_Percentage: num  4.16 5.18 1.8 0.58 3.19 2.35 4.75 -1.84 -3.52 6.41 ...
##  $ AdjClose_delay   : num  144 146 149 155 158 ...
##  $ PriorDay_AdjClose: num  143 149 156 159 160 ...
##  $ Adj.Close        : num  149 156 159 160 165 ...
##  - attr(*, "na.action")= 'omit' Named int [1:14] 340 341 342 343 344 345 346 347 348 349 ...
##   ..- attr(*, "names")= chr [1:14] "340" "341" "342" "343" ...

Our data is clean and ready for next analysis process.

Exploratory Data Analysis

We will check the correlation of each variable using ggcorr function from GGally library. Because of Data column is not a numeric, then it will be ignore.

ggcorr(data = stock, label = T)

Correlation of Each Variable

Target variable: Adj.Close

Insight from the graphic above: PriorDay_AdjClose, AdjClose_delay, Range, Open have strong positive correlation with our target variable (Adj.Close)

Then we will use those four variable as predictor to make model_stock model.

plot(x= stock$Open, y= stock$Adj.Close, col= "maroon")

plot(x= stock$Range, y= stock$Adj.Close, col= "darkturquoise")

plot(x= stock$PriorDay_AdjClose, y= stock$Adj.Close, col= "maroon")

plot(x= stock$AdjClose_delay, y= stock$Adj.Close, col= "darkturquoise")

Modeling

In this section, we will build two models.
1. All variable model
2. 4 variable (based on correlation graph above) model

We will use lm() function then we will check the result of the model using summary() function.

1. All variable model Target variable: Adj.Close Predictor variable: all variable

model_all <- lm(Adj.Close ~ ., data = stock)

summary(model_all)

## 
## Call:
## lm(formula = Adj.Close ~ ., data = stock)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9746  -2.3645  -0.3048   2.4398  28.2264 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.221e+02  1.675e+02  -3.713  0.00024 ***
## Date               6.611e-03  7.094e-03   0.932  0.35205    
## Open               2.579e-01  5.109e-02   5.049 7.37e-07 ***
## Range             -1.545e-01  4.727e-02  -3.267  0.00120 ** 
## Volume            -6.185e-07  9.590e-08  -6.449 3.99e-10 ***
## Log_Volume         6.799e+01  1.179e+01   5.767 1.86e-08 ***
## Return_Percentage  4.295e+00  1.431e-01  30.006  < 2e-16 ***
## AdjClose_delay     2.801e-02  3.069e-02   0.912  0.36218    
## PriorDay_AdjClose  7.206e-01  6.039e-02  11.931  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.355 on 330 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 4.755e+04 on 8 and 330 DF,  p-value: < 2.2e-16

\[Adj.Close = -6.221e+02 + 2.579e-01*{Open} - 1.545e-01*{Range} - 6.185e-07*{Volume}\] \[+ 6.799e+01*{Log_Volume} + 4.295e+00*{Return_Percentage} + 7.206e-01*{PriorDay_AdjClose}\] As we see in the summary above, AdjClose_delay is not significant because the P value more than 0.05 and Adjusted R-squared: 0.9991.

2. 4 variable model Target variable: Adj.Close Predictor variable: PriorDay_AdjClose, AdjClose_delay, Range, Open

model_stock <- lm(Adj.Close ~ PriorDay_AdjClose + AdjClose_delay + Range + Open, data = stock)
summary(model_stock)

## 
## Call:
## lm(formula = Adj.Close ~ PriorDay_AdjClose + AdjClose_delay + 
##     Range + Open, data = stock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.252  -6.079  -0.762   5.270  46.801 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.85650    1.68275   1.103   0.2707    
## PriorDay_AdjClose -0.14323    0.09469  -1.513   0.1313    
## AdjClose_delay     0.12597    0.05899   2.135   0.0335 *  
## Range             -0.32596    0.07477  -4.360 1.74e-05 ***
## Open               1.02651    0.07372  13.924  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.45 on 334 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9966 
## F-statistic: 2.47e+04 on 4 and 334 DF,  p-value: < 2.2e-16

\[Adj.Close = 1.85650 + 0.12597*{AdjClose_delay} - 0.32596*{Range} + 1.02651*{Open}\] As we see in the summary above, PriorDay_AdjClose is not significant because the P value more than 0.05 and Adjusted R-squared: 0.9966. Then we need to evaluate and make improvement of the model.

Evaluation

In this section, we will make the comparison of model_stock and model_all using R-squared, RMSE and MAPE.

Compare R-squared value

summary(model_all)$adj.r.squared

## [1] 0.9991123

summary(model_stock)$adj.r.squared

## [1] 0.9965908

Compare RMSE and MAPE value:

pred_all <-  predict(model_all , stock)
pred_stock <- predict(model_stock , stock)

# hitung RMSE (gunakan object pred_ineq dan pred_ineq_all sebagai y_pred)
RMSE(y_pred = pred_all , y_true = stock$Adj.Close)

## [1] 6.269972

RMSE(y_pred = pred_stock , y_true = stock$Adj.Close)

## [1] 12.36153

MAPE(y_pred = pred_all , y_true = stock$Adj.Close)

## [1] 0.01058545

MAPE(y_pred = pred_stock , y_true = stock$Adj.Close)

## [1] 0.01793644

RMSE and MAPE are error value and we can interpret that the smaller error value show the better model. From the R-squared, RMSE and MAPE value, model_all is better than model_stock but we still can improve the model until we find out the best model.

Model Improvement

Step-wise Regression (`model_backward`)

In this section we will improve the model using Step-wise Regression. Step-wise regression helps us choose good predictors by finding the combination of predictors that produces the best model based on the AIC value. The Akaike Information Criterion (AIC) represents the amount of information lost in the model, or information loss. Therefore, a good regression model is a small AIC. Then we will use backward stepwise regression.

model_backward <- step(object = model_all,
                      direction = "backward")

## Start:  AIC=1262.65
## Adj.Close ~ Date + Open + Range + Volume + Log_Volume + Return_Percentage + 
##     AdjClose_delay + PriorDay_AdjClose
## 
##                     Df Sum of Sq   RSS    AIC
## - AdjClose_delay     1        34 13361 1261.5
## - Date               1        35 13362 1261.5
## <none>                           13327 1262.7
## - Range              1       431 13758 1271.5
## - Open               1      1029 14356 1285.9
## - Log_Volume         1      1343 14670 1293.2
## - Volume             1      1680 15007 1300.9
## - PriorDay_AdjClose  1      5749 19076 1382.2
## - Return_Percentage  1     36360 49687 1706.8
## 
## Step:  AIC=1261.51
## Adj.Close ~ Date + Open + Range + Volume + Log_Volume + Return_Percentage + 
##     PriorDay_AdjClose
## 
##                     Df Sum of Sq   RSS    AIC
## - Date               1        41 13402 1260.6
## <none>                           13361 1261.5
## - Range              1       423 13783 1270.1
## - Open               1      1020 14381 1284.5
## - Log_Volume         1      1344 14705 1292.0
## - Volume             1      1705 15066 1300.2
## - PriorDay_AdjClose  1      8533 21894 1426.9
## - Return_Percentage  1     36968 50329 1709.1
## 
## Step:  AIC=1260.56
## Adj.Close ~ Open + Range + Volume + Log_Volume + Return_Percentage + 
##     PriorDay_AdjClose
## 
##                     Df Sum of Sq   RSS    AIC
## <none>                           13402 1260.6
## - Range              1       460 13862 1270.0
## - Open               1      1032 14434 1283.7
## - Log_Volume         1      1315 14717 1290.3
## - Volume             1      1688 15090 1298.8
## - PriorDay_AdjClose  1      8646 22048 1427.3
## - Return_Percentage  1     37176 50578 1708.8

summary(model_backward)

## 
## Call:
## lm(formula = Adj.Close ~ Open + Range + Volume + Log_Volume + 
##     Return_Percentage + PriorDay_AdjClose, data = stock)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.0481  -2.2679  -0.2363   2.5179  28.8744 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.873e+02  8.563e+01  -5.691 2.78e-08 ***
## Open               2.581e-01  5.104e-02   5.056 7.08e-07 ***
## Range             -1.584e-01  4.691e-02  -3.377  0.00082 ***
## Volume            -6.190e-07  9.572e-08  -6.467 3.58e-10 ***
## Log_Volume         6.706e+01  1.175e+01   5.708 2.54e-08 ***
## Return_Percentage  4.315e+00  1.422e-01  30.347  < 2e-16 ***
## PriorDay_AdjClose  7.527e-01  5.143e-02  14.635  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.354 on 332 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 6.343e+04 on 6 and 332 DF,  p-value: < 2.2e-16

Check assumptions of `model_backward`

In this section, we will check the assumptions for the model_backward.

1. Linearity

plot(model_backward, which = 1)
abline(h = 10, col = "darkturquoise")
abline(h = -10, col = "darkturquoise")

The model_backward is a linear model (passes the linearity assumption test because the residual value is around the value 0)

2. Normality of Residuals 1. Residual histogram vusialization using hist() function

hist(model_backward$residuals)

Statistics test using shapiro.test() function

shapiro.test(model_backward$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.9101, p-value = 2.461e-13

p-value = 2.461e-13, p-value < alpha (0.05) -> reject h0 (Error is not normally distributed -> does not pass the normality assumption)

3. Homoscedasticity of Residuals 1. Scatter plot visualization: fitted.values vs residuals

fitted.values is the predicted value of the training data
residuals is error value

plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")

2. Statistics test using bptest() from lmtest package

Breusch-Pagan hypothesis test:

H0: error spread constant or homoscedasticity
H1: error spread NOT constant or heteroscedasticity

Expected condition: H0 p_value > alpha -> fail to reject h0 (accept h0) p_value < alpha -> reject h0 (accept h1)

bptest(model_backward)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 115.54, df = 6, p-value < 2.2e-16

Kesimpulan: p value= 2.2e-16 -> p-value < alpha (0.05) -> terima h0 (residuals are not distributed randomly, do not pass the assumption test)

4. No Multicollinearity

vif(model_backward)

##              Open             Range            Volume        Log_Volume 
##        999.781674          3.159746         19.099636         17.776953 
## Return_Percentage PriorDay_AdjClose 
##          1.636250       1002.871156

variables indicated as multicollinear (VIF value >10) : Open, Volume, Log_Volume, PriorDay_AdjClose. Then we will fix this condition by take out the variable with VIF value >10 and similar with other variable. Volume is similar with Log_Volume, then we will keep Log_Volume because the VIF value is smaller. Open is similar with PriorDay_AdjClose, then we will keep Open because the VIF value is smaller.

Build the model

model_eval1 <- lm(Adj.Close ~ Range + Log_Volume + Return_Percentage + Open, data = stock)
summary(model_eval1)

## 
## Call:
## lm(formula = Adj.Close ~ Range + Log_Volume + Return_Percentage + 
##     Open, data = stock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.262  -3.994  -0.138   3.580  32.981 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       110.519133  38.787593   2.849  0.00465 ** 
## Range              -0.211831   0.067208  -3.152  0.00177 ** 
## Log_Volume        -14.535944   5.058832  -2.873  0.00432 ** 
## Return_Percentage   2.814919   0.165811  16.977  < 2e-16 ***
## Open                1.006929   0.003638 276.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.175 on 334 degrees of freedom
## Multiple R-squared:  0.9982, Adjusted R-squared:  0.9981 
## F-statistic: 4.558e+04 on 4 and 334 DF,  p-value: < 2.2e-16

Check the linearity

plot(model_eval1, which = 1)
abline(h = 10, col = "darkturquoise")
abline(h = -10, col = "darkturquoise")

Check the Normality

shapiro.test(model_eval1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_eval1$residuals
## W = 0.86911, p-value = 2.443e-16

Check Homoscedasticity of Residuals

bptest(model_eval1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_eval1
## BP = 121.67, df = 4, p-value < 2.2e-16

Check the multicollinearity

vif(model_eval1)

##             Range        Log_Volume Return_Percentage              Open 
##          3.110908          1.580359          1.067139          2.436284

Summary of model_eval1:
- Adjusted R-squared value of model_eval1 is 0.9981
- All of the predictors are significant.
- The residuals are in an area not far from zero, so they fulfill the linearity assumption. - P value on normality test is 2.443e-16 and it’s smaller than alpha (0.05), then the normality assumption can’t be met.
- P value on homoscedasticity test is 2.2e-16 and it’s smaller than alpha (0.05), then the normality assumption can’t be met.
- All of variable have VIF value smaller than 10, then there is no multinolinearity .
- Eventhough model_eval1 can’t meet the normality and homoscedasticity, the minimum assumption tests that must be met so that the model can be used are linearity and no multinolinearity.

Conclusion

After we evaluate and improve the model, we find that model_eval1 is the best model with adjusted R-squared value is 0.9981. It also can meet the linearity and no multinolinearity assumption as the minimum assumption tests that must be met. Based on this model, Next Day Adjusted Close price value depend on range of daily high-low stock price, log_volume (log value of daily traded volume), Return_Percentage and stock Open price.

References

Data set: https://www.kaggle.com/datasets/datazng/nvidia-historical-market-data-2023-2024-for-ml https://en.wikipedia.org/wiki/Nvidia https://www.nvidia.com/en-us/about-nvidia/#About%20Us