library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

setwd("~/Desktop/Project 3 data101")
df <- read_csv("Amazon.csv")
## Rows: 5852 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (6): Open, High, Low, Close, Adj Close, Volume
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

What influences the closing price of Amazon stocks based on opening, highest, lowest price, and volume?

The stock market is a trading network where investors buy and sell financial assets, primarily stocks, which are shares or equities of the company. Investors who buy a company’s stock become shareholders because they now own a piece of the company through its shares. Stock exchanges are a vital part of the financial system. For instance, companies like Amazon sell their stocks or shares to increase capital gains to expand their business. While the process of the stock exchange can be rewarding, it also imposes a financial risk. Investors typically seek profit by holding on to shares to gain income from dividends, or buying low-priced stocks that are likely to increase in value, which can then be sold for profit. The decision to buy or sell a stock is influenced by various factors such as the prevailing price of the stock, technical analysis, and other aspects.

The Amazon Stock Data contains information such as opening price of a stock, closing price, and how much of these stocks were sold from 1997 to 2020. This dataset consists of 7 columns and 5,852 rows. The variables in this dataset include, Date, Open (the opening price of the stock and the initial price at the start of the trade), High (the maximum price of the stock for the day), Low (the minimum price of the stock for the day), Close (the closing price of the stock for the day), Adj Close (the adjusted closing price from stock splits and dividend distributions), and lastly Volume (the physical number of shares traded of that stock on a particular day). The variables I will be implementing in my model are Open, High, Low, Volume, and Close.

https://www.kaggle.com/datasets/aayushmishra1512/faang-complete-stock-data

Data Analysis

This data analysis focuses on preparing the data for a regression model by cleaning the dataset. This process consisted of checking for missing values and renaming column titles to ensure clarity and consistency. This analysis also includes summary statistics to better understand the variables, such as the opening and closing price of a stock. To have a better understanding of the model, I have created a box plot to illustrate the change in closing price throughout the years.

head(df)
## # A tibble: 6 × 7
##   Date        Open  High   Low Close `Adj Close`   Volume
##   <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl>
## 1 1997-05-15  2.44  2.5   1.93  1.96        1.96 72156000
## 2 1997-05-16  1.97  1.98  1.71  1.73        1.73 14700000
## 3 1997-05-19  1.76  1.77  1.62  1.71        1.71  6106800
## 4 1997-05-20  1.73  1.75  1.64  1.64        1.64  5467200
## 5 1997-05-21  1.64  1.65  1.38  1.43        1.43 18853200
## 6 1997-05-22  1.44  1.45  1.31  1.40        1.40 11776800

Rename column titles

# Clean variable names
names(df) <- gsub("[(). \\-]", "_", names(df)) #Replace ., (), space, with underscore
names(df) <- gsub("_$", "", names(df))  #Remove trailing underscore
names(df) <- tolower(names(df))         #Lowercase

head(df)
## # A tibble: 6 × 7
##   date        open  high   low close adj_close   volume
##   <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 1997-05-15  2.44  2.5   1.93  1.96      1.96 72156000
## 2 1997-05-16  1.97  1.98  1.71  1.73      1.73 14700000
## 3 1997-05-19  1.76  1.77  1.62  1.71      1.71  6106800
## 4 1997-05-20  1.73  1.75  1.64  1.64      1.64  5467200
## 5 1997-05-21  1.64  1.65  1.38  1.43      1.43 18853200
## 6 1997-05-22  1.44  1.45  1.31  1.40      1.40 11776800

Handle missing values

#How many NA(s)
colSums(is.na(df))            # NAs per column
##      date      open      high       low     close adj_close    volume 
##         0         0         0         0         0         0         0

Extract the year from the date

df <-df |>
  mutate(year = year(date))

head(df)
## # A tibble: 6 × 8
##   date        open  high   low close adj_close   volume  year
##   <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>
## 1 1997-05-15  2.44  2.5   1.93  1.96      1.96 72156000  1997
## 2 1997-05-16  1.97  1.98  1.71  1.73      1.73 14700000  1997
## 3 1997-05-19  1.76  1.77  1.62  1.71      1.71  6106800  1997
## 4 1997-05-20  1.73  1.75  1.64  1.64      1.64  5467200  1997
## 5 1997-05-21  1.64  1.65  1.38  1.43      1.43 18853200  1997
## 6 1997-05-22  1.44  1.45  1.31  1.40      1.40 11776800  1997

Median closing price for Amazon stock in 1997

year_1997 <- df |>
  filter(year == "1997")

median(year_1997$close)
## [1] 3.046875

Median closing price for Amazon stock in 2020

year_2020 <- df |>
  filter(year == "2020")

median(year_2020$close)
## [1] 2356.95

Box plot of closing price throughout the years

The closing price has increased over the years. In addition, the spread of the closing price increased as well. For instance, from 2015 up to 2020, the variability of closing price is higher compared to the prices from 1997 to 2014.

boxplot(close ~ year, data = df, 
         ylab = "Closing Price", xlab = "Year",
        main = "Closing Price Over The Years ")

Summary (Open): Median, min, max

min(df$open)
## [1] 1.40625
max(df$open)
## [1] 3251.06
median(df$open)
## [1] 83.7

Summary (Close): Median, min, max

min(df$close)
## [1] 1.395833
max(df$close)
## [1] 3225
median(df$close)
## [1] 83.6025

Model Assumptions and Diagnostics

I will be using a multiple linear regression model to predict the closing (close) price based on opening price (open), max price (high), min price (low), and the number of stocks traded (volume). The multiple linear regression model, despite its extremely high \(R^2\) of 0.999, exhibits violations of assumptions (homoscedasiticity) that limit its reliability for an accurate prediction of a stocks closing price. The overall model is highly statistically significant (p < 2.2e-16), primarily driven by the relationship between the closing price and the open, high, and low prices. However, the presence of multicollinearity is a significant factor. The strong positive correlations among open, high, and low prices violates the hope for independence. While the coefficient for volume is statistically insignificant (p = 0.858), it is likely due to its low correlation with the other variables. The diagnostic plots reveals linearity but the spread of residuals are not perfectly constant across all fitted values. Change in spread of residuals as fitted values increase indicate heteroscedasticity. Furthermore, the Q-Q plot displays an S-shape with both tails deviating, indicating that the data is not normally distributed due to the presence of outliers.

Multiple Linear Regression

# Fit multiple linear regression: close ~ open + high + low + volume
multiple_model <- lm(close ~ open + high + low + volume, data = df)

# View the model summary
summary(multiple_model)
## 
## Call:
## lm(formula = close ~ open + high + low + volume, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.904  -0.587   0.010   0.541  72.224 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.536e-02  1.085e-01  -0.142    0.887    
## open        -6.336e-01  9.857e-03 -64.274   <2e-16 ***
## high         8.358e-01  8.317e-03 100.486   <2e-16 ***
## low          7.983e-01  7.795e-03 102.414   <2e-16 ***
## volume      -1.635e-09  9.165e-09  -0.178    0.858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.861 on 5847 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 2.201e+07 on 4 and 5847 DF,  p-value: < 2.2e-16

The intercept (-1.536e-02) is the predicted value of close when all predictors are zero. This is not not practically meaningful, but mathematically it’s the y-intercept. (p-value = 0.887, not statistically significant).

Equation: Closing Price = -0.01536 − 0.6336(open) + 0.8358(high) + 0.7983(low) - 0.000000001635(volume)

Coefficients (holding the other variables constant):

Open: −0.6336 per dollar (p < 2e-16). Negative slope: As opening price decreases, closing price increases. (Low opening price high closing price)

High: 0.8358 (p < 2e-16). Positive slope

Low: 0.7983 (p < 2e-16). Positive slope

Volume: -0.000000001635 (p = 0.858, not significant).

Adjusted R² = 0.999, so, about 99% of ‘close’ (closing price) variation explained for this dataset.

P-value: (p-value: < 2.2e-16) The model as a whole is highly significant. So the open, high and, low are the big drivers of closing price. Volume isn’t significant once open/high/low are in—likely multicollinearity.

Visual linearity check

Open vs Close

Clear positive linear trend.

plot(df$open, df$close,
     xlab="Open", ylab="Close", main="Opening Price vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients

High vs Close

Clear positive linear trend.

plot(df$high, df$close,
     xlab="High", ylab="Close", main="High value vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients

Low vs Close

Clear positive linear trend.

plot(df$low, df$close,
     xlab="Low", ylab="Close", main="Low value vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients

Volume vs Close

No straight-line pattern. Non-linearity

plot(df$volume, df$close,
     xlab="Volume", ylab="Close", main="Volume vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients

Independence

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Residuals vs Order: Mostly centered around 0. A few bursts in higher indices (5000–6000).

Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
plot(multiple_model)

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

Residuals vs Fitted: Mostly flat cloud, linearity acceptable. Spread of residuals start off consistent then spreads out a bit, but move towards zero and fans out again. There’s a few small clumps of residuals towards higher fitted values. This suggests heteroscedasticity.

Homoscedasticity (equal variance): The spread of residuals isn’t perfectly constant across all fitted values. Change in spread of residuals as fitted values increase indicate heteroscedasticity.

Scale–Location: Positive slope. Some increase in spread with fitted values, mild heteroscedasticity.

Q–Q: S Shape indicating outliers. Both tails deviate. The sample data is not normally distributed.

Residuals vs Leverage: A few influential points. Residuals clump around low leverage values and residuals between -10 and 10.

Check Multicollinearity

cor(df[, c("open", "high", "low", "volume")])
##              open       high        low     volume
## open    1.0000000  0.9999127  0.9998962 -0.2332929
## high    0.9999127  1.0000000  0.9998535 -0.2316199
## low     0.9998962  0.9998535  1.0000000 -0.2349415
## volume -0.2332929 -0.2316199 -0.2349415  1.0000000

Multicollinearity is present in this model. Every other variable except for volume seem to have a strong positive correlation which violates the hope for independence. The errors are correlated which weakens predictions.

Diagnosing Model Fit (RMSE and Residuals)

# For multiple model

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.858918

Multiple model, RMSE = 4.85 dollars meaning predictions miss by ~4.8 dollars on average.

Conclusion and Future Directions

In conclusion, building a multiple linear regression model to predict the closing price of a stock has served a learning opportunity, particularly in gaining a deeper understanding of the stock market and predictive modeling. The model summary shows that while the model as a whole is highly significant and the exceptionally high \(R^2\) value suggests a good fit, this model may still have limitations in it predictive power. The visual linearity looks similar for high and low prices because they are highly correlated. Furthermore, the model summary shows that volume is insignificant, but it does not affect the multicollinearity because volume is not correlated with the other independent variables. The core diagnostics revealed limitations to the model. The model exhibited heteroscedasticity and a non-normal distribution resulting from outliers. The not normally distributed model suggests the prediction overestimates or underestimates the closing price of the stock.

Overall, the linearity, the \(R^2\) value, and the highly significant p-value indicate an acceptable model. However, taking multicollinearity into account is important as well because it also plays a role in determining the greatness of the model. From a financial stand point, the rise and decline of a company’s stocks are influenced by both external and internal factors. The internal factors include the company’s financial performance, such as their earnings, profits, and revenues. The external influencing factors include market factors, political factors, inflation, etc. In the future, I would like to apply other predictive variables not included in this data, such as implementing prices from other markets, CPI (consumer price index), and SP500 price. Incorporating these variables can possibly make a better prediction model for Amazon’s closing stock price.

References