library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
setwd("~/Desktop/Project 3 data101")
df <- read_csv("Amazon.csv")
## Rows: 5852 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Open, High, Low, Close, Adj Close, Volume
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The stock market is a trading network where investors buy and sell financial assets, primarily stocks, which are shares or equities of the company. Investors who buy a company’s stock become shareholders because they now own a piece of the company through its shares. Stock exchanges are a vital part of the financial system. For instance, companies like Amazon sell their stocks or shares to increase capital gains to expand their business. While the process of the stock exchange can be rewarding, it also imposes a financial risk. Investors typically seek profit by holding on to shares to gain income from dividends, or buying low-priced stocks that are likely to increase in value, which can then be sold for profit. The decision to buy or sell a stock is influenced by various factors such as the prevailing price of the stock, technical analysis, and other aspects.
The Amazon Stock Data contains information such as opening price of a
stock, closing price, and how much of these stocks were sold from 1997
to 2020. This dataset consists of 7 columns and 5,852 rows. The
variables in this dataset include, Date, Open
(the opening price of the stock and the initial price at the start of
the trade), High (the maximum price of the stock for the
day), Low (the minimum price of the stock for the day),
Close (the closing price of the stock for the day),
Adj Close (the adjusted closing price from stock splits and
dividend distributions), and lastly Volume (the physical
number of shares traded of that stock on a particular day). The
variables I will be implementing in my model are Open,
High, Low, Volume, and
Close.
https://www.kaggle.com/datasets/aayushmishra1512/faang-complete-stock-data
This data analysis focuses on preparing the data for a regression model by cleaning the dataset. This process consisted of checking for missing values and renaming column titles to ensure clarity and consistency. This analysis also includes summary statistics to better understand the variables, such as the opening and closing price of a stock. To have a better understanding of the model, I have created a box plot to illustrate the change in closing price throughout the years.
head(df)
## # A tibble: 6 × 7
## Date Open High Low Close `Adj Close` Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997-05-15 2.44 2.5 1.93 1.96 1.96 72156000
## 2 1997-05-16 1.97 1.98 1.71 1.73 1.73 14700000
## 3 1997-05-19 1.76 1.77 1.62 1.71 1.71 6106800
## 4 1997-05-20 1.73 1.75 1.64 1.64 1.64 5467200
## 5 1997-05-21 1.64 1.65 1.38 1.43 1.43 18853200
## 6 1997-05-22 1.44 1.45 1.31 1.40 1.40 11776800
Rename column titles
# Clean variable names
names(df) <- gsub("[(). \\-]", "_", names(df)) #Replace ., (), space, with underscore
names(df) <- gsub("_$", "", names(df)) #Remove trailing underscore
names(df) <- tolower(names(df)) #Lowercase
head(df)
## # A tibble: 6 × 7
## date open high low close adj_close volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997-05-15 2.44 2.5 1.93 1.96 1.96 72156000
## 2 1997-05-16 1.97 1.98 1.71 1.73 1.73 14700000
## 3 1997-05-19 1.76 1.77 1.62 1.71 1.71 6106800
## 4 1997-05-20 1.73 1.75 1.64 1.64 1.64 5467200
## 5 1997-05-21 1.64 1.65 1.38 1.43 1.43 18853200
## 6 1997-05-22 1.44 1.45 1.31 1.40 1.40 11776800
Handle missing values
#How many NA(s)
colSums(is.na(df)) # NAs per column
## date open high low close adj_close volume
## 0 0 0 0 0 0 0
Extract the year from the date
df <-df |>
mutate(year = year(date))
head(df)
## # A tibble: 6 × 8
## date open high low close adj_close volume year
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997-05-15 2.44 2.5 1.93 1.96 1.96 72156000 1997
## 2 1997-05-16 1.97 1.98 1.71 1.73 1.73 14700000 1997
## 3 1997-05-19 1.76 1.77 1.62 1.71 1.71 6106800 1997
## 4 1997-05-20 1.73 1.75 1.64 1.64 1.64 5467200 1997
## 5 1997-05-21 1.64 1.65 1.38 1.43 1.43 18853200 1997
## 6 1997-05-22 1.44 1.45 1.31 1.40 1.40 11776800 1997
Median closing price for Amazon stock in 1997
year_1997 <- df |>
filter(year == "1997")
median(year_1997$close)
## [1] 3.046875
Median closing price for Amazon stock in 2020
year_2020 <- df |>
filter(year == "2020")
median(year_2020$close)
## [1] 2356.95
Box plot of closing price throughout the years
The closing price has increased over the years. In addition, the spread of the closing price increased as well. For instance, from 2015 up to 2020, the variability of closing price is higher compared to the prices from 1997 to 2014.
boxplot(close ~ year, data = df,
ylab = "Closing Price", xlab = "Year",
main = "Closing Price Over The Years ")
Summary (Open): Median, min, max
min(df$open)
## [1] 1.40625
max(df$open)
## [1] 3251.06
median(df$open)
## [1] 83.7
Summary (Close): Median, min, max
min(df$close)
## [1] 1.395833
max(df$close)
## [1] 3225
median(df$close)
## [1] 83.6025
I will be using a multiple linear regression model to predict the
closing (close) price based on opening price
(open), max price (high), min price
(low), and the number of stocks traded
(volume). The multiple linear regression model, despite its
extremely high \(R^2\) of 0.999,
exhibits violations of assumptions (homoscedasiticity) that limit its
reliability for an accurate prediction of a stocks closing price. The
overall model is highly statistically significant (p < 2.2e-16),
primarily driven by the relationship between the closing price and the
open, high, and low prices. However, the presence of multicollinearity
is a significant factor. The strong positive correlations among open,
high, and low prices violates the hope for independence. While the
coefficient for volume is statistically insignificant (p = 0.858), it is
likely due to its low correlation with the other variables. The
diagnostic plots reveals linearity but the spread of residuals are not
perfectly constant across all fitted values. Change in spread of
residuals as fitted values increase indicate heteroscedasticity.
Furthermore, the Q-Q plot displays an S-shape with both tails deviating,
indicating that the data is not normally distributed due to the presence
of outliers.
Multiple Linear Regression
# Fit multiple linear regression: close ~ open + high + low + volume
multiple_model <- lm(close ~ open + high + low + volume, data = df)
# View the model summary
summary(multiple_model)
##
## Call:
## lm(formula = close ~ open + high + low + volume, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.904 -0.587 0.010 0.541 72.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.536e-02 1.085e-01 -0.142 0.887
## open -6.336e-01 9.857e-03 -64.274 <2e-16 ***
## high 8.358e-01 8.317e-03 100.486 <2e-16 ***
## low 7.983e-01 7.795e-03 102.414 <2e-16 ***
## volume -1.635e-09 9.165e-09 -0.178 0.858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.861 on 5847 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 2.201e+07 on 4 and 5847 DF, p-value: < 2.2e-16
The intercept (-1.536e-02) is the predicted value of
close when all predictors are zero. This is not not
practically meaningful, but mathematically it’s the y-intercept.
(p-value = 0.887, not statistically significant).
Equation: Closing Price = -0.01536 − 0.6336(open) + 0.8358(high) + 0.7983(low) - 0.000000001635(volume)
Coefficients (holding the other variables constant):
Open: −0.6336 per dollar (p < 2e-16). Negative slope: As opening price decreases, closing price increases. (Low opening price high closing price)
High: 0.8358 (p < 2e-16). Positive slope
Low: 0.7983 (p < 2e-16). Positive slope
Volume: -0.000000001635 (p = 0.858, not significant).
Adjusted R² = 0.999, so, about 99% of ‘close’ (closing price) variation explained for this dataset.
P-value: (p-value: < 2.2e-16) The model as a whole is highly significant. So the open, high and, low are the big drivers of closing price. Volume isn’t significant once open/high/low are in—likely multicollinearity.
Open vs Close
Clear positive linear trend.
plot(df$open, df$close,
xlab="Open", ylab="Close", main="Opening Price vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients
High vs Close
Clear positive linear trend.
plot(df$high, df$close,
xlab="High", ylab="Close", main="High value vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients
Low vs Close
Clear positive linear trend.
plot(df$low, df$close,
xlab="Low", ylab="Close", main="Low value vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients
Volume vs Close
No straight-line pattern. Non-linearity
plot(df$volume, df$close,
xlab="Volume", ylab="Close", main="Volume vs Closing Price")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 5 regression coefficients
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Residuals vs Order: Mostly centered around 0. A few bursts in higher
indices (5000–6000).
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
plot(multiple_model)
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
Residuals vs Fitted: Mostly flat cloud, linearity acceptable. Spread of residuals start off consistent then spreads out a bit, but move towards zero and fans out again. There’s a few small clumps of residuals towards higher fitted values. This suggests heteroscedasticity.
Homoscedasticity (equal variance): The spread of residuals isn’t perfectly constant across all fitted values. Change in spread of residuals as fitted values increase indicate heteroscedasticity.
Scale–Location: Positive slope. Some increase in spread with fitted values, mild heteroscedasticity.
Q–Q: S Shape indicating outliers. Both tails deviate. The sample data is not normally distributed.
Residuals vs Leverage: A few influential points. Residuals clump around low leverage values and residuals between -10 and 10.
cor(df[, c("open", "high", "low", "volume")])
## open high low volume
## open 1.0000000 0.9999127 0.9998962 -0.2332929
## high 0.9999127 1.0000000 0.9998535 -0.2316199
## low 0.9998962 0.9998535 1.0000000 -0.2349415
## volume -0.2332929 -0.2316199 -0.2349415 1.0000000
Multicollinearity is present in this model. Every other variable except for volume seem to have a strong positive correlation which violates the hope for independence. The errors are correlated which weakens predictions.
# For multiple model
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.858918
Multiple model, RMSE = 4.85 dollars meaning predictions miss by ~4.8 dollars on average.
In conclusion, building a multiple linear regression model to predict
the closing price of a stock has served a learning opportunity,
particularly in gaining a deeper understanding of the stock market and
predictive modeling. The model summary shows that while the model as a
whole is highly significant and the exceptionally high \(R^2\) value suggests a good fit, this model
may still have limitations in it predictive power. The visual linearity
looks similar for high and low prices because
they are highly correlated. Furthermore, the model summary shows that
volume is insignificant, but it does not affect the
multicollinearity because volume is not correlated with the other
independent variables. The core diagnostics revealed limitations to the
model. The model exhibited heteroscedasticity and a non-normal
distribution resulting from outliers. The not normally distributed model
suggests the prediction overestimates or underestimates the closing
price of the stock.
Overall, the linearity, the \(R^2\) value, and the highly significant p-value indicate an acceptable model. However, taking multicollinearity into account is important as well because it also plays a role in determining the greatness of the model. From a financial stand point, the rise and decline of a company’s stocks are influenced by both external and internal factors. The internal factors include the company’s financial performance, such as their earnings, profits, and revenues. The external influencing factors include market factors, political factors, inflation, etc. In the future, I would like to apply other predictive variables not included in this data, such as implementing prices from other markets, CPI (consumer price index), and SP500 price. Incorporating these variables can possibly make a better prediction model for Amazon’s closing stock price.