Using Machine Learning To Predict The Closing Price Of The S&P 500 Dataset With 4 Millions Rows Of Data From 1962 To 2024

Introduction and Objectives

Introduction

In this project, we aimed to predict the closing prices of the S&P 500 index using historical data. This dataset contains 4 millions of rows with data from 1962 to the current year 2024. Accurate predictions of stock prices are crucial for investors, financial analysts, and portfolio managers to make informed decisions. The S&P 500 is a key indicator of the overall performance of the U.S. stock market, representing the 500 largest publicly traded companies in the United States. Predicting its closing prices can provide insights into market trends, help manage investment risks, and optimize trading strategies.

Importance of Predicting the S&P 500 Closing Price

Accurate predictions of the S&P 500 closing price is crucial for investors and analysts to make informed decisions about market trends and investment strategies.

Objectives

Project Goals

This project aimed to:

Process and clean 60+ years of historical market data
Identify key predictors of daily price movements
Build a robust model using Lasso regression to handle multicollinearity
Validate performance on recent market data (2010-2024)
Extract actionable insights for trading and risk management

Data Cleaning Process and Analysis

#load libraries
library(tidyverse)
library(caret)
library(glmnet)
library(reshape2)

#import data and examine it
 sp500_data <- read_csv("sp500_data.csv")

#check out the first 10 rows
head(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close `Adj Close` Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>  <dbl> <chr> 
##  1 1962-01-02     0  3.55  3.45  3.48       0.574 254509 MMM   
##  2 1962-01-03     0  3.50  3.42  3.50       0.578 505190 MMM   
##  3 1962-01-04     0  3.56  3.50  3.50       0.578 254509 MMM   
##  4 1962-01-05     0  3.49  3.40  3.41       0.563 376979 MMM   
##  5 1962-01-08     0  3.42  3.37  3.39       0.560 399942 MMM   
##  6 1962-01-09     0  3.42  3.38  3.39       0.560 376979 MMM   
##  7 1962-01-10     0  3.38  3.34  3.35       0.554 304262 MMM   
##  8 1962-01-11     0  3.37  3.26  3.34       0.551 269818 MMM   
##  9 1962-01-12     0  3.38  3.27  3.27       0.541 692723 MMM   
## 10 1962-01-15     0  3.31  3.27  3.31       0.546 252595 MMM

#check out the last 10 rows
tail(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close `Adj Close`  Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>   <dbl> <chr> 
##  1 2024-07-19  180.  181.  176.  179.        179. 2131400 ZTS   
##  2 2024-07-22  181.  182.  179.  181.        181. 1532900 ZTS   
##  3 2024-07-23  181.  182.  179.  179.        179. 1329400 ZTS   
##  4 2024-07-24  179.  181.  178.  180.        180. 1309300 ZTS   
##  5 2024-07-25  181   186.  180.  181.        181. 2473700 ZTS   
##  6 2024-07-26  182.  184.  179.  180.        180. 2437300 ZTS   
##  7 2024-07-29  181.  183.  179.  182.        182. 1302900 ZTS   
##  8 2024-07-30  182.  185.  180.  182.        182. 2271300 ZTS   
##  9 2024-07-31  182.  183.  180.  180.        180. 1740100 ZTS   
## 10 2024-08-01  181.  184.  181.  182.        182. 1986443 ZTS

#check the data rows randomly
sample_n(sp500_data, 30)

## # A tibble: 30 × 8
##    Date         Open   High    Low  Close `Adj Close`    Volume Ticker
##    <date>      <dbl>  <dbl>  <dbl>  <dbl>       <dbl>     <dbl> <chr> 
##  1 2005-06-09  15.7   15.8   15.5   15.8        15.8     752505 HSIC  
##  2 2021-10-11 217.   218.   215.   215.        202.     2312000 UNP   
##  3 2010-03-09  17.0   17.4   16.9   17.1        10.3    3724300 UDR   
##  4 2022-12-19 315.   317.   312.   314.        300.      835300 APD   
##  5 1993-12-31  13.5   13.7   13.5   13.5         6.87   8028678 IBM   
##  6 2012-03-13   8.07   8.5    8.05   8.49        6.88 385984100 BAC   
##  7 2010-10-19  52.5   53.4   51.8   52.1        38.5     164702 HUBB  
##  8 2022-01-21 245.   246.   241.   242         222.      671900 AVB   
##  9 2020-02-05  58.5   58.7   57.4   57.5        57.5    2632000 ANET  
## 10 2009-12-16   4.02   4.09   3.90   4.05        4.05    254200 BLDR  
## # ℹ 20 more rows

#compute the summary statistics of the S$P500 data
summary(sp500_data)

##       Date                 Open               High               Low          
##  Min.   :1962-01-02   Min.   :   0.000   Min.   :   0.005   Min.   :   0.005  
##  1st Qu.:1994-09-14   1st Qu.:   8.312   1st Qu.:   8.867   1st Qu.:   8.617  
##  Median :2006-07-11   Median :  25.640   Median :  26.064   Median :  25.400  
##  Mean   :2004-05-26   Mean   :  54.624   Mean   :  55.562   Mean   :  54.269  
##  3rd Qu.:2016-01-05   3rd Qu.:  57.710   3rd Qu.:  58.360   3rd Qu.:  57.050  
##  Max.   :2024-08-01   Max.   :8700.000   Max.   :8700.000   Max.   :8570.510  
##      Close            Adj Close            Volume             Ticker         
##  Min.   :   0.005   Min.   :   0.002   Min.   :0.000e+00   Length:4225194    
##  1st Qu.:   8.746   1st Qu.:   4.247   1st Qu.:4.813e+05   Class :character  
##  Median :  25.750   Median :  17.080   Median :1.439e+06   Mode  :character  
##  Mean   :  54.931   Mean   :  46.978   Mean   :6.307e+06                     
##  3rd Qu.:  57.725   3rd Qu.:  46.186   3rd Qu.:3.821e+06                     
##  Max.   :8661.980   Max.   :8661.980   Max.   :9.231e+09

Initial Data Inspection

Upon inspecting the dataset, it was observed that the ‘Open’ and ‘Volume’ variables contained some zeros. These zeros were problematic as they could introduce bias into the model, leading to inaccurate predictions. The zeros in the ‘Open’ and ‘Volume’ variable could represent missing data or erroneous entries, which is why it was necessary to remove them before proceeding with the analysis.

#removing the zeros in the "Open" and "Volume" variables
sp500_data <- sp500_data %>% 
  filter(Open !=0) %>%
  filter(Volume !=0)

Removing Zeros from the ‘Open’ And ‘Volume’ Variables

To ensure the integrity of the dataset, all rows where the ‘Open’ and ‘Volume’ variables were zero were filtered out. This step was crucial in improving the reliability of the model by eliminating any potential anomalies in the data

#rename the Adj Close column for better data manipulation
sp500_data <- sp500_data %>% 
  rename("Adj_Close" = "Adj Close") %>%
  na.omit() %>% #remove Null values
  distinct() # remove duplicate

Market Trend

#  market trend analysis
all_decades_trends <- sp500_data %>%
  mutate(Decade = floor(year(Date) / 10) * 10) %>%
  group_by(Decade) %>%
  summarise(
    Mean_Close = mean(Close, na.rm = TRUE),
    N_Stocks = n_distinct(Ticker)
  ) %>%
  mutate(Decade_Label = paste0(Decade, "s"))

all_decades_trends

## # A tibble: 7 × 4
##   Decade Mean_Close N_Stocks Decade_Label
##    <dbl>      <dbl>    <int> <chr>       
## 1   1960       2.78        8 1960s       
## 2   1970       5.08       65 1970s       
## 3   1980       7.86      204 1980s       
## 4   1990      16.9       353 1990s       
## 5   2000      35.5       430 2000s       
## 6   2010      73.4       491 2010s       
## 7   2020     165.        501 2020s

# Plot all decades market trends
ggplot(all_decades_trends, aes(x = Decade_Label, y = Mean_Close)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  labs(title = "S&P 500 Historical Market Trends (1960s-2020s)",
       subtitle = "Full dataset - Showing price evolution across all decades",
       x = "Decade", 
       y = "Mean Closing Price") +
  theme_minimal()

The S&P 500 has transformed dramatically from a concentrated index of 8 stocks averaging $2.78 in the 1960s to a diversified portfolio of 501 stocks averaging $164.68 in the 2020s

Correlation Heatmap

After cleaning the data, a correlation heatmap is generated to explore the relationships between different variables in the dataset.

#check for correlations

# Select numerical columns for correlation analysis
numerical_vars <- sp500_data %>%
  select("Open", "High", "Low", "Close", "Adj_Close" , "Volume") 

# Compute the correlation matrix
cor_matrix <- cor(numerical_vars, use = "complete.obs")

# Melt the correlation matrix into a long format
melted_cor_matrix <- melt(cor_matrix)

# Plot the heatmap
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap",
       x = "Variables",
       y = "Variables")

Interpretation Of The Heatmap For Correlation

Color Scale:

Red Areas: Indicate a high positive correlation, close to +1. This means that as one variable increases, the other tends to increase as well.

White/Light Areas: Indicate low or no correlation, close to 0. This means there is little to no linear relationship between the variables.

Blue Areas: Would indicate a high negative correlation, close to -1 (but none are present in this map, indicating no negative correlations).

The heatmap revealed high correlations between certain variables, particularly between High, Low, Close, and Adj_Close. High correlations between predictor variables can lead to multicollinearity, which can distort the results of a linear regression model.From the heatmap we can see the variables are higly correlated. Multicollinearity makes it difficult to determine the individual effect of each predictor variable on the dependent variable (Close).

Train-Test Data Split

#Chronologically splitting the data 80/20 split by date 
cutoff_date <- quantile(as.numeric(sp500_data$Date), 0.8) %>% as.Date(origin = "1970-01-01")
train.data <- sp500_data %>% filter(Date <= cutoff_date)
test.data <- sp500_data %>% filter(Date > cutoff_date)

Using Lasso Regression And Not Linear Regression

Given the high correlation between variables, a standard linear regression model (lm) would be unsuitable. High multicollinearity can inflate the variance of the coefficient estimates, leading to unreliable and unstable results.

Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a regularization technique that adds a penalty to the model based on the absolute size of the coefficients. This penalty helps to shrink the coefficients of less important variables to zero, effectively performing variable selection and reducing multicollinearity.

# predictor variable 
x <- model.matrix(Close ~ High + Low + Open  + Volume, data = train.data)[,-1]

#outcome variable
y <- train.data$Close


#compute lasso regression 

set.seed(123)
lasso <- cv.glmnet(x, y, alpha = 1)

#dispaly best lambda value
lasso$lambda.min

## [1] 1.975655

# fit the model in the training set
model <- glmnet(x , y, alpha = 1, lambda = lasso$lambda.min)

#model coeffients
coef(model)

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 0.9329876112
## High        0.9630268561
## Low         0.0004182356
## Open        .           
## Volume      .

Interpretation of the Lasso Model

Model Coefficients

High (0.9630):

For each additional unit increase in the high price of the day, the predicted closing price increases by approximately 0.9630 units. This suggests that the higher the highest price of the day, the higher the closing price tends to be. It reflects a strong positive relationship between the high price and the closing price.

Low (0.0004):

For each additional unit increase in the low price of the day, the predicted closing price increases by approximately 0.0004 units. This coefficient is very small, indicating that the low price has a minimal effect on the closing price. In practical terms, changes in the low price of the day have a very small impact on the closing price.

Open (.)and Volume(.)

The coefficients for the opening price and volume are zero, which means that, according to the Lasso model, these variables does not contribute to predicting the closing price. They are effectively excluded from the model

Predictions From The Model

# make predictions on the test data
x.test <- model.matrix(Close ~ High + Low + Open + Volume  , test.data)[,-1]

predictions <- model %>% predict(x.test) %>% as.vector()
print(head(predictions,50))

##  [1] 174.8437 176.2855 177.0905 177.5663 179.5157 178.9683 177.5906 178.3796
##  [9] 177.1309 167.2770 163.4523 161.7449 159.7152 160.7055 158.4500 159.1514
## [17] 159.3844 161.9858 162.8493 163.7430 165.8533 166.6186 166.3694 166.9093
## [25] 166.5298 162.8813 163.1710 162.6556 164.4035 163.7428 161.0768 162.1881
## [33] 161.5684 160.8427 161.5438 160.7948 162.3818 163.4050 163.7355 165.3058
## [41] 166.7882 167.4809 167.5133 166.8366 166.6030 167.0941 166.1186 165.7241
## [49] 163.5172 162.0432

The Model Accuracy

The model’s performance is evaluated using the Root Mean Squared Error (RMSE) and R-squared (R2) metrics.

#RMSE of the mode 
RMSE(predictions, test.data$Close)

## [1] 8.294859

#R2 for the model
R2(predictions, test.data$Close)

## [1] 0.9998344

A lower RMSE indicates better model performance. In this case, an RMSE of 4.691944 suggests that, on average, the model’s predictions are off by about 4.69 units from the actual closing prices. Therefore RMSE of 8.29 is very low suggesting this is a great model

R² of 0.9998344: The model explains nearly 100% of the variance in the closing price, indicating a very high level of predictive accuracy

Together, these metrics suggest that the model performs exceptionally well in predicting the closing price of the S&P 500.

Compare The Actual Data Of The Closing Price With The Predicted Data

# Combine actual closing price and predicted closing price values into a data frame
compare_results <- data.frame(
  Actual = test.data$Close,
  Predicted = predictions
)

compare_results$residuals <- compare_results$Actual - compare_results$Predicted

# View the first, last and random few rows
head(compare_results, 8)

##     Actual Predicted residuals
## 1 179.2057  174.8437  4.362024
## 2 180.9532  176.2855  4.667671
## 3 180.2007  177.0905  3.110154
## 4 182.0234  177.5663  4.457071
## 5 184.0217  179.5157  4.506061
## 6 183.5452  178.9683  4.576857
## 7 182.9850  177.5906  5.394307
## 8 182.0652  178.3796  3.685568

tail(compare_results, 8)

##        Actual Predicted residuals
## 782664 179.38  176.4522 2.9278230
## 782665 179.66  175.1614 4.4986339
## 782666 180.76  180.5645 0.1954705
## 782667 179.84  178.0605 1.7795425
## 782668 181.83  177.3189 4.5111288
## 782669 181.83  178.9660 2.8639888
## 782670 180.04  177.4925 2.5474980
## 782671 182.42  178.1380 4.2819546

sample_n(compare_results,8)

##   Actual Predicted  residuals
## 1 133.08 129.64791  3.4320950
## 2  76.89  75.15638  1.7336235
## 3  68.64  68.00741  0.6325892
## 4  94.37  92.67113  1.6988681
## 5  73.41  72.13108  1.2789284
## 6  50.38  49.96236  0.4176363
## 7 340.80 330.41953 10.3804553
## 8  27.62  27.94765 -0.3276534

Analyzing The Residuals Of The Model

For the first row: Actual = 17.093750, Predicted = 18.2209998, Residual = -1.1272481. This means the model predicted the closing price to be higher than it actually was by about 1.12 units.

For the fifth row: Actual = 225.500000, Predicted = 220.026373, Residual = 5.7436270. The model predicted the closing price to be lower than it actually was by about 5.74 units.

The Magnitude Of Residuals:

Small Residuals:

The residuals are generally small, indicating that the model’s predictions are very close to the actual values. For example, residuals like -0.2512337 and 0.3210705 are relatively small.

Larger Residuals:

There are a few instances where residuals are larger, such as 5.4736270 and 11.574892. These larger residuals indicate larger discrepancies between the actual and predicted values. However, given the high R² value (0.9998433), such discrepancies are relatively rare

Compare The Observed Data With The Predicted Data In Decades To See Which Decade The Market Did Well

To analyze which decade the model performed best is done by calculating and comparing the mean of actual and predicted closing prices by decade for the 20 % test data which is for 2010 and 2020 decades

#Add the decade column
compare_results <- compare_results %>% 
  mutate(Date = test.data$Date) %>%
  mutate(Decade = floor(year(Date) / 10) * 10)  

# Group by decade and calculate metrics
decade_summary <- compare_results %>%
  group_by(Decade) %>%
  summarise(
    Mean_Actual = mean(Actual, na.rm = TRUE),
    Mean_Predicted = mean(Predicted, na.rm = TRUE),
    RMSE = sqrt(mean((Actual - Predicted)^2, na.rm = TRUE)),  
    MAE = mean(abs(Actual - Predicted), na.rm = TRUE)
  )

# Map to readable decade labels
decade_summary <- decade_summary %>%
  mutate(Decade_Label = paste0(Decade, "s"))

head(decade_summary)

## # A tibble: 2 × 6
##   Decade Mean_Actual Mean_Predicted  RMSE   MAE Decade_Label
##    <dbl>       <dbl>          <dbl> <dbl> <dbl> <chr>       
## 1   2010        112.           110.  5.47  2.32 2010s       
## 2   2020        165.           162.  9.12  3.57 2020s

The Lasso regression model demonstrated exceptional predictive accuracy across both test decades, with mean absolute errors of just $2.32 (2.1%) in the 2010s and $3.57 (2.2%) in the more volatile 2020s. The model exhibited a consistent slight underprediction bias of approximately 1.9% across both periods, while maintaining stable relative error rates despite significantly higher price levels in the 2020s

The Bar Graph To Analyze The Trends Of The Market In The Current Decade 2010s and 2020s

#  plot showing both means and errors
ggplot(decade_summary, aes(x = Decade_Label)) +
  geom_bar(aes(y = Mean_Actual, fill = "Actual"), stat = "identity", alpha = 0.7, width = 0.4) +
  geom_bar(aes(y = Mean_Predicted, fill = "Predicted"), stat = "identity", alpha = 0.7, width = 0.4, 
           position = position_nudge(x = 0.4)) +
  geom_text(aes(y = pmax(Mean_Actual, Mean_Predicted) + 5, 
                label = paste("RMSE:", round(RMSE, 2))), 
            position = position_nudge(x = 0.2)) +
  labs(title = "Model Performance by Decade in Test Period",
       x = "Decade", 
       y = "Mean Closing Price") +
  scale_fill_manual(values = c("Actual" = "blue", "Predicted" = "red")) +
  theme_minimal()

SUMMARY

In this project, I built a smart system to predict where the S&P 500 stock market would close each day. Working with over 60 years of market data, I faced a common challenge: many of the price metrics move together, which can confuse prediction models.

I used a special technique called Lasso regression that automatically identifies which factors truly matter for prediction. The key finding was simple but powerful: the day’s highest price is the most important clue for predicting where the market will close. Other factors like opening price and trading volume turned out to be much less important.

The model was tested on recent market data from 2010-2024, and it performed remarkably well predicting closing prices with about 98% accuracy on average.

CONCLUSION

This project shows that with the right approach, we can predict stock market movements with surprising accuracy. The most valuable insight for traders and investors is this: pay close attention to how high the market reaches during the day, as this strongly signals where it’s likely to close.

While no model can predict the market perfectly, this approach provides a reliable tool for making informed estimates. The success of this project demonstrates that even in the complex world of stock markets, clear patterns exist that can be uncovered with careful analysis and modern data science techniques.

For anyone watching the markets, remember: the path to predicting where we’ll end up starts with understanding how high we’ve already been.

Using Machine Learning To Predict The Closing Price Of The S&P 500 Dataset With 4 Millions Rows Of Data From 1962 To 2024

Nsovo Ntuli

2024-08-14