Using Machine Learning To Predict The Closing Price Of The S&P 500 Dataset With 4 Millions Rows Of Data From 1962 To 2024

Introduction and Objectives

Introduction

In this project, we aimed to predict the closing prices of the S&P 500 index using historical data. This dataset contains 4 millions of rows with data from 1962 to the current year 2024. Accurate predictions of stock prices are crucial for investors, financial analysts, and portfolio managers to make informed decisions. The S&P 500 is a key indicator of the overall performance of the U.S. stock market, representing the 500 largest publicly traded companies in the United States. Predicting its closing prices can provide insights into market trends, help manage investment risks, and optimize trading strategies.

Importance of Predicting the S&P 500 Closing Price

Accurate predictions of the S&P 500 closing price is crucial for investors and analysts to make informed decisions about market trends and investment strategies.

Objectives

The primary objective of this project was to build a predictive model for the closing prices of the S&P 500 index. The model was developed using historical data and various statistical and machine learning techniques. We aimed to:

-Clean and preprocess the dataset.

-Explore the correlations between variables.

-Address issues of multicollinearity.

-Build and evaluate a predictive model using Lasso regression.

-Compare the predicted closing prices with the actual observed values.

-Visualize and interpret the results, focusing on trends across different decades.

Data Cleaning Process

#load libraries
library(tidyverse)
library(caret)
library(glmnet)
library(reshape2)

#import data and examine it
 sp500_data <- read_csv("sp500_data.csv")

#check out the first 10 rows
head(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close `Adj Close` Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>  <dbl> <chr> 
##  1 1962-01-02     0  3.55  3.45  3.48       0.574 254509 MMM   
##  2 1962-01-03     0  3.50  3.42  3.50       0.578 505190 MMM   
##  3 1962-01-04     0  3.56  3.50  3.50       0.578 254509 MMM   
##  4 1962-01-05     0  3.49  3.40  3.41       0.563 376979 MMM   
##  5 1962-01-08     0  3.42  3.37  3.39       0.560 399942 MMM   
##  6 1962-01-09     0  3.42  3.38  3.39       0.560 376979 MMM   
##  7 1962-01-10     0  3.38  3.34  3.35       0.554 304262 MMM   
##  8 1962-01-11     0  3.37  3.26  3.34       0.551 269818 MMM   
##  9 1962-01-12     0  3.38  3.27  3.27       0.541 692723 MMM   
## 10 1962-01-15     0  3.31  3.27  3.31       0.546 252595 MMM

#check out the last 10 rows
tail(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close `Adj Close`  Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>   <dbl> <chr> 
##  1 2024-07-19  180.  181.  176.  179.        179. 2131400 ZTS   
##  2 2024-07-22  181.  182.  179.  181.        181. 1532900 ZTS   
##  3 2024-07-23  181.  182.  179.  179.        179. 1329400 ZTS   
##  4 2024-07-24  179.  181.  178.  180.        180. 1309300 ZTS   
##  5 2024-07-25  181   186.  180.  181.        181. 2473700 ZTS   
##  6 2024-07-26  182.  184.  179.  180.        180. 2437300 ZTS   
##  7 2024-07-29  181.  183.  179.  182.        182. 1302900 ZTS   
##  8 2024-07-30  182.  185.  180.  182.        182. 2271300 ZTS   
##  9 2024-07-31  182.  183.  180.  180.        180. 1740100 ZTS   
## 10 2024-08-01  181.  184.  181.  182.        182. 1986443 ZTS

#check the data rows randomly
sample_n(sp500_data, 30)

## # A tibble: 30 × 8
##    Date         Open   High    Low  Close `Adj Close`   Volume Ticker
##    <date>      <dbl>  <dbl>  <dbl>  <dbl>       <dbl>    <dbl> <chr> 
##  1 1986-04-18   0     17.1   16.9   17.1        1.10      2400 WELL  
##  2 2006-06-29  56.1   57.3   56.0   56.8       28.9    3752376 JCI   
##  3 2018-01-16 149.   149.   146.   146.       129.      570300 LHX   
##  4 1999-05-28  27.8   29.8   27.5   29.6       25.2    2680000 VRSN  
##  5 1987-12-14   7.72   8.30   7.72   8.27       3.57   1040787 APD   
##  6 2021-12-15  58.3   58.9   57.0   58.6       57.4   14471600 GM    
##  7 2011-01-13  11.9   12.0   11.8   11.9       11.2     860200 LKQ   
##  8 1989-04-21   3.23   3.24   3.19   3.24       0.951  1189181 PCAR  
##  9 2014-07-01  27.2   27.8   27.0   27.7       25.2    7240100 LUV   
## 10 1997-03-04 183.   184.   175.   175.        93.2     424680 C     
## # ℹ 20 more rows

#compute the summary statistics of the S$P500 data
summary(sp500_data)

##       Date                 Open               High               Low          
##  Min.   :1962-01-02   Min.   :   0.000   Min.   :   0.005   Min.   :   0.005  
##  1st Qu.:1994-09-14   1st Qu.:   8.312   1st Qu.:   8.867   1st Qu.:   8.617  
##  Median :2006-07-11   Median :  25.640   Median :  26.064   Median :  25.400  
##  Mean   :2004-05-26   Mean   :  54.624   Mean   :  55.562   Mean   :  54.269  
##  3rd Qu.:2016-01-05   3rd Qu.:  57.710   3rd Qu.:  58.360   3rd Qu.:  57.050  
##  Max.   :2024-08-01   Max.   :8700.000   Max.   :8700.000   Max.   :8570.510  
##      Close            Adj Close            Volume             Ticker         
##  Min.   :   0.005   Min.   :   0.002   Min.   :0.000e+00   Length:4225194    
##  1st Qu.:   8.746   1st Qu.:   4.247   1st Qu.:4.813e+05   Class :character  
##  Median :  25.750   Median :  17.080   Median :1.439e+06   Mode  :character  
##  Mean   :  54.931   Mean   :  46.978   Mean   :6.307e+06                     
##  3rd Qu.:  57.725   3rd Qu.:  46.186   3rd Qu.:3.821e+06                     
##  Max.   :8661.980   Max.   :8661.980   Max.   :9.231e+09

Initial Data Inspection

Upon inspecting the dataset, it was observed that the ‘Open’ and ‘Volume’ variables contained some zeros. These zeros were problematic as they could introduce bias into the model, leading to inaccurate predictions. The zeros in the ‘Open’ and ‘Volume’ variable could represent missing data or erroneous entries, which is why it was necessary to remove them before proceeding with the analysis.

#removing the zeros in the "Open" and "Volume" variables
sp500_data <- sp500_data %>% 
  filter(Open !=0) %>%
  filter(Volume !=0)

Removing Zeros from the ‘Open’ And ‘Volume’ Variables

To ensure the integrity of the dataset, all rows where the ‘Open’ and ‘Volume’ variables were zero were filtered out. This step was crucial in improving the reliability of the model by eliminating any potential anomalies in the data

#rename the Adj Close column for better data manipulation
sp500_data <- sp500_data %>% 
  rename("Adj_Close" = "Adj Close")

# examine the data after removing the zeros in the Open variables
#check the first 10 rows
head(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close Adj_Close Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl> <chr> 
##  1 1970-01-02  5.73  5.76  5.72  5.73      1.06  86112 MMM   
##  2 1970-01-05  5.74  5.77  5.74  5.76      1.07 533894 MMM   
##  3 1970-01-06  5.76  5.82  5.75  5.82      1.08 210496 MMM   
##  4 1970-01-07  5.82  5.87  5.81  5.85      1.09 197101 MMM   
##  5 1970-01-08  5.85  5.94  5.84  5.93      1.10 363584 MMM   
##  6 1970-01-09  5.93  5.95  5.89  5.92      1.10 166483 MMM   
##  7 1970-01-12  5.92  5.92  5.88  5.92      1.10 141606 MMM   
##  8 1970-01-13  5.92  5.96  5.91  5.91      1.10 313830 MMM   
##  9 1970-01-14  5.91  6.00  5.91  5.97      1.11 451610 MMM   
## 10 1970-01-15  5.94  5.94  5.85  5.85      1.09 202842 MMM

#check the last 10 rows
tail(sp500_data,10)

## # A tibble: 10 × 8
##    Date        Open  High   Low Close Adj_Close  Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <chr> 
##  1 2024-07-19  180.  181.  176.  179.      179. 2131400 ZTS   
##  2 2024-07-22  181.  182.  179.  181.      181. 1532900 ZTS   
##  3 2024-07-23  181.  182.  179.  179.      179. 1329400 ZTS   
##  4 2024-07-24  179.  181.  178.  180.      180. 1309300 ZTS   
##  5 2024-07-25  181   186.  180.  181.      181. 2473700 ZTS   
##  6 2024-07-26  182.  184.  179.  180.      180. 2437300 ZTS   
##  7 2024-07-29  181.  183.  179.  182.      182. 1302900 ZTS   
##  8 2024-07-30  182.  185.  180.  182.      182. 2271300 ZTS   
##  9 2024-07-31  182.  183.  180.  180.      180. 1740100 ZTS   
## 10 2024-08-01  181.  184.  181.  182.      182. 1986443 ZTS

#check the data randomly
sample_n(sp500_data, 30)

## # A tibble: 30 × 8
##    Date         Open   High    Low  Close Adj_Close    Volume Ticker
##    <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl> <chr> 
##  1 2022-06-06 139.   140.   138.   138.      134.     1123400 TT    
##  2 2010-09-16  29.6   29.8   29.4   29.6      25.1    2571000 EL    
##  3 2015-11-12  52.0   52.1   51.3   51.5      49.1    2146053 HLT   
##  4 2014-07-24  73.5   73.9   72.8   73.0      47.5    2281900 KLAC  
##  5 2021-04-21  79.5   79.8   78.3   78.4      69.7    1860800 ED    
##  6 2017-02-17  82.2   82.8   81.9   82.1      64.0     596600 CPT   
##  7 1990-12-18   5.97   5.97   5.88   5.94      3.16    805600 ITW   
##  8 2018-04-13  65.9   66.0   65.2   65.5      59.2    1282700 RSG   
##  9 2011-08-01  14.2   14.3   14.0   14.2      12.0  612836000 AAPL  
## 10 2008-02-15  12.8   13.0   12.7   12.9      12.9    5378800 FI    
## # ℹ 20 more rows

#compute the summary statistics of the S$P500 data
summary(sp500_data)

##       Date                 Open               High               Low          
##  Min.   :1962-01-02   Min.   :   0.005   Min.   :   0.005   Min.   :   0.005  
##  1st Qu.:1997-08-22   1st Qu.:  11.437   1st Qu.:  11.590   1st Qu.:  11.260  
##  Median :2007-12-28   Median :  28.890   Median :  29.250   Median :  28.505  
##  Mean   :2006-04-15   Mean   :  58.930   Mean   :  59.617   Mean   :  58.228  
##  3rd Qu.:2016-08-24   3rd Qu.:  61.850   3rd Qu.:  62.520   3rd Qu.:  61.150  
##  Max.   :2024-08-01   Max.   :8700.000   Max.   :8700.000   Max.   :8570.510  
##      Close            Adj_Close            Volume             Ticker         
##  Min.   :   0.005   Min.   :   0.003   Min.   :4.700e+01   Length:3914997    
##  1st Qu.:  11.438   1st Qu.:   6.123   1st Qu.:6.036e+05   Class :character  
##  Median :  28.900   Median :  19.732   Median :1.622e+06   Mode  :character  
##  Mean   :  58.940   Mean   :  50.590   Mean   :6.772e+06                     
##  3rd Qu.:  61.869   3rd Qu.:  50.248   3rd Qu.:4.140e+06                     
##  Max.   :8661.980   Max.   :8661.980   Max.   :9.231e+09

#remove the missing values a
 sp500_data <- na.omit(sp500_data)

#remove the duplicated rows
distinct(sp500_data)

## # A tibble: 3,914,997 × 8
##    Date        Open  High   Low Close Adj_Close Volume Ticker
##    <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl> <chr> 
##  1 1970-01-02  5.73  5.76  5.72  5.73      1.06  86112 MMM   
##  2 1970-01-05  5.74  5.77  5.74  5.76      1.07 533894 MMM   
##  3 1970-01-06  5.76  5.82  5.75  5.82      1.08 210496 MMM   
##  4 1970-01-07  5.82  5.87  5.81  5.85      1.09 197101 MMM   
##  5 1970-01-08  5.85  5.94  5.84  5.93      1.10 363584 MMM   
##  6 1970-01-09  5.93  5.95  5.89  5.92      1.10 166483 MMM   
##  7 1970-01-12  5.92  5.92  5.88  5.92      1.10 141606 MMM   
##  8 1970-01-13  5.92  5.96  5.91  5.91      1.10 313830 MMM   
##  9 1970-01-14  5.91  6.00  5.91  5.97      1.11 451610 MMM   
## 10 1970-01-15  5.94  5.94  5.85  5.85      1.09 202842 MMM   
## # ℹ 3,914,987 more rows

Correlation Heatmap

After cleaning the data, a correlation heatmap is generated to explore the relationships between different variables in the dataset.

#check for correlations

# Select numerical columns for correlation analysis
numerical_vars <- sp500_data %>%
  select("Open", "High", "Low", "Close", "Adj_Close" , "Volume") 

# Compute the correlation matrix
cor_matrix <- cor(numerical_vars, use = "complete.obs")

# Melt the correlation matrix into a long format
melted_cor_matrix <- melt(cor_matrix)

# Plot the heatmap
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap",
       x = "Variables",
       y = "Variables")

Interpretation Of The Heatmap For Correlation

Color Scale:

Red Areas: Indicate a high positive correlation, close to +1. This means that as one variable increases, the other tends to increase as well.

White/Light Areas: Indicate low or no correlation, close to 0. This means there is little to no linear relationship between the variables.

Blue Areas: Would indicate a high negative correlation, close to -1 (but none are present in this map, indicating no negative correlations).

The heatmap revealed high correlations between certain variables, particularly between High, Low, Close, and Adj_Close. High correlations between predictor variables can lead to multicollinearity, which can distort the results of a linear regression model.From the heatmap we can see the variables are higly correlated. Multicollinearity makes it difficult to determine the individual effect of each predictor variable on the dependent variable (Close).

Train-Test Data Split

Explanation

The dataset is split into training and testing sets using an 80/20 ratio. The training set, containing 80% of the data, is used to fit the model, while the testing set, containing the remaining 20%, is used to evaluate the model’s performance.

set.seed(145)
training.sample <- sp500_data$Close %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- sp500_data[training.sample, ]
test.data <- sp500_data[-training.sample, ]

Using Lasso Regression And Not Linear Regression

Given the high correlation between variables, a standard linear regression model (lm) would be unsuitable. High multicollinearity can inflate the variance of the coefficient estimates, leading to unreliable and unstable results.

Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a regularization technique that adds a penalty to the model based on the absolute size of the coefficients. This penalty helps to shrink the coefficients of less important variables to zero, effectively performing variable selection and reducing multicollinearity.

# predictor variable 
x <- model.matrix(Close ~ High + Low + Open + Adj_Close + Volume, data = train.data)[,-1]

#outcome variable
y <- train.data$Close


#compute lasso regression 

set.seed(123)
lasso <- cv.glmnet(x, y, alpha = 1)

#dispaly best lambda value
lasso$lambda.min

## [1] 4.310525

# fit the model in the training set
model <- glmnet(x , y, alpha = 1, lambda = lasso$lambda.min)

#model coeffients
coef(model)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 1.7262411388
## High        0.9593727299
## Low         0.0003263719
## Open        .           
## Adj_Close   .           
## Volume      .

Interpretation of the Lasso Model

Model Coefficients

High (0.9594):

For each additional unit increase in the high price of the day, the predicted closing price increases by approximately 0.9594 units. This suggests that the higher the highest price of the day, the higher the closing price tends to be. It reflects a strong positive relationship between the high price and the closing price.

Low (0.0003):

For each additional unit increase in the low price of the day, the predicted closing price increases by approximately 0.0003 units. This coefficient is very small, indicating that the low price has a minimal effect on the closing price. In practical terms, changes in the low price of the day have a very small impact on the closing price.

Open (.), Adj_Close(.) and Volume(.)

The coefficients for the opening price, adjustment close price and volume are zero, which means that, according to the Lasso model, these variables does not contribute to predicting the closing price. They are effectively excluded from the model

Predictions From The Model

# make predictions on the test data
x.test <- model.matrix(Close ~ High + Low + Open + Volume + Adj_Close , test.data)[,-1]

predictions <- model %>% predict(x.test) %>% as.vector()
print(head(predictions,50))

##  [1] 7.261702 7.430940 7.412153 7.242881 7.236614 7.180157 6.785259 7.217809
##  [9] 7.142589 7.230342 7.224076 7.061075 7.067358 6.835410 6.716303 6.816565
## [17] 6.164623 6.258662 6.152060 6.039244 5.826107 5.706965 5.518933 5.725798
## [25] 5.807292 5.782235 5.775962 5.794767 5.851170 5.857450 5.813562 6.152090
## [33] 6.001637 6.039240 6.045526 6.127018 6.208512 6.189710 6.233592 6.264880
## [41] 6.032984 6.051792 6.139543 6.252404 6.289987 6.434190 6.590918 6.540772
## [49] 6.691222 6.653603

The Model Accuracy

The model’s performance is evaluated using the Root Mean Squared Error (RMSE) and R-squared (R2) metrics.

#RMSE of the mode 
RMSE(predictions, test.data$Close)

## [1] 4.691944

#R2 for the model
R2(predictions, test.data$Close)

## [1] 0.9998433

A lower RMSE indicates better model performance. In this case, an RMSE of 4.691944 suggests that, on average, the model’s predictions are off by about 4.69 units from the actual closing prices. Therefore RMSE of 4.691944 is very low suggesting this is a great model

R² of 0.9998433: The model explains nearly 100% of the variance in the closing price, indicating a very high level of predictive accuracy

Together, these metrics suggest that the model performs exceptionally well in predicting the closing price of the S&P 500.

Compare The Actual Data Of The Closing Price With The Predicted Data

# Combine actual closing price and predicted closing price values into a data frame
compare_results <- data.frame(
  Actual = test.data$Close,
  Predicted = predictions
)

compare_results$residuals <- compare_results$Actual - compare_results$Predicted

# View the first, last and random few rows
head(compare_results, 8)

##     Actual Predicted residuals
## 1 5.761392  7.261702 -1.500310
## 2 5.931229  7.430940 -1.499711
## 3 5.924697  7.412153 -1.487456
## 4 5.696070  7.242881 -1.546811
## 5 5.741796  7.236614 -1.494818
## 6 5.500105  7.180157 -1.680052
## 7 5.238817  6.785259 -1.546442
## 8 5.683006  7.217809 -1.534803

tail(compare_results, 8)

##        Actual Predicted residuals
## 782990 172.58  170.1139  2.466079
## 782991 174.81  170.6225  4.187449
## 782992 174.96  170.5943  4.365663
## 782993 175.43  171.7265  3.703443
## 782994 174.24  170.3734  3.866632
## 782995 182.05  179.1156  2.934380
## 782996 179.66  175.2774  4.382580
## 782997 181.83  179.0675  2.762520

sample_n(compare_results,8)

##      Actual Predicted    residuals
## 1 51.020000  51.01612  0.003877226
## 2 31.343857  32.14288 -0.799018613
## 3  9.958333  11.38318 -1.424846338
## 4 65.339996  65.71829 -0.378289294
## 5 26.975000  27.62366 -0.648662859
## 6 60.680000  60.04653  0.633474208
## 7  9.388330  10.78382 -1.395489351
## 8 82.309998  81.49604  0.813959351

Analyzing The Residuals Of The Model

For the first row: Actual = 17.093750, Predicted = 18.2209998, Residual = -1.1272481. This means the model predicted the closing price to be higher than it actually was by about 1.12 units.

For the fifth row: Actual = 225.500000, Predicted = 220.026373, Residual = 5.7436270. The model predicted the closing price to be lower than it actually was by about 5.74 units.

The Magnitude Of Residuals:

Small Residuals:

The residuals are generally small, indicating that the model’s predictions are very close to the actual values. For example, residuals like -0.2512337 and 0.3210705 are relatively small.

Larger Residuals:

There are a few instances where residuals are larger, such as 5.4736270 and 11.574892. These larger residuals indicate larger discrepancies between the actual and predicted values. However, given the high R² value (0.9998433), such discrepancies are relatively rare

Compare The Observed Data With The Predicted Data In Decades To See Which Decade The Market Did Well

To analyze which decade the model performed best is done by calculating and comparing the mean of actual and predicted closing prices by decade

#Add the decade column

compare_results <- compare_results %>% 
  mutate(Date = test.data$Date) %>% #add a date column from the test data
  mutate(Decade = floor(as.numeric(format(Date, "%y"))/10) *10) #extract Decade

# Group by decade and calculate mean for closing price 
decade_summary <- compare_results %>%
  group_by(Decade) %>%
  summarise(
    Mean_Actual = mean(Actual, na.rm = TRUE),
    Mean_Predicted = mean(Predicted, na.rm = TRUE)
  )


decade_summary <- decade_summary %>%
  mutate(Decade = case_when(
    Decade == 0 ~ "1960s",
    Decade == 10 ~ "1970s",
    Decade == 20 ~ "1980s",
    Decade == 60 ~ "1990s",
    Decade == 70 ~ "2000s",
    Decade == 80 ~ "2010s",
    Decade == 90 ~ "2020s",
   
    TRUE ~ as.character(Decade) # Default case
  ))



# Reshape the data for easier plotting
decade_summary_long <- decade_summary %>%
  pivot_longer(cols = c(Mean_Actual, Mean_Predicted),
               names_to = "Type",
               values_to = "Mean_Value")

head(decade_summary,7)

## # A tibble: 7 × 3
##   Decade Mean_Actual Mean_Predicted
##   <chr>        <dbl>          <dbl>
## 1 1960s        35.5           36.3 
## 2 1970s        73.6           73.0 
## 3 1980s       164.           162.  
## 4 1990s         2.67           4.31
## 5 2000s         5.00           6.57
## 6 2010s         7.83           9.32
## 7 2020s        17.0           18.2

Strong Performance:

The model performed quite well in the 1960s, 1970s, and 1980s, with predictions closely aligning with actual values.

Less Accurate:

The model’s accuracy decreased in the 1990s, 2000s, and 2010s, with increasing discrepancies between actual and predicted means.

Recent Decades:

In the 2020s, the model’s prediction was higher than the actual mean closing price, reflecting a similar trend seen in recent decades.

The Bar Graph To Analyze The Trends Of The Market In All The Decades From 1960s To The Current Decade 2020s

# Create the bar graph for the actual mean and predicted mean 
ggplot(decade_summary_long, aes(x = Decade, y = Mean_Value, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean Closing Price by Decade (Actual vs. Predicted)",
       x = "Decade",
       y = "Mean Closing Price") +
  scale_fill_manual(name = "Legend", values = c("Mean_Actual" = "blue", "Mean_Predicted" = "red")) +
  theme_minimal()

Decade Analysis:

1980s: Highest mean closing prices; strong market performance.

1970s: Second highest; strong performance, though slightly lower than the 1980s.

1960s: Moderate performance; third place in closing prices.

1990s: Lowest mean closing prices; model struggled with this decade.

2000s: Second last in closing prices; further decline from previous decades.

2010s: Improved but lower than the 2020s.

2020s: Rising trend; recovery in closing prices compared to previous decades.

Historical Peaks:

The highest closing prices were observed in the 1980s, reflecting strong market performance during this period.

Recent Recovery:

The recent decade (2020s) shows a recovery trend, suggesting that the market has been improving after lower performance in the 2000s and 1990s.

Decline and Recovery:

The trend from high values in the 1980s and 1970s to lower values in the 1990s and 2000s, followed by a rise in the 2020s, indicates a cyclical pattern where the market experienced downturns but is now on a path of recovery and growth.

Broader Market Implications:

Economic Cycles:

The observed trend reflects broader economic cycles, including periods of boom (1980s), followed by recessions or slower growth (1990s, 2000s), and recovery (2020s).

Model Accuracy:

The model’s ability to predict accurately varied across decades, showing strong performance in periods of high market stability and challenges in more volatile periods

SUMMARY

This project aimed to predict the closing price of the S&P 500 index using Lasso regression, addressing the challenge of multicollinearity in the dataset. The data was cleaned to remove zeros in the ‘Open’ variable, a correlation heatmap was analyzed, and the data was split into training and testing sets. Lasso regression was chosen over linear regression due to high multicollinearity, and the model was evaluated using RMSE and R-squared metrics. Finally, the predicted closing prices were compared with the actual values, with a decade-wise bar graph used for visual comparison to see which decade the market did well.

CONCLUSION

For investors and analysts, the model suggests that the High price of the day is a crucial factor in forecasting the closing price. This could be useful for trading strategies or market analysis where the daily high price can be used to estimate the expected closing price

The Lasso regression model demonstrated strong predictive performance, accurately capturing the trend of the S&P 500 closing prices over time. The high R-squared value and low RMSE indicate that the model is well-suited for predicting the closing prices. The project highlights the importance of addressing multicollinearity in predictive modeling and demonstrates the effectiveness of Lasso regression in financial data analysis

Using Machine Learning To Predict The Closing Price Of The S&P 500 Dataset With 4 Millions Rows Of Data From 1962 To 2024

Nsovo Ntuli

2024-08-14