To get started, let’s first talk about what are stocks and what is S&P 500?

In short terms, a stock is a small piece of ownership in a company. When you buy a stock, you own a share of that company. If the company does well, your stock’s value may go up — and you might earn money. If the company does poorly, the value can go down.

What is S&P 500? The Standard and Poor’s 500, or simply the S&P 500, is a stock market index tracking the stock performance of 500 leading companies listed on stock exchanges in the United States.It is one of the most commonly followed equity indices and includes approximately 80% of the total market capitalization of U.S. public companies. In better terms, it is like a report card for the United States Stock Market by showing how big, important companies are doing overall in finance. It only tracks the top 500 largest publicly traded Companies in the USA. (Source: Wikipedia 2025)

📉 Major S&P 500 Crashes (Summary)

Year Name Drop Cause
1929 Great Depression ~86% Stock bubble burst
1987 Black Monday -20% (1 day) Panic + computer trading
2000 Dot-Com Crash ~49% Tech bubble burst
2008 Financial Crisis ~57% Housing + bank collapse
2020 COVID Crash ~34% Pandemic panic
2022 Inflation Bear Market ~25% Inflation + interest rates
2025 Tariff Crash ~0.75% President Trump’s extreme tariffs

Goal for this project: The objective of this analysis is to explore the relationships between financial indicators (e.g., revenue growth, market capitalization, sector classification) and stock performance (e.g., current price) of companies in the S&P 500. Specifically, we aim to identify which company attributes are most strongly associated with higher stock valuations and to evaluate the predictive power of select features using logistic regression and ROC analysis.

First we will install the package containing the dataset.

#install.packages("palmerpenguins")
install.packages("GGally")

We will now load the dataset, and necessary libraries.

data("sp500")
SP500 <- na.omit("sp500")
library(tidyverse)
library(ggplot2)
library(GGally)

Let us take a look at the top 10 companies with the highest revenue. According to the chart and the bar graph, you can see that the Company “NVR”, founded in 1980 by Dwight Schar, currently has the highest revenue in S&P 500. NVR is considered one of America’s leading homebuilders in the East. They operate in 2 business segments which is homebuilding and mortgage banking. The homebuilding unit sells and constructs homes under the Ryan Homes, NVHomes, and Heartland Homes brands.

# Order dataset by Currentprice descending, then take top 10 rows
top10_companies <- sp500[order(-sp500$Currentprice), ][1:10, ]

# View the result
print(top10_companies)

stocks <- data.frame(
  company = c("NVR", "BKNG", "AZO", "FICO", "TDG", "MTD", "ORLY", "TPL", "GWW", "NOW"),
  price = c(8276.28, 5048.59, 3253.47, 2090.98, 1276.15, 1230.74, 1219.11, 1133.12, 1092.96, 1091.25)
)

top_stocks <- stocks %>%
  arrange(desc(price)) %>%
  head(10)

ggplot(top_stocks, aes(x = reorder(company, price), y = price, fill = company)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # horizontal bars for readability
  labs(title = "Top 10 Stock Prices", x = "Symbol", y = "Current Price") +
  theme_minimal()

This scatter plot illustrates the relationship between revenue growth and the current stock price of companies within the S&P 500 index. Each point represents a company, with its position determined by its revenue growth rate on one axis and its current stock price on the other. This visualization helps identify patterns or correlations between how much a company’s revenue is increasing and how the market is currently valuing its stock.

ggplot(sp500, aes(x = Revenuegrowth, y = Currentprice)) +
  geom_point() +
  labs(title = "Scatterplot of Current Price vs Revenue Growth",
       x = "Revenue growth",
       y = "Currentprice")
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).

Now, let us look and visualize the S&P 500 Revenue Growth using a boxplot! The S&P 500 revenue growth measures the percentage change in total sales of the companies within the S&P 500 index. It reflects the collective sales performance of these large-cap U.S. companies and is a key indicator of the overall health and shows a significant growth to the United States Economy.

ggplot(sp500)+
  aes(`Revenuegrowth`)+
  geom_boxplot() 
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Let’s take a look at all the Exchanges for S&P 500. According to the results, you can see that NYQ(New York Quarterly) is the highest out of all exchanges.

sp500 %>%
  count(Exchange, sort = TRUE) %>%
  head(4)

Let us visualize the Sector in S&P 500. For those who don’t know what it does, The S&P 500 sectors represent groupings of companies within the S&P 500 index that are categorized by their primary industry or business activity. These sectors allow investors to analyze and track the performance of specific industries within the broader market, and to potentially make more targeted investments

ggplot(sp500)+
  aes(`Sector`)+
  geom_bar() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

According to Yahoo Finance, This is the entire stock market starting from November 1996 till today! As you know, it was originated in 1923. You will get data from 1996 because thats when the dot-com buble ballooned during the late 1990s.(TIP: When Yahoo Finance updates the stock market, the stock market shown in the Preview on Posit will also update if you keep running it every month!).

library(quantmod)  # For downloading financial data
Loading required package: xts
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: ‘xts’

The following objects are masked from ‘package:dplyr’:

    first, last

Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(ggplot2)

start_date <- "1996-11-22"
end_date <- "2025-12-31" 

# Download S&P 500 data (using its common ticker symbol ^GSPC)
# The quantmod package can download data from sources like Yahoo Finance
getSymbols("^GSPC", src = "yahoo", from = start_date, to = end_date, auto.assign = TRUE) #
[1] "GSPC"
# You'll now have a time series object named 'GSPC' in your R environment

# Convert the xts object to a data frame for use with ggplot2
GSPC_df <- data.frame(Date = index(GSPC), coredata(GSPC))

# Rename the columns to something more convenient
colnames(GSPC_df) <- c("Date", "Open", "High", "Low", "Close", "Volume", "Adjusted")

# Create a line chart of the S&P 500 closing price
ggplot(GSPC_df, aes(x = Date, y = Close)) +
  geom_line() + # Create the line chart
  labs(
    title = "S&P 500 Closing Price in 2025",
    x = "Date(Months)",
    y = "Closing Price"
  ) +
  theme_minimal() 

You can notice on the stock chart that it crashed on early April due to President Trump’s extensive tariffs. While the S&P 500 did rebound later in Mid April, the initial downturn on April 7th and the preceding days was a direct consequence of President Trump’s tariff policies and the resulting market panic.

Now, let us see what companies are investing from which country! You can see that about 93% of the companies investing in S&P 500 are USA Based companies.

ggplot(sp500, aes(Country)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
  geom_text(stat = "count", 
            aes(label = after_stat(count)),
            vjust = -0.25
  )

This bar graph helps us know what state is each company from that is in the top 500 for S&P 500. I used theme and rotated the text to prevent it from overlapping with each other.

ggplot(sp500, aes(State)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) 

The ROC Curve tells us the # of true positives and false positive rate for the logistic regression model as the threshold changes. It tells how your model does for categorizing the data.

# Install the following packages for all these lines of code to work
library(rsample)
library(brglm2)
library(caret)
Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

    cov, smooth, var
# Create a Categorical Variable
sp500 <- sp500 %>%
  mutate(Tech = if_else(Sector %in% c("Technology"), "1", "0" ))

# Splitting the Data into Train and Test Data
set.seed(123)
split <- initial_split(sp500, prop = 0.5, strata = Exchange)

train_data <- training(split)
test_data <- testing(split)

# Fitting the Logistic Regression
model <- glm(as.factor(Tech) ~ as.factor(Exchange), data = train_data, family = "binomial", method = "brglmFit")
Warning: brglmFit: algorithm did not converge. Try changing the optimization algorithm defaults, e.g. the defaults for one or more of `maxit`, `epsilon`, `slowit`, and `response_adjustment`; see `?brglm_control` for default values and available options
Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
# Testing the Logistic Regression Model
test_data$predicted_prob <- predict(model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1, 0)

test_data$actual <- factor(test_data$Tech)
test_data$predicted_class <- factor(test_data$predicted_class)

# Confusion Matrix (Evaluates the Model and seeing how it does)
confusionMatrix(test_data$predicted_class, test_data$actual)
Warning in confusionMatrix.default(test_data$predicted_class, test_data$actual) :
  Levels are not in the same order for reference and data. Refactoring data to match.
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 216  35
         1   0   0
                                          
               Accuracy : 0.8606          
                 95% CI : (0.8114, 0.9009)
    No Information Rate : 0.8606          
    P-Value [Acc > NIR] : 0.5449          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 9.081e-09       
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.8606          
         Neg Pred Value :    NaN          
             Prevalence : 0.8606          
         Detection Rate : 0.8606          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : 0               
                                          
roc_obj <- roc(test_data$actual, test_data$predicted_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve")

Finally, concluding our analysis of the S&P 500 2025 Stock Market, here is a linear regression model. A linear regression model is used as a way to find the best-fitting straight line through a set of data points In order for your regression model to work, you will need to have 2 numerical variables. In this scenario, I have Revenue Growth and Market Cap. It will help predict future values of something based on related factors and understands how 1 variable changes as another increases or decreases. In your Y-Axis, you have a market cap. For example, The 1e+12 for example means 1 trillion but in scientific notation.

model <- lm(Marketcap ~ Revenuegrowth, data = sp500) 
summary(model)

Call:
lm(formula = Marketcap ~ Revenuegrowth, data = sp500)

Residuals:
       Min         1Q     Median         3Q        Max 
-5.740e+11 -8.529e+10 -6.263e+10 -2.039e+10  3.739e+12 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.866e+10  1.638e+10   5.413 9.64e-08 ***
Revenuegrowth 3.143e+11  8.469e+10   3.712 0.000229 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.407e+11 on 497 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.02697,   Adjusted R-squared:  0.02501 
F-statistic: 13.78 on 1 and 497 DF,  p-value: 0.0002293
library(ggplot2)

ggplot(sp500, aes(x = Revenuegrowth, y = Marketcap)) +
  geom_point(position = position_jitter(width = 0.2), alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "turquoise", size = 1) +
  labs(title = "Linear Regression: Market Cap vs Revenue Growth") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).

Today, Data science helps investors, analysts, and companies understand, track, and predict the performance of the S&P 500. And R is a powerful tool used to do that analysis by analyzing trends, making predictions of the stock chart, Portfolio Optimization, and lastly, risk management.

Thanks for listening! Hope this helps you understand mostly about the S&P 500 and stocks!

Made by Sid

---
title: "S&P 500 2025 Stock Update"
output: html_notebook
---

To get started, let's first talk about what are stocks and what is S&P 500?

 In short terms, a stock is a small piece of ownership in a company. When you buy a stock, you own a share of that company. If the company does well, your stock's value may go up — and you might earn money. If the company does poorly, the value can go down. 

What is S&P 500? The Standard and Poor's 500, or simply the S&P 500, is a stock market index tracking the stock performance of 500 leading companies listed on stock exchanges in the United States.It is one of the most commonly followed equity indices and includes approximately 80% of the total market capitalization of U.S. public companies. In better terms, it is like a report card for the United States Stock Market by showing how big, important companies are doing overall in finance. It only tracks the top 500 largest publicly traded Companies in the USA. (Source: Wikipedia 2025)

# 📉 Major S&P 500 Crashes (Summary)

| Year | Name                  | Drop         | Cause                             |
| ---- | --------------------- | ------------ | --------------------------------- |
| 1929 | Great Depression      | \~86%        | Stock bubble burst                |
| 1987 | Black Monday          | -20% (1 day) | Panic + computer trading          |
| 2000 | Dot-Com Crash         | \~49%        | Tech bubble burst                 |
| 2008 | Financial Crisis      | \~57%        | Housing + bank collapse           |
| 2020 | COVID Crash           | \~34%        | Pandemic panic                    |
| 2022 | Inflation Bear Market | \~25%        | Inflation + interest rates        |
| 2025 | Tariff Crash          | \~0.75%      | President Trump’s extreme tariffs |

Goal for this project: The objective of this analysis is to explore the relationships between financial indicators (e.g., revenue growth, market capitalization, sector classification) and stock performance (e.g., current price) of companies in the S&P 500. Specifically, we aim to identify which company attributes are most strongly associated with higher stock valuations and to evaluate the predictive power of select features using logistic regression and ROC analysis.

First we will install the package containing the dataset. 

```{r}
#install.packages("palmerpenguins")
install.packages("GGally")
```
We will now load the dataset, and necessary libraries. 
```{r}
data("sp500")
SP500 <- na.omit("sp500")
library(tidyverse)
library(ggplot2)
library(GGally)
```

Let us take a look at the top 10 companies with the highest revenue. According to the chart and the bar graph, you can see that the Company "NVR", founded in 1980 by Dwight Schar, currently has the highest revenue in S&P 500. NVR is considered one of America's leading homebuilders in the East. They operate in 2 business segments which is homebuilding and mortgage banking. The homebuilding unit sells and constructs homes under the Ryan Homes, NVHomes, and Heartland Homes brands.
```{r}
# Order dataset by Currentprice descending, then take top 10 rows
top10_companies <- sp500[order(-sp500$Currentprice), ][1:10, ]

# View the result
print(top10_companies)
```
```{r}

stocks <- data.frame(
  company = c("NVR", "BKNG", "AZO", "FICO", "TDG", "MTD", "ORLY", "TPL", "GWW", "NOW"),
  price = c(8276.28, 5048.59, 3253.47, 2090.98, 1276.15, 1230.74, 1219.11, 1133.12, 1092.96, 1091.25)
)

top_stocks <- stocks %>%
  arrange(desc(price)) %>%
  head(10)

ggplot(top_stocks, aes(x = reorder(company, price), y = price, fill = company)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # horizontal bars for readability
  labs(title = "Top 10 Stock Prices", x = "Symbol", y = "Current Price") +
  theme_minimal()
```
This scatter plot illustrates the relationship between revenue growth and the current stock price of companies within the S&P 500 index. Each point represents a company, with its position determined by its revenue growth rate on one axis and its current stock price on the other. This visualization helps identify patterns or correlations between how much a company's revenue is increasing and how the market is currently valuing its stock.
```{r}
ggplot(sp500, aes(x = Revenuegrowth, y = Currentprice)) +
  geom_point() +
  labs(title = "Scatterplot of Current Price vs Revenue Growth",
       x = "Revenue growth",
       y = "Currentprice")
```
Now, let us look and visualize the S&P 500 Revenue Growth using a boxplot! The S&P 500 revenue growth measures the percentage change in total sales of the companies within the S&P 500 index. It reflects the collective sales performance of these large-cap U.S. companies and is a key indicator of the overall health and shows a significant growth to the United States Economy. 

```{r}
ggplot(sp500)+
  aes(`Revenuegrowth`)+
  geom_boxplot() 
```

Let's take a look at all the Exchanges for S&P 500. According to the results, you can see that NYQ(New York Quarterly) is the highest out of all exchanges.
```{r}
sp500 %>%
  count(Exchange, sort = TRUE) %>%
  head(4)
```
Let us visualize the Sector in S&P 500. For those who don't know what it does, The S&P 500 sectors represent groupings of companies within the S&P 500 index that are categorized by their primary industry or business activity. These sectors allow investors to analyze and track the performance of specific industries within the broader market, and to potentially make more targeted investments
```{r}
ggplot(sp500)+
  aes(`Sector`)+
  geom_bar() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
According to Yahoo Finance, This is the entire stock market starting from November 1996 till today! As you know, it was originated in 1923. You will get data from 1996 because thats when the dot-com buble ballooned during the late 1990s.(TIP: When Yahoo Finance updates the stock market, the stock market shown in the Preview on Posit will also update if you keep running it every month!). 
```{r}
library(quantmod)  # For downloading financial data
library(ggplot2)

start_date <- "1996-11-22"
end_date <- "2025-12-31" 

# Download S&P 500 data (using its common ticker symbol ^GSPC)
# The quantmod package can download data from sources like Yahoo Finance
getSymbols("^GSPC", src = "yahoo", from = start_date, to = end_date, auto.assign = TRUE) #

# You'll now have a time series object named 'GSPC' in your R environment

# Convert the xts object to a data frame for use with ggplot2
GSPC_df <- data.frame(Date = index(GSPC), coredata(GSPC))

# Rename the columns to something more convenient
colnames(GSPC_df) <- c("Date", "Open", "High", "Low", "Close", "Volume", "Adjusted")

# Create a line chart of the S&P 500 closing price
ggplot(GSPC_df, aes(x = Date, y = Close)) +
  geom_line() + # Create the line chart
  labs(
    title = "S&P 500 Closing Price in 2025",
    x = "Date(Months)",
    y = "Closing Price"
  ) +
  theme_minimal() 
```
You can notice on the stock chart that it crashed on early April due to President Trump's extensive tariffs. While the S&P 500 did rebound later in Mid April, the initial downturn on April 7th and the preceding days was a direct consequence of President Trump's tariff policies and the resulting market panic. 

Now, let us see what companies are investing from which country! You can see that about 93% of the companies investing in S&P 500 are USA Based companies.
```{r}
ggplot(sp500, aes(Country)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
  geom_text(stat = "count", 
            aes(label = after_stat(count)),
            vjust = -0.25
  )
```
This bar graph helps us know what state is each company from that is in the top 500 for S&P 500. I used theme and rotated the text to prevent it from overlapping with each other.
```{r}
ggplot(sp500, aes(State)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) 
```
The ROC Curve tells us the # of true positives and false positive rate for the logistic regression model as the threshold changes. It tells how your model does for categorizing the data. 
```{r}
# Install the following packages for all these lines of code to work
library(rsample)
library(brglm2)
library(caret)
library(pROC)

# Create a Categorical Variable
sp500 <- sp500 %>%
  mutate(Tech = if_else(Sector %in% c("Technology"), "1", "0" ))

# Splitting the Data into Train and Test Data
set.seed(123)
split <- initial_split(sp500, prop = 0.5, strata = Exchange)

train_data <- training(split)
test_data <- testing(split)

# Fitting the Logistic Regression
model <- glm(as.factor(Tech) ~ as.factor(Exchange), data = train_data, family = "binomial", method = "brglmFit")

# Testing the Logistic Regression Model
test_data$predicted_prob <- predict(model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1, 0)

test_data$actual <- factor(test_data$Tech)
test_data$predicted_class <- factor(test_data$predicted_class)

# Confusion Matrix (Evaluates the Model and seeing how it does)
confusionMatrix(test_data$predicted_class, test_data$actual)

roc_obj <- roc(test_data$actual, test_data$predicted_prob)
plot(roc_obj, main = "ROC Curve")
```
Finally, concluding our analysis of the S&P 500 2025 Stock Market, here is a linear regression model. A linear regression model is used as a way to find the best-fitting straight line through a set of data points In order for your regression model to work, you will need to have 2 numerical variables. In this scenario, I have Revenue Growth and Market Cap. It will help predict future values of something based on related factors and understands how 1 variable changes as another increases or decreases. In your Y-Axis, you have a market cap. For example, The 1e+12 for example means 1 trillion but in scientific notation.
```{r}
model <- lm(Marketcap ~ Revenuegrowth, data = sp500) 
summary(model)

library(ggplot2)

ggplot(sp500, aes(x = Revenuegrowth, y = Marketcap)) +
  geom_point(position = position_jitter(width = 0.2), alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "turquoise", size = 1) +
  labs(title = "Linear Regression: Market Cap vs Revenue Growth") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

Today, Data science helps investors, analysts, and companies understand, track, and predict the performance of the S&P 500.
And R is a powerful tool used to do that analysis by analyzing trends, making predictions of the stock chart, Portfolio Optimization, and lastly, risk management.

Thanks for listening! Hope this helps you understand mostly about the S&P 500 and stocks!

Made by Sid 