sp500_data <- read.csv("https://raw.githubusercontent.com/Kingtilon1/MachineLearning-BigData/refs/heads/main/StockMarket/sp500_companies.csv")
head(sp500_data)
## Exchange Symbol Shortname Longname
## 1 NMS AAPL Apple Inc. Apple Inc.
## 2 NMS NVDA NVIDIA Corporation NVIDIA Corporation
## 3 NMS MSFT Microsoft Corporation Microsoft Corporation
## 4 NMS AMZN Amazon.com, Inc. Amazon.com, Inc.
## 5 NMS GOOGL Alphabet Inc. Alphabet Inc.
## 6 NMS GOOG Alphabet Inc. Alphabet Inc.
## Sector Industry Currentprice
## 1 Technology Consumer Electronics 247.77
## 2 Technology Semiconductors 135.07
## 3 Technology Software - Infrastructure 443.33
## 4 Consumer Cyclical Internet Retail 225.04
## 5 Communication Services Internet Content & Information 185.17
## 6 Communication Services Internet Content & Information 186.53
## Marketcap Ebitda Revenuegrowth City State Country
## 1 3.745242e+12 1.34661e+11 0.061 Cupertino CA United States
## 2 3.307865e+12 6.11840e+10 1.224 Santa Clara CA United States
## 3 3.296105e+12 1.36552e+11 0.160 Redmond WA United States
## 4 2.366296e+12 1.11583e+11 0.110 Seattle WA United States
## 5 2.276776e+12 1.23470e+11 0.151 Mountain View CA United States
## 6 2.271096e+12 1.23470e+11 0.151 Mountain View CA United States
## Fulltimeemployees
## 1 164000
## 2 29600
## 3 228000
## 4 1551000
## 5 181269
## 6 181269
## Longbusinesssummary
## 1 Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, and HomePod. It also provides AppleCare support and cloud services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts, as well as advertising services include third-party licensing arrangements and its own advertising platforms. In addition, the company offers various subscription-based services, such as Apple Arcade, a game subscription service; Apple Fitness+, a personalized fitness service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It distributes third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1976 and is headquartered in Cupertino, California.
## 2 NVIDIA Corporation provides graphics and compute and networking solutions in the United States, Taiwan, China, Hong Kong, and internationally. The Graphics segment offers GeForce GPUs for gaming and PCs, the GeForce NOW game streaming service and related infrastructure, and solutions for gaming platforms; Quadro/NVIDIA RTX GPUs for enterprise workstation graphics; virtual GPU or vGPU software for cloud-based visual and virtual computing; automotive platforms for infotainment systems; and Omniverse software for building and operating metaverse and 3D internet applications. The Compute & Networking segment comprises Data Center computing platforms and end-to-end networking platforms, including Quantum for InfiniBand and Spectrum for Ethernet; NVIDIA DRIVE automated-driving platform and automotive development agreements; Jetson robotics and other embedded platforms; NVIDIA AI Enterprise and other software; and DGX Cloud software and services. The company's products are used in gaming, professional visualization, data center, and automotive markets. It sells its products to original equipment manufacturers, original device manufacturers, system integrators and distributors, independent software vendors, cloud service providers, consumer internet companies, add-in board manufacturers, distributors, automotive manufacturers and tier-1 automotive suppliers, and other ecosystem participants. NVIDIA Corporation was incorporated in 1993 and is headquartered in Santa Clara, California.
## 3 Microsoft Corporation develops and supports software, services, devices and solutions worldwide. The Productivity and Business Processes segment offers office, exchange, SharePoint, Microsoft Teams, office 365 Security and Compliance, Microsoft viva, and Microsoft 365 copilot; and office consumer services, such as Microsoft 365 consumer subscriptions, Office licensed on-premises, and other office services. This segment also provides LinkedIn; and dynamics business solutions, including Dynamics 365, a set of intelligent, cloud-based applications across ERP, CRM, power apps, and power automate; and on-premises ERP and CRM applications. The Intelligent Cloud segment offers server products and cloud services, such as azure and other cloud services; SQL and windows server, visual studio, system center, and related client access licenses, as well as nuance and GitHub; and enterprise services including enterprise support services, industry solutions, and nuance professional services. The More Personal Computing segment offers Windows, including windows OEM licensing and other non-volume licensing of the Windows operating system; Windows commercial comprising volume licensing of the Windows operating system, windows cloud services, and other Windows commercial offerings; patent licensing; and windows Internet of Things; and devices, such as surface, HoloLens, and PC accessories. Additionally, this segment provides gaming, which includes Xbox hardware and content, and first- and third-party content; Xbox game pass and other subscriptions, cloud gaming, advertising, third-party disc royalties, and other cloud services; and search and news advertising, which includes Bing, Microsoft News and Edge, and third-party affiliates. The company sells its products through OEMs, distributors, and resellers; and directly through digital marketplaces, online, and retail stores. The company was founded in 1975 and is headquartered in Redmond, Washington.
## 4 Amazon.com, Inc. engages in the retail sale of consumer products, advertising, and subscriptions service through online and physical stores in North America and internationally. The company operates through three segments: North America, International, and Amazon Web Services (AWS). It also manufactures and sells electronic devices, including Kindle, Fire tablets, Fire TVs, Echo, Ring, Blink, and eero; and develops and produces media content. In addition, the company offers programs that enable sellers to sell their products in its stores; and programs that allow authors, independent publishers, musicians, filmmakers, Twitch streamers, skill and app developers, and others to publish and sell content. Further, it provides compute, storage, database, analytics, machine learning, and other services, as well as advertising services through programs, such as sponsored ads, display, and video advertising. Additionally, the company offers Amazon Prime, a membership program. The company's products offered through its stores include merchandise and content purchased for resale and products offered by third-party sellers. It serves consumers, sellers, developers, enterprises, content creators, advertisers, and employees. Amazon.com, Inc. was incorporated in 1994 and is headquartered in Seattle, Washington.
## 5 Alphabet Inc. offers various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. The Google Services segment provides products and services, including ads, Android, Chrome, devices, Gmail, Google Drive, Google Maps, Google Photos, Google Play, Search, and YouTube. It is also involved in the sale of apps and in-app purchases and digital content in the Google Play and YouTube; and devices, as well as in the provision of YouTube consumer subscription services. The Google Cloud segment offers infrastructure, cybersecurity, databases, analytics, AI, and other services; Google Workspace that include cloud-based communication and collaboration tools for enterprises, such as Gmail, Docs, Drive, Calendar, and Meet; and other services for enterprise customers. The Other Bets segment sells healthcare-related and internet services. The company was incorporated in 1998 and is headquartered in Mountain View, California.
## 6 Alphabet Inc. offers various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. The Google Services segment provides products and services, including ads, Android, Chrome, devices, Gmail, Google Drive, Google Maps, Google Photos, Google Play, Search, and YouTube. It is also involved in the sale of apps and in-app purchases and digital content in the Google Play and YouTube; and devices, as well as in the provision of YouTube consumer subscription services. The Google Cloud segment offers infrastructure, cybersecurity, databases, analytics, AI, and other services; Google Workspace that include cloud-based communication and collaboration tools for enterprises, such as Gmail, Docs, Drive, Calendar, and Meet; and other services for enterprise customers. The Other Bets segment sells healthcare-related and internet services. The company was incorporated in 1998 and is headquartered in Mountain View, California.
## Weight
## 1 0.06634304
## 2 0.05859536
## 3 0.05838706
## 4 0.04191645
## 5 0.04033071
## 6 0.04023009
The initial data examination reveals the top 6 companies in the S&P 500. We can see that tech giants dominate the top positions, with Apple, NVIDIA, and Microsoft leading. Notably, Alphabet (Google) appears twice due to its dual-class stock structure. The data shows NVIDIA with an impressive revenue growth of 122.4%, significantly higher than its peers.
str(sp500_data)
## 'data.frame': 503 obs. of 16 variables:
## $ Exchange : chr "NMS" "NMS" "NMS" "NMS" ...
## $ Symbol : chr "AAPL" "NVDA" "MSFT" "AMZN" ...
## $ Shortname : chr "Apple Inc." "NVIDIA Corporation" "Microsoft Corporation" "Amazon.com, Inc." ...
## $ Longname : chr "Apple Inc." "NVIDIA Corporation" "Microsoft Corporation" "Amazon.com, Inc." ...
## $ Sector : chr "Technology" "Technology" "Technology" "Consumer Cyclical" ...
## $ Industry : chr "Consumer Electronics" "Semiconductors" "Software - Infrastructure" "Internet Retail" ...
## $ Currentprice : num 248 135 443 225 185 ...
## $ Marketcap : num 3.75e+12 3.31e+12 3.30e+12 2.37e+12 2.28e+12 ...
## $ Ebitda : num 1.35e+11 6.12e+10 1.37e+11 1.12e+11 1.23e+11 ...
## $ Revenuegrowth : num 0.061 1.224 0.16 0.11 0.151 ...
## $ City : chr "Cupertino" "Santa Clara" "Redmond" "Seattle" ...
## $ State : chr "CA" "CA" "WA" "WA" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ Fulltimeemployees : int 164000 29600 228000 1551000 181269 181269 72404 140473 396500 20000 ...
## $ Longbusinesssummary: chr "Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessor"| __truncated__ "NVIDIA Corporation provides graphics and compute and networking solutions in the United States, Taiwan, China, "| __truncated__ "Microsoft Corporation develops and supports software, services, devices and solutions worldwide. The Productivi"| __truncated__ "Amazon.com, Inc. engages in the retail sale of consumer products, advertising, and subscriptions service throug"| __truncated__ ...
## $ Weight : num 0.0663 0.0586 0.0584 0.0419 0.0403 ...
The dataset contains 503 observations with 16 variables. We have a mix of character and numeric data types: - Character variables include Exchange, Symbol, Sector, and Industry - Numeric variables include Currentprice (ranging widely), Marketcap (in high denominations), and Revenuegrowth - The dataset shows clean formatting with appropriate data types for each variable
summary(sp500_data)
## Exchange Symbol Shortname Longname
## Length:503 Length:503 Length:503 Length:503
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Sector Industry Currentprice Marketcap
## Length:503 Length:503 Min. : 10.13 Min. :5.844e+09
## Class :character Class :character 1st Qu.: 71.47 1st Qu.:2.014e+10
## Mode :character Mode :character Median : 126.61 Median :3.820e+10
## Mean : 227.40 Mean :1.122e+11
## 3rd Qu.: 237.05 3rd Qu.:8.220e+10
## Max. :8857.62 Max. :3.745e+12
##
## Ebitda Revenuegrowth City State
## Min. :-3.991e+09 Min. :-0.60200 Length:503 Length:503
## 1st Qu.: 1.623e+09 1st Qu.: 0.00200 Class :character Class :character
## Median : 2.942e+09 Median : 0.05000 Mode :character Mode :character
## Mean : 7.031e+09 Mean : 0.07048
## 3rd Qu.: 6.017e+09 3rd Qu.: 0.10900
## Max. : 1.495e+11 Max. : 1.63200
## NA's :29 NA's :3
## Country Fulltimeemployees Longbusinesssummary Weight
## Length:503 Min. : 28 Length:503 Min. :0.0001035
## Class :character 1st Qu.: 10200 Class :character 1st Qu.:0.0003567
## Mode :character Median : 21595 Mode :character Median :0.0006766
## Mean : 57745 Mean :0.0019881
## 3rd Qu.: 54762 3rd Qu.:0.0014561
## Max. :2100000 Max. :0.0663430
## NA's :9
The summary statistics reveal interesting insights: - Current stock prices range from $10.13 to $8,857.62, showing extreme variation - Market capitalization ranges from $5.84B to $3.75T, indicating the diverse size of companies - Revenue growth averages 7.05%, but ranges from -60.2% to 163.2% - Missing values exist in EBITDA (29), Revenue growth (3), and Full-time employees (9) - The mean number of full-time employees is 57,745, but ranges from 28 to 2,100,000
missing_values <- colSums(is.na(sp500_data))
missing_values[missing_values > 0]
## Ebitda Revenuegrowth Fulltimeemployees
## 29 3 9
The missing value analysis shows three variables requiring attention: - EBITDA: 29 missing values - Revenue growth: 3 missing values - Full-time employees: 9 missing values This represents a relatively small proportion of our 503 observations but will need to be addressed in our modeling approach.
numerical_cols <- sp500_data %>%
select(where(is.numeric)) %>%
names()
sp500_data %>%
pivot_longer(cols = all_of(numerical_cols),
names_to = "Variable",
values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Distribution of Numerical Variables")
The boxplot visualization reveals: - Extreme right-skewed distributions
in Marketcap and EBITDA - Numerous outliers in most variables,
particularly in Currentprice - Weight shows a more compressed
distribution - Revenue growth has several extreme outliers in both
directions
cor_matrix <- sp500_data %>%
select(where(is.numeric)) %>%
cor(use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45)
The correlation matrix reveals several strong relationships: - Market
cap and EBITDA show strong positive correlation - Weight appears highly
correlated with market cap, suggesting index weighting - Current price
shows moderate correlation with market cap - Revenue growth shows weak
correlation with other variables, indicating its potential independence
as a predictor
categorical_cols <- sp500_data %>%
select(where(is.character)) %>%
names()
categorical_summary <- sp500_data %>%
select(all_of(categorical_cols)) %>%
summarize(across(everything(), ~n_distinct(.)))
print(categorical_summary)
## Exchange Symbol Shortname Longname Sector Industry City State Country
## 1 4 503 500 500 11 114 236 42 8
## Longbusinesssummary
## 1 500
The categorical variable analysis shows: - 4 different exchanges represented - 503 unique symbols (as expected) - 11 distinct sectors - 114 different industries - 236 cities and 42 states represented - Nearly all companies have unique business summaries (500 distinct values)
sp500_data <- sp500_data %>%
mutate(
MarketCapCategory = case_when(
Marketcap < 2e9 ~ "Small",
Marketcap >= 2e9 & Marketcap < 10e9 ~ "Medium",
Marketcap >= 10e9 ~ "Large"
),
RevenueGrowthCategory = case_when(
Revenuegrowth < 0 ~ "Negative",
Revenuegrowth >= 0 & Revenuegrowth < 10 ~ "Low",
Revenuegrowth >= 10 & Revenuegrowth < 30 ~ "Moderate",
Revenuegrowth >= 30 ~ "High"
)
)
Created two new categorical features: - MarketCapCategory: Classifies companies into Small, Medium, and Large based on market capitalization - RevenueGrowthCategory: Categorizes growth rates into Negative, Low, Moderate, and High
sp500_data$HighPriceStock <- ifelse(sp500_data$Currentprice > median(sp500_data$Currentprice), 1, 0)
train_index <- createDataPartition(sp500_data$Currentprice, p = 0.8, list = FALSE)
train_data <- sp500_data[train_index, ]
test_data <- sp500_data[-train_index, ]
Data preparation steps completed: - Created binary target variable (HighPriceStock) based on median price - Split data into 80% training and 20% testing sets - Maintained proportional representation of high/low price stocks
logistic_model <- glm(HighPriceStock ~ Marketcap + Revenuegrowth + Ebitda,
data = train_data,
family = binomial())
summary(logistic_model)
##
## Call:
## glm(formula = HighPriceStock ~ Marketcap + Revenuegrowth + Ebitda,
## family = binomial(), data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.009e-01 1.443e-01 -3.472 0.000516 ***
## Marketcap 1.636e-11 3.338e-12 4.901 9.55e-07 ***
## Revenuegrowth -3.494e-02 6.370e-01 -0.055 0.956258
## Ebitda -1.050e-10 2.520e-11 -4.168 3.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.4 on 378 degrees of freedom
## Residual deviance: 472.2 on 375 degrees of freedom
## (24 observations deleted due to missingness)
## AIC: 480.2
##
## Number of Fisher Scoring iterations: 7
The logistic regression results show: - Market cap is highly significant (p < 0.001) with positive coefficient - EBITDA is significant (p < 0.001) but with negative coefficient - Revenue growth is not significant (p = 0.956) - AIC of 480.2 suggests reasonable model fit - 24 observations were excluded due to missing values
train_data_clean <- train_data %>%
drop_na(HighPriceStock, Marketcap, Revenuegrowth, Ebitda)
sapply(train_data_clean[c("HighPriceStock", "Marketcap", "Revenuegrowth", "Ebitda")],
function(x) sum(is.na(x)))
## HighPriceStock Marketcap Revenuegrowth Ebitda
## 0 0 0 0
control <- trainControl(method = "boot", number = 100)
bootstrap_model <- train(HighPriceStock ~ Marketcap + Revenuegrowth + Ebitda,
data = train_data_clean,
method = "glm",
family = "binomial",
trControl = control)
print(bootstrap_model)
## Generalized Linear Model
##
## 379 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 379, 379, 379, 379, 379, 379, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4701112 0.1306548 0.4369013
Bootstrap analysis results: - Successfully removed all NA values from training data - 100 bootstrap replicates performed - RMSE of 0.47 indicates moderate prediction error - R-squared of 0.13 suggests limited explanatory power - MAE of 0.44 provides additional error metric
predictions <- predict(logistic_model, newdata = test_data, type = "response")
predicted_class <- ifelse(predictions > 0.5, 1, 0)
conf_matrix <- confusionMatrix(factor(predicted_class),
factor(test_data$HighPriceStock))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 23
## 1 4 25
##
## Accuracy : 0.7065
## 95% CI : (0.6024, 0.7969)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : 0.0002351
##
## Kappa : 0.4223
##
## Mcnemar's Test P-Value : 0.0005320
##
## Sensitivity : 0.9091
## Specificity : 0.5208
## Pos Pred Value : 0.6349
## Neg Pred Value : 0.8621
## Prevalence : 0.4783
## Detection Rate : 0.4348
## Detection Prevalence : 0.6848
## Balanced Accuracy : 0.7150
##
## 'Positive' Class : 0
##
Model evaluation metrics: - Overall accuracy: 70.65% (95% CI: 60.24% - 79.69%) - High sensitivity (0.9091) but lower specificity (0.5208) - Kappa of 0.4223 indicates moderate agreement beyond chance - Balanced accuracy of 0.715 suggests reasonable overall performance - McNemar’s test p-value of 0.00053 indicates significant differences in error rates