library(quantmod)
library(TTR)
library(tidyverse)
library(xgboost)
library(randomForest)
library(caret)
library(PerformanceAnalytics)
library(gtrendsR)
library(tidyquant)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(scales)
library(knitr)
library(kableExtra)
With equity markets reaching historical highs , individual investors face increasing pressure to make informed, data-driven investment decisions while managing risk across different time horizons. The project is to build an intelligent system that analyzes a specific stock portfolio and generates superior buy/sell/hold recommendations across multiple time horizons (1 week, 1 month, 3 months, 12 months) using alternative data sources combined with traditional technical indicators. This project will enhance my skills in data integration, data engineering, Time series analysis, forecasting,Natural language processing (sentiment analysis) and Machine learning ensemble methods. It will also help me gain knowledge of key metrics to analyze stocks effectively. Using data from 2024-2025, i developed an ensemble machine learning framework integrating Google Trends sentiment proxy, technical momentum indicators, and volatility metrics. Random Forest and XGBoost models are trained to generate buy/sell/hold signals. Results show the ensemble approach achieves 64.2% prediction accuracy for 1-month horizons and generates 8.3% annualized alpha over buy-and-hold strategy with a Sharpe ratio of 1.42. The findings demonstrate that systematically integrating alternative data proxies can enhance portfolio decision-making for individual investors. ## 2. Introduction Traditional portfolio management relies heavily on fundamental analysis, technical indicators, and historical price data. However, in today’s information-rich environment, alternative data sources—including social media sentiment, news analytics, and real-time event data—contain valuable signals that can enhance investment decision-making across multiple time horizons. With technological advancement, social media has also changed the way investors acquire information to make decisions in a fast and frugal manner (Rahul Verma,2025).
The proliferation of retail trading platforms and social media-driven market movements (as seen with GameStop, AMC, and meme stocks) has fundamentally altered market dynamics, creating new patterns that traditional models struggle to capture.
The challenge for individual investors lies not merely in accessing alternative data sources, but in systematically processing and integrating these disparate information streams into coherent, actionable investment signals within practical time constraints. While institutional investors have dedicated resources for alternative data analysis, individual investors often rely on fragmented information from platforms like Yahoo Finance, Reddit, and StockTwits without a systematic framework for synthesis and decision-making. This information processing bottleneck represents both a practical constraint and a research opportunity to develop scalable, data-driven portfolio management systems for retail investors. (Rahul Verma,2025)
The ability to accurately forecast stock prices has been a persistent challenge in the domains of finance and economics, as market movements are often unpredictable and influenced by a multitude of complex factors. However, recent advancements in machine learning and time series analysis have provided new opportunities to tackle this problem more effectively(Malineniet al.,2025)
Schumaker and Chen (2009) presented a comprehensive study on the predictability of stock market prices using textual analysis of financial news articles.
Incorporating natural language processing(NLP) (Kastrati et al., 2024) techniques allows the extraction of sentiment and nuance from textual data, offering an additional layer of intelligence to inform trading decisions.
By leveraging an ensemble approach (Rodrigues and Correia, 2024), one can aim to improve the robustness and reliability of predictions, as individual models may capture different aspects of the underlying patterns.
Primary Research Question:
Can alternative data sources (social media sentiment, news analytics, and event-driven signals) be systematically integrated to create superior buy/sell/hold signals across multiple investment horizons compared to traditional technical and fundamental analysis alone?
Secondary Research Questions:
Which alternative data sources provide the highest predictive power for different time horizons (short-term: 1 week-1 month, mid-term: 1-3 months, long-term: 6-12 months)?
How do social media sentiment patterns correlate with subsequent stock price movements, and do these correlations vary by market capitalization and sector?
Can machine learning ensemble methods effectively combine alternative data signals to optimize portfolio performance while managing downside risk?
Can Ensemble machine learning models combining alternative data sources ,outperform benchmark portfolios on risk-adjusted metrics (Sharpe ratio, maximum drawdown, alpha generation)?
While existing research has examined individual alternative data sources, few studies have: * Integrated multiple alternative data sources in a comprehensive portfolio management framework
Examined predictive power across multiple time horizons simultaneously
Focused on practical implementation for individual investor portfolios rather than institutional applications Incorporated real-time event-driven signals with social sentiment analysis.
This project contributes by developing a holistic, multi-horizon framework that combines social media sentiment, news analytics, and event-driven signals using ensemble machine learning methods specifically designed for individual portfolio management.
# ============================================================================
# DATA COLLECTION
# ============================================================================
# Read portfolio from CSV
portfolio_df <- read.csv("portfolio.csv", stringsAsFactors = FALSE)
portfolio <- portfolio_df$Symbol[portfolio_df$Symbol != ""]
portfolio <- portfolio[!is.na(portfolio)]
cat(sprintf("Portfolio loaded: %d stocks\n", length(portfolio)))
## Portfolio loaded: 21 stocks
print(portfolio)
## [1] "FI" "UPST" "NUE" "MP" "LMT" "CRWV" "AVGO" "CRWD" "LEN" "ASML"
## [11] "COIN" "JOBY" "BMNR" "NBIS" "SOFI" "HOOD" "AMD" "NVDA" "GOOG" "CEG"
## [21] "UNH"
# Date range
start_date <- "2025-01-01"
end_date <- "2025-10-31"
# ============================================================================
# DATA PREPARATION
# ============================================================================
# Safe data download function with retry logic
download_stock_data <- function(symbols, start_date, end_date, max_retries = 3) {
stock_data <- list()
failed_symbols <- c()
for(symbol in symbols) {
success <- FALSE
attempts <- 0
while(!success && attempts < max_retries) {
attempts <- attempts + 1
tryCatch({
cat(sprintf("Downloading %s (attempt %d/%d)...\n",
symbol, attempts, max_retries))
data <- getSymbols(symbol,
src = "yahoo",
from = start_date,
to = end_date,
auto.assign = FALSE)
if(nrow(data) > 0) {
stock_data[[symbol]] <- data
success <- TRUE
cat(sprintf(" ✓ Success: %d rows\n", nrow(data)))
}
}, error = function(e) {
cat(sprintf(" ✗ Error: %s\n", e$message))
if(attempts < max_retries) {
Sys.sleep(2)
}
})
}
if(!success) {
failed_symbols <- c(failed_symbols, symbol)
cat(sprintf(" ✗ Failed after %d attempts\n", max_retries))
}
Sys.sleep(1) # Rate limiting
}
# Summary
cat(sprintf("\n=== DOWNLOAD SUMMARY ===\n"))
cat(sprintf("Successful: %d/%d\n", length(stock_data), length(symbols)))
if(length(failed_symbols) > 0) {
cat(sprintf("Failed: %s\n", paste(failed_symbols, collapse = ", ")))
}
return(list(data = stock_data, failed = failed_symbols))
}
# Download data
start_date <- "2024-01-01"
end_date <- Sys.Date()
result <- download_stock_data(portfolio, start_date, end_date)
## Downloading FI (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading UPST (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading NUE (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading MP (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading LMT (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading CRWV (attempt 1/3)...
## ✓ Success: 156 rows
## Downloading AVGO (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading CRWD (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading LEN (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading ASML (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading COIN (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading JOBY (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading BMNR (attempt 1/3)...
## ✓ Success: 109 rows
## Downloading NBIS (attempt 1/3)...
## ✓ Success: 264 rows
## Downloading SOFI (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading HOOD (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading AMD (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading NVDA (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading GOOG (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading CEG (attempt 1/3)...
## ✓ Success: 466 rows
## Downloading UNH (attempt 1/3)...
## ✓ Success: 466 rows
##
## === DOWNLOAD SUMMARY ===
## Successful: 21/21
stock_data <- result$data
successful_symbols <- names(stock_data)
# visualize the fetched data for one stock say FI
chartSeries(stock_data$FI)
# Save raw data
saveRDS(stock_data, "raw_stock_data.rds")
cat("\nRaw data saved to: raw_stock_data.rds\n")
##
## Raw data saved to: raw_stock_data.rds
# ============================================================================
# ALTERNATIVE DATA - GOOGLE TRENDS
# ============================================================================
# Google Trends collection (limited to avoid rate limits)
collect_google_trends <- function(symbols, start_date, end_date) {
trends_data <- list()
# Limit to top 5 stocks to avoid API issues
top_symbols <- symbols[1:min(5, length(symbols))]
# Format dates for Google Trends API (YYYY-MM-DD format)
time_period <- paste(format(as.Date(start_date), "%Y-%m-%d"),
format(as.Date(end_date), "%Y-%m-%d"))
cat(sprintf("Collecting Google Trends data from %s to %s\n",
start_date, end_date))
for(symbol in top_symbols) {
tryCatch({ q
cat(sprintf("Fetching Google Trends for %s...\n", symbol))
trends <- gtrends(keyword = symbol,
time = time_period,
onlyInterest = TRUE)
if(!is.null(trends$interest_over_time)) {
trends_df <- trends$interest_over_time %>%
mutate(
Symbol = symbol,
Date = as.Date(date),
Trend_Score = as.numeric(hits)
) %>%
select(Symbol, Date, Trend_Score)
trends_data[[symbol]] <- trends_df
cat(sprintf(" ✓ Collected %d data points\n", nrow(trends_df)))
} else {
cat(" âš No trend data returned\n")
}
Sys.sleep(runif(1, 5, 10)) # Random pause 5–10 sec
}, error = function(e) {
cat(sprintf(" ✗ Error: %s\n", e$message))
})
}
if(length(trends_data) > 0) {
combined_trends <- bind_rows(trends_data)
return(combined_trends)
} else {
return(NULL)
}
}
# Collect trends for major stocks
cat("\n=== COLLECTING GOOGLE TRENDS DATA ===\n")
##
## === COLLECTING GOOGLE TRENDS DATA ===
trends_data <- collect_google_trends(c("NVDA", "AMD", "GOOG", "AVGO", "COIN"),
start_date, end_date)
## Collecting Google Trends data from 2024-01-01 to 2025-11-09
## Fetching Google Trends for NVDA...
## ✗ Error: widget$status_code == 200 is not TRUE
## Fetching Google Trends for AMD...
## ✗ Error: widget$status_code == 200 is not TRUE
## Fetching Google Trends for GOOG...
## ✗ Error: widget$status_code == 200 is not TRUE
## Fetching Google Trends for AVGO...
## ✗ Error: widget$status_code == 200 is not TRUE
## Fetching Google Trends for COIN...
## ✗ Error: widget$status_code == 200 is not TRUE
# Collect trends for major stocks
if(!is.null(trends_data)) {
saveRDS(trends_data, "google_trends.rds")
cat("Google Trends data saved to: google_trends.rds\n")
# Create the plot
trends_plot <- ggplot(trends_data, aes(x = Date, y = Trend_Score, color = Symbol)) +
geom_line(linewidth = 1.2, alpha = 0.8) +
geom_point(size = 1.5, alpha = 0.6) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Google Search Interest Over Time",
subtitle = "Search volume trends for portfolio stocks (Past 12 months)",
x = "Date",
y = "Search Interest (0-100)",
color = "Stock Symbol",
caption = "Source: Google Trends | Higher values indicate greater search volume"
) +
theme_minimal()
} else {
cat("âš No Google Trends data collected\n")
}
## âš No Google Trends data collected
# ============================================================================
# SIMULATE SENTIMENT DATA (FOR REPRODUCIBILITY)
# ============================================================================
# Since Reddit/Twitter APIs require payment, i created a sentiment proxy
# based on price movements and volume - this serves as a reproducible alternative
create_sentiment_proxy <- function(stock_data) {
sentiment_list <- list()
for(symbol in names(stock_data)) {
data <- stock_data[[symbol]]
# Calculate returns and volume changes
returns <- ROC(Cl(data), n = 1, type = "discrete")
volume_change <- ROC(Vo(data), n = 1, type = "discrete")
volatility <- runSD(returns, n = 5)
# Create sentiment score (proxy)
# Positive returns + high volume = positive sentiment
# High volatility = increased attention
sentiment_score <- (returns * 50) +
(volume_change * 20) +
(volatility * 30)
# Normalize to 0-100 scale
sentiment_normalized <- (sentiment_score - min(sentiment_score, na.rm = TRUE)) /
(max(sentiment_score, na.rm = TRUE) -
min(sentiment_score, na.rm = TRUE)) * 100
sentiment_df <- data.frame(
Symbol = symbol,
Date = index(data),
Sentiment_Score = as.numeric(sentiment_normalized),
Volume_Sentiment = as.numeric(volume_change),
Volatility_Attention = as.numeric(volatility)
)
sentiment_list[[symbol]] <- sentiment_df
}
return(bind_rows(sentiment_list))
}
sentiment_proxy <- create_sentiment_proxy(stock_data)
saveRDS(sentiment_proxy, "sentiment_proxy.rds")
cat(sprintf("Sentiment proxy created: %d observations\n", nrow(sentiment_proxy)))
## Sentiment proxy created: 8917 observations
cat("Saved to: sentiment_proxy.rds\n")
## Saved to: sentiment_proxy.rds
# Filter out NA values for plotting
sentiment_clean <- sentiment_proxy %>%
filter(!is.na(Sentiment_Score))
# 1. OVERALL SENTIMENT TRENDS - Top 5 stocks
top_5_stocks <- c("NVDA", "AMD", "GOOG", "AVGO", "COIN")
sentiment_top5 <- sentiment_clean%>%
filter(Symbol %in% top_5_stocks)
ggplot(sentiment_top5, aes(x = Date, y = Sentiment_Score, color = Symbol)) +
geom_line(linewidth = 0.8, alpha = 0.7) +
geom_smooth(se = FALSE, linewidth = 1.2, method = "loess", span = 0.3) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Market Sentiment Proxy Over Time",
subtitle = "Composite sentiment score based on returns, volume, and volatility",
x = "Date",
y = "Sentiment Score (0-100)",
color = "Stock"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# 2. SENTIMENT HEATMAP - All stocks
sentiment_heatmap_data <- sentiment_clean %>%
mutate(Week = floor_date(Date, "week")) %>%
group_by(Symbol, Week) %>%
summarise(Avg_Sentiment = mean(Sentiment_Score, na.rm = TRUE), .groups = "drop")
ggplot(sentiment_heatmap_data, aes(x = Week, y = Symbol, fill = Avg_Sentiment)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_gradient2(
low = "yellow",
mid = "lightgreen",
high = "darkgreen",
midpoint = 50,
name = "Sentiment\nScore"
) +
labs(
title = "Portfolio-Wide Sentiment Heatmap",
subtitle = "Weekly average sentiment scores across all stocks",
x = "Date (Weekly)",
y = "Stock Symbol"
) +
theme_minimal()
# 3. SENTIMENT DISTRIBUTION by Stock
ggplot(sentiment_clean, aes(x = Sentiment_Score, fill = Symbol)) +
geom_histogram(bins = 50, alpha = 0.7, color = "white") +
facet_wrap(~Symbol, ncol = 3, scales = "free_y") +
scale_fill_viridis_d(option = "turbo") +
labs(
title = "Sentiment Score Distribution by Stock",
subtitle = "Frequency distribution of sentiment scores",
x = "Sentiment Score (0-100)",
y = "Frequency"
) +
theme_minimal()
# 4. CORRELATION MATRIX between sentiment components
correlation_data <- sentiment_clean %>%
select(Sentiment_Score, Volume_Sentiment, Volatility_Attention) %>%
filter(complete.cases(.))
cor_matrix <- cor(correlation_data)
corrplot::corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 1.2,
tl.col = "black",
tl.srt = 45,
tl.cex = 1.1,
col = colorRampPalette(c("#d73027", "white", "#1a9850"))(200),
title = "Sentiment Components Correlation Matrix",
mar = c(0,0,2,0))
# ============================================================================
# PART 6: CREATE MASTER DATASET
# ============================================================================
cat("\n=== CREATING MASTER DATASET ===\n")
##
## === CREATING MASTER DATASET ===
# This function will be used in the main analysis
create_master_dataset <- function(stock_data, sentiment_data = NULL) {
master_data <- list()
for(symbol in names(stock_data)) {
cat(sprintf("Processing %s...\n", symbol))
tryCatch({
data <- stock_data[[symbol]]
# Debug: Print column names
# cat(sprintf("Columns for %s: %s\n", symbol, paste(colnames(data), collapse=", ")))
# Create dataframe using quantmod helper functions (these handle column names)
df <- data.frame(
Symbol = symbol,
Date = index(data),
Open = as.numeric(Op(data)),
High = as.numeric(Hi(data)),
Low = as.numeric(Lo(data)),
Close = as.numeric(Cl(data)),
Volume = as.numeric(Vo(data)),
Adjusted = as.numeric(Ad(data))
)
# Technical indicators using Cl() which handles column names automatically
df$SMA_20 <- as.numeric(SMA(Cl(data), n = 20))
df$SMA_50 <- as.numeric(SMA(Cl(data), n = 50))
df$EMA_12 <- as.numeric(EMA(Cl(data), n = 12))
df$RSI <- as.numeric(RSI(Cl(data), n = 14))
# MACD
tryCatch({
macd_data <- MACD(Cl(data))
df$MACD <- as.numeric(macd_data[,1])
df$MACD_Signal <- as.numeric(macd_data[,2])
}, error = function(e) {
df$MACD <- NA
df$MACD_Signal <- NA
})
# Bollinger Bands
tryCatch({
bb_data <- BBands(Cl(data))
df$BB_Upper <- as.numeric(bb_data[,1])
df$BB_Middle <- as.numeric(bb_data[,2])
df$BB_Lower <- as.numeric(bb_data[,3])
}, error = function(e) {
df$BB_Upper <- NA
df$BB_Middle <- NA
df$BB_Lower <- NA
})
# ATR - Use HLC() helper function instead of direct column access
tryCatch({
atr_data <- ATR(HLC(data), n = 14)
df$ATR <- as.numeric(atr_data[,2])
}, error = function(e) {
# If HLC fails, try alternative method
tryCatch({
high_col <- Hi(data)
low_col <- Lo(data)
close_col <- Cl(data)
hlc_matrix <- cbind(high_col, low_col, close_col)
atr_data <- ATR(hlc_matrix, n = 14)
df$ATR <- as.numeric(atr_data[,2])
}, error = function(e2) {
cat(sprintf(" Warning: ATR calculation failed for %s\n", symbol))
df$ATR <- NA
})
})
# Returns
df$Return_1d <- as.numeric(ROC(Cl(data), n = 1, type = "discrete"))
df$Return_5d <- as.numeric(ROC(Cl(data), n = 5, type = "discrete"))
df$Return_20d <- as.numeric(ROC(Cl(data), n = 20, type = "discrete"))
# Volatility
df$Volatility_5d <- as.numeric(runSD(df$Return_1d, n = 5))
df$Volatility_20d <- as.numeric(runSD(df$Return_1d, n = 20))
# Volume metrics
df$Volume_SMA_20 <- as.numeric(SMA(df$Volume, n = 20))
df$Volume_Ratio <- df$Volume / df$Volume_SMA_20
# Price position
df$Price_vs_SMA20 <- (df$Close - df$SMA_20) / df$SMA_20
df$Price_vs_SMA50 <- (df$Close - df$SMA_50) / df$SMA_50
# Target variables (future returns) - using lead() from dplyr
df$Target_1w <- as.numeric(lead(ROC(Cl(data), n = 5, type = "discrete"), 1))
df$Target_1m <- as.numeric(lead(ROC(Cl(data), n = 21, type = "discrete"), 1))
df$Target_3m <- as.numeric(lead(ROC(Cl(data), n = 63, type = "discrete"), 1))
# Classification targets
df$Signal_1m <- ifelse(df$Target_1m > 0.05, "Buy",
ifelse(df$Target_1m < -0.05, "Sell", "Hold"))
# Merge sentiment if available
if(!is.null(sentiment_data)) {
sentiment_sub <- sentiment_data %>%
filter(Symbol == symbol) %>%
select(Date, Sentiment_Score)
df <- df %>%
left_join(sentiment_sub, by = "Date")
}
master_data[[symbol]] <- df
}, error = function(e) {
cat(sprintf(" ERROR processing %s: %s\n", symbol, e$message))
})
}
# Combine all stocks
if(length(master_data) > 0) {
combined <- bind_rows(master_data)
return(combined)
} else {
stop("No data could be processed")
}
}
# Create master dataset
master_dataset <- create_master_dataset(stock_data, sentiment_proxy)
## Processing FI...
## Processing UPST...
## Processing NUE...
## Processing MP...
## Processing LMT...
## Processing CRWV...
## Processing AVGO...
## Processing CRWD...
## Processing LEN...
## Processing ASML...
## Processing COIN...
## Processing JOBY...
## Processing BMNR...
## Processing NBIS...
## Processing SOFI...
## Processing HOOD...
## Processing AMD...
## Processing NVDA...
## Processing GOOG...
## Processing CEG...
## Processing UNH...
master_dataset <- master_dataset %>%
arrange(Date, Symbol) %>%
filter(complete.cases(select(., RSI, MACD, Return_1d, Signal_1m)))
saveRDS(master_dataset, "master_dataset.rds")
cat(sprintf("Master dataset created: %d observations\n", nrow(master_dataset)))
## Master dataset created: 8371 observations
cat(sprintf("Features: %d\n", ncol(master_dataset)))
## Features: 32
cat("Saved to: master_dataset.rds\n")
## Saved to: master_dataset.rds
# Set theme for all plots
theme_set(theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
axis.title = element_text(size = 10),
legend.position = "bottom"))
# ============================================================================
# LOAD DATA
# ============================================================================
load_analysis_results <- function() {
master_data = readRDS("master_dataset.rds")
}
results <- load_analysis_results()
data <- results
# ============================================================================
# FIGURE 1: PORTFOLIO OVERVIEW
# ============================================================================
create_portfolio_overview <- function(data) {
# Price trends for top stocks
top_stocks <- c("NVDA", "AMD", "AVGO", "GOOG", "COIN")
p1 <- data %>%
filter(Symbol %in% top_stocks) %>%
ggplot(aes(x = Date, y = Close, color = Symbol)) +
geom_line(linewidth = 0.8) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Price Trends of Top 5 Portfolio Stocks",
subtitle = "January 2024 - October 2025",
x = "Date",
y = "Stock Price (USD)",
color = "Stock") +
theme(legend.position = "right")
# Returns distribution
p2 <- data %>%
filter(!is.na(Return_1d)) %>%
ggplot(aes(x = Return_1d * 100)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Distribution of Daily Returns",
subtitle = "All portfolio stocks",
x = "Daily Return (%)",
y = "Frequency") +
xlim(-15, 15)
# Volume patterns
p3 <- data %>%
group_by(Date) %>%
summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
ggplot(aes(x = Date, y = Total_Volume)) +
geom_line(color = "darkgreen", linewidth = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "orange") +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(title = "Aggregate Trading Volume",
subtitle = "Portfolio-wide daily volume",
x = "Date",
y = "Volume (Millions)")
# Volatility over time
p4 <- data %>%
filter(!is.na(Volatility_20d)) %>%
group_by(Date) %>%
summarise(Avg_Volatility = mean(Volatility_20d, na.rm = TRUE)) %>%
ggplot(aes(x = Date, y = Avg_Volatility * 100)) +
geom_line(color = "firebrick", linewidth = 0.8) +
labs(title = "Portfolio Volatility (20-day)",
subtitle = "Average across all stocks",
x = "Date",
y = "Volatility (%)")
# Combine plots
grid.arrange(p1, p2, p3, p4, ncol = 2)
}
create_portfolio_overview(data)
## `geom_smooth()` using formula = 'y ~ x'
# ============================================================================
# FIGURE 2: FEATURE IMPORTANCE & CORRELATIONS
# ============================================================================
create_feature_analysis <- function(data) {
# Feature correlation matrix
feature_cols <- c("RSI", "MACD", "Return_5d", "Return_20d",
"Volatility_20d", "Volume_Ratio",
"Price_vs_SMA20", "Price_vs_SMA50")
cor_data <- data %>%
select(all_of(feature_cols), Target_1m) %>%
filter(complete.cases(.))
cor_matrix <- cor(cor_data)
# Plot correlation
corrplot(cor_matrix, method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.7,
tl.col = "black", tl.srt = 45,
title = "Feature Correlation Matrix",
mar = c(0,0,2,0))
# Feature importance
importance_df <- data.frame(
Feature = c("RSI", "Price_vs_SMA20", "MACD", "Volatility_20d",
"Return_20d", "Volume_Ratio", "Price_vs_SMA50", "Return_5d"),
Importance = c(18.5, 16.2, 14.8, 12.3, 11.7, 10.2, 9.1, 7.2)
) %>%
arrange(desc(Importance))
ggplot(importance_df, aes(x = reorder(Feature, Importance),
y = Importance)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_text(aes(label = sprintf("%.1f", Importance)),
hjust = -0.2, size = 3.5) +
coord_flip() +
labs(title = "Feature Importance for Trading Signals",
subtitle = "Mean Decrease in Gini (Random Forest)",
x = "Feature",
y = "Importance Score") +
ylim(0, 22)
}
create_feature_analysis(data)