library(quantmod)
library(TTR)
library(tidyverse)
library(xgboost)
library(randomForest)
library(caret)
library(PerformanceAnalytics)
library(gtrendsR)
library(tidyquant)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(scales)
library(knitr)
library(kableExtra)

1. Abstract

As equity markets reach historical highs, individual investors require data-driven systems to manage risk across diverse time horizons. This project develops an intelligent trading signal system that generates Buy/Sell/Hold recommendations for 1-week, 1-month, 3-month, and 12-month horizons. By integrating traditional technical indicators with alternative data—specifically Google Trends and news sentiment from the Finnhub API—an ensemble framework using Random Forest and XGBoost was constructed. The system achieved a 56.78% prediction accuracy for the 12-month horizon. Results indicate that systematic integration of alternative data proxies significantly enhances portfolio decision-making for retail investors

2. Introduction

Traditional portfolio management relies heavily on fundamental analysis, technical indicators, and historical price data. However, in today’s information-rich environment, alternative data sources—including social media sentiment, news analytics, and real-time event data—contain valuable signals that can enhance investment decision-making across multiple time horizons. With technological advancement, social media has also changed the way investors acquire information to make decisions in a fast and frugal manner (Rahul Verma,2025).

The proliferation of retail trading platforms and social media-driven market movements (as seen with GameStop, AMC, and meme stocks) has fundamentally altered market dynamics, creating new patterns that traditional models struggle to capture.

The challenge for individual investors lies not merely in accessing alternative data sources, but in systematically processing and integrating these disparate information streams into coherent, actionable investment signals within practical time constraints. While institutional investors have dedicated resources for alternative data analysis, individual investors often rely on fragmented information from platforms like Yahoo Finance, Reddit, and StockTwits without a systematic framework for synthesis and decision-making. This information processing bottleneck represents both a practical constraint and a research opportunity to develop scalable, data-driven portfolio management systems for retail investors. (Rahul Verma,2025)

3. Objective

The objective of this project is to design and validate an trading recommendation system that generates Buy / Hold / Sell signals across multiple investment horizons (1-week, 1-month, 3-month, and 12-month) using Technical indicators and alternate data sources(social media sentiment or news analytics, and event-driven signals) . The project aims to enhance decision-making by identifying return-predictive market patterns while maintaining statistical rigor, interpretability, and risk awareness.

5. Research Gap and Originality

While existing research has examined individual alternative data sources, few studies have: * Integrated multiple alternative data sources in a comprehensive portfolio management framework

Examined predictive power across multiple time horizons simultaneously
Focused on practical implementation for individual investor portfolios rather than institutional applications Incorporated real-time event-driven signals with social sentiment analysis.

This project contributes by developing a holistic, multi-horizon framework that combines social media sentiment, news analytics, and event-driven signals using ensemble machine learning methods specifically designed for individual portfolio management.

6. Data Collection

# ============================================================================
# DATA COLLECTION
# ============================================================================

# Read portfolio from CSV
portfolio_df <- read.csv("portfolio.csv", stringsAsFactors = FALSE)
portfolio <- portfolio_df$Symbol[portfolio_df$Symbol != ""]
portfolio <- portfolio[!is.na(portfolio)]

cat(sprintf("Portfolio loaded: %d stocks\n", length(portfolio)))

## Portfolio loaded: 21 stocks

print(portfolio)

##  [1] "FI"   "UPST" "NUE"  "MP"   "LMT"  "CRWV" "AVGO" "CRWD" "LEN"  "ASML"
## [11] "COIN" "JOBY" "BMNR" "NBIS" "SOFI" "HOOD" "AMD"  "NVDA" "GOOG" "CEG" 
## [21] "UNH"

# Date range
start_date <- "2025-01-01"
end_date <- "2025-11-30"

7. Data preparation: Yahoo Finance

The methodology utilizes three distinct data streams to create a holistic view of market sentiment and momentum: Market Data: 10 years of daily price data (2025-2015) was extracted from Yahoo Finance using the quantmod package in R. I used the getSymbols() method in R using the quantmod package. It allows you to fetch historical stock prices and other financial data directly from various online sources like Yahoo Finance, Google Finance, oanda and others. I used Yahoo finance.

# ============================================================================
# DATA PREPARATION
# ============================================================================

# Safe data download function with retry logic
download_stock_data <- function(symbols, start_date, end_date, max_retries = 3) {
  stock_data <- list()
  failed_symbols <- c()
  
  for(symbol in symbols) {
    success <- FALSE
    attempts <- 0
    
    while(!success && attempts < max_retries) {
      attempts <- attempts + 1
      
      tryCatch({
        cat(sprintf("Downloading %s (attempt %d/%d)...\n", 
                   symbol, attempts, max_retries))
        
        data <- getSymbols(symbol, 
                          src = "yahoo", 
                          from = start_date, 
                          to = end_date, 
                          auto.assign = FALSE)
        
        if(nrow(data) > 0) {
          stock_data[[symbol]] <- data
          success <- TRUE
          cat(sprintf("  ✓ Success: %d rows\n", nrow(data)))
        }
        
      }, error = function(e) {
        cat(sprintf("  ✗ Error: %s\n", e$message))
        if(attempts < max_retries) {
          Sys.sleep(2)
        }
      })
    }
    
    if(!success) {
      failed_symbols <- c(failed_symbols, symbol)
      cat(sprintf("  ✗ Failed after %d attempts\n", max_retries))
    }
    
    Sys.sleep(1) # Rate limiting
  }
  
  # Summary
  cat(sprintf("\n=== DOWNLOAD SUMMARY ===\n"))
  cat(sprintf("Successful: %d/%d\n", length(stock_data), length(symbols)))
  if(length(failed_symbols) > 0) {
    cat(sprintf("Failed: %s\n", paste(failed_symbols, collapse = ", ")))
  }
  
  return(list(data = stock_data, failed = failed_symbols))
}

# Download data
start_date <- "2024-01-01"
end_date <- Sys.Date()

result <- download_stock_data(portfolio, start_date, end_date)

## Downloading FI (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading UPST (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading NUE (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading MP (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading LMT (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading CRWV (attempt 1/3)...
##   ✓ Success: 183 rows
## Downloading AVGO (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading CRWD (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading LEN (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading ASML (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading COIN (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading JOBY (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading BMNR (attempt 1/3)...
##   ✓ Success: 136 rows
## Downloading NBIS (attempt 1/3)...
##   ✓ Success: 291 rows
## Downloading SOFI (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading HOOD (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading AMD (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading NVDA (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading GOOG (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading CEG (attempt 1/3)...
##   ✓ Success: 493 rows
## Downloading UNH (attempt 1/3)...
##   ✓ Success: 493 rows
## 
## === DOWNLOAD SUMMARY ===
## Successful: 21/21

stock_data <- result$data
successful_symbols <- names(stock_data)

# visualize the fetched data for one stock say FI
chartSeries(stock_data$FI)

# Save raw data
saveRDS(stock_data, "raw_stock_data.rds")
cat("\nRaw data saved to: raw_stock_data.rds\n")

## 
## Raw data saved to: raw_stock_data.rds

8. Data preparation: Google Trends

Programmatic access via the gtrendsR package provided search volume patterns for top tickers like NVDA, AMD, GOOG, AVGO, and COIN.The gtrends function in R is part of the gtrendsR package, which provides an interface for retrieving and displaying Google Trends information directly within R. This allows for programmatic access to Google Trends data, enabling analysis and visualization without manual downloads from the Google Trends website. I limited to weekly trend and only top 5 stocks to avoid API issues(“NVDA”, “AMD”, “GOOG”, “AVGO”, “COIN”).

# ============================================================================
# ALTERNATIVE DATA - GOOGLE TRENDS
# ============================================================================

# Google Trends collection (limited to avoid rate limits)
collect_google_trends <- function(symbols, start_date, end_date) {
  trends_data <- list()
  
  # Limit to top 5 stocks to avoid API issues
  top_symbols <- symbols[1:min(5, length(symbols))]
  
  # Format dates for Google Trends API (YYYY-MM-DD format)
  time_period <- paste(format(as.Date(start_date), "%Y-%m-%d"),
                      format(as.Date(end_date), "%Y-%m-%d"))
  
  cat(sprintf("Collecting Google Trends data from %s to %s\n", 
              start_date, end_date))
  
  for(symbol in top_symbols) {
    tryCatch({ q
      cat(sprintf("Fetching Google Trends for %s...\n", symbol))
      
      trends <- gtrends(keyword = symbol, 
                       time = time_period,
                       onlyInterest = TRUE)
      
      if(!is.null(trends$interest_over_time)) {
        trends_df <- trends$interest_over_time %>%
          mutate(
            Symbol = symbol,
            Date = as.Date(date),
            Trend_Score = as.numeric(hits)
          ) %>%
          select(Symbol, Date, Trend_Score)
        
        trends_data[[symbol]] <- trends_df
        cat(sprintf("  ✓ Collected %d data points\n", nrow(trends_df)))
      } else {
        cat("  ⚠ No trend data returned\n")
      }
      
       Sys.sleep(runif(1, 5, 10))  # Random pause 5–10 sec
      
    }, error = function(e) {
      cat(sprintf("  ✗ Error: %s\n", e$message))
    })
  }
  
  if(length(trends_data) > 0) {
    combined_trends <- bind_rows(trends_data)
    return(combined_trends)
  } else {
    return(NULL)
  }
}

# Collect trends for major stocks
cat("\n=== COLLECTING GOOGLE TRENDS DATA ===\n")

## 
## === COLLECTING GOOGLE TRENDS DATA ===

trends_data <- collect_google_trends(c("NVDA", "AMD", "GOOG", "AVGO", "COIN"),
                                     start_date, end_date)

## Collecting Google Trends data from 2024-01-01 to 2025-12-18
## Fetching Google Trends for NVDA...
##   ✓ Collected 103 data points
## Fetching Google Trends for AMD...
##   ✓ Collected 103 data points
## Fetching Google Trends for GOOG...
##   ✓ Collected 103 data points
## Fetching Google Trends for AVGO...
##   ✓ Collected 103 data points
## Fetching Google Trends for COIN...
##   ✓ Collected 103 data points

# Collect trends for major stocks

if(!is.null(trends_data)) {
  saveRDS(trends_data, "google_trends.rds")
  cat("Google Trends data saved to: google_trends.rds\n")
  
# Create the plot
  trends_plot <- ggplot(trends_data, aes(x = Date, y = Trend_Score, color = Symbol)) +
    geom_line(linewidth = 1.2, alpha = 0.8) +
    geom_point(size = 1.5, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    labs(
      title = "Google Search Interest Over Time",
      subtitle = "Search volume trends for portfolio stocks (Past 12 months)",
      x = "Date",
      y = "Search Interest (0-100)",
      color = "Stock Symbol",
      caption = "Source: Google Trends | Higher values indicate greater search volume"
    ) +
    theme_minimal()
  
} else {
  cat("⚠ No Google Trends data collected\n")
}

## Google Trends data saved to: google_trends.rds

9. Data preparation: Sentiment Data

Since Reddit/Twitter APIs require payment, i registered and generated API key from FINNHUB API - FREE FINANCIAL NEWS & SENTIMENT website at: https://finnhub.io/. Sentiment comes from news articles via the Finnhub API. It contains Headlines and summaries which are all converted to text. After i fetched this data, i scored sentiment using AFINN lexicon, normalized it to -1 to 1 scale.

# ============================================================================
# FINNHUB API - FREE FINANCIAL NEWS & SENTIMENT
# Sign up at: https://finnhub.io/ (Free tier: 60 calls/minute)
# ============================================================================

library(httr)

## Warning: package 'httr' was built under R version 4.3.3

## 
## Attaching package: 'httr'

## The following object is masked from 'package:caret':
## 
##     progress

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 4.3.3

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

library(tidyverse)
library(lubridate)
library(syuzhet)

## Warning: package 'syuzhet' was built under R version 4.3.3

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:scales':
## 
##     rescale

# ============================================================================
# FINNHUB NEWS SENTIMENT COLLECTION
# ============================================================================

collect_finnhub_sentiment <- function(symbols, start_date, end_date, api_key) {
  
  if(is.null(api_key) || api_key == "") {
    cat("⚠ No Finnhub API key provided\n")
    cat("Get a free key at: https://finnhub.io/register\n")
    return(NULL)
  }
  
  cat("\n=== COLLECTING FINNHUB NEWS SENTIMENT ===\n")
  cat("API: Finnhub.io (Free tier: 60 calls/min, unlimited for basic data)\n\n")
  
  news_data_list <- list()
  
  # Convert dates to required format
  from_date <- format(as.Date(start_date), "%Y-%m-%d")
  to_date <- format(as.Date(end_date), "%Y-%m-%d")
  
  for(symbol in symbols) {
    cat(sprintf("Fetching news for %s...\n", symbol))
    
    tryCatch({
      # Finnhub company news endpoint
      url <- sprintf(
        "https://finnhub.io/api/v1/company-news?symbol=%s&from=%s&to=%s&token=%s",
        symbol, from_date, to_date, api_key
      )
      
      response <- GET(url)
      
      cat(sprintf("  API Status: %d\n", status_code(response)))
      
      if(status_code(response) == 200) {
        news_json <- fromJSON(content(response, "text"))
        
        if(length(news_json) > 0 && is.data.frame(news_json)) {
          cat(sprintf("  Found %d articles\n", nrow(news_json)))
          
          # Process news data
          news_df <- news_json %>%
            mutate(
              Symbol = symbol,
              Date = as.Date(as.POSIXct(datetime, origin = "1970-01-01")),
              # Combine headline and summary
              text = paste(
                ifelse(is.na(headline), "", headline),
                ifelse(is.na(summary), "", summary),
                sep = ". "
              ),
              text = trimws(text)
            ) %>%
            filter(text != "" & text != ".") %>%
            mutate(
              # Calculate sentiment using AFINN lexicon
              sentiment_score = get_sentiment(text, method = "afinn"),
              # Normalize sentiment to -1 to 1 scale
              sentiment_normalized = sentiment_score / max(abs(sentiment_score), na.rm = TRUE)
            ) %>%
            group_by(Symbol, Date) %>%
            summarise(
              News_Count = n(),
              Avg_Sentiment = mean(sentiment_score, na.rm = TRUE),
              Sentiment_Positive = sum(sentiment_score > 0),
              Sentiment_Negative = sum(sentiment_score < 0),
              Sentiment_Neutral = sum(sentiment_score == 0),
              Max_Sentiment = max(sentiment_score),
              Min_Sentiment = min(sentiment_score),
              .groups = "drop"
            )
          
          news_data_list[[symbol]] <- news_df
          
          cat(sprintf("  ✓ Processed %d articles into %d daily records\n", 
                     sum(news_df$News_Count), nrow(news_df)))
          
        } else {
          cat(sprintf("  ⚠ No news articles found for %s\n", symbol))
        }
      } else {
        error_content <- content(response, "text")
        cat(sprintf("  ✗ API Error: %s\n", error_content))
      }
      
      Sys.sleep(1.5)  # Rate limiting: 60 calls/min = 1 per second
      
    }, error = function(e) {
      cat(sprintf("  ✗ Error: %s\n", e$message))
    })
  }
  
  if(length(news_data_list) > 0) {
    combined_news <- bind_rows(news_data_list)
    cat(sprintf("\n✓ Successfully collected news for %d/%d symbols\n", 
               length(news_data_list), length(symbols)))
    cat(sprintf("Total daily records: %d\n", nrow(combined_news)))
    return(combined_news)
  } else {
    cat("\n⚠ No news data collected\n")
    return(NULL)
  }
}

# ============================================================================
# FINNHUB SENTIMENT METRICS (Additional endpoint)
# ============================================================================

collect_finnhub_sentiment_scores <- function(symbols, api_key) {
  
  cat("\n=== COLLECTING FINNHUB SENTIMENT SCORES ===\n")
  
  sentiment_list <- list()
  
  for(symbol in symbols) {
    cat(sprintf("Fetching sentiment scores for %s...\n", symbol))
    
    tryCatch({
      # Finnhub news sentiment endpoint
      url <- sprintf(
        "https://finnhub.io/api/v1/news-sentiment?symbol=%s&token=%s",
        symbol, api_key
      )
      
      response <- GET(url)
      
      if(status_code(response) == 200) {
        sentiment_json <- fromJSON(content(response, "text"))
        
        if(!is.null(sentiment_json$sentiment)) {
          sentiment_df <- data.frame(
            Symbol = symbol,
            Date = Sys.Date(),
            Buzz_ArticlesInLastWeek = sentiment_json$buzz$articlesInLastWeek,
            Buzz_WeeklyAverage = sentiment_json$buzz$weeklyAverage,
            Sentiment_BearishPercent = sentiment_json$sentiment$bearishPercent,
            Sentiment_BullishPercent = sentiment_json$sentiment$bullishPercent,
            Sentiment_Score = sentiment_json$companyNewsScore
          )
          
          sentiment_list[[symbol]] <- sentiment_df
          cat(sprintf("  ✓ Collected sentiment metrics\n"))
        }
      }
      
      Sys.sleep(1.5)
      
    }, error = function(e) {
      cat(sprintf("  ✗ Error: %s\n", e$message))
    })
  }
  
  if(length(sentiment_list) > 0) {
    return(bind_rows(sentiment_list))
  } else {
    return(NULL)
  }
}

# Get your free Finnhub API key at: https://finnhub.io/register
finnhub_key <- "d4t45bpr01qhr5to5i80d4t45bpr01qhr5to5i8g"  # my key

# Your portfolio
portfolio <- c("NVDA", "AMD", "GOOG", "AVGO", "COIN", 
               "FI", "UPST", "NUE", "MP", "LMT")

start_date <- "2024-01-01"
end_date <- Sys.Date()

# Collect news sentiment
finnhub_news <- collect_finnhub_sentiment(portfolio, start_date, end_date, finnhub_key)

## 
## === COLLECTING FINNHUB NEWS SENTIMENT ===
## API: Finnhub.io (Free tier: 60 calls/min, unlimited for basic data)
## 
## Fetching news for NVDA...
##   API Status: 200
##   Found 249 articles
##   ✓ Processed 249 articles into 8 daily records
## Fetching news for AMD...
##   API Status: 200
##   Found 246 articles
##   ✓ Processed 246 articles into 14 daily records
## Fetching news for GOOG...
##   API Status: 200
##   Found 250 articles
##   ✓ Processed 250 articles into 9 daily records
## Fetching news for AVGO...
##   API Status: 200
##   Found 249 articles
##   ✓ Processed 249 articles into 9 daily records
## Fetching news for COIN...
##   API Status: 200
##   Found 248 articles
##   ✓ Processed 248 articles into 17 daily records
## Fetching news for FI...
##   API Status: 200
##   Found 245 articles
##   ✓ Processed 245 articles into 68 daily records
## Fetching news for UPST...
##   API Status: 200
##   Found 242 articles
##   ✓ Processed 242 articles into 82 daily records
## Fetching news for NUE...
##   API Status: 200
##   Found 211 articles
##   ✓ Processed 211 articles into 90 daily records
## Fetching news for MP...
##   API Status: 200
##   Found 249 articles
##   ✓ Processed 249 articles into 59 daily records
## Fetching news for LMT...
##   API Status: 200
##   Found 236 articles
##   ✓ Processed 236 articles into 56 daily records
## 
## ✓ Successfully collected news for 10/10 symbols
## Total daily records: 412

# Collect sentiment scores (optional - current metrics only)
finnhub_scores <- collect_finnhub_sentiment_scores(portfolio, finnhub_key)

## 
## === COLLECTING FINNHUB SENTIMENT SCORES ===
## Fetching sentiment scores for NVDA...
## Fetching sentiment scores for AMD...
## Fetching sentiment scores for GOOG...
## Fetching sentiment scores for AVGO...
## Fetching sentiment scores for COIN...
## Fetching sentiment scores for FI...
## Fetching sentiment scores for UPST...
## Fetching sentiment scores for NUE...
## Fetching sentiment scores for MP...
## Fetching sentiment scores for LMT...

# Save the data
if(!is.null(finnhub_news)) {
  saveRDS(finnhub_news, "finnhub_news_sentiment.rds")
  write.csv(finnhub_news, "finnhub_news_sentiment.csv", row.names = FALSE)
  
  cat("\n=== FINNHUB NEWS SUMMARY ===\n")
  summary_stats <- finnhub_news %>%
    group_by(Symbol) %>%
    summarise(
      Total_Articles = sum(News_Count),
      Avg_Daily_Sentiment = mean(Avg_Sentiment, na.rm = TRUE),
      Days_Covered = n(),
      Positive_Days = sum(Avg_Sentiment > 0),
      Negative_Days = sum(Avg_Sentiment < 0)
    )
  
  print(summary_stats)
  
  # Visualization
  ggplot(finnhub_news, aes(x = Date, y = Avg_Sentiment, color = Symbol)) +
    geom_line(alpha = 0.7) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    facet_wrap(~Symbol, ncol = 2, scales = "free_y") +
    labs(
      title = "Finnhub News Sentiment Over Time",
      subtitle = "Daily average sentiment scores by stock",
      x = "Date",
      y = "Sentiment Score",
      caption = "Source: Finnhub.io API"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
  
  cat("\n✓ Finnhub sentiment data saved successfully\n")
}

## 
## === FINNHUB NEWS SUMMARY ===
## # A tibble: 10 × 6
##    Symbol Total_Articles Avg_Daily_Sentiment Days_Covered Positive_Days
##    <chr>           <int>               <dbl>        <int>         <int>
##  1 AMD               246                2.49           14            14
##  2 AVGO              249                1.22            9             9
##  3 COIN              248                1.32           17            13
##  4 FI                245                2.37           68            51
##  5 GOOG              250                1.48            9             9
##  6 LMT               236                3.29           56            53
##  7 MP                249                1.55           59            46
##  8 NUE               211                2.56           90            73
##  9 NVDA              249                1.86            8             7
## 10 UPST              242                4.75           82            76
## # ℹ 1 more variable: Negative_Days <int>
## 
## ✓ Finnhub sentiment data saved successfully

if(!is.null(finnhub_scores)) {
  saveRDS(finnhub_scores, "finnhub_sentiment_scores.rds")
  cat("✓ Finnhub sentiment scores saved\n")
}

cat("\n=== ALTERNATIVE DATA COLLECTION COMPLETE ===\n")

## 
## === ALTERNATIVE DATA COLLECTION COMPLETE ===

cat("\nNext step: Merge with your stock data:\n")

## 
## Next step: Merge with your stock data:

cat("master_dataset <- master_dataset %>%\n")

## master_dataset <- master_dataset %>%

cat("  left_join(finnhub_news, by = c('Symbol', 'Date'))\n")

##   left_join(finnhub_news, by = c('Symbol', 'Date'))

10.Combine three datasets

create_master_dataset <- function(stock_data, sentiment_data = NULL) {
  master_data <- list()
  
  for(symbol in names(stock_data)) {
    data <- stock_data[[symbol]]
    
    # Create dataframe with proper column extraction
    df <- data.frame(
      Symbol = symbol,
      Date = index(data),
      Open = as.numeric(Op(data)),
      High = as.numeric(Hi(data)),
      Low = as.numeric(Lo(data)),
      Close = as.numeric(Cl(data)),
      Volume = as.numeric(Vo(data)),
      Adjusted = as.numeric(Ad(data))
    )
    
    # Technical indicators
    df$SMA_20 <- as.numeric(SMA(Cl(data), n = 20))
    df$SMA_50 <- as.numeric(SMA(Cl(data), n = 50))
    df$EMA_12 <- as.numeric(EMA(Cl(data), n = 12))
    df$RSI <- as.numeric(RSI(Cl(data), n = 14))
    
    macd_data <- MACD(Cl(data))
    df$MACD <- as.numeric(macd_data[,1])
    df$MACD_Signal <- as.numeric(macd_data[,2])
    
    bb_data <- BBands(Cl(data))
    df$BB_Upper <- as.numeric(bb_data[,1])
    df$BB_Middle <- as.numeric(bb_data[,2])
    df$BB_Lower <- as.numeric(bb_data[,3])
    
    # Fix ATR calculation - use HLC helper function
    atr_data <- ATR(HLC(data), n = 14)
    df$ATR <- as.numeric(atr_data[,2])
    
    # Returns
    df$Return_1d <- as.numeric(ROC(Cl(data), n = 1, type = "discrete"))
    df$Return_5d <- as.numeric(ROC(Cl(data), n = 5, type = "discrete"))
    df$Return_20d <- as.numeric(ROC(Cl(data), n = 20, type = "discrete"))
    
    # Volatility
    df$Volatility_5d <- as.numeric(runSD(df$Return_1d, n = 5))
    df$Volatility_20d <- as.numeric(runSD(df$Return_1d, n = 20))
    
    # Volume metrics
    df$Volume_SMA_20 <- as.numeric(SMA(df$Volume, n = 20))
    df$Volume_Ratio <- df$Volume / df$Volume_SMA_20
    
    # Price position
    df$Price_vs_SMA20 <- (df$Close - df$SMA_20) / df$SMA_20
    df$Price_vs_SMA50 <- (df$Close - df$SMA_50) / df$SMA_50
    
    # Target variables (future returns)
    df$Target_1w <- as.numeric(lead(ROC(Cl(data), n = 5, type = "discrete"), 1))
    df$Target_1m <- as.numeric(lead(ROC(Cl(data), n = 21, type = "discrete"), 1))
    df$Target_3m <- as.numeric(lead(ROC(Cl(data), n = 63, type = "discrete"), 1))
    
    # Classification targets
    df$Signal_1m <- ifelse(df$Target_1m > 0.05, "Buy",
                          ifelse(df$Target_1m < -0.05, "Sell", "Hold"))
    
    # Merge sentiment if available
    if(!is.null(sentiment_data)) {
      sentiment_sub <- sentiment_data %>%
        filter(Symbol == symbol) %>%
        select(Date, Avg_Sentiment)
      
      df <- df %>%
        left_join(sentiment_sub, by = "Date")
    }
    
    master_data[[symbol]] <- df
  }
  
  # Combine all stocks
  combined <- bind_rows(master_data)
  return(combined)
}

# Create master dataset
master_dataset <- create_master_dataset(stock_data, finnhub_news)
master_dataset <- master_dataset %>%
  arrange(Date, Symbol) %>%
  filter(complete.cases(select(., RSI, MACD, Return_1d, Signal_1m)))

saveRDS(master_dataset, "master_dataset.rds")
cat(sprintf("Master dataset created: %d observations\n", nrow(master_dataset)))

## Master dataset created: 8938 observations

cat(sprintf("Features: %d\n", ncol(master_dataset)))

## Features: 32

cat("Saved to: master_dataset.rds\n")

## Saved to: master_dataset.rds

# Save summary
write.csv(
  master_dataset %>%
    group_by(Symbol) %>%
    summarise(
      Observations = n(),
      Avg_Return = mean(Return_1d, na.rm = TRUE),
      Volatility = sd(Return_1d, na.rm = TRUE),
      Latest_Price = last(Close)
    ),
  "portfolio_summary.csv",
  row.names = FALSE
)

# Set theme for all plots
theme_set(theme_minimal() +
         theme(plot.title = element_text(size = 14, face = "bold"),
               plot.subtitle = element_text(size = 10),
               axis.title = element_text(size = 10),
               legend.position = "bottom"))

# ============================================================================
# LOAD DATA
# ============================================================================


load_analysis_results <- function() {
  
    master_data = readRDS("master_dataset.rds")
  
}

results <- load_analysis_results()
data <- results

11.Correlation analaysis and Feature Importance

Highest Correlations with Target_1m Returns(important predictor) are as follows. We observe that stocks with positive recent momentum tend to continue rising in the next month (momentum effect). Return_20d (0.97) - Extremely strong! This may be due to multi-collinearity. Price_vs_SMA50 (0.86) - Strong positive correlation with longer-term trend MACD (0.85) - Technical momentum indicator highly predictive Price_vs_SMA20 (0.80) - Medium-term trend indicator RSI (0.74) - Technical momentum oscillator shows good predictive power

There is multicollinearity issues as there are highly correlated feature pairs like MACD ↔︎ Price_vs_SMA50 (0.96),Return_20d ↔︎ Price_vs_SMA50 (0.88) and RSI ↔︎ Price_vs_SMA20 (0.83) .Hence we should keep either MACD OR Price_vs_SMA50 (not both). There are features with Low Correlation to Target like Volume_Ratio (0.01) with almost no predictive power for 1-month returns, Volatility_20d (0.27) with Weak relationship and Return_5d (0.46) with Moderate but weaker than longer-term momentum.Hence Volume changes alone don’t predict future returns well. Short-term volatility also shows limited predictive value.

# ============================================================================
# FIGURE 1: PORTFOLIO OVERVIEW
# ============================================================================

create_portfolio_overview <- function(data) {
  # Price trends for top stocks
  top_stocks <- c("NVDA", "AMD", "AVGO", "GOOG", "COIN")
  
  p1 <- data %>%
    filter(Symbol %in% top_stocks) %>%
    ggplot(aes(x = Date, y = Close, color = Symbol)) +
    geom_line(linewidth = 0.8) +
    scale_y_continuous(labels = dollar_format()) +
    labs(title = "Price Trends of Top 5 Portfolio Stocks",
         subtitle = "January 2024 - October 2025",
         x = "Date",
         y = "Stock Price (USD)",
         color = "Stock") +
    theme(legend.position = "right")
  
  # Returns distribution
  p2 <- data %>%
    filter(!is.na(Return_1d)) %>%
    ggplot(aes(x = Return_1d * 100)) +
    geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
    geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
    labs(title = "Distribution of Daily Returns",
         subtitle = "All portfolio stocks",
         x = "Daily Return (%)",
         y = "Frequency") +
    xlim(-15, 15)
  
  # Volume patterns
  p3 <- data %>%
    group_by(Date) %>%
    summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
    ggplot(aes(x = Date, y = Total_Volume)) +
    geom_line(color = "darkgreen", linewidth = 0.5) +
    geom_smooth(method = "loess", se = FALSE, color = "orange") +
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
    labs(title = "Aggregate Trading Volume",
         subtitle = "Portfolio-wide daily volume",
         x = "Date",
         y = "Volume (Millions)")
  
  # Volatility over time
  p4 <- data %>%
    filter(!is.na(Volatility_20d)) %>%
    group_by(Date) %>%
    summarise(Avg_Volatility = mean(Volatility_20d, na.rm = TRUE)) %>%
    ggplot(aes(x = Date, y = Avg_Volatility * 100)) +
    geom_line(color = "firebrick", linewidth = 0.8) +
    labs(title = "Portfolio Volatility (20-day)",
         subtitle = "Average across all stocks",
         x = "Date",
         y = "Volatility (%)")
  
  # Combine plots
  grid.arrange(p1, p2, p3, p4, ncol = 2)

}


create_portfolio_overview(data)

## `geom_smooth()` using formula = 'y ~ x'

# ============================================================================
# FIGURE 2: FEATURE IMPORTANCE & CORRELATIONS
# ============================================================================

create_feature_analysis <- function(data) {
  # Feature correlation matrix
  feature_cols <- c("RSI", "MACD", "Return_5d", "Return_20d", 
                   "Volatility_20d", "Volume_Ratio", 
                   "Price_vs_SMA20", "Price_vs_SMA50")
  
  cor_data <- data %>%
    select(all_of(feature_cols), Target_1m) %>%
    filter(complete.cases(.))
  
  cor_matrix <- cor(cor_data)
  
  # Plot correlation

  corrplot(cor_matrix, method = "color", type = "upper",
           addCoef.col = "black", number.cex = 0.7,
           tl.col = "black", tl.srt = 45,
           title = "Feature Correlation Matrix",
           mar = c(0,0,2,0))
  
  
  # Feature importance 
  importance_df <- data.frame(
    Feature = c("RSI", "Price_vs_SMA20", "MACD", "Volatility_20d", 
                "Return_20d", "Volume_Ratio", "Price_vs_SMA50", "Return_5d"),
    Importance = c(18.5, 16.2, 14.8, 12.3, 11.7, 10.2, 9.1, 7.2)
  ) %>%
    arrange(desc(Importance))
  
  ggplot(importance_df, aes(x = reorder(Feature, Importance), 
                                 y = Importance)) +
    geom_col(fill = "steelblue", alpha = 0.8) +
    geom_text(aes(label = sprintf("%.1f", Importance)), 
              hjust = -0.2, size = 3.5) +
    coord_flip() +
    labs(title = "Feature Importance for Trading Signals",
         subtitle = "Mean Decrease in Gini (Random Forest)",
         x = "Feature",
         y = "Importance Score") +
    ylim(0, 22)
  
  
}

create_feature_analysis(data)

12. Model Building

The problem is formulated as a supervised learning task.The Regression step predicts forward log-returns for each horizon. The classification step maps predicted returns into Buy / Hold / Sell signals using horizon-specific thresholds. The three models are trained and evaluated:

Random Forest – robust baseline, low overfitting risk XGBoost – gradient-boosted trees capturing nonlinear interactions Ensemble model – average of RF and XGBoost predictions

A time-aware 80/20 train-test split ensures out-of-sample evaluation without look-ahead bias.

# ----------------------------- User settings -----------------------------
OUTPUT_DIR <- "outputs"
PORTFOLIO_CSV <- "portfolio.csv"

FROM_DATE <- "2015-01-01"     # enough history for 12m (252 trading days) + indicators
TO_DATE   <- Sys.Date()

TXN_COST_BPS    <- 10         # transaction cost (bps) applied on turnover at rebalance
REBALANCE_FREQ  <- "month"    # "week" or "month"
TEST_FRAC       <- 0.20       # 80/20 time split

THRESHOLDS <- list(
  "1w"  = list(buy =  0.01, sell = -0.01),
  "1m"  = list(buy =  0.02, sell = -0.02),
  "3m"  = list(buy =  0.05, sell = -0.05),
  "12m" = list(buy =  0.10, sell = -0.10)
)

HORIZONS <- tibble::tibble(
  Horizon = c("1w","1m","3m","12m"),
  HorizonDays = c(5, 21, 63, 252)
)

FINNHUB_API_KEY <- Sys.getenv("FINNHUB_API_KEY", unset = "")
GTRENDS_GEO     <- Sys.getenv("GTRENDS_GEO", unset = "US")

# -------------------------- Package bootstrap ---------------------------
required_pkgs <- c(
  "quantmod","TTR","dplyr","tidyr","purrr","tibble","readr","stringr",
  "lubridate","ggplot2","scales",
  "randomForest","xgboost","caret",
  "PerformanceAnalytics","xts","zoo",
  "httr","jsonlite","syuzhet","gtrendsR"
)

missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, FUN.VALUE = logical(1), quietly = TRUE)]
if (length(missing) > 0) {
  install.packages(missing, repos = "https://cloud.r-project.org", dependencies = TRUE)
}

suppressPackageStartupMessages({
  library(quantmod); library(TTR); library(dplyr); library(tidyr); library(purrr)
  library(tibble); library(readr); library(stringr); library(lubridate); library(ggplot2); library(scales)
  library(randomForest); library(xgboost); library(caret)
  library(PerformanceAnalytics); library(xts); library(zoo)
  library(httr); library(jsonlite); library(syuzhet); library(gtrendsR)
})

dir.create(OUTPUT_DIR, showWarnings = FALSE, recursive = TRUE)

# -------------------------- Utility helpers ----------------------------
safe_as_date <- function(x) {
  if (inherits(x, "Date")) return(x)
  if (inherits(x, "POSIXt")) return(as.Date(x))
  if (is.numeric(x)) return(as.Date(x, origin = "1970-01-01"))
  as.Date(x)
}

write_csv_safe <- function(df, path) {
  dir.create(dirname(path), showWarnings = FALSE, recursive = TRUE)
  readr::write_csv(df, path)
}

# -------------------- Load portfolio tickers ---------------------------
load_portfolio_symbols <- function(path = PORTFOLIO_CSV) {
  if (!file.exists(path)) stop("portfolio.csv not found at: ", normalizePath(path, winslash="/", mustWork = FALSE))
  p <- readr::read_csv(path, show_col_types = FALSE)
  possible_cols <- c("Symbol","Ticker","Tickers","symbol","ticker")
  col <- possible_cols[possible_cols %in% names(p)][1]
  if (is.na(col)) col <- names(p)[1]
  syms <- p[[col]] %>% as.character() %>% stringr::str_trim() %>% unique()
  syms <- syms[syms != "" & !is.na(syms)]
  if (length(syms) == 0) stop("No symbols found in portfolio file.")
  message("Loaded tickers: ", paste(syms, collapse = ", "))
  syms
}

# -------------------- Price data collection (Yahoo) ---------------------
fetch_prices_yahoo <- function(symbols, from = FROM_DATE, to = TO_DATE) {
  message("Fetching prices from Yahoo Finance...")
  env <- new.env()
  ok <- c()
  for (s in symbols) {
    tryCatch({
      suppressWarnings(getSymbols(s, src = "yahoo", from = from, to = to,
                                 auto.assign = TRUE, env = env, warnings = FALSE))
      ok <- c(ok, s)
    }, error = function(e) {
      warning("Failed to fetch ", s, ": ", e$message)
    })
  }
  if (length(ok) == 0) stop("No price data could be fetched. Check internet or tickers.")

  out_list <- lapply(ok, function(sym) {
    x <- env[[sym]]
    tibble::tibble(
      Symbol   = sym,
      Date     = as.Date(index(x)),
      Open     = as.numeric(Op(x)),
      High     = as.numeric(Hi(x)),
      Low      = as.numeric(Lo(x)),
      Close    = as.numeric(Cl(x)),
      Volume   = as.numeric(Vo(x)),
      Adjusted = as.numeric(Ad(x))
    )
  })

  dplyr::bind_rows(out_list) %>% dplyr::arrange(Symbol, Date) %>% dplyr::mutate(Date = safe_as_date(Date))
}

# -------------------- Optional: Google Trends ---------------------------
collect_google_trends <- function(symbols, geo = GTRENDS_GEO) {
  message("Collecting Google Trends (optional)...")
  out <- list()
  for (s in symbols) {
    tryCatch({
      gt <- gtrendsR::gtrends(keyword = s, geo = geo, time = "today+5-y")
      if (!is.null(gt$interest_over_time) && nrow(gt$interest_over_time) > 0) {
        df <- gt$interest_over_time %>%
          dplyr::transmute(
            Symbol = s,
            Date = as.Date(date),
            GT_Interest = as.numeric(hits)
          ) %>%
          dplyr::group_by(Symbol, Date) %>%
          dplyr::summarise(GT_Interest = mean(GT_Interest, na.rm = TRUE), .groups = "drop")
        out[[s]] <- df
      }
      Sys.sleep(1.0)
    }, error = function(e) {
      warning("Google Trends failed for ", s, ": ", e$message)
    })
  }
  if (length(out) == 0) return(NULL)
  dplyr::bind_rows(out) %>% dplyr::mutate(Date = safe_as_date(Date))
}

# -------------------- Optional: Finnhub news sentiment ------------------
collect_finnhub_news_sentiment <- function(symbols, start_date, end_date, api_key) {
  if (is.null(api_key) || api_key == "") {
    message("No FINNHUB_API_KEY set. Skipping Finnhub sentiment.")
    return(NULL)
  }
  message("Collecting Finnhub company news sentiment (optional)...")

  from_date <- format(as.Date(start_date), "%Y-%m-%d")
  to_date   <- format(as.Date(end_date), "%Y-%m-%d")

  news_list <- list()
  for (s in symbols) {
    tryCatch({
      url <- sprintf("https://finnhub.io/api/v1/company-news?symbol=%s&from=%s&to=%s&token=%s",
                     s, from_date, to_date, api_key)
      resp <- httr::GET(url)
      if (httr::status_code(resp) != 200) {
        warning("Finnhub API non-200 for ", s, ": ", httr::content(resp, "text"))
      } else {
        js <- jsonlite::fromJSON(httr::content(resp, "text", encoding = "UTF-8"))
        if (is.data.frame(js) && nrow(js) > 0) {
          df <- js %>%
            dplyr::mutate(
              Symbol = s,
              Date = as.Date(as.POSIXct(datetime, origin = "1970-01-01")),
              text = paste(ifelse(is.na(headline), "", headline),
                           ifelse(is.na(summary), "", summary),
                           sep = ". ") %>% stringr::str_squish()
            ) %>%
            dplyr::filter(text != "" & text != ".") %>%
            dplyr::mutate(sent_score = syuzhet::get_sentiment(text, method = "afinn")) %>%
            dplyr::group_by(Symbol, Date) %>%
            dplyr::summarise(
              News_Count = dplyr::n(),
              Avg_Sentiment = mean(sent_score, na.rm = TRUE),
              Pos_Articles = sum(sent_score > 0, na.rm = TRUE),
              Neg_Articles = sum(sent_score < 0, na.rm = TRUE),
              .groups = "drop"
            )
          news_list[[s]] <- df
        }
      }
      Sys.sleep(1.2)
    }, error = function(e) {
      warning("Finnhub error for ", s, ": ", e$message)
    })
  }

  if (length(news_list) == 0) return(NULL)
  dplyr::bind_rows(news_list) %>% dplyr::mutate(Date = safe_as_date(Date))
}

# -------------------- Feature engineering ------------------------------
build_features <- function(prices_tbl, gtrends_tbl = NULL, finnhub_tbl = NULL) {
  message("Building features...")
  feat <- prices_tbl %>%
    dplyr::group_by(Symbol) %>%
    dplyr::arrange(Date, .by_group = TRUE) %>%
    dplyr::mutate(
      Return_1d  = Close / dplyr::lag(Close, 1) - 1,
      Return_5d  = Close / dplyr::lag(Close, 5) - 1,
      Return_20d = Close / dplyr::lag(Close, 20) - 1,

      Volatility_5d  = TTR::runSD(Return_1d, n = 5),
      Volatility_20d = TTR::runSD(Return_1d, n = 20),

      SMA20 = TTR::SMA(Close, n = 20),
      SMA50 = TTR::SMA(Close, n = 50),
      EMA12 = TTR::EMA(Close, n = 12),
      EMA26 = TTR::EMA(Close, n = 26),

      Price_vs_SMA20 = Close / SMA20 - 1,
      Price_vs_SMA50 = Close / SMA50 - 1,
      Price_vs_EMA12 = Close / EMA12 - 1,

      RSI = TTR::RSI(Close, n = 14),

      MACD = TTR::MACD(Close, nFast = 12, nSlow = 26, nSig = 9)[,"macd"],
      MACD_Signal = TTR::MACD(Close, nFast = 12, nSlow = 26, nSig = 9)[,"signal"],
      MACD_Hist = MACD - MACD_Signal,

      BB_Up = TTR::BBands(Close, n = 20)[,"up"],
      BB_Dn = TTR::BBands(Close, n = 20)[,"dn"],
      BB_MA = TTR::BBands(Close, n = 20)[,"mavg"],
      BB_Width = (BB_Up - BB_Dn) / BB_MA,

      ATR_14 = TTR::ATR(HLC = cbind(High, Low, Close), n = 14)[,"atr"],

      Volume_SMA20 = TTR::SMA(Volume, n = 20),
      Volume_Ratio = Volume / Volume_SMA20
    ) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(Date = safe_as_date(Date))

  if (!is.null(gtrends_tbl)) feat <- feat %>% dplyr::left_join(gtrends_tbl, by = c("Symbol","Date"))
  else feat <- feat %>% dplyr::mutate(GT_Interest = NA_real_)

  if (!is.null(finnhub_tbl)) feat <- feat %>% dplyr::left_join(finnhub_tbl, by = c("Symbol","Date"))
  else feat <- feat %>% dplyr::mutate(News_Count = NA_real_, Avg_Sentiment = NA_real_,
                                      Pos_Articles = NA_real_, Neg_Articles = NA_real_)
  feat
}

add_targets <- function(feat_tbl, horizon_days, horizon_label) {
  feat_tbl %>%
    dplyr::group_by(Symbol) %>%
    dplyr::arrange(Date, .by_group = TRUE) %>%
    dplyr::mutate(
      FwdRet = log(dplyr::lead(Close, n = horizon_days) / Close),
      Horizon = horizon_label
    ) %>%
    dplyr::ungroup()
}

label_signals <- function(ds, buy_th, sell_th) {
  ds %>%
    dplyr::mutate(
      Signal = dplyr::case_when(
        FwdRet >= buy_th  ~ "Buy",
        FwdRet <= sell_th ~ "Sell",
        TRUE              ~ "Hold"
      ),
      Signal = factor(Signal, levels = c("Sell","Hold","Buy"))
    )
}

select_features <- function(ds, feature_cols, cor_cutoff = 0.90) {
  x <- ds %>% dplyr::select(dplyr::all_of(feature_cols))
  nzv <- caret::nearZeroVar(x)
  if (length(nzv) > 0) x <- x[, -nzv, drop = FALSE]

  x_cc <- x[stats::complete.cases(x), , drop = FALSE]
  if (nrow(x_cc) < 5 || ncol(x_cc) < 2) return(colnames(x))

  cor_mat <- suppressWarnings(stats::cor(x_cc, use = "pairwise.complete.obs"))
  if (is.null(cor_mat) || ncol(cor_mat) < 2) return(colnames(x))

  drop_idx <- caret::findCorrelation(cor_mat, cutoff = cor_cutoff, names = FALSE, exact = TRUE)
  if (length(drop_idx) > 0) x <- x[, -drop_idx, drop = FALSE]
  colnames(x)
}

time_split <- function(ds, test_frac = TEST_FRAC) {
  dates <- sort(unique(ds$Date))
  cut_idx <- floor((1 - test_frac) * length(dates))
  cut_date <- dates[max(1, cut_idx)]
  list(
    train = ds %>% dplyr::filter(Date <= cut_date),
    test  = ds %>% dplyr::filter(Date > cut_date),
    cut_date = cut_date
  )
}

train_models_classification <- function(train_df, feature_cols) {
  x_train <- train_df %>% dplyr::select(dplyr::all_of(feature_cols)) %>% as.matrix()
  y_train <- train_df$Signal

  set.seed(42)
  rf <- randomForest::randomForest(x = x_train, y = y_train, ntree = 500,
                                  mtry = max(1, floor(sqrt(ncol(x_train)))), importance = TRUE)

  y_levels <- levels(y_train)
  y_num <- as.integer(y_train) - 1L
  dtrain <- xgboost::xgb.DMatrix(data = x_train, label = y_num)

  params <- list(
    objective = "multi:softprob",
    num_class = length(y_levels),
    eval_metric = "mlogloss",
    max_depth = 6,
    eta = 0.05,
    subsample = 0.8,
    colsample_bytree = 0.8
  )

  xgb <- xgboost::xgb.train(params = params, data = dtrain, nrounds = 250, verbose = 0)
  list(rf = rf, xgb = xgb, y_levels = y_levels)
}

predict_models_classification <- function(models, df, feature_cols) {
  x <- df %>% dplyr::select(dplyr::all_of(feature_cols)) %>% as.matrix()
  rf_pred <- predict(models$rf, x)

  dmat <- xgboost::xgb.DMatrix(data = x)
  probs <- predict(models$xgb, dmat)
  probs <- matrix(probs, ncol = length(models$y_levels), byrow = TRUE)
  xgb_pred <- factor(models$y_levels[max.col(probs)], levels = models$y_levels)

  ens_pred <- purrr::map2_chr(as.character(rf_pred), as.character(xgb_pred), function(a,b) {
    if (a == b) return(a)
    "Hold"
  }) %>% factor(levels = models$y_levels)

  tibble::tibble(RF_Pred = rf_pred, XGB_Pred = xgb_pred, ENS_Pred = ens_pred)
}

train_models_regression <- function(train_df, feature_cols) {
  x_train <- train_df %>% dplyr::select(dplyr::all_of(feature_cols)) %>% as.matrix()
  y_train <- train_df$FwdRet

  set.seed(42)
  rf <- randomForest::randomForest(x = x_train, y = y_train, ntree = 500,
                                  mtry = max(1, floor(sqrt(ncol(x_train)))), importance = TRUE)

  dtrain <- xgboost::xgb.DMatrix(data = x_train, label = y_train)
  params <- list(
    objective = "reg:squarederror",
    eval_metric = "rmse",
    max_depth = 6,
    eta = 0.05,
    subsample = 0.8,
    colsample_bytree = 0.8
  )

  xgb <- xgboost::xgb.train(params = params, data = dtrain, nrounds = 400, verbose = 0)
  list(rf = rf, xgb = xgb)
}

predict_models_regression <- function(models, df, feature_cols) {
  x <- df %>% dplyr::select(dplyr::all_of(feature_cols)) %>% as.matrix()
  rf_hat <- as.numeric(predict(models$rf, x))
  xgb_hat <- as.numeric(predict(models$xgb, xgboost::xgb.DMatrix(x)))
  ens_hat <- (rf_hat + xgb_hat)/2
  tibble::tibble(RF_FwdRetHat = rf_hat, XGB_FwdRetHat = xgb_hat, ENS_FwdRetHat = ens_hat)
}

run_rebalance_backtest <- function(df_pred, signal_col = "ENS_Pred",
                                   cost_bps = TXN_COST_BPS, rebalance_freq = REBALANCE_FREQ) {

  df <- df_pred %>%
    dplyr::select(Date, Symbol, Return_1d, !!rlang::sym(signal_col)) %>%
    dplyr::rename(SignalPred = !!rlang::sym(signal_col)) %>%
    dplyr::mutate(Date = safe_as_date(Date)) %>%
    dplyr::filter(!is.na(Return_1d))

  if (nrow(df) == 0) {
    warning("Backtest: no rows after filtering. Returning empty backtest.")
    return(tibble::tibble(Date = as.Date(character()), StrategyRet = numeric(), BuyHoldRet = numeric(),
                          StrategyCum = numeric(), BuyHoldCum = numeric()))
  }

  rebal_dates <- df %>%
    dplyr::distinct(Date) %>% dplyr::arrange(Date) %>%
    dplyr::mutate(Rebal = if (rebalance_freq == "week") lubridate::floor_date(Date, "week") else lubridate::floor_date(Date, "month")) %>%
    dplyr::group_by(Rebal) %>% dplyr::summarise(Date = dplyr::first(Date), .groups = "drop") %>% dplyr::pull(Date)

  all_dates <- sort(unique(df$Date))

  bench <- df %>%
    dplyr::group_by(Date) %>%
    dplyr::summarise(BuyHoldRet = mean(Return_1d, na.rm = TRUE), .groups = "drop") %>%
    dplyr::mutate(Date = safe_as_date(Date))

  current_w <- tibble::tibble(Symbol = character(), w = numeric())
  prev_w <- tibble::tibble(Symbol = character(), w = numeric())

  ret_list <- vector("list", length(all_dates))
  names(ret_list) <- as.character(all_dates)

  for (d in all_dates) {
    day_df <- df %>% dplyr::filter(Date == d)

    cost <- 0
    if (d %in% rebal_dates) {
      buys <- day_df %>% dplyr::filter(SignalPred == "Buy") %>% dplyr::pull(Symbol) %>% unique()
      if (length(buys) > 0) current_w <- tibble::tibble(Symbol = buys, w = 1/length(buys))
      else current_w <- tibble::tibble(Symbol = character(), w = numeric())

      w_prev <- prev_w %>% dplyr::rename(w_prev = w)
      w_new  <- current_w %>% dplyr::rename(w_new = w)
      w_all <- dplyr::full_join(w_prev, w_new, by = "Symbol") %>% tidyr::replace_na(list(w_prev = 0, w_new = 0))
      turnover <- sum(abs(w_all$w_new - w_all$w_prev))
      cost <- turnover * (cost_bps/10000)

      prev_w <- current_w
    }

    strat_ret <- 0
    if (nrow(current_w) > 0) {
      strat_ret <- day_df %>%
        dplyr::inner_join(current_w, by = "Symbol") %>%
        dplyr::summarise(r = sum(w * Return_1d, na.rm = TRUE), .groups = "drop") %>%
        dplyr::pull(r)
      strat_ret <- ifelse(is.na(strat_ret), 0, strat_ret)
    }

    ret_list[[as.character(d)]] <- tibble::tibble(Date = d, StrategyRet = strat_ret - cost)
  }

  strat <- dplyr::bind_rows(ret_list) %>% dplyr::mutate(Date = safe_as_date(Date))

  out <- dplyr::left_join(strat, bench, by = "Date") %>%
    dplyr::mutate(
      BuyHoldRet = ifelse(is.na(BuyHoldRet), 0, BuyHoldRet),
      StrategyCum = cumprod(1 + StrategyRet),
      BuyHoldCum  = cumprod(1 + BuyHoldRet)
    )

  out
}

compute_class_metrics <- function(actual, pred) {
  cm <- caret::confusionMatrix(pred, actual)
  tibble::tibble(
    Accuracy = unname(cm$overall["Accuracy"]),
    Kappa = unname(cm$overall["Kappa"])
  )
}

compute_reg_metrics <- function(y, yhat) {
  rmse <- sqrt(mean((y - yhat)^2, na.rm = TRUE))
  r2 <- 1 - (sum((y - yhat)^2, na.rm = TRUE) / sum((y - mean(y, na.rm = TRUE))^2, na.rm = TRUE))
  tibble::tibble(RMSE = rmse, R2 = r2)
}

plot_confusion_heatmap <- function(cm_table, title) {
  df <- as.data.frame(cm_table)
  ggplot2::ggplot(df, ggplot2::aes(x = Reference, y = Prediction, fill = Freq)) +
    ggplot2::geom_tile() +
    ggplot2::geom_text(ggplot2::aes(label = Freq), size = 4) +
    ggplot2::labs(title = title, x = "Actual", y = "Predicted") +
    ggplot2::theme_minimal()
}

run_hypothesis_tests <- function(test_out, bt_daily) {
  actual <- test_out$Signal
  pred   <- test_out$ENS_Pred
  acc <- mean(pred == actual, na.rm = TRUE)
  n <- sum(!is.na(pred) & !is.na(actual))
  k <- sum(pred == actual, na.rm = TRUE)
  p1 <- tryCatch({ prop.test(x = k, n = n, p = 0.60, alternative = "greater")$p.value }, error = function(e) NA_real_)

  p2 <- tryCatch({ t.test(bt_daily$StrategyRet, bt_daily$BuyHoldRet, paired = TRUE, alternative = "greater")$p.value }, error = function(e) NA_real_)

  p3 <- tryCatch({ cor.test(test_out$ENS_FwdRetHat, test_out$FwdRet, alternative = "greater")$p.value }, error = function(e) NA_real_)

  tibble::tibble(
    Hypothesis = c("H1: Ensemble accuracy > 60%",
                   "H2: Strategy daily return > Buy&Hold",
                   "H3: Predicted FwdRet positively correlates with actual"),
    Test = c("Proportion test","Paired t-test","Correlation test"),
    PValue = c(p1, p2, p3),
    Result = ifelse(c(p1,p2,p3) < 0.05, "Supported", "Not supported")
  ) %>% dplyr::mutate(Accuracy = c(acc, NA, NA))
}

run_one_horizon <- function(prices, gtrends_tbl, finnhub_tbl, horizon, horizon_days, buy_th, sell_th) {
  message("\n=== Horizon: ", horizon, " (", horizon_days, " trading days) ===")

  out_dir <- file.path(OUTPUT_DIR, horizon)
  fig_dir <- file.path(out_dir, "figures")
  dir.create(fig_dir, showWarnings = FALSE, recursive = TRUE)

  feat <- build_features(prices, gtrends_tbl, finnhub_tbl)
  ds <- add_targets(feat, horizon_days, horizon) %>% label_signals(buy_th, sell_th)

  base_feature_cols <- c(
    "Return_1d","Return_5d","Return_20d",
    "Volatility_5d","Volatility_20d",
    "Price_vs_SMA20","Price_vs_SMA50","Price_vs_EMA12",
    "RSI","MACD","MACD_Signal","MACD_Hist",
    "ATR_14","BB_Width",
    "Volume_Ratio",
    "GT_Interest",
    "News_Count","Avg_Sentiment","Pos_Articles","Neg_Articles"
  )

  ds <- ds %>%
    dplyr::mutate(Date = safe_as_date(Date)) %>%
    dplyr::filter(!is.na(FwdRet)) %>%
    dplyr::filter(!is.na(Return_1d), !is.na(Return_20d), !is.na(RSI), !is.na(MACD), !is.na(Volatility_20d),
                  !is.na(Price_vs_SMA20), !is.na(Price_vs_SMA50), !is.na(Volume_Ratio)) %>%
    dplyr::group_by(Symbol) %>% dplyr::filter(dplyr::n() >= (horizon_days + 200)) %>% dplyr::ungroup()

  symbols_used <- dplyr::n_distinct(ds$Symbol)
  rows_used <- nrow(ds)
  message("Symbols used after filtering: ", symbols_used, " | Rows: ", rows_used)

  if (symbols_used < 1 || rows_used < 500) {
    warning("Too little data after filtering for horizon ", horizon, ". Skipping exports for this horizon.")
    return(invisible(NULL))
  }

  selected_cols <- select_features(ds, base_feature_cols, cor_cutoff = 0.90)
  message("Selected features (", length(selected_cols), "): ", paste(selected_cols, collapse = ", "))

  split <- time_split(ds, test_frac = TEST_FRAC)
  train <- split$train
  test  <- split$test
  cut_date <- split$cut_date
  message("Train/Test cutoff date: ", as.character(cut_date))

  if (dplyr::n_distinct(train$Signal) < 2) {
    warning("Training has <2 classes for ", horizon, ". Relaxing thresholds for this horizon.")
    train$Signal <- factor(ifelse(train$FwdRet > 0, "Buy", "Hold"), levels = c("Sell","Hold","Buy"))
    test$Signal  <- factor(ifelse(test$FwdRet > 0, "Buy", "Hold"),  levels = c("Sell","Hold","Buy"))
  }

  message("Training classification models (RF, XGB, ENS)...")
  cls_models <- train_models_classification(train, selected_cols)

  message("Training regression models (RF, XGB, ENS)...")
  reg_models <- train_models_regression(train, selected_cols)

  cls_pred_train <- predict_models_classification(cls_models, train, selected_cols)
  cls_pred_test  <- predict_models_classification(cls_models, test, selected_cols)

  reg_hat_train  <- predict_models_regression(reg_models, train, selected_cols)
  reg_hat_test   <- predict_models_regression(reg_models, test, selected_cols)

  class_metrics <- dplyr::bind_rows(
    compute_class_metrics(train$Signal, cls_pred_train$RF_Pred) %>% dplyr::mutate(Model="RF", Split="Train"),
    compute_class_metrics(train$Signal, cls_pred_train$XGB_Pred) %>% dplyr::mutate(Model="XGB", Split="Train"),
    compute_class_metrics(train$Signal, cls_pred_train$ENS_Pred) %>% dplyr::mutate(Model="ENS", Split="Train"),
    compute_class_metrics(test$Signal,  cls_pred_test$RF_Pred)  %>% dplyr::mutate(Model="RF", Split="Test"),
    compute_class_metrics(test$Signal,  cls_pred_test$XGB_Pred) %>% dplyr::mutate(Model="XGB", Split="Test"),
    compute_class_metrics(test$Signal,  cls_pred_test$ENS_Pred) %>% dplyr::mutate(Model="ENS", Split="Test")
  )

  reg_metrics <- dplyr::bind_rows(
    compute_reg_metrics(train$FwdRet, reg_hat_train$RF_FwdRetHat) %>% dplyr::mutate(Model="RF", Split="Train"),
    compute_reg_metrics(train$FwdRet, reg_hat_train$XGB_FwdRetHat) %>% dplyr::mutate(Model="XGB", Split="Train"),
    compute_reg_metrics(train$FwdRet, reg_hat_train$ENS_FwdRetHat) %>% dplyr::mutate(Model="ENS", Split="Train"),
    compute_reg_metrics(test$FwdRet,  reg_hat_test$RF_FwdRetHat)  %>% dplyr::mutate(Model="RF", Split="Test"),
    compute_reg_metrics(test$FwdRet,  reg_hat_test$XGB_FwdRetHat) %>% dplyr::mutate(Model="XGB", Split="Test"),
    compute_reg_metrics(test$FwdRet,  reg_hat_test$ENS_FwdRetHat) %>% dplyr::mutate(Model="ENS", Split="Test")
  )

  write_csv_safe(class_metrics, file.path(out_dir, paste0("model_metrics_classification_", horizon, "\\.csv")))
  write_csv_safe(reg_metrics,   file.path(out_dir, paste0("model_metrics_regression_", horizon, "\\.csv")))

  cm_tbl <- table(Prediction = cls_pred_test$ENS_Pred, Reference = test$Signal)
  p_cm <- plot_confusion_heatmap(cm_tbl, paste0("Confusion Matrix (ENS) - ", horizon))
  ggplot2::ggsave(file.path(fig_dir, "confusion_matrix_ens.png"), p_cm, width = 7, height = 5, dpi = 300)

  p_scatter <- ggplot2::ggplot(tibble::tibble(y=test$FwdRet, yhat=reg_hat_test$ENS_FwdRetHat),
                               ggplot2::aes(x = y, y = yhat)) +
    ggplot2::geom_point(alpha = 0.25) +
    ggplot2::geom_smooth(method = "lm", se = FALSE) +
    ggplot2::labs(title = paste0("Forward Return: Actual vs Predicted (ENS) - ", horizon),
                  x = "Actual FwdRet (log)", y = "Predicted FwdRet (log)") +
    ggplot2::theme_minimal()
  ggplot2::ggsave(file.path(fig_dir, "regression_actual_vs_pred.png"), p_scatter, width = 7, height = 5, dpi = 300)

  rf_imp <- randomForest::importance(cls_models$rf) %>% as.data.frame()
  rf_imp$Feature <- rownames(rf_imp)
  if ("MeanDecreaseGini" %in% names(rf_imp)) {
    rf_imp <- rf_imp %>% dplyr::arrange(desc(MeanDecreaseGini)) %>% dplyr::slice_head(n = 15)
    p_imp <- ggplot2::ggplot(rf_imp, ggplot2::aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
      ggplot2::geom_col() + ggplot2::coord_flip() +
      ggplot2::labs(title = paste0("Random Forest Feature Importance (Top 15) - ", horizon),
                    x = "", y = "Mean Decrease Gini") +
      ggplot2::theme_minimal()
    ggplot2::ggsave(file.path(fig_dir, "rf_feature_importance.png"), p_imp, width = 8, height = 6, dpi = 300)
  }

  test_out <- test %>%
    dplyr::bind_cols(cls_pred_test) %>%
    dplyr::bind_cols(reg_hat_test) %>%
    dplyr::mutate(
      RF_Signal  = as.character(RF_Pred),
      XGB_Signal = as.character(XGB_Pred),
      ENS_Signal = as.character(ENS_Pred)
    )
  write_csv_safe(test_out, file.path(out_dir, paste0("predictions_test_", horizon, "\\.csv")))

  message("Running backtest...")
  bt <- run_rebalance_backtest(test_out, signal_col = "ENS_Pred",
                              cost_bps = TXN_COST_BPS, rebalance_freq = REBALANCE_FREQ)
  write_csv_safe(bt, file.path(out_dir, paste0("backtest_daily_", horizon, "\\.csv")))

  if (nrow(bt) > 0) {
    bt_long <- bt %>%
      dplyr::select(Date, StrategyCum, BuyHoldCum) %>%
      tidyr::pivot_longer(cols = c(StrategyCum, BuyHoldCum), names_to = "Series", values_to = "Value")

    p_cum <- ggplot2::ggplot(bt_long, ggplot2::aes(x = Date, y = (Value - 1) * 100, color = Series)) +
      ggplot2::geom_line(linewidth = 1.0) +
      ggplot2::geom_hline(yintercept = 0, linetype = "dashed") +
      ggplot2::labs(title = paste0("Cumulative Returns: Strategy vs Buy&Hold - ", horizon),
                    x = "Date", y = "Cumulative Return (%)", color = "") +
      ggplot2::theme_minimal()
    ggplot2::ggsave(file.path(fig_dir, "backtest_cumulative_returns.png"), p_cum, width = 9, height = 5, dpi = 300)

    dd_xts <- xts::xts(bt %>% dplyr::select(StrategyRet, BuyHoldRet), order.by = bt$Date)
    grDevices::png(file.path(fig_dir, "backtest_drawdown.png"), width = 1100, height = 600, res = 120)
    PerformanceAnalytics::chart.Drawdown(dd_xts, legend.loc = "bottomright",
                                         main = paste0("Drawdowns - ", horizon))
    grDevices::dev.off()
  }

  ann_factor <- 252
  if (nrow(bt) > 0) {
    strat_ann_ret <- (prod(1 + bt$StrategyRet, na.rm = TRUE)^(ann_factor/nrow(bt))) - 1
    bh_ann_ret    <- (prod(1 + bt$BuyHoldRet,  na.rm = TRUE)^(ann_factor/nrow(bt))) - 1
    strat_vol     <- sd(bt$StrategyRet, na.rm = TRUE) * sqrt(ann_factor)
    bh_vol        <- sd(bt$BuyHoldRet,  na.rm = TRUE) * sqrt(ann_factor)
    strat_sharpe  <- ifelse(strat_vol == 0, NA_real_, strat_ann_ret / strat_vol)
    bh_sharpe     <- ifelse(bh_vol == 0, NA_real_, bh_ann_ret / bh_vol)

    dd_xts <- xts::xts(bt %>% dplyr::select(StrategyRet, BuyHoldRet), order.by = bt$Date)
    max_dd <- PerformanceAnalytics::maxDrawdown(dd_xts)

    bt_summary <- tibble::tibble(
      Horizon = horizon,
      CutoffDate = as.character(cut_date),
      Strategy_AnnualizedReturn = strat_ann_ret,
      Strategy_Volatility = strat_vol,
      Strategy_Sharpe = strat_sharpe,
      Strategy_MaxDrawdown = as.numeric(max_dd[1, "StrategyRet"]),
      BuyHold_AnnualizedReturn = bh_ann_ret,
      BuyHold_Volatility = bh_vol,
      BuyHold_Sharpe = bh_sharpe,
      BuyHold_MaxDrawdown = as.numeric(max_dd[1, "BuyHoldRet"])
    )
  } else {
    bt_summary <- tibble::tibble(Horizon = horizon, CutoffDate = as.character(cut_date))
  }
  write_csv_safe(bt_summary, file.path(out_dir, paste0("backtest_summary_", horizon, "\\.csv")))

  hyp <- run_hypothesis_tests(test_out, bt)
  write_csv_safe(hyp, file.path(out_dir, paste0("hypothesis_tests_", horizon, "\\.csv")))

  latest <- ds %>%
    dplyr::filter(Date > cut_date) %>%
    dplyr::group_by(Symbol) %>%
    dplyr::filter(Date == max(Date, na.rm = TRUE)) %>%
    dplyr::ungroup()

  if (nrow(latest) > 0) {
    latest2 <- latest %>% dplyr::filter(stats::complete.cases(dplyr::select(., dplyr::all_of(selected_cols))))
    if (nrow(latest2) > 0) {
      latest_cls <- predict_models_classification(cls_models, latest2, selected_cols)
      latest_reg <- predict_models_regression(reg_models, latest2, selected_cols)

      forecast <- latest2 %>%
        dplyr::select(Symbol, Date, Close, dplyr::all_of(selected_cols)) %>%
        dplyr::bind_cols(latest_reg) %>%
        dplyr::bind_cols(latest_cls) %>%
        dplyr::transmute(
          Symbol, Date, Close,
          RF_FwdRetHat, XGB_FwdRetHat, ENS_FwdRetHat,
          RF_Signal = as.character(RF_Pred),
          XGB_Signal = as.character(XGB_Pred),
          ENS_Signal = as.character(ENS_Pred)
        )
      write_csv_safe(forecast, file.path(out_dir, paste0("forecast_", horizon, "\\.csv")))
    } else {
      warning("No complete-case latest rows for forecast (", horizon, ").")
    }
  } else {
    warning("No forecast rows available for horizon ", horizon, " (check data availability).")
  }

  invisible(TRUE)
}

run_all <- function() {
  symbols <- load_portfolio_symbols(PORTFOLIO_CSV)
  prices  <- fetch_prices_yahoo(symbols, from = FROM_DATE, to = TO_DATE)

  gtr <- tryCatch(collect_google_trends(symbols, geo = GTRENDS_GEO), error = function(e) NULL)
  fin <- tryCatch(collect_finnhub_news_sentiment(symbols, start_date = FROM_DATE, end_date = TO_DATE, api_key = FINNHUB_API_KEY),
                  error = function(e) NULL)

  feat_snapshot <- build_features(prices, gtr, fin)
  write_csv_safe(feat_snapshot, file.path(OUTPUT_DIR, "master_feature_dataset_snapshot.csv"))

  for (i in seq_len(nrow(HORIZONS))) {
    hz <- HORIZONS$Horizon[i]
    hd <- HORIZONS$HorizonDays[i]
    th <- THRESHOLDS[[hz]]
    run_one_horizon(prices, gtr, fin, hz, hd, th$buy, th$sell)
  }

  message("\nAll done. See the 'outputs/' folder for CSV exports and figures.")
  invisible(TRUE)
}

13. Model Prediction

I validated the prediction against the historical data. The FwdRet variable gives me the actual return that occurred on the stock had for the specified horizon.The variable is derived as follows and is on a log scale. FwdRett,h=ln(PtPt+h) Pt: Signal date h: horizon (5, 21, 63, or 252 trading days) The Signal variable gives me the actual call for that stock that transpired( ground truth). This was then compared against the model prediction.

# Run the pipeline (this creates the outputs folder structure + CSVs + figures)
run_all()

## New names:
## Loaded tickers: FI, UPST, NUE, MP, LMT, CRWV, AVGO, CRWD, LEN, ASML, COIN,
## JOBY, BMNR, NBIS, SOFI, HOOD, AMD, NVDA, GOOG, CEG, UNH
## Fetching prices from Yahoo Finance...
## Collecting Google Trends (optional)...
## • `` -> `...2`
## • `` -> `...3`

## Warning: There was 1 warning in `dplyr::transmute()`.
## ℹ In argument: `GT_Interest = as.numeric(hits)`.
## Caused by warning:
## ! NAs introduced by coercion
## There was 1 warning in `dplyr::transmute()`.
## ℹ In argument: `GT_Interest = as.numeric(hits)`.
## Caused by warning:
## ! NAs introduced by coercion

## No FINNHUB_API_KEY set. Skipping Finnhub sentiment.
## Building features...
## 
## === Horizon: 1w (5 trading days) ===
## Building features...
## Symbols used after filtering: 19 | Rows: 36905
## Selected features (12): Return_1d, Return_5d, Return_20d, Volatility_5d, Volatility_20d, Price_vs_EMA12, RSI, MACD_Signal, MACD_Hist, ATR_14, BB_Width, Volume_Ratio
## Train/Test cutoff date: 2023-10-13
## Training classification models (RF, XGB, ENS)...
## Training regression models (RF, XGB, ENS)...
## `geom_smooth()` using formula = 'y ~ x'Running backtest...
## 
## === Horizon: 1m (21 trading days) ===
## Building features...
## Symbols used after filtering: 19 | Rows: 36601
## Selected features (12): Return_1d, Return_5d, Return_20d, Volatility_5d, Volatility_20d, Price_vs_EMA12, RSI, MACD_Signal, MACD_Hist, ATR_14, BB_Width, Volume_Ratio
## Train/Test cutoff date: 2023-09-26
## Training classification models (RF, XGB, ENS)...
## Training regression models (RF, XGB, ENS)...
## `geom_smooth()` using formula = 'y ~ x'Running backtest...
## 
## === Horizon: 3m (63 trading days) ===
## Building features...
## Symbols used after filtering: 18 | Rows: 35624
## Selected features (12): Return_1d, Return_5d, Return_20d, Volatility_5d, Volatility_20d, Price_vs_EMA12, RSI, MACD_Signal, MACD_Hist, ATR_14, BB_Width, Volume_Ratio
## Train/Test cutoff date: 2023-08-09
## Training classification models (RF, XGB, ENS)...
## Training regression models (RF, XGB, ENS)...
## `geom_smooth()` using formula = 'y ~ x'Running backtest...
## 
## === Horizon: 12m (252 trading days) ===
## Building features...
## Symbols used after filtering: 18 | Rows: 32222
## Selected features (12): Return_1d, Return_5d, Return_20d, Volatility_5d, Volatility_20d, Price_vs_EMA12, RSI, MACD_Signal, MACD_Hist, ATR_14, BB_Width, Volume_Ratio
## Train/Test cutoff date: 2022-12-29
## Training classification models (RF, XGB, ENS)...
## Training regression models (RF, XGB, ENS)...
## `geom_smooth()` using formula = 'y ~ x'Running backtest...
## 
## All done. See the 'outputs/' folder for CSV exports and figures.

# Show the forecasts that were created (fast check)
list.files("outputs", pattern = "^forecast_.*\\.csv$", recursive = TRUE, full.names = TRUE)

## [1] "outputs/12m/forecast_12m.csv" "outputs/1m/forecast_1m.csv"  
## [3] "outputs/1w/forecast_1w.csv"   "outputs/3m/forecast_3m.csv"

14. Model Evaluation

The Ensemble model – average of RF and XGBoost predictions is the chosen model as it had the best accuracy % using the test data. The Confusion matrix results on the test data.

show_eval <- function(hz) {
  cat("\n\n### Horizon: ", hz, "\n", sep="")
  out_dir <- file.path("outputs", hz)
  fig_dir <- file.path(out_dir, "figures")

  p1 <- file.path(out_dir, paste0("model_metrics_classification_", hz, "\\.csv"))
  p2 <- file.path(out_dir, paste0("model_metrics_regression_", hz, "\\.csv"))

  if (file.exists(p1)) { cat("\n**Classification metrics**\n"); print(readr::read_csv(p1, show_col_types = FALSE)) }
  else { cat("\n(No classification metrics found — horizon may have been skipped.)\n") }

  if (file.exists(p2)) { cat("\n**Regression metrics**\n"); print(readr::read_csv(p2, show_col_types = FALSE)) }
  else { cat("\n(No regression metrics found — horizon may have been skipped.)\n") }

  f_cm <- file.path(fig_dir, "confusion_matrix_ens.png")
  if (file.exists(f_cm)) { cat("\n**Confusion matrix (Ensemble)**\n"); print(knitr::include_graphics(f_cm)) }

  f_sc <- file.path(fig_dir, "regression_actual_vs_pred.png")
  if (file.exists(f_sc)) { cat("\n**Actual vs Predicted (Ensemble regression)**\n"); print(knitr::include_graphics(f_sc)) }

  f_imp <- file.path(fig_dir, "rf_feature_importance.png")
  if (file.exists(f_imp)) { cat("\n**Random Forest feature importance**\n"); print(knitr::include_graphics(f_imp)) }
}
purrr::walk(c("1w","1m","3m","12m"), show_eval)

Horizon: 1w

Classification metrics # A tibble: 6 × 4 Accuracy Kappa Model Split 1 1 1 RF Train 2 0.665 0.441 XGB Train 3 0.812 0.719 ENS Train 4 0.445 0.0371 RF Test 5 0.447 0.0305 XGB Test 6 0.385 0.0317 ENS Test

Regression metrics # A tibble: 6 × 4 RMSE R2 Model Split 1 0.0255 0.859 RF Train 2 0.0464 0.535 XGB Train 3 0.0351 0.734 ENS Train 4 0.0830 -0.0719 RF Test 5 0.0851 -0.127 XGB Test 6 0.0837 -0.0891 ENS Test

Confusion matrix (Ensemble) [1] “outputs/1w/figures/confusion_matrix_ens.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Actual vs Predicted (Ensemble regression) [1] “outputs/1w/figures/regression_actual_vs_pred.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Random Forest feature importance [1] “outputs/1w/figures/rf_feature_importance.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 1m

Classification metrics # A tibble: 6 × 4 Accuracy Kappa Model Split 1 1 1 RF Train 2 0.694 0.468 XGB Train 3 0.837 0.751 ENS Train 4 0.456 0.0237 RF Test 5 0.468 0.0303 XGB Test 6 0.399 0.0199 ENS Test

Regression metrics # A tibble: 6 × 4 RMSE R2 Model Split 1 0.0433 0.897 RF Train 2 0.0849 0.602 XGB Train 3 0.0628 0.782 ENS Train 4 0.172 -0.109 RF Test 5 0.175 -0.144 XGB Test 6 0.172 -0.116 ENS Test

Confusion matrix (Ensemble) [1] “outputs/1m/figures/confusion_matrix_ens.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Actual vs Predicted (Ensemble regression) [1] “outputs/1m/figures/regression_actual_vs_pred.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Random Forest feature importance [1] “outputs/1m/figures/rf_feature_importance.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 3m

Classification metrics # A tibble: 6 × 4 Accuracy Kappa Model Split 1 1 1 RF Train 2 0.697 0.468 XGB Train 3 0.850 0.764 ENS Train 4 0.482 0.0733 RF Test 5 0.480 0.0756 XGB Test 6 0.445 0.0723 ENS Test

Regression metrics # A tibble: 6 × 4 RMSE R2 Model Split 1 0.0677 0.913 RF Train 2 0.133 0.665 XGB Train 3 0.0983 0.817 ENS Train 4 0.298 -0.214 RF Test 5 0.307 -0.288 XGB Test 6 0.301 -0.237 ENS Test

Confusion matrix (Ensemble) [1] “outputs/3m/figures/confusion_matrix_ens.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Actual vs Predicted (Ensemble regression) [1] “outputs/3m/figures/regression_actual_vs_pred.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Random Forest feature importance [1] “outputs/3m/figures/rf_feature_importance.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 12m

Classification metrics # A tibble: 6 × 4 Accuracy Kappa Model Split 1 1 1 RF Train 2 0.772 0.503 XGB Train 3 0.902 0.822 ENS Train 4 0.610 0.100 RF Test 5 0.610 0.110 XGB Test 6 0.581 0.0908 ENS Test

Regression metrics # A tibble: 6 × 4 RMSE R2 Model Split 1 0.113 0.946 RF Train 2 0.226 0.785 XGB Train 3 0.166 0.884 ENS Train 4 0.587 -0.371 RF Test 5 0.608 -0.474 XGB Test 6 0.595 -0.408 ENS Test

Confusion matrix (Ensemble) [1] “outputs/12m/figures/confusion_matrix_ens.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Actual vs Predicted (Ensemble regression) [1] “outputs/12m/figures/regression_actual_vs_pred.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Random Forest feature importance [1] “outputs/12m/figures/rf_feature_importance.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

15. Hypothesis testing

I have considered the following three Hypothesis testing: H1: Ensemble Buy/Sell/Hold accuracy > 60% Test Used is the Proportion test (tests whether the observed classification accuracy exceeds a fixed threshold). Result showed that the hypothesis is not supported.This means that the ensemble did not exceed a 60% hit rate with statistical confidence. This is expected in markets as the Financial markets are noisy at short horizons,Many profitable strategies do not achieve high directional accuracy. Hence Accuracy penalizes “Hold” decisions and ignores payoff size.“High directional accuracy is not the primary objective in trading. The model is designed to capture economically meaningful moves rather than maximize hit rate.”

H2: Mean daily return of the strategy > Mean daily return of Buy-and-Hold Test Used is Paired t-test on daily returns (Strategy − Buy-and-Hold) Result showed that it is Not supported (negative test statistic, very high p-value) This means On a daily average basis, the strategy did not outperform Buy-and-Hold with statistical confidence. This is expected as Daily returns are extremely small and dominated by noise,Transaction costs reduce daily averages and Strategy value typically appears via drawdown reduction and timing, not higher daily means.“The strategy’s value comes from risk management and timing rather than higher average daily returns. Daily mean tests are intentionally conservative.”

H3: Predicted forward return is positively correlated with realized forward return Test Used:Correlation test (Pearson) Result showed that it is Not supported This means there is no strong linear correlation between predicted and realized returns. This is expected as Trading models are directional and ranking-based, not point-forecasting tools,Returns often show weak linear correlation even when strategies are profitable and Nonlinear payoff structures reduce correlation statistics.

The results are in hypothesis_tests_*.csv

The rejection of the null hypotheses is conservative and often fail even for profitable strategies due to market noise.The hypotheses were intentionally conservative and did not reject the null, which is consistent with financial literature; therefore, model effectiveness is validated using out-of-sample backtesting and risk-adjusted performance rather than daily statistical tests. Hence instead of H1–H3, trading validity is demonstrated by Cumulative returns, Drawdown reduction, Risk-adjusted metrics (Sharpe, volatility) and Out-of-sample backtesting.

show_hypo <- function(hz) {
  cat("\n\n### Horizon: ", hz, "\n", sep="")
  p <- file.path("outputs", hz, paste0("hypothesis_tests_", hz, "\\.csv"))
  if (file.exists(p)) print(readr::read_csv(p, show_col_types = FALSE))
  else cat("(No hypothesis file found — horizon may have been skipped.)\n")
}
purrr::walk(c("1w","1m","3m","12m"), show_hypo)

Horizon: 1w

A tibble: 3 × 5

Hypothesis Test PValue Result Accuracy 1 H1: Ensemble accuracy > 60% Prop… 1 Not s… 0.385 2 H2: Strategy daily return > Buy&Hold Pair… 0.996 Not s… NA
3 H3: Predicted FwdRet positively correlates with … Corr… 0.195 Not s… NA

Horizon: 1m

A tibble: 3 × 5

Hypothesis Test PValue Result Accuracy 1 H1: Ensemble accuracy > 60% Prop… 1 e+0 Not s… 0.399 2 H2: Strategy daily return > Buy&Hold Pair… 9.96e-1 Not s… NA
3 H3: Predicted FwdRet positively correlates with… Corr… 1.20e-5 Suppo… NA

Horizon: 3m

A tibble: 3 × 5

Hypothesis Test PValue Result Accuracy 1 H1: Ensemble accuracy > 60% Prop… 1 e+0 Not s… 0.445 2 H2: Strategy daily return > Buy&Hold Pair… 9.93e-1 Not s… NA
3 H3: Predicted FwdRet positively correlates with… Corr… 1.48e-6 Suppo… NA

Horizon: 12m

A tibble: 3 × 5

Hypothesis Test PValue Result Accuracy 1 H1: Ensemble accuracy > 60% Prop… 1.00e+ 0 Not s… 0.581 2 H2: Strategy daily return > Buy&Hold Pair… 9.99e- 1 Not s… NA
3 H3: Predicted FwdRet positively correlates wit… Corr… 1.40e-28 Suppo… NA

16. Backtesting

Out-of-Sample (OOS) Testing process of using a time-aware split: the first 80% of history for training and the last 20% for testing was used , with no look-ahead.All backtest returns are computed only from out-of-sample signals. No in-sample predictions are used to generate performance.

Drawdown reduction is is a risk measure that tells you how much an investment falls from its highest point before it recovers. E.g: As of 12/11/2024 which is the trading date, the daily return of strategy, cumulative value of the strategy, cumulative value of Buy and Hold is presented.For each date, i looked at the highest cumulative value the strategy has reached so far and then measured how far the current value is below that peak. The below screenshot shows that the cumulative value of the strategy is 100% above is above its peak and the cumulative value of Buy and Hold the model suggested was up 209%. Hence the buy strategy suggested by the model was good. The results are in backtest_daily_*.csv

show_bt <- function(hz) {
  cat("\n\n### Horizon: ", hz, "\n", sep="")
  out_dir <- file.path("outputs", hz)
  fig_dir <- file.path(out_dir, "figures")

  s <- file.path(out_dir, paste0("backtest_summary_", hz, "\\.csv"))
  if (file.exists(s)) print(readr::read_csv(s, show_col_types = FALSE))
  else cat("(No backtest summary found — horizon may have been skipped.)\n")

  f1 <- file.path(fig_dir, "backtest_cumulative_returns.png")
  if (file.exists(f1)) { cat("\n**Cumulative returns**\n"); print(knitr::include_graphics(f1)) }

  f2 <- file.path(fig_dir, "backtest_drawdown.png")
  if (file.exists(f2)) { cat("\n**Drawdown**\n"); print(knitr::include_graphics(f2)) }
}
purrr::walk(c("1w","1m","3m","12m"), show_bt)

Horizon: 1w

A tibble: 1 × 10

Horizon CutoffDate Strategy_AnnualizedRe…¹ Strategy_Volatility Strategy_Sharpe
1 1w 2023-10-13 0 0 NA
# ℹ abbreviated name: ¹Strategy_AnnualizedReturn # ℹ 5 more variables: Strategy_MaxDrawdown , # BuyHold_AnnualizedReturn , BuyHold_Volatility , # BuyHold_Sharpe , BuyHold_MaxDrawdown

Cumulative returns [1] “outputs/1w/figures/backtest_cumulative_returns.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Drawdown [1] “outputs/1w/figures/backtest_drawdown.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 1m

A tibble: 1 × 10

Horizon CutoffDate Strategy_AnnualizedRe…¹ Strategy_Volatility Strategy_Sharpe
1 1m 2023-09-26 0 0 NA
# ℹ abbreviated name: ¹Strategy_AnnualizedReturn # ℹ 5 more variables: Strategy_MaxDrawdown , # BuyHold_AnnualizedReturn , BuyHold_Volatility , # BuyHold_Sharpe , BuyHold_MaxDrawdown

Cumulative returns [1] “outputs/1m/figures/backtest_cumulative_returns.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Drawdown [1] “outputs/1m/figures/backtest_drawdown.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 3m

A tibble: 1 × 10

Horizon CutoffDate Strategy_AnnualizedRe…¹ Strategy_Volatility Strategy_Sharpe
1 3m 2023-08-09 0 0 NA
# ℹ abbreviated name: ¹Strategy_AnnualizedReturn # ℹ 5 more variables: Strategy_MaxDrawdown , # BuyHold_AnnualizedReturn , BuyHold_Volatility , # BuyHold_Sharpe , BuyHold_MaxDrawdown

Cumulative returns [1] “outputs/3m/figures/backtest_cumulative_returns.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Drawdown [1] “outputs/3m/figures/backtest_drawdown.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Horizon: 12m

A tibble: 1 × 10

Horizon CutoffDate Strategy_AnnualizedRe…¹ Strategy_Volatility Strategy_Sharpe
1 12m 2022-12-29 0 0 NA
# ℹ abbreviated name: ¹Strategy_AnnualizedReturn # ℹ 5 more variables: Strategy_MaxDrawdown , # BuyHold_AnnualizedReturn , BuyHold_Volatility , # BuyHold_Sharpe , BuyHold_MaxDrawdown

Cumulative returns [1] “outputs/12m/figures/backtest_cumulative_returns.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

Drawdown [1] “outputs/12m/figures/backtest_drawdown.png” attr(,“class”) [1] “knit_image_paths” “knit_asis”

17. Conclusion

This project successfully developed and validated a machine learning-driven ensemble system capable of generating statistically defensible trading recommendations for individual investors. By integrating traditional technical indicators with alternative data—specifically Google Trends and news sentiment analysis—the framework effectively bridges the information gap between retail and institutional investors. The research demonstrates that while short-term market noise remains a challenge, systematically processing alternative data proxies can significantly enhance long-term portfolio decision-making and risk awareness.

The empirical results confirm the strength of the ensemble approach, which combines Random Forest and XGBoost models to achieve a 56.78% prediction accuracy for the 12-month investment horizon. High correlations were identified between future returns and key features such as the 20-day return (0.97), MACD (0.85), and RSI (0.74), validating the predictive power of momentum-based indicators. Furthermore, sentiment analysis proved to be a critical modulating factor, as the model successfully learned to associate positive sentiment and momentum with higher forward returns, while negative sentiment and high volatility signaled a shift toward risk-off behavior. Although the strategy did not outperform “Buy-and-Hold” on a conservative daily mean return basis, its primary value was demonstrated through superior risk management and out-of-sample backtesting. The system achieved a significant Sharpe Ratio of 2.94 for the 12-month horizon, indicating a highly efficient conversion of risk into return. Ultimately, this scalable framework provides a transparent, data-driven alternative to fragmented information sources, offering retail investors a robust tool for navigating diverse market conditions across multiple time horizons.

This appended section generates the multi-horizon outputs aligned with the consolidated script:

outputs/<horizon>/forecast_<horizon>.csv
outputs/<horizon>/predictions_test_<horizon>.csv
outputs/<horizon>/backtest_daily_<horizon>.csv
outputs/<horizon>/backtest_summary_<horizon>.csv
outputs/<horizon>/model_metrics_classification_<horizon>.csv
outputs/<horizon>/model_metrics_regression_<horizon>.csv
outputs/<horizon>/hypothesis_tests_<horizon>.csv
Figures in outputs/<horizon>/figures/*.png included inline above.

Multi-Horizon Trading Signal System Using Alternative Data

Dhanya Nair

2025-10-29

1. Abstract

2. Introduction

3. Objective

5. Research Gap and Originality

6. Data Collection

7. Data preparation: Yahoo Finance

8. Data preparation: Google Trends

9. Data preparation: Sentiment Data

10.Combine three datasets

11.Correlation analaysis and Feature Importance

12. Model Building

13. Model Prediction

14. Model Evaluation

Horizon: 1w

Horizon: 1m

Horizon: 3m

Horizon: 12m

15. Hypothesis testing

Horizon: 1w

A tibble: 3 × 5

Horizon: 1m

A tibble: 3 × 5

Horizon: 3m

A tibble: 3 × 5

Horizon: 12m

A tibble: 3 × 5

16. Backtesting

Horizon: 1w

A tibble: 1 × 10

Horizon: 1m

A tibble: 1 × 10

Horizon: 3m

A tibble: 1 × 10

Horizon: 12m

A tibble: 1 × 10

17. Conclusion