Introduction:

The general reason why I wanted to do this study was to figure out how to best invest my money. With so much going on in the economy as well as news stories that we hear about social security benefits potentially decreasing in 2030’s as well as notably in France where they increase the retirement age. I personally would like to know how to best invest my money.

What I am going to do below, is gather the most popular index funds on the market and compare them to each other to see which one is performing the best. The hope and future of this project is to maybe either trade stocks automatically based on trend data or even create my own portfolio and optimize it to be the best investment.

Libraries Needed:

library(rvest)
library(tidyverse)
library(stats)
library(lubridate)
library(forecast)
library(tseries)
library(zoo)

Data Collecting:

The way I decided to pick the index funds that are being used for comparison is using the website: https://www.marketwatch.com/tools/top-25-mutual-funds. With this I expect to pull each symbol of the 25 index funds and run them through Yahoo Finance to get the past 5 years of market data for each Index fund. It should be noted that during this data transformation I got rid of the Index funds that ended with -XX as those funds are money market funds. General information, Money Market funds are mutual funds that invest in short term debt securities to typically maintain the value of a $1. Long story short, it allows the investor to place their money in something that will hold the value of 1 dollar “no matter what”. So I got rid of it since the ROI will just be 0 as it won’t change.

results_df <- data.frame()

# Loop through each row
for (i in 1:25) {
  
  # Construct the CSS selector
  indicator <- paste0("tr:nth-child(", i, ") > td:nth-child(2) > a")
  
  # Extract the text from the webpage
  data <- read_html("https://www.marketwatch.com/tools/top-25-mutual-funds") %>% 
    html_nodes(indicator) %>% 
    html_text()
  
  # Add the data to the dataframe
  results_df <- rbind(results_df, data)
}

names(results_df) <- c("X", "Y")
Top25 <- (results_df[,-2])

# Find indices of elements ending with "XX"
xx_indices <- grep("XX$", Top25)

# Remove elements ending with "XX"
Top25 <- Top25[-xx_indices]

# Print updated vector
print(Top25)
##  [1] "VSMPX" "VFIAX" "FXAIX" "VTSAX" "VGTSX" "VIIIX" "VFFSX" "VTBNX" "VTBIX"
## [10] "AGTHX" "VINIX" "VBTLX" "ABALX"
# Create a list to hold the data frames
df_list <- list()

# Loop through each CSV file
for (file1 in Top25) {
  # Construct URL for current CSV file
  url <- paste0("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/Portfolio%20Optimization/CSV%20Files/", file1, ".csv")
  
  # Read CSV file into data frame
  df <- read_csv(url)
  
  # Add data frame to the list
  df_list[[file1]] <- df
}

Data Transformation:

The purpose of this data transformation is to convert everything into a monthly period under one data frame with each index fund apart of this. The brief steps are to group by month and then take the min and max price of each month so for the future we can also incorporate volatility. As well as taking the open and close prices of each month so we can see how each index fund performs on a monthly basis if wanted.

names(df_list)
##  [1] "VSMPX" "VFIAX" "FXAIX" "VTSAX" "VGTSX" "VIIIX" "VFFSX" "VTBNX" "VTBIX"
## [10] "AGTHX" "VINIX" "VBTLX" "ABALX"
str(df_list[['ABALX']])
## spc_tbl_ [1,258 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date     : Date[1:1258], format: "2018-05-09" "2018-05-10" ...
##  $ Open     : num [1:1258] 27 27.2 27.2 27.2 27.1 ...
##  $ High     : num [1:1258] 27 27.2 27.2 27.2 27.1 ...
##  $ Low      : num [1:1258] 27 27.2 27.2 27.2 27.1 ...
##  $ Close    : num [1:1258] 27 27.2 27.2 27.2 27.1 ...
##  $ Adj Close: num [1:1258] 21.9 22 22 22 21.9 ...
##  $ Volume   : num [1:1258] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_date(format = ""),
##   ..   Open = col_double(),
##   ..   High = col_double(),
##   ..   Low = col_double(),
##   ..   Close = col_double(),
##   ..   `Adj Close` = col_double(),
##   ..   Volume = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary_df <- data.frame(stock = character(),
                         year = numeric(),
                         month = character(),
                         min_price = numeric(),
                         max_price = numeric(),
                         open_price = numeric(),
                         close_price = numeric())

str(df_list$VSMPX$Date)
##  Date[1:1258], format: "2018-05-09" "2018-05-10" "2018-05-11" "2018-05-14" "2018-05-15" ...
for (i in seq_along(df_list)) {
  stock_name <- names(df_list)[i]
  stock_df <- df_list[[i]]
  monthly_summary <- stock_df %>%
    mutate(date = as.Date(Date)) %>%
    group_by(month = format(date, "%b %Y")) %>%
    summarise(min_price = min(as.numeric(`Adj Close`)),
              max_price = max(as.numeric(`Adj Close`)),
              open_price = as.numeric(Open[1]),
              close_price = as.numeric(tail(Close, n = 1))) %>%
    mutate(stock = stock_name)
  summary_df <- rbind(summary_df, monthly_summary)
 } 

summary_df$month <- as.Date(paste0(summary_df$month, "-01"), format="%b %Y-%d")
summary_df$month <- as.yearmon(summary_df$month, "%b %Y")
str(summary_df)
## tibble [793 × 6] (S3: tbl_df/tbl/data.frame)
##  $ month      : 'yearmon' num [1:793] Apr 2019 Apr 2020 Apr 2021 Apr 2022 ...
##  $ min_price  : num [1:793] 125 107 185 186 183 ...
##  $ max_price  : num [1:793] 129 129 194 207 188 ...
##  $ open_price : num [1:793] 134 112 191 209 187 ...
##  $ close_price: num [1:793] 137 133 198 189 188 ...
##  $ stock      : chr [1:793] "VSMPX" "VSMPX" "VSMPX" "VSMPX" ...
Figure 1:

Here the general graph of each index fund that was gathered.

ggplot(summary_df, aes(x = month, y = close_price, group = stock, color = stock)) +
  geom_line() +
  labs(x = "Month", y = "Close Price", color = "Stock")

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Return on Investment:

The big calculation I did to compare which Index funds did the best is ROI for the period. ROI is Return on Investment as seen below VFFSX and VFIAX had the best performance at around 0.532 as VTBNX had the worst ROI at -0.07446. So quick thing, Return on Investment is just the percentage of how much you got back after you invested so if I invested 1 dollar into the VFFSX, I would have gotten 1.532 dollars back for a roughly 53% profit. Overall, you would like ROI to be as high possible but for most investment you just want them to outpace inflation which let’s assume it was 5% the last 5 years. As calculated below it would be 27.63% which is lower than the ROI of the best Index fund that calculated.

subset_summary_df <- summary_df %>% filter(month == "May 2018"| month == "May 2023")
print(subset_summary_df, n= 26)
## # A tibble: 26 × 6
##    month     min_price max_price open_price close_price stock
##    <yearmon>     <dbl>     <dbl>      <dbl>       <dbl> <chr>
##  1 May 2018     117.      118.       127.        128.   VSMPX
##  2 May 2023     183.      188.       188.        187.   VSMPX
##  3 May 2018     228.      232.       249.        251.   VFIAX
##  4 May 2023     375.      385.       385.        382.   VFIAX
##  5 May 2018      85.8      87.1       94.4        94.8  FXAIX
##  6 May 2023     141.      145.       145.        144.   FXAIX
##  7 May 2018      62.2      63.2       67.7        68.2  VTSAX
##  8 May 2023      97.8     100.       100.         99.7  VTSAX
##  9 May 2018      15.5      16.1       18.3        17.9  VGTSX
## 10 May 2023      17.8      18.1       18.0        18.1  VGTSX
## 11 May 2018     206.      210.       246.        247.   VIIIX
## 12 May 2023     339.      348.       348.        346.   VIIIX
## 13 May 2018     121.      123.       132.        133.   VFFSX
## 14 May 2023     199.      204.       204.        203.   VFFSX
## 15 May 2018       9.03      9.18      10.3        10.4  VTBNX
## 16 May 2023       9.54      9.66       9.54        9.57 VTBNX
## 17 May 2018       9.07      9.22      10.3        10.4  VTBIX
## 18 May 2023       9.54      9.66       9.54        9.57 VTBIX
## 19 May 2018      37.3      37.9       52.9        53.2  AGTHX
## 20 May 2023      54.3      55.4       55.4        55.3  AGTHX
## 21 May 2018     206.      210.       246.        247.   VINIX
## 22 May 2023     339.      348.       348.        346.   VINIX
## 23 May 2018       9.12      9.27      10.4        10.4  VBTLX
## 24 May 2023       9.65      9.78       9.65        9.68 VBTLX
## 25 May 2018      21.9      22.1       27.0        27.1  ABALX
## 26 May 2023      29.4      29.8       29.8        29.6  ABALX
calculate_ROI <- function(df) {
  df %>%
    group_by(stock) %>%
    summarize(ROI = (last(close_price) - first(open_price))/first(open_price)) %>%
    arrange(desc(ROI))
}

calculate_ROI(subset_summary_df)
## # A tibble: 13 × 2
##    stock      ROI
##    <chr>    <dbl>
##  1 VFFSX  0.532  
##  2 VFIAX  0.532  
##  3 FXAIX  0.523  
##  4 VSMPX  0.473  
##  5 VTSAX  0.473  
##  6 VINIX  0.404  
##  7 VIIIX  0.404  
##  8 ABALX  0.0962 
##  9 AGTHX  0.0467 
## 10 VGTSX -0.00874
## 11 VBTLX -0.0665 
## 12 VTBIX -0.0745 
## 13 VTBNX -0.0745
# 5% the last 5 year Inflation calculation
#Total inflation = (1 + inflation rate)^number of years - 1
#Total inflation = (1 + 0.05)^5 - 1
#Total inflation = 1.27628 - 1
#Total inflation = 0.27628 or 27.63%

Prediction of Top/Worst performing Index funds:

As calculated from the ROI, the Top 5 performing index funds were: VFFSX, VFIAX, FXAIX, VSMPX, VTSAV. The Worst preforming one was VTBNX. Now I will be taking the Top one VFFSX and the Lowest one VTBNX and create a time series and hopefully a 200 day unit prediction forecast.

#Top 5 ROI
VFFSX <- df_list[["VFFSX"]]
VFIAX <- df_list[["VFIAX"]]
FXAIX <- df_list[["FXAIX"]]
VSMPX <- df_list[["VSMPX"]]
VTSAV <- df_list[["VTSAV"]]
#Worst ROI
VTBNX <- df_list[["VTBNX"]]


VFFSX$Date <- as.Date(VFFSX$Date)
VFIAX$Date <- as.Date(VFIAX$Date)
FXAIX$Date <- as.Date(FXAIX$Date)
VSMPX$Date <- as.Date(VSMPX$Date)
VTSAV$Date <- as.Date(VTSAV$Date)


VTBNX$Date <- as.Date(VTBNX$Date)


head(VFFSX)
## # A tibble: 6 × 7
##   Date        Open  High   Low Close `Adj Close` Volume
##   <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>  <dbl>
## 1 2018-05-09  132.  132.  132.  132.        121.      0
## 2 2018-05-10  133.  133.  133.  133.        122.      0
## 3 2018-05-11  134.  134.  134.  134.        122.      0
## 4 2018-05-14  134.  134.  134.  134.        123.      0
## 5 2018-05-15  133.  133.  133.  133.        122.      0
## 6 2018-05-16  134.  134.  134.  134.        122.      0
#In case you were wondering why I used 252 instead of 365 for Frequency, the data that is collected here is only on trading days. Therefore, the data will only coincide with how many trading days are in a year which would exclude weekends and bank holidays
mts_VFFSX <- ts(VFFSX$`Adj Close`, start = 0, frequency = 252)


str(mts_VFFSX)
##  Time-Series [1:1258] from 0 to 4.99: 121 122 122 123 122 ...
plot(mts_VFFSX, xlab = "Years",
     ylab = "Price in $",
     main = "VFFSX Price over the last 5 years",
     col.main = "darkgreen")

?ts()

mts_VFFSX %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot()

mts_stl <- mts_VFFSX %>%
  stl(t.window=13, s.window="periodic", robust=TRUE)

mts_stl %>% seasadj() %>% naive() %>%
  autoplot() + ylab("Price") +
  ggtitle("Naive forecasts of seasonally adjusted data")

naive_fcst <- mts_stl %>% seasadj() %>% naive(h=200)
autoplot(naive_fcst) + ylab("Price") +
  ggtitle("Naive forecasts of seasonally adjusted data")

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mts_VTBNX <- ts(VTBNX$`Adj Close`, start = 0, frequency = 252)

plot(mts_VTBNX, xlab = "Years",
     ylab = "Price in $",
     main = "VTBNX Price over the last 5 years",
     col.main = "darkgreen")

mts_stl_1 <- mts_VTBNX %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  seasadj()

autoplot(mts_stl_1) + 
  ylab("Price") +
  ggtitle("Seasonal decomposition of data")

naive_fcst_1 <- mts_stl_1 %>% naive(h=200)
autoplot(naive_fcst_1) + 
  ylab("Price") +
  ggtitle("Naive forecasts of seasonally adjusted data")

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#Alternative Method

# Fit a linear model using lm()
linear_model <- lm(`Adj Close`~Date, data = VFFSX)

# Predict the continuation of the linear model
continuation <- predict(linear_model, interval = 'confidence')

# Create a new vector of Adj Close values with predicted Date values
predicted_data <- data.frame(`Adj Close` = VFFSX$`Adj Close`, Date = continuation[,1])

# Plot the data
plot(VFFSX$Date, VFFSX$`Adj Close`, main = "VFFSX Price over the last 5 years", 
     xlab = "Year", ylab = "Price at Close")

# Add the linear regression line to the plot
abline(linear_model)

# Add the predicted values to the plot
lines(predicted_data$Date, predicted_data$`Adj Close`, col = "red", lwd = 2)

Issues/Future Prospects:

I ran into a lot of issues with webscraping social media platforms and forums for my sentiment analysis portion. Where I wanted to get the most talked about stocks over the past 3 years on twitter and then funnel them into the above methodology of determining their ROI values, Month to Month. Some fixable issues that I ran into in this code was during the data transformation as the column names isn’t just “adj_close and was ‘Adj Close’ which made my for loop fail a bunch of time unable to obtain that data. Other issues was with the MTS plot which made is so the date value was in numeric instead of a date was fixed not with as.Date() but with turning into ‘%M %Y’ format, adding a ‘-01’ day value and reconverting with the as.yearmon() in zoo.

This is only a first version as I plan to continue this for the next couple of years as a side school project. Future prospects will be to have a more refined prediction model/plots. Then from the data collecting portion, I intend to hopefully attach an API or have a better continuous method of gathering the most up to date information. That being said it will require more resources to determine where I would like to determine which outlet would be good to gather trending stocks in either direction. Lastly, if possible I would like to hook this up to an automatically trader API do simulate trades and pull back if there is any profit/loss margins real time.

Conclusion:

We were able to find that VFFSX, VFIAX, FXAIX, VSMPX, VTSAV are all good options for investing money in to beat inflation during these past 5 years. Overall, I am aware this is a basic breakdown of just following the S&P 500 as the index funds that have followed it has general increased about or slightly varied to how much the S&P Grew these past years.