Introduction

This project will use a trading data API to obtain historical stock price data to report on stock metrics and performace. The focus of this analysis will be to:

  • Import and tidy the stock trading API data for a stock of choice
  • Visualize trends and historical prices of the selected stock
  • Investigate google trends interest over time in the selected stock
  • Perform sentiment analysis on recent news articles about the selected stock
  • Create a time series forecasting model that uses historical stock prices to predict future trends
  • Develop a Shiny application that condenses this information into a single dashboard for any selected stock

Libraries Used

The following libraries were used for data clean-up, interacting with APIs, translating JSON to a datatable, text mining, web scraping, visualization, and forecasting.

#data cleanup
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(lubridate)
library(data.table)

#API and json
library(httr)
library(jsonlite)
library(config)

#Web Scraping
library(rvest)

#Visualization
library(plotly)
library(ggplot2)
library(DT)

#Data
library(bea.R)
library(devtools)
library(gtrendsR)

#Text Analysis
library(tidytext)
library(wordcloud)
library(RColorBrewer)

#Forecasting
library(quantmod)
library(forecast)
library(tseries)
library(prophet)

Import Historical Stock Data

To obtain historical stock data, the World Trading Data API is used. This report allows for the selection of any stock for analysis by changing the ticker saved to the ‘symbol’ variable. For the purpose of this project, the company Amazon (AMZN) is used.

symbol <- 'AMZN'

World Trading Data API

The World Trading Data API url inputs the selected symbol, the date range that is being requested, and a custom API token. To obtain a unique API token, please visit the World Trading Data API Documentation.

config <- config::get()
date_from = today()-dyears(5)

URL <- paste0("https://www.worldtradingdata.com/api/v1/history?symbol=",symbol,
              "&sort=newest&date_from=",date_from,
              "&api_token=",config$stock_apikey)
results <- GET(url = URL)

To extract the content from the results, the jsonlite package is used. The output of this file is one row with thousands of columns, so data pre-processing is required.

content <- content(results, "text")
content %<>%
  fromJSON(flatten = TRUE) %>% #Flatten
  as.data.frame() #Make dataframe

#Number of columns
ncol(content)
## [1] 6291

Stock Data Pre-Processing

To tidy the stock data, all price and volume data fields are gathered and arranged into a long datatable. Regular expressions, column filtering, and data type alterations are also done.

#gather
stock <- gather(content, "time","value",2:ncol(content)) 
stock$value <- as.numeric(stock$value)

#extract the date and metric into a new field
stock$date <- as_date(str_extract(string = stock$time, pattern = "\\d{4}.\\d{2}.\\d{2}"))
stock$metric <- str_extract(string = stock$time, pattern = "open|close|high|low|volume")

#exclude the unneccessary column and spread metric columns
stock %<>%
  select(c(name, date, metric, value)) %>%
  spread(metric, value)

datatable(stock)

Visualization

Now that the stock dataset is tidied, a visualization of the price and volume time series can be created. For greater interactivity, plotly is used for this visualization.

p1 <- stock %>%
  plot_ly(x = ~date,
          type = "candlestick", 
          open = ~open, 
          close = ~close, 
          high = ~high,
          low = ~low,
          name = "price") %>%
  layout(
    xaxis = list(
      rangeselector = list(
        buttons = list(
          list(
            count = 1,
            label = "1 mo",
            step = "week",
            stepmode = "backward"),
          list(
            count = 3,
            label = "3 mo",
            step = "month",
            stepmode = "backward"),
          list(
            count = 6,
            label = "6 mo",
            step = "month",
            stepmode = "backward"),
          list(
            count = 1,
            label = "1 yr",
            step = "year",
            stepmode = "backward"),
          list(
            count = 3,
            label = "3 yr",
            step = "year",
            stepmode = "backward"),
          list(step = "all"))),
      rangeslider = list(visible = FALSE)),
         yaxis = list(title = "Price ($)",
                      showgrid = TRUE,
                      showticklabels = TRUE))
p2 <- stock %>%
  plot_ly(x=~date, y=~volume, type='bar', name = "Volume") %>%
  layout(yaxis = list(title = "Volume"))

p <- subplot(p1, p2, heights = c(0.7,0.3), nrows=2,
             shareX = TRUE, titleY = TRUE) %>%
  layout(title = paste0(symbol))
p

Recent Stock News

News articles provide excellent insight on the performance of each stock. The next step in this stock performance report is to import and perform sentiment analysis on recent news articles about the selected company.

Google News API

To obtain news article data, the Google News API is used. The Google News API url inputs the company name (which is scraped from Marketwatch using the selected symbol). The date range inputed into the url requests the last 30 days of news article data. Finally, articles are sorted by relevance and 100 articles are requested using a custom API key. To obtain a unique API token, please visit the Google News API Documentation. Once the API request is successful, jsonlite is used to transform the json data from the API into a dataframe.

##get company name using web-scraping
url_overview = paste0("https://www.marketwatch.com/investing/stock/",symbol,"/profile")
var_overview = read_html(url_overview)
company <-  var_overview %>% 
  html_nodes('#instrumentname') %>%
  html_text() %>%
  as.character()

#news API Query
url_news = paste0("https://newsapi.org/v2/everything?q=",
                  str_replace_all(company,pattern = " ", replacement = "%20"),
                  "&from=",today()-ddays(30), #last 30 days
                  "&sortBy=relevance&pageSize=100&language=en&apiKey=",config$news_apikey)

#API json to datatable
results <- GET(url = url_news)
news <- content(results, "text")
news %<>%
  fromJSON(flatten = TRUE) %>% #flatten
  as.data.frame() %>% #make dataframe
  select(c(articles.title, articles.description, articles.content, articles.publishedAt))
datatable(news)

News Sentiment Analysis

Now that 100 recent news articles about the selected stock are available, text mining and sentiment analysis can be performed on this analysis.

Word Cloud

The first step is to unnest each word in the article description, allowing for a ‘bag of words’ sentiment analysis approach. For a quick visualization of the most frequently used words, a word cloud is created.

news_words <- news %>%
  select(c("articles.title","articles.description", "articles.content", "articles.publishedAt")) %>%
  unnest_tokens(word, articles.description) %>%
  filter(!word %in% append(stop_words$word, values = "chars"), str_detect(word, "^[a-z']+$"))
news_words$date = as_date(news_words$articles.publishedAt)

words_only <- news_words %>%
  count(word, sort =TRUE)

set.seed(1)
wordcloud(words = words_only$word, freq = words_only$n, scale=c(5,.5), max.words=50, colors=brewer.pal(8, "Dark2"))

News Sentiment over Time

To perform basic sentiment analysis, the afinn sentiment lexicon is used. This lexicon assigns scores to each word on a scale of -5 to 5. To view news sentiment about the selected company over the past month, the dataset is grouped by article and date and the score is summarised by the mean for each group.

afinn <- get_sentiments("afinn")

sentiment_summary <- news_words %>%
  left_join(afinn) %>%
  filter(!is.na(score)) %>%
  group_by(articles.title, date) %>%
  summarise(score = mean(score)) %>%
  mutate(sentiment = ifelse(score>0, "positive","negative")) 

datatable(sentiment_summary)
ggplot(sentiment_summary, aes(date, score)) + geom_bar(stat = "identity", aes(fill=sentiment))  + ggtitle(paste0(symbol, ": News Sentiment Over Time")) 

Time Series Forecasting

In the previous steps, various factors such as news sentiment and Google trends were analyzed. In this step, the Prophet API will be used to forecast future prices for the selected stock.

Using Prophet

Prophet is a software created by Facebook for forecasting time series data. For more information, please visit the Prophet API Documentation. In this next step, the data is pre-processed to fit the requirements of Prophet and a prediction is created (accounting for the regular stock gaps on weekends). The output of the forecast is the date, forecasted close price, and the lower and upper confidence intervals based on an 80% confidence levels.

#pre-processing
df <- stock %>%
  select(c("date","close")) %>%
  rename(ds = date, y = close)

#predictions
m <- prophet(df)
future <- make_future_dataframe(m, periods = 365) %>% filter(!wday(ds) %in% c(1,7)) #account for regular gaps on weekends
forecast <- predict(m, future) 
datatable(forecast[c('ds','yhat','yhat_lower','yhat_upper')])

Prophet Forecast Results

The results of the time series forecasting are plotted below. For most stocks, it seems like Prophet was able to capture the trends in close prices, but fails to forecast sharp changes in price.

plot(m, forecast, xlabel = "date", ylabel = "stock close price ($)") + ggtitle(paste0(symbol, ": Stock Price Prediction"))

Forecast Evaluation

To further evaluate the forecast results, a residual plot is created. Based on the residual plot, it is evident that this forecast does not capture all of the variability in stock prices over time.

forecast$ds <- as_date(forecast$ds)

residuals <- df %>% 
  left_join(forecast[c('ds','yhat','yhat_lower','yhat_upper')], by = "ds") %>%
  filter(ds < today()) %>%
  mutate(res = (y-yhat))
datatable(residuals)
ggplot(residuals, aes(ds, res)) + geom_point() + geom_hline(yintercept =0, color = "red") + labs(title ="Prophet Forecasting Residuals", x = "date", y = "residual") 

Next steps

  • Create this code as a function (inputs = symbol and API keys)
  • Create interactive Shiny application
  • Integrate relational or NoSQL database
  • Multiple regression using various economic indicators (maybe use bea.R package data)
  • More advanced sentiment analysis of news articles
  • Other time series forecasting methods (ARIMA)
  • Tinker with Prophet forecasting (maybe add Test-train-split?)
  • Compare the forecast to the stock index