This project will use a trading data API to obtain historical stock price data to report on stock metrics and performace. The focus of this analysis will be to:
The following libraries were used for data clean-up, interacting with APIs, translating JSON to a datatable, text mining, web scraping, visualization, and forecasting.
#data cleanup
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(lubridate)
library(data.table)
#API and json
library(httr)
library(jsonlite)
library(config)
#Web Scraping
library(rvest)
#Visualization
library(plotly)
library(ggplot2)
library(DT)
#Data
library(bea.R)
library(devtools)
library(gtrendsR)
#Text Analysis
library(tidytext)
library(wordcloud)
library(RColorBrewer)
#Forecasting
library(quantmod)
library(forecast)
library(tseries)
library(prophet)
To obtain historical stock data, the World Trading Data API is used. This report allows for the selection of any stock for analysis by changing the ticker saved to the ‘symbol’ variable. For the purpose of this project, the company Amazon (AMZN) is used.
symbol <- 'AMZN'
The World Trading Data API url inputs the selected symbol, the date range that is being requested, and a custom API token. To obtain a unique API token, please visit the World Trading Data API Documentation.
config <- config::get()
date_from = today()-dyears(5)
URL <- paste0("https://www.worldtradingdata.com/api/v1/history?symbol=",symbol,
"&sort=newest&date_from=",date_from,
"&api_token=",config$stock_apikey)
results <- GET(url = URL)
To extract the content from the results, the jsonlite package is used. The output of this file is one row with thousands of columns, so data pre-processing is required.
content <- content(results, "text")
content %<>%
fromJSON(flatten = TRUE) %>% #Flatten
as.data.frame() #Make dataframe
#Number of columns
ncol(content)
## [1] 6291
To tidy the stock data, all price and volume data fields are gathered and arranged into a long datatable. Regular expressions, column filtering, and data type alterations are also done.
#gather
stock <- gather(content, "time","value",2:ncol(content))
stock$value <- as.numeric(stock$value)
#extract the date and metric into a new field
stock$date <- as_date(str_extract(string = stock$time, pattern = "\\d{4}.\\d{2}.\\d{2}"))
stock$metric <- str_extract(string = stock$time, pattern = "open|close|high|low|volume")
#exclude the unneccessary column and spread metric columns
stock %<>%
select(c(name, date, metric, value)) %>%
spread(metric, value)
datatable(stock)
Now that the stock dataset is tidied, a visualization of the price and volume time series can be created. For greater interactivity, plotly is used for this visualization.
p1 <- stock %>%
plot_ly(x = ~date,
type = "candlestick",
open = ~open,
close = ~close,
high = ~high,
low = ~low,
name = "price") %>%
layout(
xaxis = list(
rangeselector = list(
buttons = list(
list(
count = 1,
label = "1 mo",
step = "week",
stepmode = "backward"),
list(
count = 3,
label = "3 mo",
step = "month",
stepmode = "backward"),
list(
count = 6,
label = "6 mo",
step = "month",
stepmode = "backward"),
list(
count = 1,
label = "1 yr",
step = "year",
stepmode = "backward"),
list(
count = 3,
label = "3 yr",
step = "year",
stepmode = "backward"),
list(step = "all"))),
rangeslider = list(visible = FALSE)),
yaxis = list(title = "Price ($)",
showgrid = TRUE,
showticklabels = TRUE))
p2 <- stock %>%
plot_ly(x=~date, y=~volume, type='bar', name = "Volume") %>%
layout(yaxis = list(title = "Volume"))
p <- subplot(p1, p2, heights = c(0.7,0.3), nrows=2,
shareX = TRUE, titleY = TRUE) %>%
layout(title = paste0(symbol))
p
Economic indicators are excellent predictors of future stock price, but what about Google search interest in the stock? Using the gtrendsr package, the following code queries interest over time for the selected stock over the previous five years and creates a visualization using plotly.
trends <- gtrends(keyword = symbol, geo = "US", onlyInterest = TRUE)
trends <- trends$interest_over_time %>%
as_data_frame() %>%
select(c(date, hits, keyword))
trends$date <- as_date(ceiling_date(trends$date, unit = "weeks", change_on_boundary = NULL,
week_start = getOption("lubridate.week.start", 1)))
trends %>%
plot_ly(x=~date, y=~hits, mode = 'lines', name = "Google Search Trends") %>%
layout(title = paste0("Interest over Time: ",symbol), yaxis = list(title = "hits"))
Using the google trends dataset, it is now possible to view the relationship between interest over time (‘hits’) and stock performance. To do this, a left join is used to combine trend and stock data by date. The outcome of the join is then used to plot the relationship between hits and stock close price (for Amazon, this relationship is somewhat linear).
trends %>%
left_join(stock, by = "date") %>%
select(one_of(c("date", "hits", "close"))) %>%
drop_na() %>%
ggplot(aes(hits, close)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
labs(title =paste0(symbol,": Relationship between Hits and Close Stock Price"))
News articles provide excellent insight on the performance of each stock. The next step in this stock performance report is to import and perform sentiment analysis on recent news articles about the selected company.
To obtain news article data, the Google News API is used. The Google News API url inputs the company name (which is scraped from Marketwatch using the selected symbol). The date range inputed into the url requests the last 30 days of news article data. Finally, articles are sorted by relevance and 100 articles are requested using a custom API key. To obtain a unique API token, please visit the Google News API Documentation. Once the API request is successful, jsonlite is used to transform the json data from the API into a dataframe.
##get company name using web-scraping
url_overview = paste0("https://www.marketwatch.com/investing/stock/",symbol,"/profile")
var_overview = read_html(url_overview)
company <- var_overview %>%
html_nodes('#instrumentname') %>%
html_text() %>%
as.character()
#news API Query
url_news = paste0("https://newsapi.org/v2/everything?q=",
str_replace_all(company,pattern = " ", replacement = "%20"),
"&from=",today()-ddays(30), #last 30 days
"&sortBy=relevance&pageSize=100&language=en&apiKey=",config$news_apikey)
#API json to datatable
results <- GET(url = url_news)
news <- content(results, "text")
news %<>%
fromJSON(flatten = TRUE) %>% #flatten
as.data.frame() %>% #make dataframe
select(c(articles.title, articles.description, articles.content, articles.publishedAt))
datatable(news)
Now that 100 recent news articles about the selected stock are available, text mining and sentiment analysis can be performed on this analysis.
The first step is to unnest each word in the article description, allowing for a ‘bag of words’ sentiment analysis approach. For a quick visualization of the most frequently used words, a word cloud is created.
news_words <- news %>%
select(c("articles.title","articles.description", "articles.content", "articles.publishedAt")) %>%
unnest_tokens(word, articles.description) %>%
filter(!word %in% append(stop_words$word, values = "chars"), str_detect(word, "^[a-z']+$"))
news_words$date = as_date(news_words$articles.publishedAt)
words_only <- news_words %>%
count(word, sort =TRUE)
set.seed(1)
wordcloud(words = words_only$word, freq = words_only$n, scale=c(5,.5), max.words=50, colors=brewer.pal(8, "Dark2"))
To perform basic sentiment analysis, the afinn sentiment lexicon is used. This lexicon assigns scores to each word on a scale of -5 to 5. To view news sentiment about the selected company over the past month, the dataset is grouped by article and date and the score is summarised by the mean for each group.
afinn <- get_sentiments("afinn")
sentiment_summary <- news_words %>%
left_join(afinn) %>%
filter(!is.na(score)) %>%
group_by(articles.title, date) %>%
summarise(score = mean(score)) %>%
mutate(sentiment = ifelse(score>0, "positive","negative"))
datatable(sentiment_summary)
ggplot(sentiment_summary, aes(date, score)) + geom_bar(stat = "identity", aes(fill=sentiment)) + ggtitle(paste0(symbol, ": News Sentiment Over Time"))
In the previous steps, various factors such as news sentiment and Google trends were analyzed. In this step, the Prophet API will be used to forecast future prices for the selected stock.
Prophet is a software created by Facebook for forecasting time series data. For more information, please visit the Prophet API Documentation. In this next step, the data is pre-processed to fit the requirements of Prophet and a prediction is created (accounting for the regular stock gaps on weekends). The output of the forecast is the date, forecasted close price, and the lower and upper confidence intervals based on an 80% confidence levels.
#pre-processing
df <- stock %>%
select(c("date","close")) %>%
rename(ds = date, y = close)
#predictions
m <- prophet(df)
future <- make_future_dataframe(m, periods = 365) %>% filter(!wday(ds) %in% c(1,7)) #account for regular gaps on weekends
forecast <- predict(m, future)
datatable(forecast[c('ds','yhat','yhat_lower','yhat_upper')])
The results of the time series forecasting are plotted below. For most stocks, it seems like Prophet was able to capture the trends in close prices, but fails to forecast sharp changes in price.
plot(m, forecast, xlabel = "date", ylabel = "stock close price ($)") + ggtitle(paste0(symbol, ": Stock Price Prediction"))
To further evaluate the forecast results, a residual plot is created. Based on the residual plot, it is evident that this forecast does not capture all of the variability in stock prices over time.
forecast$ds <- as_date(forecast$ds)
residuals <- df %>%
left_join(forecast[c('ds','yhat','yhat_lower','yhat_upper')], by = "ds") %>%
filter(ds < today()) %>%
mutate(res = (y-yhat))
datatable(residuals)
ggplot(residuals, aes(ds, res)) + geom_point() + geom_hline(yintercept =0, color = "red") + labs(title ="Prophet Forecasting Residuals", x = "date", y = "residual")