The text analyzing code is based on the code from Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st ed.). O’Reilly Media.
To scrape Advance Micro Devices Investor Relations press releases to derive sentiments by year.
This is a simple sign in gui that Rmarkdown will pick up and allow a user to sign in.
library(shiny)
ui <- shinyUI(fluidPage(
textInput(inputId = 'username', label = 'Database User Name', value = ""),
textInput(inputId = 'pw', label = 'Datebase Password', value = ""),
actionButton('save_inputs', 'Save inputs')
))
server <- shinyServer(function(input, output,session) {
observeEvent(input$save_inputs, {
credentials <- list()
credentials <<- list(user_name= input$username, pw = input$pw)
stopApp(returnValue = invisible())
})
})
shinyApp(ui = ui, server=server)
##
## Listening on http://127.0.0.1:8541
plot of chunk unnamed-chunk-5
library(RMySQL)
db_con <- dbConnect(MySQL(),user=credentials$user_name,password=credentials$pw)
create_database_statement<- 'CREATE DATABASE IF NOT EXISTS amd;'
create_database <- dbSendQuery(db_con, create_database_statement)
use_database <- dbSendQuery(db_con, 'USE amd;')
create_query <- 'Create Table IF NOT EXISTS Press_Releases(
Links varchar(300) NOT NULL,
Title varchar(300) NOT NULL,
Article longtext,
Published DateTime,
primary key (Links,Title,Published)
);'
create_table <- dbSendQuery(db_con, create_query)
This section of the code scrapes all press releases from AMD’s Investor Relations page. In order to successfully do this a proxy ip address is needed. Otherwise the site will block your ip address and it will kill the scraping. Most companies that do scraping have their own proxy addresses so this isn’t much of an issue but for a regular person this can be a major hurdle to webscraping. Luckily, there are multiple free sites that offer free proxy ip addresses such as; ProxyScrape, Free Proxy, ProxyNova, Hide.me, HMA(hide my ass) or Open Proxy, to name a few.
Most of these sites update their proxy addresses often, but the longer a proxy address has been in use the more likely that the address has been banned at the address you are trying to scrape.
To pull off this scrape of just over 1000 press releases I only needed one proxy address. There are other ways of avoiding having your requests blocked, namely by adding a timer into the code. Which this code has and is currently set to make a request every .75 seconds after the last request was executed. If needed this timer can be increased.
#
#
#UNCOMMENT
#A new proxy ip and matching port will have to be entered in the proxy_ip variable
#proxy_ip <- data.frame('ip'='96.95.164.41','port' = 3128)
#
# amd_link <- 'https://ir.amd.com/news-events/press-releases/detail/'
#
# for(s in 1:1100){
# #time between requests
# Sys.sleep(.75)
# link <- paste0(amd_link,s,'/',collapse = '' )
# get_html <- GET(link,use_proxy(proxy_ip$ip[1],proxy_ip$port[1]))
# code <- status_code(get_html)
#
# if (as.numeric(code) != 404){
# html_data <- read_html(link)
# data_nodes <- html_nodes(html_data,'.full-news-article')
# title <- html_nodes(html_data, '.article-heading')%>%html_text()
# release_date <-html_nodes(data_nodes,css= 'time')%>%html_text()
# release_date <- as.Date(release_date,format="%B %d, %Y %H:%M")
# p_text<-data_nodes%>% html_nodes("p") %>% html_text()
#
# txt = c()
# for(i in 1:length(p_text)){
# if(!is.na(release_date)){
# if(as.character(p_text[i]) != "Resources"){
#
# txt[i] <- gsub("\r?\n|\r", "", p_text[i])
# txt[i] <- gsub('\\"','',txt[i])
#
# }else{
#
# break
# }
#
# }else{
#
# break
#
# }
#
# article <- str_flatten(txt,collapse=' ')
#
# df <- data.frame("Links"= link,
# 'Title'= title,
# 'article'= article,
# 'Published'= release_date)
# dbWriteTable(db_con, name='Press_Releases', value=df , append=T, row.names=F, overwrite=F);
# }
# }
# }
This section of code is where the articles are extracted from the database. Then data is grouped by year and tokenized.
library(lubridate)
library(tidytext)
get_article_query <- 'SELECT Article,Published FROM Press_Releases;'
article_txt <- dbGetQuery(db_con, get_article_query)
df <- data.frame(article_txt,stringsAsFactors = F)
art <- article_txt %>%
group_by(pub_year = floor_date(as.Date(`Published`),unit ="year"))%>%
tibble(text=Article)%>%
mutate(linenumber = row_number())%>%
ungroup()%>%
unnest_tokens("word", text)
Here the bing sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.
bing_art_sent<- art %>%
inner_join(get_sentiments("bing"))%>%
count(pub_year, index = linenumber,sentiment)%>%
pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(bing_art_sent, aes(pub_year,sentiment))+
geom_point()
plot of chunk unnamed-chunk-10
Here the bing sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.
nrc_art_sent<- art %>%
inner_join(get_sentiments("nrc"))%>%
count(pub_year, index = linenumber,sentiment)%>%
pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(nrc_art_sent, aes(pub_year,sentiment))+
geom_point()
plot of chunk unnamed-chunk-11
Here the Loughran McDonald sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.
loughran_art_sent<- art %>%
inner_join(get_sentiments("loughran"))%>%
count(pub_year, index = linenumber,sentiment)%>%
pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(loughran_art_sent, aes(pub_year,sentiment))+
geom_point()
plot of chunk unnamed-chunk-12
From the three scatter plots of each difference lexicon we can see there is a difference in the number of words counted. The NCR lexicon has the most words with a peak of about twenty. Followed by the bing lexicon with a peak of roughly twelve words. Then the Loughran McDonald lexicon has the smallest peak at roughly 11 words. Bing and Loughran catch basically the same amount of words.
This difference in peaks shows that some lexicons are more sensitive than others. With the Loughran lexicon being specific to finance it is only going to pick up words that are unique to that industry.
Even with the differences in the amount of words that each lexicon picks up, all three charts follow the same pattern, just on slightly different scales.
To get kind of an idea about trends in the sentiments of AMD’s press releases I take the sum of each column in the Loughran sentiments variable. I then plot it as a line graph. We can see from the graph that in 2011 there is a peak articles and of all sentiments.
This correlates to a rocky time in AMD’s history when AMD was launching a new product line of server chips called the Bulldozer in 2012. Unfortunately the chip was a pretty big failure and AMD was on the brink of bankruptcy. This led to drastic changes at the company where the CEO was let go and replaced with Dr. Lisa Su.
Much to my surprise after her hiring and redirecting of the company back to sustainability with a great new product line, the sentiments and amount of press releases by AMD has been on a downward trend. At the same time uncertainty has remained pretty constant throughout the years, while negative sentiments has bottomed out near zero since about 2016. Which is the year they started to announce their new product roadmap developed under Dr. Su’s leadership.
Looking at the graph of the mean sentiments gives a totally different perspective. From 2015 to 2017 we can see a peak mean in sentiment, which correlates to AMD’s upcoming new product road map that was released in 2017. We can also see an increase in the negative and uncertainty sentiments during that time frame. Perhaps a left of residual sentiment from their last product failure.
summary_of_sums <- loughran_art_sent %>%
group_by(pub_year) %>%
summarise_each(list(mean = mean,sum=sum), -index)
ggplot(summary_of_sums,aes(x=pub_year))+
geom_line(aes(y = negative_sum), color = "darkred") +
geom_line(aes(y = positive_sum), color="green")+
geom_line(aes(y = uncertainty_sum), color = 'blue')+
geom_line(aes(y=sentiment_sum), color ='black')+
theme(legend.position = "right")
plot of chunk unnamed-chunk-13
ggplot(summary_of_sums,aes(x=pub_year))+
geom_line(aes(y = negative_mean), color = "darkred") +
geom_line(aes(y = positive_mean), color="green")+
geom_line(aes(y = uncertainty_mean), color = 'blue')+
geom_line(aes(y=sentiment_mean), color ='black')+
theme(legend.position = "right")
plot of chunk unnamed-chunk-13
Bigrams is a tokenization method where words are paired together. It’s purpose is to provide extra context to each word.
art_bigrams <- article_txt %>%
group_by(pub_year = floor_date(as.Date(`Published`),unit ="year"))%>%
tibble(text=Article)%>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
art_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 28,210 × 2
## bigram n
## <chr> <int>
## 1 amd today 511
## 2 sunnyvale ca 494
## 3 nyse amd 474
## 4 today announced 474
## 5 amd nyse 452
## 6 of the 399
## 7 ca marketwired 312
## 8 the amd 283
## 9 nasdaq amd 268
## 10 amd nasdaq 265
## # … with 28,200 more rows
art_bigrams_sep <- art_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
art_bigrams_filt <- art_bigrams_sep %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
art_bigram_counts <- art_bigrams_filt %>%
count(word1, word2, sort = TRUE)
art_bigrams_united <- art_bigrams_filt %>%
unite(bigram, word1, word2, sep = " ")
TF-IDF stands for term frequency-inverse document frequency. Which means that the weight of a words frequency in a document is inverse to it’s frequency. So common words like the, that appear often will not have as much of a weight as a word such as strong, which might only appear once. From the table below we can see how word pairs that only occur a few times have a much greater idf proportion.
art_bigram_tf_idf <- art_bigrams_united %>%
count(pub_year, bigram) %>%
bind_tf_idf(bigram, pub_year, n) %>%
arrange(tf_idf)
art_bigram_tf_idf
## # A tibble: 17,655 × 6
## pub_year bigram n tf idf tf_idf
## <date> <chr> <int> <dbl> <dbl> <dbl>
## 1 2017-01-01 senior vice 1 0.000501 0.0690 0.0000346
## 2 2017-01-01 vice president 1 0.000501 0.0690 0.0000346
## 3 2013-01-01 chief financial 1 0.000264 0.143 0.0000377
## 4 2013-01-01 financial officer 1 0.000264 0.143 0.0000377
## 5 2016-01-01 chief financial 1 0.000343 0.143 0.0000491
## 6 2016-01-01 financial officer 1 0.000343 0.143 0.0000491
## 7 2014-01-01 fourth quarter 1 0.000248 0.223 0.0000552
## 8 2012-01-01 fourth quarter 1 0.000255 0.223 0.0000568
## 9 2013-01-01 fourth quarter 1 0.000264 0.223 0.0000588
## 10 2013-01-01 technology conference 1 0.000264 0.223 0.0000588
## # … with 17,645 more rows
I don’t like any of these methods. English is very complex language where words have multiple meanings and connotations, where the meaning of the word depends on the other words in the sentence.
For example in the sentiments the word cloud is a negative word. Which in some instances that is very true, but in the context of information technology and businesses that are involved in it, the cloud is a positive word. A company expanding its cloud presence would never be expanding a negative sentiment, it would be expanding a positive sentiment. The lack of context from this type of analysis may not lead to accurate predictions.
There are many other methods for text analysis that try to derive context and a deeper meaning from the words such as Part of Speech, which would seem more appropriate from a business analytics point of view.