Notes

The text analyzing code is based on the code from Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st ed.). O’Reilly Media.

Objective

To scrape Advance Micro Devices Investor Relations press releases to derive sentiments by year.

Shiny app for database sign in.

This is a simple sign in gui that Rmarkdown will pick up and allow a user to sign in.

library(shiny)
ui <- shinyUI(fluidPage(
  
  textInput(inputId = 'username', label = 'Database User Name', value = ""),
  textInput(inputId = 'pw', label = 'Datebase Password', value = ""),
  actionButton('save_inputs', 'Save inputs')
  
)) 

server <-  shinyServer(function(input, output,session) {
  
  
  observeEvent(input$save_inputs, {
    
    credentials <- list()
    
    credentials <<- list(user_name= input$username, pw = input$pw)
    
    stopApp(returnValue = invisible())
    
  }) 
  
})

shinyApp(ui = ui, server=server)
## 
## Listening on http://127.0.0.1:8541

plot of chunk unnamed-chunk-5

Creation of the database and tables for storing the press releases

library(RMySQL)

db_con <- dbConnect(MySQL(),user=credentials$user_name,password=credentials$pw)
create_database_statement<- 'CREATE DATABASE IF NOT EXISTS amd;'
create_database <- dbSendQuery(db_con, create_database_statement)
use_database <- dbSendQuery(db_con, 'USE amd;')

create_query <- 'Create Table IF NOT EXISTS Press_Releases(
                    Links varchar(300) NOT NULL,
                    Title varchar(300) NOT NULL,
                    Article longtext,
                    Published DateTime,
                    primary key (Links,Title,Published)
                  );'

create_table <- dbSendQuery(db_con, create_query)

Scraping Advanced Micro Devices Investor Relations page for press releases.

This section of the code scrapes all press releases from AMD’s Investor Relations page. In order to successfully do this a proxy ip address is needed. Otherwise the site will block your ip address and it will kill the scraping. Most companies that do scraping have their own proxy addresses so this isn’t much of an issue but for a regular person this can be a major hurdle to webscraping. Luckily, there are multiple free sites that offer free proxy ip addresses such as; ProxyScrape, Free Proxy, ProxyNova, Hide.me, HMA(hide my ass) or Open Proxy, to name a few.

Most of these sites update their proxy addresses often, but the longer a proxy address has been in use the more likely that the address has been banned at the address you are trying to scrape.

To pull off this scrape of just over 1000 press releases I only needed one proxy address. There are other ways of avoiding having your requests blocked, namely by adding a timer into the code. Which this code has and is currently set to make a request every .75 seconds after the last request was executed. If needed this timer can be increased.

#
#
#UNCOMMENT
#A new proxy ip and matching port will have to be entered in the proxy_ip variable
#proxy_ip <- data.frame('ip'='96.95.164.41','port' = 3128)
# 
# amd_link <- 'https://ir.amd.com/news-events/press-releases/detail/'
# 
# for(s in 1:1100){
#     #time between requests
#     Sys.sleep(.75)
#     link <- paste0(amd_link,s,'/',collapse = '' )
#     get_html <- GET(link,use_proxy(proxy_ip$ip[1],proxy_ip$port[1]))
#     code <- status_code(get_html)
# 
#     if (as.numeric(code) != 404){
#       html_data <- read_html(link)
#       data_nodes <- html_nodes(html_data,'.full-news-article')
#       title <- html_nodes(html_data, '.article-heading')%>%html_text()
#       release_date <-html_nodes(data_nodes,css= 'time')%>%html_text()
#       release_date <- as.Date(release_date,format="%B %d, %Y %H:%M")
#       p_text<-data_nodes%>% html_nodes("p") %>% html_text()
# 
#       txt = c()
#       for(i in 1:length(p_text)){
#           if(!is.na(release_date)){
#             if(as.character(p_text[i]) != "Resources"){
# 
#               txt[i] <- gsub("\r?\n|\r", "", p_text[i])
#               txt[i] <- gsub('\\"','',txt[i])
# 
#             }else{
# 
#               break
#             }
# 
#           }else{
# 
#             break
# 
#             }
# 
#     article <- str_flatten(txt,collapse=' ')
# 
#     df <- data.frame("Links"= link,
#                      'Title'= title,
#                      'article'= article,
#                      'Published'= release_date)
#     dbWriteTable(db_con, name='Press_Releases', value=df , append=T, row.names=F, overwrite=F);
#       }
#     }
# }

Query and tokenizing the entire article set

This section of code is where the articles are extracted from the database. Then data is grouped by year and tokenized.

library(lubridate)
library(tidytext)
get_article_query <- 'SELECT Article,Published FROM Press_Releases;'

article_txt <- dbGetQuery(db_con, get_article_query)
df <- data.frame(article_txt,stringsAsFactors = F)

art <- article_txt %>%
  group_by(pub_year = floor_date(as.Date(`Published`),unit ="year"))%>%
  tibble(text=Article)%>%
  mutate(linenumber = row_number())%>%
  ungroup()%>%
  unnest_tokens("word", text)

Bing sentiments

Here the bing sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.

bing_art_sent<- art %>% 
  inner_join(get_sentiments("bing"))%>%
  count(pub_year, index = linenumber,sentiment)%>%
  pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(bing_art_sent, aes(pub_year,sentiment))+
  geom_point()

plot of chunk unnamed-chunk-10

NCR Sentiments

Here the bing sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.

nrc_art_sent<- art %>% 
  inner_join(get_sentiments("nrc"))%>%
  count(pub_year, index = linenumber,sentiment)%>%
  pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(nrc_art_sent, aes(pub_year,sentiment))+
  geom_point()

plot of chunk unnamed-chunk-11

My chosen lexicon Loughran McDonald financial lexicon

Here the Loughran McDonald sentiments lexicon is used to take make a sentiment reading by subtracting the total number of negative sentiments from the total positive sentiments. It is then put into a scatter plot to show any yearly changes.

loughran_art_sent<- art %>% 
  inner_join(get_sentiments("loughran"))%>%
  count(pub_year, index = linenumber,sentiment)%>%
  pivot_wider(names_from =sentiment, values_from = n,values_fill = 0)%>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(loughran_art_sent, aes(pub_year,sentiment))+
  geom_point()

plot of chunk unnamed-chunk-12

From the three scatter plots of each difference lexicon we can see there is a difference in the number of words counted. The NCR lexicon has the most words with a peak of about twenty. Followed by the bing lexicon with a peak of roughly twelve words. Then the Loughran McDonald lexicon has the smallest peak at roughly 11 words. Bing and Loughran catch basically the same amount of words.

This difference in peaks shows that some lexicons are more sensitive than others. With the Loughran lexicon being specific to finance it is only going to pick up words that are unique to that industry.

Even with the differences in the amount of words that each lexicon picks up, all three charts follow the same pattern, just on slightly different scales.

Summary of sums

To get kind of an idea about trends in the sentiments of AMD’s press releases I take the sum of each column in the Loughran sentiments variable. I then plot it as a line graph. We can see from the graph that in 2011 there is a peak articles and of all sentiments.

This correlates to a rocky time in AMD’s history when AMD was launching a new product line of server chips called the Bulldozer in 2012. Unfortunately the chip was a pretty big failure and AMD was on the brink of bankruptcy. This led to drastic changes at the company where the CEO was let go and replaced with Dr. Lisa Su.

Much to my surprise after her hiring and redirecting of the company back to sustainability with a great new product line, the sentiments and amount of press releases by AMD has been on a downward trend. At the same time uncertainty has remained pretty constant throughout the years, while negative sentiments has bottomed out near zero since about 2016. Which is the year they started to announce their new product roadmap developed under Dr. Su’s leadership.

Looking at the graph of the mean sentiments gives a totally different perspective. From 2015 to 2017 we can see a peak mean in sentiment, which correlates to AMD’s upcoming new product road map that was released in 2017. We can also see an increase in the negative and uncertainty sentiments during that time frame. Perhaps a left of residual sentiment from their last product failure.

summary_of_sums <- loughran_art_sent %>%
  group_by(pub_year) %>% 
  summarise_each(list(mean = mean,sum=sum), -index)

ggplot(summary_of_sums,aes(x=pub_year))+
  geom_line(aes(y = negative_sum), color = "darkred") + 
  geom_line(aes(y = positive_sum), color="green")+
  geom_line(aes(y = uncertainty_sum), color = 'blue')+
  geom_line(aes(y=sentiment_sum), color ='black')+
  theme(legend.position = "right")

plot of chunk unnamed-chunk-13

ggplot(summary_of_sums,aes(x=pub_year))+
  geom_line(aes(y = negative_mean), color = "darkred") + 
  geom_line(aes(y = positive_mean), color="green")+
  geom_line(aes(y = uncertainty_mean), color = 'blue')+
  geom_line(aes(y=sentiment_mean), color ='black')+
  theme(legend.position = "right")

plot of chunk unnamed-chunk-13

Bigrams

Bigrams is a tokenization method where words are paired together. It’s purpose is to provide extra context to each word.

art_bigrams <- article_txt %>%
  group_by(pub_year = floor_date(as.Date(`Published`),unit ="year"))%>%
  tibble(text=Article)%>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

art_bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 28,210 × 2
##    bigram              n
##    <chr>           <int>
##  1 amd today         511
##  2 sunnyvale ca      494
##  3 nyse amd          474
##  4 today announced   474
##  5 amd nyse          452
##  6 of the            399
##  7 ca marketwired    312
##  8 the amd           283
##  9 nasdaq amd        268
## 10 amd nasdaq        265
## # … with 28,200 more rows
art_bigrams_sep <- art_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

art_bigrams_filt <- art_bigrams_sep %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
art_bigram_counts <- art_bigrams_filt %>% 
  count(word1, word2, sort = TRUE)

art_bigrams_united <- art_bigrams_filt %>%
  unite(bigram, word1, word2, sep = " ")

TF-IDF

TF-IDF stands for term frequency-inverse document frequency. Which means that the weight of a words frequency in a document is inverse to it’s frequency. So common words like the, that appear often will not have as much of a weight as a word such as strong, which might only appear once. From the table below we can see how word pairs that only occur a few times have a much greater idf proportion.

art_bigram_tf_idf <- art_bigrams_united %>%
  count(pub_year, bigram) %>%
  bind_tf_idf(bigram, pub_year, n) %>%
  arrange(tf_idf)
art_bigram_tf_idf
## # A tibble: 17,655 × 6
##    pub_year   bigram                    n       tf    idf    tf_idf
##    <date>     <chr>                 <int>    <dbl>  <dbl>     <dbl>
##  1 2017-01-01 senior vice               1 0.000501 0.0690 0.0000346
##  2 2017-01-01 vice president            1 0.000501 0.0690 0.0000346
##  3 2013-01-01 chief financial           1 0.000264 0.143  0.0000377
##  4 2013-01-01 financial officer         1 0.000264 0.143  0.0000377
##  5 2016-01-01 chief financial           1 0.000343 0.143  0.0000491
##  6 2016-01-01 financial officer         1 0.000343 0.143  0.0000491
##  7 2014-01-01 fourth quarter            1 0.000248 0.223  0.0000552
##  8 2012-01-01 fourth quarter            1 0.000255 0.223  0.0000568
##  9 2013-01-01 fourth quarter            1 0.000264 0.223  0.0000588
## 10 2013-01-01 technology conference     1 0.000264 0.223  0.0000588
## # … with 17,645 more rows

Conclussion

I don’t like any of these methods. English is very complex language where words have multiple meanings and connotations, where the meaning of the word depends on the other words in the sentence.

For example in the sentiments the word cloud is a negative word. Which in some instances that is very true, but in the context of information technology and businesses that are involved in it, the cloud is a positive word. A company expanding its cloud presence would never be expanding a negative sentiment, it would be expanding a positive sentiment. The lack of context from this type of analysis may not lead to accurate predictions.

There are many other methods for text analysis that try to derive context and a deeper meaning from the words such as Part of Speech, which would seem more appropriate from a business analytics point of view.