Text Analytics & Sentiment

———————R Markdown: Input Section———————-

WELCOME TO MY R-MARKDOWN WEBSITE:

The program shown below is a pdf analysis program that outputs different types of visualizations and summary statistics for textual analysis. The program is quiet simple and easy to reproduce.

Steps:

Importing all the libraries used is the most important step for this program to work as intended. Then using the pdf_text function download the file into R.

2.Once the file is turned into a data frame the program tokenizes each word and filters out all stop words. Tokenizing simply means each word is in it’s own row to facilitate the analysis.

3.After exploring the data frame, the next step is to clean out and filter any words that need to be cleared out.

4.The first quantitative analysis is to plot out the most used words and plotting it on a bar graph is shown below.

A word cloud is another example of a visualization technique to use.

6.This program uses the package SentimentR for the sentiment analysis section. By using the sentiment function on the data frame, the program gives each sentence a quantitative marking of the sentiment. SentimentR has a built in dictionary with marks given to certain words.

7.Within the SentimentR is a plot of emotional valence which shows the change in the sentiment from the beginning until the end of the file.

8.Once the sentiments of each sentence is quantified, a box plot and histogram of the sentiments is plotted using plotly. The box plot and histogram are interactive, go ahead and put the curser over the plot….

:) :) :) :) :) :) :) :) :) :) :) :) :)

Hope you enjoy this resource!!

library(pdftools)

## Using poppler version 21.04.0

library(tidytext)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(esquisse)
library(sentimentr)
library(stringr)

# Read in PDF #

text <- pdf_text("./scotia.pdf")
text_df <- data_frame(line=1:1, text=text)

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

# Read multiple pdf files #

#file_list <- list.files(pattern="*.pdf")
#all_files <- lapply(file_list, FUN = function(files) {
  
#  pdf_text(files)
  
#})

#text_df <- all_files


# Take Off Pages [3-9] #
#text_df <- anti_join(text_df,text_df[3:9,])


# Tokenise Words #
x <- text_df %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

## Joining, by = "word"

              ### R-code specifically for Analyst Reports ###


# Make Duplicate List #
x_2 <- x
x_2 <- filter(x_2,x_2$word!="usd")
# Find Neutral #
gg<-str_which(x_2$word,"neutral")
gg_df <- data_frame(line=1:1, line_num=gg)
gg_df <- count(gg_df,wt=NULL,sort=FALSE)

# Find Buy #
gg_2<-str_which(x_2$word,"buy")
gg_df2 <- data_frame(line=1:1, line_num=gg_2)
gg_df2 <- count(gg_df2,wt=NULL,sort=FALSE)

# Find Sell #
gg_3<-str_which(x_2$word,"sell")
gg_df3 <- data_frame(line=1:1, line_num=gg_3)
gg_df3 <- count(gg_df3,wt=NULL,sort=FALSE)

# Making Of The New Dataframe To Plot #
gg_dff<-bind_rows(gg_df,gg_df2)
gg_dff<-bind_rows(gg_dff,gg_df3)

vec<- c("Neutral","Buy","Sell")
gg_dff$new_col<-vec


# Word Counts #
x %>% 
  count(word,sort=TRUE)

## # A tibble: 1,078 x 2
##    word            n
##    <chr>       <int>
##  1 research       65
##  2 scotiabank     63
##  3 scotia         42
##  4 price          38
##  5 analyst        34
##  6 rating         33
##  7 document       31
##  8 target         30
##  9 investment     26
## 10 information    25
## # ... with 1,068 more rows

#Sentiment-R #
y <- sentiment(text)
sentiment_by(text)

##    element_id word_count         sd ave_sentiment
## 1:          1        627 0.23596668    0.08129990
## 2:          2        150 0.37422466    0.32614058
## 3:          3        744 0.14375983    0.14220026
## 4:          4         85 0.10563323    0.14520638
## 5:          5        176 0.06449405    0.01448992
## 6:          6        613 0.30412397    0.26169229
## 7:          7        985 0.23566459    0.09307446
## 8:          8        974 0.22069255    0.15953934
## 9:          9        361 0.25196784    0.12340167

# Esquisse to have fun #
#esquisser(y)


#Find Price Target
gg_pT<-str_which(x_2$word,"target")
gg_pT <- data_frame(line=1:1, line_num=gg_pT)
#gg_pT <- count(gg_pT,wt=NULL,sort=FALSE)
gg_pTT <- pull(gg_pT,line_num)
gg_pTT <- gg_pTT+1
gg_pTT <- first(gg_pTT)
ptt <- x_2[gg_pTT,2]

gg_pTT2<-gg_pTT+1
gg_pTT2<-first(gg_pTT2)
ptt2<-x_2[gg_pTT2,2]

ptt

## # A tibble: 1 x 1
##   word      
##   <chr>     
## 1 scotiabank

#Statistical Summary Of Sentiment
summary(y$sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.53033  0.00000  0.08452  0.12308  0.25000  0.95927

————————OUTPUT SECTION—————————-

ALL OUTPUT PLOTS BELOW:

Ratings Bar Chart

Word Cloud

## Joining, by = "word"

Word Count Bar Chart

Plotly Boxplot

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:sentimentr':
## 
##     highlight

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Plotly Histogram

Sentiment Change

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.