WELCOME TO MY R-MARKDOWN WEBSITE:
The program shown below is a pdf analysis program that outputs different types of visualizations and summary statistics for textual analysis. The program is quiet simple and easy to reproduce.
Steps:
2.Once the file is turned into a data frame the program tokenizes each word and filters out all stop words. Tokenizing simply means each word is in it’s own row to facilitate the analysis.
3.After exploring the data frame, the next step is to clean out and filter any words that need to be cleared out.
4.The first quantitative analysis is to plot out the most used words and plotting it on a bar graph is shown below.
6.This program uses the package SentimentR for the sentiment analysis section. By using the sentiment function on the data frame, the program gives each sentence a quantitative marking of the sentiment. SentimentR has a built in dictionary with marks given to certain words.
7.Within the SentimentR is a plot of emotional valence which shows the change in the sentiment from the beginning until the end of the file.
8.Once the sentiments of each sentence is quantified, a box plot and histogram of the sentiments is plotted using plotly. The box plot and histogram are interactive, go ahead and put the curser over the plot….
:) :) :) :) :) :) :) :) :) :) :) :) :)
Hope you enjoy this resource!!
library(pdftools)
## Using poppler version 21.04.0
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(esquisse)
library(sentimentr)
library(stringr)
# Read in PDF #
text <- pdf_text("./scotia.pdf")
text_df <- data_frame(line=1:1, text=text)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
# Read multiple pdf files #
#file_list <- list.files(pattern="*.pdf")
#all_files <- lapply(file_list, FUN = function(files) {
# pdf_text(files)
#})
#text_df <- all_files
# Take Off Pages [3-9] #
#text_df <- anti_join(text_df,text_df[3:9,])
# Tokenise Words #
x <- text_df %>%
unnest_tokens(word,text) %>%
anti_join(stop_words)
## Joining, by = "word"
### R-code specifically for Analyst Reports ###
# Make Duplicate List #
x_2 <- x
x_2 <- filter(x_2,x_2$word!="usd")
# Find Neutral #
gg<-str_which(x_2$word,"neutral")
gg_df <- data_frame(line=1:1, line_num=gg)
gg_df <- count(gg_df,wt=NULL,sort=FALSE)
# Find Buy #
gg_2<-str_which(x_2$word,"buy")
gg_df2 <- data_frame(line=1:1, line_num=gg_2)
gg_df2 <- count(gg_df2,wt=NULL,sort=FALSE)
# Find Sell #
gg_3<-str_which(x_2$word,"sell")
gg_df3 <- data_frame(line=1:1, line_num=gg_3)
gg_df3 <- count(gg_df3,wt=NULL,sort=FALSE)
# Making Of The New Dataframe To Plot #
gg_dff<-bind_rows(gg_df,gg_df2)
gg_dff<-bind_rows(gg_dff,gg_df3)
vec<- c("Neutral","Buy","Sell")
gg_dff$new_col<-vec
# Word Counts #
x %>%
count(word,sort=TRUE)
## # A tibble: 1,078 x 2
## word n
## <chr> <int>
## 1 research 65
## 2 scotiabank 63
## 3 scotia 42
## 4 price 38
## 5 analyst 34
## 6 rating 33
## 7 document 31
## 8 target 30
## 9 investment 26
## 10 information 25
## # ... with 1,068 more rows
#Sentiment-R #
y <- sentiment(text)
sentiment_by(text)
## element_id word_count sd ave_sentiment
## 1: 1 627 0.23596668 0.08129990
## 2: 2 150 0.37422466 0.32614058
## 3: 3 744 0.14375983 0.14220026
## 4: 4 85 0.10563323 0.14520638
## 5: 5 176 0.06449405 0.01448992
## 6: 6 613 0.30412397 0.26169229
## 7: 7 985 0.23566459 0.09307446
## 8: 8 974 0.22069255 0.15953934
## 9: 9 361 0.25196784 0.12340167
# Esquisse to have fun #
#esquisser(y)
#Find Price Target
gg_pT<-str_which(x_2$word,"target")
gg_pT <- data_frame(line=1:1, line_num=gg_pT)
#gg_pT <- count(gg_pT,wt=NULL,sort=FALSE)
gg_pTT <- pull(gg_pT,line_num)
gg_pTT <- gg_pTT+1
gg_pTT <- first(gg_pTT)
ptt <- x_2[gg_pTT,2]
gg_pTT2<-gg_pTT+1
gg_pTT2<-first(gg_pTT2)
ptt2<-x_2[gg_pTT2,2]
ptt
## # A tibble: 1 x 1
## word
## <chr>
## 1 scotiabank
#Statistical Summary Of Sentiment
summary(y$sentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.53033 0.00000 0.08452 0.12308 0.25000 0.95927
ALL OUTPUT PLOTS BELOW:
## Joining, by = "word"
##
## Attaching package: 'plotly'
## The following object is masked from 'package:sentimentr':
##
## highlight
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.