1 Introduction

This notebook provides a reproducible guide for text analysis on a corpus using R. The objective of this analysis is to review frequencies, sentiment analysis, topic modelling and do some Exploratory Data Analysis (EDA).

1.1 Objective

The objective of this vignette is to use dplyr, tydr and tidytext to:

  1. Tidy document-term matrices and corpus objects
  2. Understanding how R deals with unnest_tokens()
  3. Cleaning & Preprocessing
  4. Perform a sentiment analysis
  5. Identify term frequencies
  6. Understand n-grams and analyse word networks
  7. Topic Modelling
  8. Clustering, classification and prediction.

1.2 Terms

Term Definition
String Text is generally read into R as character vectors which is essentially a string.
Corpus Objects that contain raw strings annotated with metadata and descriptions.
Document-term Matrix A matrix describing a collection of documents with a row for each document and a column for each term. The values within the matrix is the word count.
unnest_tokens() Convert tibbles to one token per document per row. This gives the opportunity to analyse the text.

1.3 Workflow

fig 1. Text Mining process

fig 1. Text Mining process

2 Load libraries and set global parameters

docs <- VCorpus(DirSource("./docs"))

3 PreProcessing

Cleaning and preprocessing of the text After obtaining the corpus, usually, the next step will be cleaning and preprocessing of the text. For this endeavor we are mostly going to use functions from the tm & tidytext packages. In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words “manage”, “management” and “managing” should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.

For this project we’ll use the preprocessing functions below:

  1. removePunctuation(): Remove all punctuation marks
  2. tolower(): Make all characters lowercase
  3. removeNumbers(): Remove numbers
  4. removeWords(): Remove stopwords
  5. stripWhitespace(): Remove excess whitespace
  6. stemDocument(): Shorten words to core meaning.

Let’s apply these functions to our corpus:

#Remove punctuation - replace punctuation marks with " "
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs, content_transformer(tolower))
#Strip digits
docs <- tm_map(docs, removeNumbers)
#Remove stopwords from standard stopword list 
docs <- tm_map(docs, removeWords, stopwords("english"))
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
#inspect output
writeLines(as.character(docs[[30]]))
## conventional approaches knowledge management projects focus cognitive thoughtrelated mechanical aspects knowledge creation capture alternate view one considers knowledge created interactions people – interactions – develop mutually acceptable interpretations theories facts ways suit particular needs project knowledge socially constructed true project managers need pay attention environmental social factors influence knowledge construction position taken paul jackson jane klobas paper entitled building knowledge projects practical application social constructivism information systems development presents knowledge creation sharing process model based social constructivist theory article summary review paper social constructivist view knowledge jackson klobas begin observation engineering disciplines founded belief knowledge can expressed propositions correspond reality independent human perception however alternate view knowledge absolute relative – ie depends mental models beliefs used interpret facts objects events relevant example software product viewed business users software developers former group may see application terms utility whereas latter may see instance particular technology perception gaps can also occur within seemingly homogenous groups – teams comprised software developers example can happen variety reasons differences experience cultural backgrounds make group social constructivism looks gaps can bridged authors discussion relies work berger luckmann described gap perceptions different individuals can overcome create socially constructed shared reality phrase socially constructed implies reality pertains project example created via common understanding issues followed mutual agreement players comprises reality view strikes particular chord akin stated aims dialogue mapping technique described several earlier posts see article example relevant projects knowledge information systems development social construct first authors make point information systems development isd projects …intensive exercises constructing social reality process data modeling models informed particular worldview systems designers use particular formal representations isd projects operational reality new explicitly constructed becomes understood accepted negotiated agreement participants two cultures business essentially knowledge emerges interaction discussion project proceeds however methodologies used design typically founded engineering approach takes positivist view rather social one authors suggest perhaps social constructivist paradigm offers insight continuing failure namely happening isd project far complex simple translation description external reality instructions computer emergence articulation multiple indeterminate sometimes unconscious sometimes ineffable realities negotiated achievement consensus new agreed reality explicit form business data model amenable computerization mind authors aim develop model addresses shortcomings traditional positivist view knowledge isd projects representing berger luckmanns theory social constructivism terms knowledge process model identify management principles map processes principles form basis survey used operational version process model operational model assessed experts tested project manager reallife project knowledge creationsharing process model process model jackson klobas describe based berger luckmanns work figure knowledge creationsharing model figure knowledge creationsharing model model describes personal knowledge created – personal knowledge individual knows personal knowledge built using mental models world – models frameworks individuals use make sense world according jacksonklobas process model personal knowledge built number process including internalisation absorption knowledge individual knowledge creation construction new knowledge repetitive performance tasks learning skills becoming aware new ideas ways thinking frameworks latter corresponds learning concepts theories even new ways perceiving world correspond change subjective reality individual externalisation representation description knowledge using speech symbols can perceived internalized others think explaining ideas procedures individuals objectivation creation shared constructs represent groups understanding world point knowledge objectified – perceived existence independent individuals legitimation authorization objectified knowledge correct standard reification process objective knowledge assumes status makes difficult change challenge familiar example reified knowledge procedure process hardened system – thats just way things done around common response processes challenged links depicted figure show relationships processes jackson klobas suggest knowledge creation isd projects social process occurs continual communication business sure elements knowledge creation – design prototyping development learning new skills etc – amount nought unless discussed argued agreed communicated social interactions interactions occur wider context organization reasonable claim resulting knowledge takes form mirrors social environment organization clearly model knowledge creation different usual interpretation knowledge independent reality regardless whether known group operational model good theory makes interesting academic discussions practice can model operationalised jackson klobas describe approach creating testing utility rather validity model discuss following sections knowledge sharing heuristics begin surveyed literature knowledge management identify knowledge sharing heuristics ie experiencebased techniques enable knowledge sharing example heuristics associated externalization process standard documentation modelling tools make business requirements easy understand stakeholders staff communicate regularly direct facetoface contact use prototypes authors identified heuristics matched process model according authors matching process simple cases doubt process heuristic attached suggests model provides natural way organize voluminous complex body research knowledge creation sharing important well suggests conceptual model illustrated fig can form basis simple means assess knowledge creation sharing capabilities work environments assurance relevant variables covered validating mapping validity matching checked using twenty historical case studies isd projects worked follows explanations worked well didnt mapped model process areas using heuristics identified prior step aim answer question relationship project failure problems respective knowledge processes conversely project success presence positive indicators one case studies authors use wellknown possibly overanalysed failure automated dispatch system london ambulance service paper succinct summary case study reproduce london ambulance service las largest ambulance service world provides accident emergency patient transport services resident population nearly seven million people isd project intended produce automated system dispatch ambulances emergencies existing manual system poor cumbersome inefficient relatively unreliable goal new system provide efficient command control process overcome deficiencies furthermore system seen management opportunity resolve perceived issues poor industrial relations outmoded work practices low resource utilization tender let development system components including computer aided dispatch automatic vehicle location radio interfacing mobile data terminals update status callout tender let company inexperienced large systems delivery whilst project profound implications work practices personnel hardly involved design system upon implementation many errors software infrastructure led critical operational shortcomings failure calls reach ambulances system lasted week necessary revert manual system jackson klobas show conceptual model maps knowledgerelated factors may played role failure project example heading personal knowledge one can identify least two potential factors lack involvement endusers design selection inexperienced vendor disconnect management employees suggests couple factors relating reification mutual negative perceptions outmoded unchallenged work practices validation authors suggest model provides comprehensive framework explains projects failed may overstating case – whats cause whats effect hard tell especially fact nonetheless model seem able capture many knowledgerelated gaps played role failures looking heuristics mapped process one might able suggest ways deficiencies addressed example externalization problem area one might suggest use prototypes encourage face face communication business personnel surveybased tool encouraged authors created survey tool intended evaluate knowledge creationsharing effectiveness project environments tool academic terms used model translated everyday language example term externalization translated knowledge sharing – see fig translated terms tool asked project managers evaluate project environments knowledge creation process capability scale based inputs recommend specific improvement strategies capabilities scored low tool evaluated four project managers used work environment period weeks end period interviewed responses analysed using content analysis match experiences requirements designed intent tool unfortunately paper provide details tool difficult say much paraphrase authors comments based evaluation authors conclude tool provides common framework project managers discuss issues pertaining knowledge creation sharing means identify potential problems might done address field testing one evaluators model tested tool field tester project manager wanted identify knowledge creationsharing deficiencies work environment ways addressed answered questions based evaluation knowledge sharing capabilities environment developed improvement plan based strategies suggested tool along ideas completed survey plan returned researchers use tool revealed following knowledge creationsharing deficiencies project managers environment inadequate personal knowledge ineffective externalization inadequate standardization objectivation strategies suggested tool include internet portal promote knowledge capture sharing included discussion forums areas capture discuss best practices etc role playing workshops reveal processes worked practice ie surface tacit knowledge based authors suggest technology can used promote support knowledge sharing standardization just storage interventions make tacit knowledge explicit can helpful side benefit note survey raised consciousness knowledge creationsharing within team reflections conclusions opinion value paper lies model survey tool conceptual framework underpins – namely idea knowledge depends shaped social environment evolves perhaps example might help clarify means consider organisation decides implement project management best practices described fill popular methodologies wrong way implement practices wholesale without regard organizational culture norms preexisting practices approach unlikely lead imposed practices taking root organisation hand approach picks practices useful tailors organizational needs constraints culture likely meet success second approach works attempts bridge gap ideal best practice social reality organisation encourages employees adapt practices ways make sense context organization invariably involves modifying practices sometimes substantially creating new socially constructed knowledge bargain another interesting point authors make several knowledge sharing heuristics think number classified unambiguously one processes model suggests model reasonable view knowledge creationsharing process one accepts conclusion model indeed provide common framework discussing issues relating knowledge creation project environments associated heuristics can help identify processes dont work well im unable judge usefulness surveybased tool developed authors provide much detail paper however isnt really issue field project management many tools techniques anyway key message paper opinion every project unique context techniques used others interpreted applied ways meaningful context particular project paper excellent counterpoint methodologyoriented practice knowledge management projects required reading methodologists project managers believe things need done book regardless social organizational context
#Stem document
docs <- tm_map(docs,stemDocument)

we need to convert the corpus document into a tidy format. It needs to be in a data frame which represents each row as a document.

t_corpus <- docs %>% tidy()
t_corpus
## # A tibble: 41 x 8
##    author datetimestamp       description heading id    language origin
##    <lgl>  <dttm>              <lgl>       <lgl>   <chr> <chr>    <lgl> 
##  1 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  2 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  3 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  4 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  5 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  6 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  7 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  8 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  9 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
## 10 NA     2018-10-30 15:04:50 NA          NA      Doc1… en       NA    
## # ... with 31 more rows, and 1 more variable: text <chr>
d_corpus <- t_corpus %>% 
  select(id, text)

Now it’s in a tibble format we can use the `unnest_tokens which tokenizes the text variable into works and creates one row per token. It also converts it into a tidy frame.

We can see that this format we can read the:

  1. Date and time stamp meta data
  2. id - the document name (ie. “Doc01.txt”)
  3. Language
  4. Text - the actual content.

By using unnest_tokens it also converts all characters to lower case.

tidy_df <- t_corpus %>%
  unnest_tokens(word, text)

tidy_df %>%
  select(id, word) %>%
  head(15)
## # A tibble: 15 x 2
##    id        word    
##    <chr>     <chr>   
##  1 Doc01.txt practic 
##  2 Doc01.txt risk    
##  3 Doc01.txt manag   
##  4 Doc01.txt ration  
##  5 Doc01.txt meansend
##  6 Doc01.txt base    
##  7 Doc01.txt process 
##  8 Doc01.txt risk    
##  9 Doc01.txt identifi
## 10 Doc01.txt analys  
## 11 Doc01.txt solv    
## 12 Doc01.txt mitig   
## 13 Doc01.txt although
## 14 Doc01.txt step    
## 15 Doc01.txt seem

4 Tidytext() Cleaning

The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).

get_stopwords()
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 165 more rows

Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).

wosw_tidy <- tidy_df %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word)))
## Joining, by = "word"
## Warning in evalq(is.na(as.numeric(word)), <environment>): NAs introduced by
## coercion
wosw_tidy %>%
  select(id, word) %>%
  head(10)
## # A tibble: 10 x 2
##    id        word    
##    <chr>     <chr>   
##  1 Doc01.txt practic 
##  2 Doc01.txt risk    
##  3 Doc01.txt manag   
##  4 Doc01.txt ration  
##  5 Doc01.txt meansend
##  6 Doc01.txt base    
##  7 Doc01.txt process 
##  8 Doc01.txt risk    
##  9 Doc01.txt identifi
## 10 Doc01.txt analys

5 Analysis


5.1 Term frequencies & TF-IDF

Using the function count we can identify term frequencies of the entire corpus.

wosw_tidy %>%
  count(word, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    word        n
##    <chr>   <int>
##  1 project   580
##  2 risk      541
##  3 manag     507
##  4 can       487
##  5 use       431
##  6 one       376
##  7 figur     335
##  8 time      333
##  9 task      285
## 10 point     265
wosw_tidy %>%
  count(word, sort = TRUE) %>%
  head(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf that adds the tfidf-score to a data frame containing term frequencies and document meta data.

Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.

I believe that the IDF is better to run on each document. Because IDF depends on the index (2 in the case of the class, or N for the number of reviews), you will lose some granularity to the underlying structure of the data. Because even though the class might be the same, the underlying unit is actually the person writing the review, and it makes sense then that this should be your index unit. This is because the reviews are the ones being drawn from the distribution (somewhat iid, but not exactly), and so for a better sense of the idf, we want it measured at the level of the drawing of our sample points.

If you were curious, you could train your model for each of the classes separately, and here you would be assuming that the reviews for the two classes are drawn from separate distributions and are sufficiently different to have different models associated with them. I suspect however that the loss in the extra data will be not worth the gain from specifically modeling each class. [https://stats.stackexchange.com/questions/364834/tf-idf-for-text-classification-on-what-should-idf-be-calculated]

perb <- wosw_tidy %>%
  unnest(id, word) %>%
  count(id, word, sort = TRUE) %>%
  ungroup()

#calculate the overall words in the documents

peridc <- perb %>%
  group_by(id) %>%
  summarize(total = sum(n))

book_words <- left_join(perb, peridc)
## Joining, by = "id"
book_words
## # A tibble: 19,094 x 4
##    id        word        n total
##    <chr>     <chr>   <int> <int>
##  1 Doc20.txt cluster   101  2146
##  2 Doc40.txt task       90  1604
##  3 Doc35.txt time       86  2370
##  4 Doc29.txt manag      83  1743
##  5 Doc39.txt correl     79  1419
##  6 Doc09.txt manag      78  1174
##  7 Doc21.txt topic      78  1608
##  8 Doc32.txt project    78   984
##  9 Doc08.txt risk       77  1685
## 10 Doc09.txt risk       77  1174
## # ... with 19,084 more rows

As per the table above, we can see the most frequent terms per book showing overall total words in the document.

frequency <- wosw_tidy %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(id, word) %>%
  group_by(id) %>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) 

frequency %>%
  arrange(desc(proportion))
## # A tibble: 19,094 x 3
## # Groups:   id [41]
##    id        word    proportion
##    <chr>     <chr>        <dbl>
##  1 Doc07.txt risk        0.0909
##  2 Doc31.txt design      0.0808
##  3 Doc05.txt project     0.0805
##  4 Doc32.txt project     0.0793
##  5 Doc07.txt project     0.0705
##  6 Doc04.txt eleph       0.0678
##  7 Doc04.txt risk        0.0678
##  8 Doc09.txt manag       0.0664
##  9 Doc02.txt risk        0.0664
## 10 Doc09.txt risk        0.0656
## # ... with 19,084 more rows

Zipf’s law states that the frequency tht a word appears is inversely proportional to its rank.

The distributions above are typical.

freq_by_rank <- book_words %>% 
  group_by(id) %>%
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank
## # A tibble: 19,094 x 6
## # Groups:   id [41]
##    id        word        n total  rank `term frequency`
##    <chr>     <chr>   <int> <int> <int>            <dbl>
##  1 Doc20.txt cluster   101  2146     1           0.0471
##  2 Doc40.txt task       90  1604     1           0.0561
##  3 Doc35.txt time       86  2370     1           0.0363
##  4 Doc29.txt manag      83  1743     1           0.0476
##  5 Doc39.txt correl     79  1419     1           0.0557
##  6 Doc09.txt manag      78  1174     1           0.0664
##  7 Doc21.txt topic      78  1608     1           0.0485
##  8 Doc32.txt project    78   984     1           0.0793
##  9 Doc08.txt risk       77  1685     1           0.0457
## 10 Doc09.txt risk       77  1174     2           0.0656
## # ... with 19,084 more rows

The rank column ranks each word within the frequency table.

freq_by_rank %>%
  ggplot(aes(rank, `term frequency`, colour = id)) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
  scale_x_log10() +
  scale_y_log10()

By plotting in log-log co-ordinates we can see that the corpus, even though seperated by id, are similiar to each other and the relationship between rank and frequency does have a negative slope.

rank_subset <- freq_by_rank %>%
  filter(rank<500,
         rank>10)

lm(log10(`term frequency`) ~log10(rank), data = rank_subset)
## 
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -1.1061      -0.7717

As per above, we’re assessing the middle section of the rank range and see whether what exponent of the power law is. As we can see from log10(rank) above, the slow is -0.7717.

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = id)) + 
  geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

We see that this is a closely related to zipf’s law. The deviations at the high rank (1-10) are not uncommon for lanugage studies. The deviations at low rank are unusual. This means that there is a higher percentage of the most common words.

book_words <- book_words %>%
  bind_tf_idf(word, id, n)

book_words
## # A tibble: 19,094 x 7
##    id        word        n total     tf   idf  tf_idf
##    <chr>     <chr>   <int> <int>  <dbl> <dbl>   <dbl>
##  1 Doc20.txt cluster   101  2146 0.0471 2.10  0.0990 
##  2 Doc40.txt task       90  1604 0.0561 0.769 0.0432 
##  3 Doc35.txt time       86  2370 0.0363 0.280 0.0101 
##  4 Doc29.txt manag      83  1743 0.0476 0.187 0.00891
##  5 Doc39.txt correl     79  1419 0.0557 1.77  0.0984 
##  6 Doc09.txt manag      78  1174 0.0664 0.187 0.0124 
##  7 Doc21.txt topic      78  1608 0.0485 1.15  0.0557 
##  8 Doc32.txt project    78   984 0.0793 0.248 0.0196 
##  9 Doc08.txt risk       77  1685 0.0457 0.718 0.0328 
## 10 Doc09.txt risk       77  1174 0.0656 0.718 0.0471 
## # ... with 19,084 more rows

We can see that with common words, the tf and tf-idf are close to zero because they are extremely common words. They are common because they appear across all of our corpus (they will have a natural log of 1) and therefore this approach decreases the weigth for common words.

book_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))
## # A tibble: 19,094 x 6
##    id        word          n     tf   idf tf_idf
##    <chr>     <chr>     <int>  <dbl> <dbl>  <dbl>
##  1 Doc04.txt eleph        16 0.0678  2.61 0.177 
##  2 Doc26.txt stupid       36 0.0316  3.71 0.117 
##  3 Doc02.txt erm          30 0.0273  3.71 0.101 
##  4 Doc31.txt design       48 0.0808  1.23 0.0993
##  5 Doc20.txt cluster     101 0.0471  2.10 0.0990
##  6 Doc39.txt correl       79 0.0557  1.77 0.0984
##  7 Doc10.txt score        33 0.0517  1.77 0.0914
##  8 Doc24.txt scapegoat    14 0.0218  3.71 0.0809
##  9 Doc37.txt dart         15 0.0246  3.02 0.0744
## 10 Doc07.txt strateg      20 0.0455  1.63 0.0743
## # ... with 19,084 more rows

Here we can see proper nouns and names that are important in these documents. None of them occur in all of the corpus, and therefore important characteristic words.

5.2 Sentiment Analysis

Sentiment analysis gives us the opportunity to analyse and investigate the emotional intent of words whether the text are positive or negative.

To analyse sentiment, we generally consider a combintation of words rather than an individual word. This isn’t the only method but often-used approach.

The workflow is as per below:

Flowchart of text analysis

Flowchart of text analysis

sentiments
## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

The tidytext package contains a dictionary that containts several sentiment lexicons.

The three general-purpose lexicons are:

  1. AFINN from Finn Nielsen,
  2. bing from Bing Liu and Collaborators, and
  3. nrc from Saif Mohammad and Peter Turney.

All of the lexicons above are based on unigrams. They assign english words and associate it with scores for negative or positive sentiment. nrc, however, is a binary yes/no in emotions.

tidy_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

tidy_df %>% 
  inner_join(tidy_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 68 x 2
##    word        n
##    <chr>   <int>
##  1 organ      91
##  2 good       90
##  3 share      64
##  4 success    59
##  5 present    57
##  6 score      48
##  7 excel      34
##  8 deal       29
##  9 content    25
## 10 found      23
## # ... with 58 more rows

We can see mostly positive words here. lets apply this across documents in our corpus.

tidy_df_sent <- tidy_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(id, sentiment) %>%
  spread(sentiment, n, fill = 0)
## Joining, by = "word"
  #mutate(sentiment = positve - negative)

tidy_df_sent <- tidy_df_sent %>%
  mutate(sentiment = positive - negative)

tidy_df_sent
## # A tibble: 41 x 4
##    id        negative positive sentiment
##    <chr>        <dbl>    <dbl>     <dbl>
##  1 Doc01.txt      146       36      -110
##  2 Doc02.txt      106       35       -71
##  3 Doc03.txt       46        9       -37
##  4 Doc04.txt       30        5       -25
##  5 Doc05.txt       47       43        -4
##  6 Doc06.txt      105       30       -75
##  7 Doc07.txt       47       12       -35
##  8 Doc08.txt      150       43      -107
##  9 Doc09.txt      123       32       -91
## 10 Doc10.txt       55       18       -37
## # ... with 31 more rows
ggplot(tidy_df_sent, aes(id, sentiment, fill = id)) +
  geom_col(show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can see here that documents 1-9 have negative sentiments and documents 10 onto 41 are primarily positive.

bing_word_counts <- tidy_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

As per the pot above, we can see the how much the word “risk” and “problem” are used in association with a negative sentiment.

Sentiment Analysis provides a useful way to understand the attributes associated in text.

5.3 Relationships between words N-Gram

bigrams <- tidy_df %>%
  unnest_tokens(bigram, word, token = "ngrams", n=2)

bigrams
## # A tibble: 47,012 x 8
##    author datetimestamp       description heading id    language origin
##    <lgl>  <dttm>              <lgl>       <lgl>   <chr> <chr>    <lgl> 
##  1 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  2 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  3 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  4 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  5 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  6 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  7 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  8 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
##  9 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
## 10 NA     2018-10-30 15:04:50 NA          NA      Doc0… en       NA    
## # ... with 47,002 more rows, and 1 more variable: bigram <chr>
bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 35,865 x 2
##    bigram                   n
##    <chr>                <int>
##  1 project manag          139
##  2 risk manag             133
##  3 complet time           105
##  4 doc tmmapdoc            64
##  5 figur figur             61
##  6 mont carlo              61
##  7 best practic            51
##  8 shown figur             51
##  9 triangular distribut    47
## 10 one can                 39
## # ... with 35,855 more rows

We can see the most commong bigram is “Project Managmeent”, “risk Management” and so on.

bigram_tf_idf <- bigrams %>%
  count(id, bigram) %>%
  bind_tf_idf(bigram, id, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
## # A tibble: 41,495 x 6
##    id        bigram                      n     tf   idf tf_idf
##    <chr>     <chr>                   <int>  <dbl> <dbl>  <dbl>
##  1 Doc07.txt strateg risk               12 0.0273  3.71 0.101 
##  2 Doc15.txt inform knowledg            15 0.0196  3.71 0.0727
##  3 Doc26.txt function stupid            20 0.0175  3.71 0.0649
##  4 Doc27.txt side effect                14 0.0163  3.71 0.0606
##  5 Doc36.txt standard deviat            10 0.0190  3.02 0.0574
##  6 Doc24.txt scapegoat approach         10 0.0154  3.71 0.0573
##  7 Doc33.txt best practic               39 0.0358  1.52 0.0543
##  8 Doc17.txt infrastructur technolog    11 0.0137  3.71 0.0509
##  9 Doc28.txt organis cultur              9 0.0189  2.61 0.0495
## 10 Doc09.txt risk manag                 37 0.0315  1.52 0.0477
## # ... with 41,485 more rows

By applying tf-idf and n-gram we can discover the most important elements of each document within the corpus.

Using bigrams over unigrams gives us the oppotunity to capture structure and understand the context.

5.3.1 Visualising a network of bigrams

R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:

  1. From: The node of an edge is comign from
  2. To: the node an edge is going towards
  3. Weight: A numeric value associated with each edge.
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts %>%
  filter(n > 20) %>%
  graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

The figure above visualises the text structure and how they are linked and directionality. The more common they are the darker the arrows.

5.3.2 Pairwise Correlation

We want to examine correlation among words, which indicates how often they appear together relative to how often they appear seperately.

word_cors <- wosw_tidy %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, id, sort = TRUE)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
word_cors
## # A tibble: 331,200 x 3
##    item1      item2      correlation
##    <chr>      <chr>            <dbl>
##  1 carlo      mont             1    
##  2 mont       carlo            1    
##  3 compendium ibi              0.995
##  4 ibi        compendium       0.995
##  5 erm        appetit          0.994
##  6 appetit    erm              0.994
##  7 tmax       tmin             0.993
##  8 tmin       tmax             0.993
##  9 tmax       tml              0.991
## 10 tml        tmax             0.991
## # ... with 331,190 more rows

The output above is useful for further exploration. Lets have a look into the terms “tmax”.

word_cors %>%
  filter(item1 == "tmax")
## # A tibble: 575 x 3
##    item1 item2      correlation
##    <chr> <chr>            <dbl>
##  1 tmax  tmin             0.993
##  2 tmax  tml              0.991
##  3 tmax  pt               0.935
##  4 tmax  t                0.848
##  5 tmax  estim            0.843
##  6 tmax  correspond       0.835
##  7 tmax  triangular       0.828
##  8 tmax  simul            0.827
##  9 tmax  obtain           0.805
## 10 tmax  equat            0.802
## # ... with 565 more rows
word_cors %>%
  filter(correlation > .95) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

The correlation plot above show >=95% correlation. The relationships are symmetrical.

5.4 Topic Modelling

Topic modelling is an unsupervised classification of documents that divides the collection into natural groups so we can understand them seperately.

For this project we’ll be applying the Latent Dirichlet Allocation (LDA) method for fitting a topic model. It:

  1. Treats each document as a mixture of topics
  2. Each topic as a mixture of words

This allows for potential “overlap” in content rather than being classified as discrete groups.

#create Document Term Matrix
dtm <- DocumentTermMatrix(docs)

#create model fit
ap_lda <- LDA(dtm, k = 3, control = list())
ap_lda
## A LDA_VEM topic model with 3 topics.

Now we’ve fitted a LDA model. We have to now explore and interpret the model.

5.4.1 Word-Topic Probabilities

Using tidy(), we can extract the per-topic-per-word probabilities also known as “Beta” from the model.

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 13,506 x 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 –demonstr 7.60e-136
##  2     2 –demonstr 1.93e-139
##  3     3 –demonstr 8.87e-  5
##  4     1 –especi   4.69e-125
##  5     2 –especi   4.26e-  5
##  6     3 –especi   6.36e-130
##  7     1 –howev    5.66e- 43
##  8     2 –howev    2.87e-  9
##  9     3 –howev    8.87e-  5
## 10     1 –techniqu 6.33e-132
## # ... with 13,496 more rows

The returend tibble format is now a one-topic-per-term-per-row. Each combination, the model computes the probability of that term being generated fromt hat topic. For example: “especi” has the probility of:

  1. Topic 1 - 4.49 x 10^-120
  2. Topic 2 - 2.83 x 10^-125
  3. Topic 3 - 4.62 x 10^-5

Lets find the top 10 terms that are most common within each topic.

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

This visualization lets us understand the two topics that were extracted from the corpus. The most common words in topic 1 include “task”, “time”, “figure”, and “distribution”, which suggests it may represent educational texts. Those most commong in topic 2 include “Document”, “Cluster”, “Topic” suggesting that this topic represents text classification texts. Those most common in topic 3 include “project”, “risk”, and “manag”, suggesting that this topic represents Project management texts. One important observation about the words in each topic is that some words, such as “can” and “use”, are common within all three topics topics.

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread
## # A tibble: 467 x 5
##    term             topic1       topic2   topic3 log_ratio
##    <chr>             <dbl>        <dbl>    <dbl>     <dbl>
##  1 accept    0.00000000225 0.00102      1.95e-19    18.8  
##  2 achiev    0.000152      0.00197      1.83e-14     3.69 
##  3 action    0.000585      0.00155      3.06e-24     1.40 
##  4 address   0.000288      0.00159      1.31e- 4     2.46 
##  5 adjust    0.00131       0.00000148   1.20e- 8    -9.79 
##  6 affect    0.000363      0.00103      1.39e- 4     1.51 
##  7 aim       0.000476      0.00147      6.17e- 4     1.63 
##  8 algorithm 0.00599       0.0000000285 1.22e- 3   -17.7  
##  9 also      0.00307       0.00227      1.75e- 3    -0.434
## 10 although  0.000793      0.00181      1.89e- 3     1.19 
## # ... with 457 more rows

5.4.2 Document-topic Probabilities

LDA also models each document as a mixture of topics through the use of “gamma” in tidy()

ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents
## # A tibble: 123 x 3
##    document  topic     gamma
##    <chr>     <int>     <dbl>
##  1 Doc01.txt     1 0.0000348
##  2 Doc02.txt     1 0.287    
##  3 Doc03.txt     1 0.000105 
##  4 Doc04.txt     1 0.000195 
##  5 Doc05.txt     1 0.0000519
##  6 Doc06.txt     1 0.0000453
##  7 Doc07.txt     1 0.000105 
##  8 Doc08.txt     1 0.0000275
##  9 Doc09.txt     1 0.0000398
## 10 Doc10.txt     1 0.0000730
## # ... with 113 more rows

From observing documents 1-10, We can see that document 3 & 8 contain approximately 13% of the words in topic 1. We can see that many of these documents are drawn from the topic 2 & 3.

5.5 Hierarchical Clustering

We can find clusters of words that characterise a set of documents.

m <- as.matrix(dtm)
#compute distance between document vectors
d <- dist(m)
#run hierarchical clustering using Ward's method (explore other options later)
groups <- hclust(d,method = "ward.D")
#plot, use hang to ensure that labels fall below tree
plot(groups, hang=-1)

hclusters <- cutree(groups,3)
hclusters
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt 
##         1         1         1         1         1         1         1 
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt 
##         1         1         1         1         1         1         1 
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt 
##         1         1         1         1         2         2         2 
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt 
##         2         1         1         1         1         1         1 
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt 
##         1         1         1         1         1         1         3 
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt 
##         1         1         3         3         3         3
#lets try another distance measure
cosineSim <- function(x){
  as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)
cd <- 1-cs
#run hierarchical clustering using cosine distance
groups <- hclust(cd,method="ward.D")
#plot, use hang to ensure that labels fall below tree
plot(groups, hang=-1)

hclusters_cosine <- cutree(groups,3)

5.6 K-Means Clustering

#compute distance between document vectors
d <- dist(m)
#kmeans clustering
#kmeans - run with nstart=100 and k=2,3,5 to compare results with hclust
kfit <- kmeans(d, 4, nstart=100)
#plot - 
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

#print contents of kfit
print(kfit)
## K-means clustering with 4 clusters of sizes 19, 6, 2, 14
## 
## Cluster means:
##   Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1  108.1902  109.8525  63.73692  62.53554  98.52241  106.8860  74.07278
## 2  164.0325  168.2921 141.07874 144.29141 159.47704  166.1681 149.68420
## 3  202.2290  205.8845 180.65401 185.69205 192.65477  201.0714 189.07201
## 4  109.0816  111.1807  99.62930 103.79043 105.94408  105.2069  99.98784
##   Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1  130.8908  136.8752  77.02794  138.4394   72.7574  74.16148  106.0221
## 2  174.6133  187.7614 147.25453  145.8428  137.4458 142.97565  156.0618
## 3  201.6781  216.8520 189.16669  210.8285  184.8022 184.96900  192.4457
## 4  123.3366  121.2925 108.18662  160.1996  113.4550 116.14897  127.5738
##   Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1  82.09724  73.29033  79.84728  102.7634  170.8184  166.2086  131.9923
## 2 142.55259 139.39946 144.00266  150.6986  141.2076  141.9019  129.9559
## 3 192.09524 184.37799 185.52908  189.3455  218.2610  218.9969  202.4562
## 4 116.39036 113.30962 120.73450  122.4294  182.9737  183.4248  152.7061
##   Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 1  126.7113  74.13313  74.19499  76.11665  86.36957  82.06487  68.69331
## 2  124.3788 145.39231 146.53521 146.38600 150.81875 149.34265 145.15200
## 3  199.0643 182.51451 187.82722 187.85944 192.80172 188.78237 187.11769
## 4  151.0251 113.50634 113.40627 107.53746 121.43078 117.98936 106.03430
##   Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 1  133.7728  116.9127  80.61226  100.1876  102.4817  82.74281 203.88766
## 2  178.3254  164.5905 150.51414  161.2192  156.6190 148.99268 215.46146
## 3  212.8706  204.9333 191.24404  195.2125  198.9165 186.97323  63.50591
## 4  130.4496  124.4860 118.70398  111.7444  120.9206 116.65152 206.06853
##   Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1   70.6491   73.0388  101.8259  135.3703 167.80122  116.0721
## 2  141.6723  140.0531  152.2319  146.7843 189.43933  155.1067
## 3  175.9253  173.6405  137.7472  165.0955  63.50591  123.1712
## 4  112.6261  112.8993  113.0662  153.3267 176.07607  127.9183
## 
## Clustering vector:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt 
##         4         4         1         1         4         4         1 
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt 
##         4         4         1         2         1         1         4 
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt 
##         1         1         1         4         2         2         2 
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt 
##         2         1         1         1         1         1         1 
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt 
##         4         4         1         4         4         1         3 
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt 
##         1         1         4         2         3         4 
## 
## Within cluster sum of squares by cluster:
## [1] 166966.24 202025.47  37096.15 348849.43
##  (between_SS / total_SS =  62.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
#print cluster sizes
kfit$size
## [1] 19  6  2 14
#print clusters (members)
kfit$cluster
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt 
##         4         4         1         1         4         4         1 
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt 
##         4         4         1         2         1         1         4 
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt 
##         1         1         1         4         2         2         2 
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt 
##         2         1         1         1         1         1         1 
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt 
##         4         4         1         4         4         1         3 
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt 
##         1         1         4         2         3         4
#sum of squared distance between cluster centers 
kfit$betweenss
## [1] 1234173
#sum of squared distance within a cluster (this are the quantities that the algorithm
#attempts to minimise)
kfit$withinss
## [1] 166966.24 202025.47  37096.15 348849.43

Lets determine the optimal number of clusters by identifying the elebow of the plot summed intra-cluster distances (withinss) as fn of k.

wss <- 2:(length(docs)-1)
for (i in 2:(length(docs)-1)) wss[i] <- sum(kmeans(d,centers=i,nstart=25)$withinss)
plot(2:(length(docs)-1), wss[2:(length(docs)-1)], type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares") 

It’s not a very clear elbow and can be 3 or 5.

cosineSim <- function(x){
  as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)
cd <- 1-cs
kfit <- kmeans(cd, 5, nstart=100)
clusplot(as.matrix(cd), kfit$cluster, color=T, shade=T, labels=2, lines=0)

#print contents of kfit
print(kfit)
## K-means clustering with 5 clusters of sizes 9, 7, 7, 6, 12
## 
## Cluster means:
##   Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 0.3467390 0.3702095 0.3725129 0.4572123 0.5774887 0.3310664 0.3681517
## 2 0.8224432 0.8641520 0.7847439 0.8763468 0.8143455 0.8271382 0.8471637
## 3 0.8417660 0.8762549 0.7288068 0.8695281 0.8433583 0.8398390 0.8513733
## 4 0.7163626 0.7265631 0.7190131 0.7761829 0.3939277 0.5672919 0.5972205
## 5 0.7935452 0.8242923 0.7887771 0.8836909 0.8248001 0.8220622 0.8529045
##   Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 0.3534253 0.3355292 0.4667107 0.8031170 0.7742731 0.8938587 0.8770232
## 2 0.8077241 0.8362993 0.8415093 0.3939766 0.4243705 0.4921867 0.5330926
## 3 0.7331937 0.8449659 0.8222052 0.8014180 0.8055787 0.8384485 0.8204215
## 4 0.6927705 0.5058114 0.7941625 0.7443012 0.7371821 0.9000575 0.8608461
## 5 0.7750194 0.8420439 0.8288928 0.7235806 0.7374335 0.8243347 0.8156602
##   Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 0.8108352 0.8119106 0.8681651 0.8367485 0.7857325 0.8708532 0.8559012
## 2 0.5485686 0.4036151 0.4809752 0.7836799 0.7145495 0.7757845 0.8126291
## 3 0.8673541 0.8080342 0.8136435 0.7935313 0.7364108 0.7725986 0.7965062
## 4 0.6302067 0.7458052 0.8603819 0.8412972 0.7484149 0.8662933 0.8419034
## 5 0.7749949 0.7795732 0.8085912 0.7191706 0.6146604 0.6795011 0.6852190
##   Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 1 0.8691893 0.8639050 0.8071345 0.6836051 0.7889370 0.7985686 0.7418661
## 2 0.7352509 0.8484055 0.8295581 0.7493001 0.7675647 0.7739771 0.8326160
## 3 0.7765359 0.8446099 0.8727796 0.8518514 0.8637897 0.8449760 0.8715498
## 4 0.8700103 0.7529901 0.7873905 0.4921728 0.7001361 0.7168138 0.4493658
## 5 0.6875826 0.7437957 0.7436182 0.7024206 0.6879769 0.7120552 0.7972733
##   Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 1 0.6441959 0.7254344 0.8397107 0.6909956 0.7977239 0.7672331 0.8257439
## 2 0.7947143 0.6969687 0.8329971 0.8081530 0.7750562 0.7176933 0.7884683
## 3 0.8602805 0.8542394 0.8998757 0.8650002 0.8701797 0.8303617 0.3605370
## 4 0.3982155 0.4972008 0.7594473 0.3936417 0.7211006 0.6791412 0.8099358
## 5 0.7502048 0.7519181 0.7529356 0.8158524 0.7096813 0.7121688 0.7922491
##   Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 0.8941193 0.8542984 0.6150702 0.8684120 0.8757189 0.8283629
## 2 0.8587821 0.8014595 0.8340652 0.8475437 0.8239996 0.8005800
## 3 0.5317026 0.5123296 0.3925266 0.4895318 0.3578506 0.3377660
## 4 0.9016832 0.8637042 0.8729390 0.8396498 0.8762019 0.8398788
## 5 0.8556040 0.8124324 0.8410490 0.8360226 0.8301849 0.8087150
## 
## Clustering vector:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt 
##         1         1         1         1         4         1         1 
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt 
##         1         1         1         2         2         2         2 
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt 
##         2         2         2         5         5         5         5 
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt 
##         5         5         5         4         5         5         4 
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt 
##         4         4         5         4         5         5         3 
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt 
##         3         3         3         3         3         3 
## 
## Within cluster sum of squares by cluster:
## [1] 3.087747 3.133126 2.966720 2.501078 8.619944
##  (between_SS / total_SS =  60.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
#print cluster sizes
kfit$size
## [1]  9  7  7  6 12
#print clusters (members)
kfit$cluster
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt 
##         1         1         1         1         4         1         1 
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt 
##         1         1         1         2         2         2         2 
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt 
##         2         2         2         5         5         5         5 
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt 
##         5         5         5         4         5         5         4 
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt 
##         4         4         5         4         5         5         3 
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt 
##         3         3         3         3         3         3
#sum of squared distance between cluster centers 
kfit$betweenss
## [1] 31.26847
#sum of squared distance within a cluster (this are the quantities that the algorithm
#attempts to minimise)
kfit$withinss
## [1] 3.087747 3.133126 2.966720 2.501078 8.619944

5.7 Network Graphs

#these numbers will be used to reference files in the network graph
filekey <- cbind(1:length(docs),rownames(m))
rownames(m) <- 1:length(docs)

#compute cosine similarity between document vectors
#converting to distance matrix sets diagonal elements to 0
cosineSim <- function(x){
  as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)

#adjacency matrix: set entries below a certain threshold to 0.
#We choose half the magnitude of the largest element of the matrix
#as the cutoff. This is an arbitrary choice
cs[cs < max(cs)/1.2] <- 0
cs[cs < 0.5] <- 0
cs <- round(cs,3)
# build a graph from the above matrix
#mode is undirected because similarity is a bidirectional relationship
g <- graph.adjacency(as.matrix(cs), weighted=T, mode = "undirected")
#one of many possible layouts, see igraph docs
layout1 <- layout.fruchterman.reingold(g)
#basic plot with no weighting - fruchtermann reingold weighting
plot(g, layout=layout1)

#another layout
plot(g, layout=layout.kamada.kawai)

comm_fg <- fastgreedy.community(g)
comm_fg$membership
##  [1]  1  1  6  7  5  1  1  3  1  3  4  4  8  9 10  4 11 12 13 14 15 16 17
## [24] 18 19 20 21 22 23 24 25  5 26 27  2 28 29  2 30  2  2
V(g)$color <- comm_fg$membership
plot(g, layout=layout.kamada.kawai)

community_mapping <- cbind(as.data.frame(filekey, row.names = F),comm_fg$membership)
community_mapping
##    V1        V2 comm_fg$membership
## 1   1 Doc01.txt                  1
## 2   2 Doc02.txt                  1
## 3   3 Doc03.txt                  6
## 4   4 Doc04.txt                  7
## 5   5 Doc05.txt                  5
## 6   6 Doc06.txt                  1
## 7   7 Doc07.txt                  1
## 8   8 Doc08.txt                  3
## 9   9 Doc09.txt                  1
## 10 10 Doc10.txt                  3
## 11 11 Doc11.txt                  4
## 12 12 Doc12.txt                  4
## 13 13 Doc13.txt                  8
## 14 14 Doc14.txt                  9
## 15 15 Doc15.txt                 10
## 16 16 Doc16.txt                  4
## 17 17 Doc17.txt                 11
## 18 18 Doc18.txt                 12
## 19 19 Doc19.txt                 13
## 20 20 Doc20.txt                 14
## 21 21 Doc21.txt                 15
## 22 22 Doc22.txt                 16
## 23 23 Doc23.txt                 17
## 24 24 Doc24.txt                 18
## 25 25 Doc25.txt                 19
## 26 26 Doc26.txt                 20
## 27 27 Doc27.txt                 21
## 28 28 Doc28.txt                 22
## 29 29 Doc29.txt                 23
## 30 30 Doc30.txt                 24
## 31 31 Doc31.txt                 25
## 32 32 Doc32.txt                  5
## 33 33 Doc33.txt                 26
## 34 34 Doc34.txt                 27
## 35 35 Doc35.txt                  2
## 36 36 Doc36.txt                 28
## 37 37 Doc37.txt                 29
## 38 38 Doc38.txt                  2
## 39 39 Doc39.txt                 30
## 40 40 Doc40.txt                  2
## 41 41 Doc41.txt                  2
comm_lv <- cluster_louvain(g)
comm_lv$membership
##  [1]  3  3  1  2 24  3  3  4  3  4  5  5  6  7  8  5  9 10 11 12 13 14 15
## [24] 16 17 18 19 20 21 22 23 24 25 26 30 27 28 30 29 30 30
V(g)$color <- comm_lv$membership
plot(g, layout=layout.kamada.kawai)

community_mapping <- cbind(community_mapping,comm_lv$membership)
community_mapping
##    V1        V2 comm_fg$membership comm_lv$membership
## 1   1 Doc01.txt                  1                  3
## 2   2 Doc02.txt                  1                  3
## 3   3 Doc03.txt                  6                  1
## 4   4 Doc04.txt                  7                  2
## 5   5 Doc05.txt                  5                 24
## 6   6 Doc06.txt                  1                  3
## 7   7 Doc07.txt                  1                  3
## 8   8 Doc08.txt                  3                  4
## 9   9 Doc09.txt                  1                  3
## 10 10 Doc10.txt                  3                  4
## 11 11 Doc11.txt                  4                  5
## 12 12 Doc12.txt                  4                  5
## 13 13 Doc13.txt                  8                  6
## 14 14 Doc14.txt                  9                  7
## 15 15 Doc15.txt                 10                  8
## 16 16 Doc16.txt                  4                  5
## 17 17 Doc17.txt                 11                  9
## 18 18 Doc18.txt                 12                 10
## 19 19 Doc19.txt                 13                 11
## 20 20 Doc20.txt                 14                 12
## 21 21 Doc21.txt                 15                 13
## 22 22 Doc22.txt                 16                 14
## 23 23 Doc23.txt                 17                 15
## 24 24 Doc24.txt                 18                 16
## 25 25 Doc25.txt                 19                 17
## 26 26 Doc26.txt                 20                 18
## 27 27 Doc27.txt                 21                 19
## 28 28 Doc28.txt                 22                 20
## 29 29 Doc29.txt                 23                 21
## 30 30 Doc30.txt                 24                 22
## 31 31 Doc31.txt                 25                 23
## 32 32 Doc32.txt                  5                 24
## 33 33 Doc33.txt                 26                 25
## 34 34 Doc34.txt                 27                 26
## 35 35 Doc35.txt                  2                 30
## 36 36 Doc36.txt                 28                 27
## 37 37 Doc37.txt                 29                 28
## 38 38 Doc38.txt                  2                 30
## 39 39 Doc39.txt                 30                 29
## 40 40 Doc40.txt                  2                 30
## 41 41 Doc41.txt                  2                 30
#lets weight the nodes and edges
#set label (not really necessary)
#V=vertex, E=edge
V(g)$label <- V(g)$name
#Vertex size proportional to number of connections
V(g)$size <- degree(g)*.6
#Vertex label size proportional to number of connections
V(g)$label.cex <-  degree(g) / max(degree(g))+ .8
#label colour default black
V(g)$label.color <- "black"
#Vertex color organe
V(g)$color <- "orange"
#edge color grey
E(g)$color <- "grey"
#edge width proportional to similarity (weight)
E(g)$width <- E(g)$weight*7

#lets weight the nodes and edges
#set label (not really necessary)
#V=vertex, E=edge
V(g)$label <- V(g)$name
#Vertex size proportional to number of connections
V(g)$size <- degree(g)*.6
#Vertex label size proportional to number of connections
V(g)$label.cex <-  degree(g) / max(degree(g))+ .6
#label colour default black
V(g)$label.color <- "black"
#Vertex color organe
V(g)$color <- "orange"
#edge color grey
E(g)$color <- "grey"
#edge width proportional to similarity (weight)
E(g)$width <- E(g)$weight*5
# plot the graph in layout1 (fruchtermann reingold)
plot(g, layout=layout.auto)

#output is quite ugly. Explore igraph to see how you
#can fix it
plot(g, layout=layout1)

6.0 Conclusion

From our exploration, we can identify 3 clear topics in this corpus and covered the topics:

  1. Tidy document-term matrices and corpus objects
  2. Understanding how R deals with unnest_tokens()
  3. Cleaning & Preprocessing
  4. Perform a sentiment analysis
  5. Identify term frequencies
  6. Understand n-grams and analyse word networks
  7. Topic Modelling
  8. Clustering, classification and prediction.