This notebook provides a reproducible guide for text analysis on a corpus using R. The objective of this analysis is to review frequencies, sentiment analysis, topic modelling and do some Exploratory Data Analysis (EDA).
The objective of this vignette is to use dplyr, tydr and tidytext to:
unnest_tokens()| Term | Definition |
|---|---|
| String | Text is generally read into R as character vectors which is essentially a string. |
| Corpus | Objects that contain raw strings annotated with metadata and descriptions. |
| Document-term Matrix | A matrix describing a collection of documents with a row for each document and a column for each term. The values within the matrix is the word count. |
unnest_tokens() |
Convert tibbles to one token per document per row. This gives the opportunity to analyse the text. |
fig 1. Text Mining process
docs <- VCorpus(DirSource("./docs"))
Cleaning and preprocessing of the text After obtaining the corpus, usually, the next step will be cleaning and preprocessing of the text. For this endeavor we are mostly going to use functions from the tm & tidytext packages. In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words “manage”, “management” and “managing” should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.
For this project we’ll use the preprocessing functions below:
removePunctuation(): Remove all punctuation markstolower(): Make all characters lowercaseremoveNumbers(): Remove numbersremoveWords(): Remove stopwordsstripWhitespace(): Remove excess whitespacestemDocument(): Shorten words to core meaning.Let’s apply these functions to our corpus:
#Remove punctuation - replace punctuation marks with " "
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs, content_transformer(tolower))
#Strip digits
docs <- tm_map(docs, removeNumbers)
#Remove stopwords from standard stopword list
docs <- tm_map(docs, removeWords, stopwords("english"))
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
#inspect output
writeLines(as.character(docs[[30]]))
## conventional approaches knowledge management projects focus cognitive thoughtrelated mechanical aspects knowledge creation capture alternate view one considers knowledge created interactions people – interactions – develop mutually acceptable interpretations theories facts ways suit particular needs project knowledge socially constructed true project managers need pay attention environmental social factors influence knowledge construction position taken paul jackson jane klobas paper entitled building knowledge projects practical application social constructivism information systems development presents knowledge creation sharing process model based social constructivist theory article summary review paper social constructivist view knowledge jackson klobas begin observation engineering disciplines founded belief knowledge can expressed propositions correspond reality independent human perception however alternate view knowledge absolute relative – ie depends mental models beliefs used interpret facts objects events relevant example software product viewed business users software developers former group may see application terms utility whereas latter may see instance particular technology perception gaps can also occur within seemingly homogenous groups – teams comprised software developers example can happen variety reasons differences experience cultural backgrounds make group social constructivism looks gaps can bridged authors discussion relies work berger luckmann described gap perceptions different individuals can overcome create socially constructed shared reality phrase socially constructed implies reality pertains project example created via common understanding issues followed mutual agreement players comprises reality view strikes particular chord akin stated aims dialogue mapping technique described several earlier posts see article example relevant projects knowledge information systems development social construct first authors make point information systems development isd projects …intensive exercises constructing social reality process data modeling models informed particular worldview systems designers use particular formal representations isd projects operational reality new explicitly constructed becomes understood accepted negotiated agreement participants two cultures business essentially knowledge emerges interaction discussion project proceeds however methodologies used design typically founded engineering approach takes positivist view rather social one authors suggest perhaps social constructivist paradigm offers insight continuing failure namely happening isd project far complex simple translation description external reality instructions computer emergence articulation multiple indeterminate sometimes unconscious sometimes ineffable realities negotiated achievement consensus new agreed reality explicit form business data model amenable computerization mind authors aim develop model addresses shortcomings traditional positivist view knowledge isd projects representing berger luckmanns theory social constructivism terms knowledge process model identify management principles map processes principles form basis survey used operational version process model operational model assessed experts tested project manager reallife project knowledge creationsharing process model process model jackson klobas describe based berger luckmanns work figure knowledge creationsharing model figure knowledge creationsharing model model describes personal knowledge created – personal knowledge individual knows personal knowledge built using mental models world – models frameworks individuals use make sense world according jacksonklobas process model personal knowledge built number process including internalisation absorption knowledge individual knowledge creation construction new knowledge repetitive performance tasks learning skills becoming aware new ideas ways thinking frameworks latter corresponds learning concepts theories even new ways perceiving world correspond change subjective reality individual externalisation representation description knowledge using speech symbols can perceived internalized others think explaining ideas procedures individuals objectivation creation shared constructs represent groups understanding world point knowledge objectified – perceived existence independent individuals legitimation authorization objectified knowledge correct standard reification process objective knowledge assumes status makes difficult change challenge familiar example reified knowledge procedure process hardened system – thats just way things done around common response processes challenged links depicted figure show relationships processes jackson klobas suggest knowledge creation isd projects social process occurs continual communication business sure elements knowledge creation – design prototyping development learning new skills etc – amount nought unless discussed argued agreed communicated social interactions interactions occur wider context organization reasonable claim resulting knowledge takes form mirrors social environment organization clearly model knowledge creation different usual interpretation knowledge independent reality regardless whether known group operational model good theory makes interesting academic discussions practice can model operationalised jackson klobas describe approach creating testing utility rather validity model discuss following sections knowledge sharing heuristics begin surveyed literature knowledge management identify knowledge sharing heuristics ie experiencebased techniques enable knowledge sharing example heuristics associated externalization process standard documentation modelling tools make business requirements easy understand stakeholders staff communicate regularly direct facetoface contact use prototypes authors identified heuristics matched process model according authors matching process simple cases doubt process heuristic attached suggests model provides natural way organize voluminous complex body research knowledge creation sharing important well suggests conceptual model illustrated fig can form basis simple means assess knowledge creation sharing capabilities work environments assurance relevant variables covered validating mapping validity matching checked using twenty historical case studies isd projects worked follows explanations worked well didnt mapped model process areas using heuristics identified prior step aim answer question relationship project failure problems respective knowledge processes conversely project success presence positive indicators one case studies authors use wellknown possibly overanalysed failure automated dispatch system london ambulance service paper succinct summary case study reproduce london ambulance service las largest ambulance service world provides accident emergency patient transport services resident population nearly seven million people isd project intended produce automated system dispatch ambulances emergencies existing manual system poor cumbersome inefficient relatively unreliable goal new system provide efficient command control process overcome deficiencies furthermore system seen management opportunity resolve perceived issues poor industrial relations outmoded work practices low resource utilization tender let development system components including computer aided dispatch automatic vehicle location radio interfacing mobile data terminals update status callout tender let company inexperienced large systems delivery whilst project profound implications work practices personnel hardly involved design system upon implementation many errors software infrastructure led critical operational shortcomings failure calls reach ambulances system lasted week necessary revert manual system jackson klobas show conceptual model maps knowledgerelated factors may played role failure project example heading personal knowledge one can identify least two potential factors lack involvement endusers design selection inexperienced vendor disconnect management employees suggests couple factors relating reification mutual negative perceptions outmoded unchallenged work practices validation authors suggest model provides comprehensive framework explains projects failed may overstating case – whats cause whats effect hard tell especially fact nonetheless model seem able capture many knowledgerelated gaps played role failures looking heuristics mapped process one might able suggest ways deficiencies addressed example externalization problem area one might suggest use prototypes encourage face face communication business personnel surveybased tool encouraged authors created survey tool intended evaluate knowledge creationsharing effectiveness project environments tool academic terms used model translated everyday language example term externalization translated knowledge sharing – see fig translated terms tool asked project managers evaluate project environments knowledge creation process capability scale based inputs recommend specific improvement strategies capabilities scored low tool evaluated four project managers used work environment period weeks end period interviewed responses analysed using content analysis match experiences requirements designed intent tool unfortunately paper provide details tool difficult say much paraphrase authors comments based evaluation authors conclude tool provides common framework project managers discuss issues pertaining knowledge creation sharing means identify potential problems might done address field testing one evaluators model tested tool field tester project manager wanted identify knowledge creationsharing deficiencies work environment ways addressed answered questions based evaluation knowledge sharing capabilities environment developed improvement plan based strategies suggested tool along ideas completed survey plan returned researchers use tool revealed following knowledge creationsharing deficiencies project managers environment inadequate personal knowledge ineffective externalization inadequate standardization objectivation strategies suggested tool include internet portal promote knowledge capture sharing included discussion forums areas capture discuss best practices etc role playing workshops reveal processes worked practice ie surface tacit knowledge based authors suggest technology can used promote support knowledge sharing standardization just storage interventions make tacit knowledge explicit can helpful side benefit note survey raised consciousness knowledge creationsharing within team reflections conclusions opinion value paper lies model survey tool conceptual framework underpins – namely idea knowledge depends shaped social environment evolves perhaps example might help clarify means consider organisation decides implement project management best practices described fill popular methodologies wrong way implement practices wholesale without regard organizational culture norms preexisting practices approach unlikely lead imposed practices taking root organisation hand approach picks practices useful tailors organizational needs constraints culture likely meet success second approach works attempts bridge gap ideal best practice social reality organisation encourages employees adapt practices ways make sense context organization invariably involves modifying practices sometimes substantially creating new socially constructed knowledge bargain another interesting point authors make several knowledge sharing heuristics think number classified unambiguously one processes model suggests model reasonable view knowledge creationsharing process one accepts conclusion model indeed provide common framework discussing issues relating knowledge creation project environments associated heuristics can help identify processes dont work well im unable judge usefulness surveybased tool developed authors provide much detail paper however isnt really issue field project management many tools techniques anyway key message paper opinion every project unique context techniques used others interpreted applied ways meaningful context particular project paper excellent counterpoint methodologyoriented practice knowledge management projects required reading methodologists project managers believe things need done book regardless social organizational context
#Stem document
docs <- tm_map(docs,stemDocument)
we need to convert the corpus document into a tidy format. It needs to be in a data frame which represents each row as a document.
t_corpus <- docs %>% tidy()
t_corpus
## # A tibble: 41 x 8
## author datetimestamp description heading id language origin
## <lgl> <dttm> <lgl> <lgl> <chr> <chr> <lgl>
## 1 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 2 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 3 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 4 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 5 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 6 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 7 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 8 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 9 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 10 NA 2018-10-30 15:04:50 NA NA Doc1… en NA
## # ... with 31 more rows, and 1 more variable: text <chr>
d_corpus <- t_corpus %>%
select(id, text)
Now it’s in a tibble format we can use the `unnest_tokens which tokenizes the text variable into works and creates one row per token. It also converts it into a tidy frame.
We can see that this format we can read the:
By using unnest_tokens it also converts all characters to lower case.
tidy_df <- t_corpus %>%
unnest_tokens(word, text)
tidy_df %>%
select(id, word) %>%
head(15)
## # A tibble: 15 x 2
## id word
## <chr> <chr>
## 1 Doc01.txt practic
## 2 Doc01.txt risk
## 3 Doc01.txt manag
## 4 Doc01.txt ration
## 5 Doc01.txt meansend
## 6 Doc01.txt base
## 7 Doc01.txt process
## 8 Doc01.txt risk
## 9 Doc01.txt identifi
## 10 Doc01.txt analys
## 11 Doc01.txt solv
## 12 Doc01.txt mitig
## 13 Doc01.txt although
## 14 Doc01.txt step
## 15 Doc01.txt seem
The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).
get_stopwords()
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 165 more rows
Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).
wosw_tidy <- tidy_df %>%
anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word)))
## Joining, by = "word"
## Warning in evalq(is.na(as.numeric(word)), <environment>): NAs introduced by
## coercion
wosw_tidy %>%
select(id, word) %>%
head(10)
## # A tibble: 10 x 2
## id word
## <chr> <chr>
## 1 Doc01.txt practic
## 2 Doc01.txt risk
## 3 Doc01.txt manag
## 4 Doc01.txt ration
## 5 Doc01.txt meansend
## 6 Doc01.txt base
## 7 Doc01.txt process
## 8 Doc01.txt risk
## 9 Doc01.txt identifi
## 10 Doc01.txt analys
Using the function count we can identify term frequencies of the entire corpus.
wosw_tidy %>%
count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 project 580
## 2 risk 541
## 3 manag 507
## 4 can 487
## 5 use 431
## 6 one 376
## 7 figur 335
## 8 time 333
## 9 task 285
## 10 point 265
wosw_tidy %>%
count(word, sort = TRUE) %>%
head(10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf that adds the tfidf-score to a data frame containing term frequencies and document meta data.
Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.
I believe that the IDF is better to run on each document. Because IDF depends on the index (2 in the case of the class, or N for the number of reviews), you will lose some granularity to the underlying structure of the data. Because even though the class might be the same, the underlying unit is actually the person writing the review, and it makes sense then that this should be your index unit. This is because the reviews are the ones being drawn from the distribution (somewhat iid, but not exactly), and so for a better sense of the idf, we want it measured at the level of the drawing of our sample points.
If you were curious, you could train your model for each of the classes separately, and here you would be assuming that the reviews for the two classes are drawn from separate distributions and are sufficiently different to have different models associated with them. I suspect however that the loss in the extra data will be not worth the gain from specifically modeling each class. [https://stats.stackexchange.com/questions/364834/tf-idf-for-text-classification-on-what-should-idf-be-calculated]
perb <- wosw_tidy %>%
unnest(id, word) %>%
count(id, word, sort = TRUE) %>%
ungroup()
#calculate the overall words in the documents
peridc <- perb %>%
group_by(id) %>%
summarize(total = sum(n))
book_words <- left_join(perb, peridc)
## Joining, by = "id"
book_words
## # A tibble: 19,094 x 4
## id word n total
## <chr> <chr> <int> <int>
## 1 Doc20.txt cluster 101 2146
## 2 Doc40.txt task 90 1604
## 3 Doc35.txt time 86 2370
## 4 Doc29.txt manag 83 1743
## 5 Doc39.txt correl 79 1419
## 6 Doc09.txt manag 78 1174
## 7 Doc21.txt topic 78 1608
## 8 Doc32.txt project 78 984
## 9 Doc08.txt risk 77 1685
## 10 Doc09.txt risk 77 1174
## # ... with 19,084 more rows
As per the table above, we can see the most frequent terms per book showing overall total words in the document.
frequency <- wosw_tidy %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(id, word) %>%
group_by(id) %>%
mutate(proportion = n / sum(n)) %>%
select(-n)
frequency %>%
arrange(desc(proportion))
## # A tibble: 19,094 x 3
## # Groups: id [41]
## id word proportion
## <chr> <chr> <dbl>
## 1 Doc07.txt risk 0.0909
## 2 Doc31.txt design 0.0808
## 3 Doc05.txt project 0.0805
## 4 Doc32.txt project 0.0793
## 5 Doc07.txt project 0.0705
## 6 Doc04.txt eleph 0.0678
## 7 Doc04.txt risk 0.0678
## 8 Doc09.txt manag 0.0664
## 9 Doc02.txt risk 0.0664
## 10 Doc09.txt risk 0.0656
## # ... with 19,084 more rows
Zipf’s law states that the frequency tht a word appears is inversely proportional to its rank.
The distributions above are typical.
freq_by_rank <- book_words %>%
group_by(id) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank
## # A tibble: 19,094 x 6
## # Groups: id [41]
## id word n total rank `term frequency`
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Doc20.txt cluster 101 2146 1 0.0471
## 2 Doc40.txt task 90 1604 1 0.0561
## 3 Doc35.txt time 86 2370 1 0.0363
## 4 Doc29.txt manag 83 1743 1 0.0476
## 5 Doc39.txt correl 79 1419 1 0.0557
## 6 Doc09.txt manag 78 1174 1 0.0664
## 7 Doc21.txt topic 78 1608 1 0.0485
## 8 Doc32.txt project 78 984 1 0.0793
## 9 Doc08.txt risk 77 1685 1 0.0457
## 10 Doc09.txt risk 77 1174 2 0.0656
## # ... with 19,084 more rows
The rank column ranks each word within the frequency table.
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, colour = id)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
By plotting in log-log co-ordinates we can see that the corpus, even though seperated by id, are similiar to each other and the relationship between rank and frequency does have a negative slope.
rank_subset <- freq_by_rank %>%
filter(rank<500,
rank>10)
lm(log10(`term frequency`) ~log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.1061 -0.7717
As per above, we’re assessing the middle section of the rank range and see whether what exponent of the power law is. As we can see from log10(rank) above, the slow is -0.7717.
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = id)) +
geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
We see that this is a closely related to zipf’s law. The deviations at the high rank (1-10) are not uncommon for lanugage studies. The deviations at low rank are unusual. This means that there is a higher percentage of the most common words.
book_words <- book_words %>%
bind_tf_idf(word, id, n)
book_words
## # A tibble: 19,094 x 7
## id word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Doc20.txt cluster 101 2146 0.0471 2.10 0.0990
## 2 Doc40.txt task 90 1604 0.0561 0.769 0.0432
## 3 Doc35.txt time 86 2370 0.0363 0.280 0.0101
## 4 Doc29.txt manag 83 1743 0.0476 0.187 0.00891
## 5 Doc39.txt correl 79 1419 0.0557 1.77 0.0984
## 6 Doc09.txt manag 78 1174 0.0664 0.187 0.0124
## 7 Doc21.txt topic 78 1608 0.0485 1.15 0.0557
## 8 Doc32.txt project 78 984 0.0793 0.248 0.0196
## 9 Doc08.txt risk 77 1685 0.0457 0.718 0.0328
## 10 Doc09.txt risk 77 1174 0.0656 0.718 0.0471
## # ... with 19,084 more rows
We can see that with common words, the tf and tf-idf are close to zero because they are extremely common words. They are common because they appear across all of our corpus (they will have a natural log of 1) and therefore this approach decreases the weigth for common words.
book_words %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 19,094 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Doc04.txt eleph 16 0.0678 2.61 0.177
## 2 Doc26.txt stupid 36 0.0316 3.71 0.117
## 3 Doc02.txt erm 30 0.0273 3.71 0.101
## 4 Doc31.txt design 48 0.0808 1.23 0.0993
## 5 Doc20.txt cluster 101 0.0471 2.10 0.0990
## 6 Doc39.txt correl 79 0.0557 1.77 0.0984
## 7 Doc10.txt score 33 0.0517 1.77 0.0914
## 8 Doc24.txt scapegoat 14 0.0218 3.71 0.0809
## 9 Doc37.txt dart 15 0.0246 3.02 0.0744
## 10 Doc07.txt strateg 20 0.0455 1.63 0.0743
## # ... with 19,084 more rows
Here we can see proper nouns and names that are important in these documents. None of them occur in all of the corpus, and therefore important characteristic words.
Sentiment analysis gives us the opportunity to analyse and investigate the emotional intent of words whether the text are positive or negative.
To analyse sentiment, we generally consider a combintation of words rather than an individual word. This isn’t the only method but often-used approach.
The workflow is as per below:
Flowchart of text analysis
sentiments
## # A tibble: 27,314 x 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 abacus trust nrc NA
## 2 abandon fear nrc NA
## 3 abandon negative nrc NA
## 4 abandon sadness nrc NA
## 5 abandoned anger nrc NA
## 6 abandoned fear nrc NA
## 7 abandoned negative nrc NA
## 8 abandoned sadness nrc NA
## 9 abandonment anger nrc NA
## 10 abandonment fear nrc NA
## # ... with 27,304 more rows
The tidytext package contains a dictionary that containts several sentiment lexicons.
The three general-purpose lexicons are:
AFINN from Finn Nielsen,bing from Bing Liu and Collaborators, andnrc from Saif Mohammad and Peter Turney.All of the lexicons above are based on unigrams. They assign english words and associate it with scores for negative or positive sentiment. nrc, however, is a binary yes/no in emotions.
tidy_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_df %>%
inner_join(tidy_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 68 x 2
## word n
## <chr> <int>
## 1 organ 91
## 2 good 90
## 3 share 64
## 4 success 59
## 5 present 57
## 6 score 48
## 7 excel 34
## 8 deal 29
## 9 content 25
## 10 found 23
## # ... with 58 more rows
We can see mostly positive words here. lets apply this across documents in our corpus.
tidy_df_sent <- tidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(id, sentiment) %>%
spread(sentiment, n, fill = 0)
## Joining, by = "word"
#mutate(sentiment = positve - negative)
tidy_df_sent <- tidy_df_sent %>%
mutate(sentiment = positive - negative)
tidy_df_sent
## # A tibble: 41 x 4
## id negative positive sentiment
## <chr> <dbl> <dbl> <dbl>
## 1 Doc01.txt 146 36 -110
## 2 Doc02.txt 106 35 -71
## 3 Doc03.txt 46 9 -37
## 4 Doc04.txt 30 5 -25
## 5 Doc05.txt 47 43 -4
## 6 Doc06.txt 105 30 -75
## 7 Doc07.txt 47 12 -35
## 8 Doc08.txt 150 43 -107
## 9 Doc09.txt 123 32 -91
## 10 Doc10.txt 55 18 -37
## # ... with 31 more rows
ggplot(tidy_df_sent, aes(id, sentiment, fill = id)) +
geom_col(show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can see here that documents 1-9 have negative sentiments and documents 10 onto 41 are primarily positive.
bing_word_counts <- tidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
As per the pot above, we can see the how much the word “risk” and “problem” are used in association with a negative sentiment.
Sentiment Analysis provides a useful way to understand the attributes associated in text.
bigrams <- tidy_df %>%
unnest_tokens(bigram, word, token = "ngrams", n=2)
bigrams
## # A tibble: 47,012 x 8
## author datetimestamp description heading id language origin
## <lgl> <dttm> <lgl> <lgl> <chr> <chr> <lgl>
## 1 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 2 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 3 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 4 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 5 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 6 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 7 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 8 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 9 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## 10 NA 2018-10-30 15:04:50 NA NA Doc0… en NA
## # ... with 47,002 more rows, and 1 more variable: bigram <chr>
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 35,865 x 2
## bigram n
## <chr> <int>
## 1 project manag 139
## 2 risk manag 133
## 3 complet time 105
## 4 doc tmmapdoc 64
## 5 figur figur 61
## 6 mont carlo 61
## 7 best practic 51
## 8 shown figur 51
## 9 triangular distribut 47
## 10 one can 39
## # ... with 35,855 more rows
We can see the most commong bigram is “Project Managmeent”, “risk Management” and so on.
bigram_tf_idf <- bigrams %>%
count(id, bigram) %>%
bind_tf_idf(bigram, id, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 41,495 x 6
## id bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Doc07.txt strateg risk 12 0.0273 3.71 0.101
## 2 Doc15.txt inform knowledg 15 0.0196 3.71 0.0727
## 3 Doc26.txt function stupid 20 0.0175 3.71 0.0649
## 4 Doc27.txt side effect 14 0.0163 3.71 0.0606
## 5 Doc36.txt standard deviat 10 0.0190 3.02 0.0574
## 6 Doc24.txt scapegoat approach 10 0.0154 3.71 0.0573
## 7 Doc33.txt best practic 39 0.0358 1.52 0.0543
## 8 Doc17.txt infrastructur technolog 11 0.0137 3.71 0.0509
## 9 Doc28.txt organis cultur 9 0.0189 2.61 0.0495
## 10 Doc09.txt risk manag 37 0.0315 1.52 0.0477
## # ... with 41,485 more rows
By applying tf-idf and n-gram we can discover the most important elements of each document within the corpus.
Using bigrams over unigrams gives us the oppotunity to capture structure and understand the context.
R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
The figure above visualises the text structure and how they are linked and directionality. The more common they are the darker the arrows.
We want to examine correlation among words, which indicates how often they appear together relative to how often they appear seperately.
word_cors <- wosw_tidy %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, id, sort = TRUE)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
word_cors
## # A tibble: 331,200 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 carlo mont 1
## 2 mont carlo 1
## 3 compendium ibi 0.995
## 4 ibi compendium 0.995
## 5 erm appetit 0.994
## 6 appetit erm 0.994
## 7 tmax tmin 0.993
## 8 tmin tmax 0.993
## 9 tmax tml 0.991
## 10 tml tmax 0.991
## # ... with 331,190 more rows
The output above is useful for further exploration. Lets have a look into the terms “tmax”.
word_cors %>%
filter(item1 == "tmax")
## # A tibble: 575 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 tmax tmin 0.993
## 2 tmax tml 0.991
## 3 tmax pt 0.935
## 4 tmax t 0.848
## 5 tmax estim 0.843
## 6 tmax correspond 0.835
## 7 tmax triangular 0.828
## 8 tmax simul 0.827
## 9 tmax obtain 0.805
## 10 tmax equat 0.802
## # ... with 565 more rows
word_cors %>%
filter(correlation > .95) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
The correlation plot above show >=95% correlation. The relationships are symmetrical.
Topic modelling is an unsupervised classification of documents that divides the collection into natural groups so we can understand them seperately.
For this project we’ll be applying the Latent Dirichlet Allocation (LDA) method for fitting a topic model. It:
This allows for potential “overlap” in content rather than being classified as discrete groups.
#create Document Term Matrix
dtm <- DocumentTermMatrix(docs)
#create model fit
ap_lda <- LDA(dtm, k = 3, control = list())
ap_lda
## A LDA_VEM topic model with 3 topics.
Now we’ve fitted a LDA model. We have to now explore and interpret the model.
Using tidy(), we can extract the per-topic-per-word probabilities also known as “Beta” from the model.
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 13,506 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 –demonstr 7.60e-136
## 2 2 –demonstr 1.93e-139
## 3 3 –demonstr 8.87e- 5
## 4 1 –especi 4.69e-125
## 5 2 –especi 4.26e- 5
## 6 3 –especi 6.36e-130
## 7 1 –howev 5.66e- 43
## 8 2 –howev 2.87e- 9
## 9 3 –howev 8.87e- 5
## 10 1 –techniqu 6.33e-132
## # ... with 13,496 more rows
The returend tibble format is now a one-topic-per-term-per-row. Each combination, the model computes the probability of that term being generated fromt hat topic. For example: “especi” has the probility of:
Lets find the top 10 terms that are most common within each topic.
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
This visualization lets us understand the two topics that were extracted from the corpus. The most common words in topic 1 include “task”, “time”, “figure”, and “distribution”, which suggests it may represent educational texts. Those most commong in topic 2 include “Document”, “Cluster”, “Topic” suggesting that this topic represents text classification texts. Those most common in topic 3 include “project”, “risk”, and “manag”, suggesting that this topic represents Project management texts. One important observation about the words in each topic is that some words, such as “can” and “use”, are common within all three topics topics.
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread
## # A tibble: 467 x 5
## term topic1 topic2 topic3 log_ratio
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accept 0.00000000225 0.00102 1.95e-19 18.8
## 2 achiev 0.000152 0.00197 1.83e-14 3.69
## 3 action 0.000585 0.00155 3.06e-24 1.40
## 4 address 0.000288 0.00159 1.31e- 4 2.46
## 5 adjust 0.00131 0.00000148 1.20e- 8 -9.79
## 6 affect 0.000363 0.00103 1.39e- 4 1.51
## 7 aim 0.000476 0.00147 6.17e- 4 1.63
## 8 algorithm 0.00599 0.0000000285 1.22e- 3 -17.7
## 9 also 0.00307 0.00227 1.75e- 3 -0.434
## 10 although 0.000793 0.00181 1.89e- 3 1.19
## # ... with 457 more rows
LDA also models each document as a mixture of topics through the use of “gamma” in tidy()
ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents
## # A tibble: 123 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Doc01.txt 1 0.0000348
## 2 Doc02.txt 1 0.287
## 3 Doc03.txt 1 0.000105
## 4 Doc04.txt 1 0.000195
## 5 Doc05.txt 1 0.0000519
## 6 Doc06.txt 1 0.0000453
## 7 Doc07.txt 1 0.000105
## 8 Doc08.txt 1 0.0000275
## 9 Doc09.txt 1 0.0000398
## 10 Doc10.txt 1 0.0000730
## # ... with 113 more rows
From observing documents 1-10, We can see that document 3 & 8 contain approximately 13% of the words in topic 1. We can see that many of these documents are drawn from the topic 2 & 3.
We can find clusters of words that characterise a set of documents.
m <- as.matrix(dtm)
#compute distance between document vectors
d <- dist(m)
#run hierarchical clustering using Ward's method (explore other options later)
groups <- hclust(d,method = "ward.D")
#plot, use hang to ensure that labels fall below tree
plot(groups, hang=-1)
hclusters <- cutree(groups,3)
hclusters
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 1 1 1 1 1 1
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 1 1 1 1 1 1
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 1 1 1 2 2 2
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 2 1 1 1 1 1 1
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 1 1 1 1 1 1 3
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 1 3 3 3 3
#lets try another distance measure
cosineSim <- function(x){
as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)
cd <- 1-cs
#run hierarchical clustering using cosine distance
groups <- hclust(cd,method="ward.D")
#plot, use hang to ensure that labels fall below tree
plot(groups, hang=-1)
hclusters_cosine <- cutree(groups,3)
#compute distance between document vectors
d <- dist(m)
#kmeans clustering
#kmeans - run with nstart=100 and k=2,3,5 to compare results with hclust
kfit <- kmeans(d, 4, nstart=100)
#plot -
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
#print contents of kfit
print(kfit)
## K-means clustering with 4 clusters of sizes 19, 6, 2, 14
##
## Cluster means:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 108.1902 109.8525 63.73692 62.53554 98.52241 106.8860 74.07278
## 2 164.0325 168.2921 141.07874 144.29141 159.47704 166.1681 149.68420
## 3 202.2290 205.8845 180.65401 185.69205 192.65477 201.0714 189.07201
## 4 109.0816 111.1807 99.62930 103.79043 105.94408 105.2069 99.98784
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 130.8908 136.8752 77.02794 138.4394 72.7574 74.16148 106.0221
## 2 174.6133 187.7614 147.25453 145.8428 137.4458 142.97565 156.0618
## 3 201.6781 216.8520 189.16669 210.8285 184.8022 184.96900 192.4457
## 4 123.3366 121.2925 108.18662 160.1996 113.4550 116.14897 127.5738
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 82.09724 73.29033 79.84728 102.7634 170.8184 166.2086 131.9923
## 2 142.55259 139.39946 144.00266 150.6986 141.2076 141.9019 129.9559
## 3 192.09524 184.37799 185.52908 189.3455 218.2610 218.9969 202.4562
## 4 116.39036 113.30962 120.73450 122.4294 182.9737 183.4248 152.7061
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 1 126.7113 74.13313 74.19499 76.11665 86.36957 82.06487 68.69331
## 2 124.3788 145.39231 146.53521 146.38600 150.81875 149.34265 145.15200
## 3 199.0643 182.51451 187.82722 187.85944 192.80172 188.78237 187.11769
## 4 151.0251 113.50634 113.40627 107.53746 121.43078 117.98936 106.03430
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 1 133.7728 116.9127 80.61226 100.1876 102.4817 82.74281 203.88766
## 2 178.3254 164.5905 150.51414 161.2192 156.6190 148.99268 215.46146
## 3 212.8706 204.9333 191.24404 195.2125 198.9165 186.97323 63.50591
## 4 130.4496 124.4860 118.70398 111.7444 120.9206 116.65152 206.06853
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 70.6491 73.0388 101.8259 135.3703 167.80122 116.0721
## 2 141.6723 140.0531 152.2319 146.7843 189.43933 155.1067
## 3 175.9253 173.6405 137.7472 165.0955 63.50591 123.1712
## 4 112.6261 112.8993 113.0662 153.3267 176.07607 127.9183
##
## Clustering vector:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 4 4 1 1 4 4 1
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 4 4 1 2 1 1 4
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 1 1 4 2 2 2
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 2 1 1 1 1 1 1
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 4 4 1 4 4 1 3
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 1 4 2 3 4
##
## Within cluster sum of squares by cluster:
## [1] 166966.24 202025.47 37096.15 348849.43
## (between_SS / total_SS = 62.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
#print cluster sizes
kfit$size
## [1] 19 6 2 14
#print clusters (members)
kfit$cluster
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 4 4 1 1 4 4 1
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 4 4 1 2 1 1 4
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 1 1 4 2 2 2
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 2 1 1 1 1 1 1
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 4 4 1 4 4 1 3
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 1 4 2 3 4
#sum of squared distance between cluster centers
kfit$betweenss
## [1] 1234173
#sum of squared distance within a cluster (this are the quantities that the algorithm
#attempts to minimise)
kfit$withinss
## [1] 166966.24 202025.47 37096.15 348849.43
Lets determine the optimal number of clusters by identifying the elebow of the plot summed intra-cluster distances (withinss) as fn of k.
wss <- 2:(length(docs)-1)
for (i in 2:(length(docs)-1)) wss[i] <- sum(kmeans(d,centers=i,nstart=25)$withinss)
plot(2:(length(docs)-1), wss[2:(length(docs)-1)], type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
It’s not a very clear elbow and can be 3 or 5.
cosineSim <- function(x){
as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)
cd <- 1-cs
kfit <- kmeans(cd, 5, nstart=100)
clusplot(as.matrix(cd), kfit$cluster, color=T, shade=T, labels=2, lines=0)
#print contents of kfit
print(kfit)
## K-means clustering with 5 clusters of sizes 9, 7, 7, 6, 12
##
## Cluster means:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 0.3467390 0.3702095 0.3725129 0.4572123 0.5774887 0.3310664 0.3681517
## 2 0.8224432 0.8641520 0.7847439 0.8763468 0.8143455 0.8271382 0.8471637
## 3 0.8417660 0.8762549 0.7288068 0.8695281 0.8433583 0.8398390 0.8513733
## 4 0.7163626 0.7265631 0.7190131 0.7761829 0.3939277 0.5672919 0.5972205
## 5 0.7935452 0.8242923 0.7887771 0.8836909 0.8248001 0.8220622 0.8529045
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 0.3534253 0.3355292 0.4667107 0.8031170 0.7742731 0.8938587 0.8770232
## 2 0.8077241 0.8362993 0.8415093 0.3939766 0.4243705 0.4921867 0.5330926
## 3 0.7331937 0.8449659 0.8222052 0.8014180 0.8055787 0.8384485 0.8204215
## 4 0.6927705 0.5058114 0.7941625 0.7443012 0.7371821 0.9000575 0.8608461
## 5 0.7750194 0.8420439 0.8288928 0.7235806 0.7374335 0.8243347 0.8156602
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 1 0.8108352 0.8119106 0.8681651 0.8367485 0.7857325 0.8708532 0.8559012
## 2 0.5485686 0.4036151 0.4809752 0.7836799 0.7145495 0.7757845 0.8126291
## 3 0.8673541 0.8080342 0.8136435 0.7935313 0.7364108 0.7725986 0.7965062
## 4 0.6302067 0.7458052 0.8603819 0.8412972 0.7484149 0.8662933 0.8419034
## 5 0.7749949 0.7795732 0.8085912 0.7191706 0.6146604 0.6795011 0.6852190
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 1 0.8691893 0.8639050 0.8071345 0.6836051 0.7889370 0.7985686 0.7418661
## 2 0.7352509 0.8484055 0.8295581 0.7493001 0.7675647 0.7739771 0.8326160
## 3 0.7765359 0.8446099 0.8727796 0.8518514 0.8637897 0.8449760 0.8715498
## 4 0.8700103 0.7529901 0.7873905 0.4921728 0.7001361 0.7168138 0.4493658
## 5 0.6875826 0.7437957 0.7436182 0.7024206 0.6879769 0.7120552 0.7972733
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 1 0.6441959 0.7254344 0.8397107 0.6909956 0.7977239 0.7672331 0.8257439
## 2 0.7947143 0.6969687 0.8329971 0.8081530 0.7750562 0.7176933 0.7884683
## 3 0.8602805 0.8542394 0.8998757 0.8650002 0.8701797 0.8303617 0.3605370
## 4 0.3982155 0.4972008 0.7594473 0.3936417 0.7211006 0.6791412 0.8099358
## 5 0.7502048 0.7519181 0.7529356 0.8158524 0.7096813 0.7121688 0.7922491
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 1 0.8941193 0.8542984 0.6150702 0.8684120 0.8757189 0.8283629
## 2 0.8587821 0.8014595 0.8340652 0.8475437 0.8239996 0.8005800
## 3 0.5317026 0.5123296 0.3925266 0.4895318 0.3578506 0.3377660
## 4 0.9016832 0.8637042 0.8729390 0.8396498 0.8762019 0.8398788
## 5 0.8556040 0.8124324 0.8410490 0.8360226 0.8301849 0.8087150
##
## Clustering vector:
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 1 1 1 4 1 1
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 1 1 2 2 2 2
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 2 2 2 5 5 5 5
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 5 5 5 4 5 5 4
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 4 4 5 4 5 5 3
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 3.087747 3.133126 2.966720 2.501078 8.619944
## (between_SS / total_SS = 60.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
#print cluster sizes
kfit$size
## [1] 9 7 7 6 12
#print clusters (members)
kfit$cluster
## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt
## 1 1 1 1 4 1 1
## Doc08.txt Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt
## 1 1 1 2 2 2 2
## Doc15.txt Doc16.txt Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt
## 2 2 2 5 5 5 5
## Doc22.txt Doc23.txt Doc24.txt Doc25.txt Doc26.txt Doc27.txt Doc28.txt
## 5 5 5 4 5 5 4
## Doc29.txt Doc30.txt Doc31.txt Doc32.txt Doc33.txt Doc34.txt Doc35.txt
## 4 4 5 4 5 5 3
## Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt Doc41.txt
## 3 3 3 3 3 3
#sum of squared distance between cluster centers
kfit$betweenss
## [1] 31.26847
#sum of squared distance within a cluster (this are the quantities that the algorithm
#attempts to minimise)
kfit$withinss
## [1] 3.087747 3.133126 2.966720 2.501078 8.619944
#these numbers will be used to reference files in the network graph
filekey <- cbind(1:length(docs),rownames(m))
rownames(m) <- 1:length(docs)
#compute cosine similarity between document vectors
#converting to distance matrix sets diagonal elements to 0
cosineSim <- function(x){
as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m)
#adjacency matrix: set entries below a certain threshold to 0.
#We choose half the magnitude of the largest element of the matrix
#as the cutoff. This is an arbitrary choice
cs[cs < max(cs)/1.2] <- 0
cs[cs < 0.5] <- 0
cs <- round(cs,3)
# build a graph from the above matrix
#mode is undirected because similarity is a bidirectional relationship
g <- graph.adjacency(as.matrix(cs), weighted=T, mode = "undirected")
#one of many possible layouts, see igraph docs
layout1 <- layout.fruchterman.reingold(g)
#basic plot with no weighting - fruchtermann reingold weighting
plot(g, layout=layout1)
#another layout
plot(g, layout=layout.kamada.kawai)
comm_fg <- fastgreedy.community(g)
comm_fg$membership
## [1] 1 1 6 7 5 1 1 3 1 3 4 4 8 9 10 4 11 12 13 14 15 16 17
## [24] 18 19 20 21 22 23 24 25 5 26 27 2 28 29 2 30 2 2
V(g)$color <- comm_fg$membership
plot(g, layout=layout.kamada.kawai)
community_mapping <- cbind(as.data.frame(filekey, row.names = F),comm_fg$membership)
community_mapping
## V1 V2 comm_fg$membership
## 1 1 Doc01.txt 1
## 2 2 Doc02.txt 1
## 3 3 Doc03.txt 6
## 4 4 Doc04.txt 7
## 5 5 Doc05.txt 5
## 6 6 Doc06.txt 1
## 7 7 Doc07.txt 1
## 8 8 Doc08.txt 3
## 9 9 Doc09.txt 1
## 10 10 Doc10.txt 3
## 11 11 Doc11.txt 4
## 12 12 Doc12.txt 4
## 13 13 Doc13.txt 8
## 14 14 Doc14.txt 9
## 15 15 Doc15.txt 10
## 16 16 Doc16.txt 4
## 17 17 Doc17.txt 11
## 18 18 Doc18.txt 12
## 19 19 Doc19.txt 13
## 20 20 Doc20.txt 14
## 21 21 Doc21.txt 15
## 22 22 Doc22.txt 16
## 23 23 Doc23.txt 17
## 24 24 Doc24.txt 18
## 25 25 Doc25.txt 19
## 26 26 Doc26.txt 20
## 27 27 Doc27.txt 21
## 28 28 Doc28.txt 22
## 29 29 Doc29.txt 23
## 30 30 Doc30.txt 24
## 31 31 Doc31.txt 25
## 32 32 Doc32.txt 5
## 33 33 Doc33.txt 26
## 34 34 Doc34.txt 27
## 35 35 Doc35.txt 2
## 36 36 Doc36.txt 28
## 37 37 Doc37.txt 29
## 38 38 Doc38.txt 2
## 39 39 Doc39.txt 30
## 40 40 Doc40.txt 2
## 41 41 Doc41.txt 2
comm_lv <- cluster_louvain(g)
comm_lv$membership
## [1] 3 3 1 2 24 3 3 4 3 4 5 5 6 7 8 5 9 10 11 12 13 14 15
## [24] 16 17 18 19 20 21 22 23 24 25 26 30 27 28 30 29 30 30
V(g)$color <- comm_lv$membership
plot(g, layout=layout.kamada.kawai)
community_mapping <- cbind(community_mapping,comm_lv$membership)
community_mapping
## V1 V2 comm_fg$membership comm_lv$membership
## 1 1 Doc01.txt 1 3
## 2 2 Doc02.txt 1 3
## 3 3 Doc03.txt 6 1
## 4 4 Doc04.txt 7 2
## 5 5 Doc05.txt 5 24
## 6 6 Doc06.txt 1 3
## 7 7 Doc07.txt 1 3
## 8 8 Doc08.txt 3 4
## 9 9 Doc09.txt 1 3
## 10 10 Doc10.txt 3 4
## 11 11 Doc11.txt 4 5
## 12 12 Doc12.txt 4 5
## 13 13 Doc13.txt 8 6
## 14 14 Doc14.txt 9 7
## 15 15 Doc15.txt 10 8
## 16 16 Doc16.txt 4 5
## 17 17 Doc17.txt 11 9
## 18 18 Doc18.txt 12 10
## 19 19 Doc19.txt 13 11
## 20 20 Doc20.txt 14 12
## 21 21 Doc21.txt 15 13
## 22 22 Doc22.txt 16 14
## 23 23 Doc23.txt 17 15
## 24 24 Doc24.txt 18 16
## 25 25 Doc25.txt 19 17
## 26 26 Doc26.txt 20 18
## 27 27 Doc27.txt 21 19
## 28 28 Doc28.txt 22 20
## 29 29 Doc29.txt 23 21
## 30 30 Doc30.txt 24 22
## 31 31 Doc31.txt 25 23
## 32 32 Doc32.txt 5 24
## 33 33 Doc33.txt 26 25
## 34 34 Doc34.txt 27 26
## 35 35 Doc35.txt 2 30
## 36 36 Doc36.txt 28 27
## 37 37 Doc37.txt 29 28
## 38 38 Doc38.txt 2 30
## 39 39 Doc39.txt 30 29
## 40 40 Doc40.txt 2 30
## 41 41 Doc41.txt 2 30
#lets weight the nodes and edges
#set label (not really necessary)
#V=vertex, E=edge
V(g)$label <- V(g)$name
#Vertex size proportional to number of connections
V(g)$size <- degree(g)*.6
#Vertex label size proportional to number of connections
V(g)$label.cex <- degree(g) / max(degree(g))+ .8
#label colour default black
V(g)$label.color <- "black"
#Vertex color organe
V(g)$color <- "orange"
#edge color grey
E(g)$color <- "grey"
#edge width proportional to similarity (weight)
E(g)$width <- E(g)$weight*7
#lets weight the nodes and edges
#set label (not really necessary)
#V=vertex, E=edge
V(g)$label <- V(g)$name
#Vertex size proportional to number of connections
V(g)$size <- degree(g)*.6
#Vertex label size proportional to number of connections
V(g)$label.cex <- degree(g) / max(degree(g))+ .6
#label colour default black
V(g)$label.color <- "black"
#Vertex color organe
V(g)$color <- "orange"
#edge color grey
E(g)$color <- "grey"
#edge width proportional to similarity (weight)
E(g)$width <- E(g)$weight*5
# plot the graph in layout1 (fruchtermann reingold)
plot(g, layout=layout.auto)
#output is quite ugly. Explore igraph to see how you
#can fix it
plot(g, layout=layout1)
From our exploration, we can identify 3 clear topics in this corpus and covered the topics:
unnest_tokens()