text_mining

#loading tm - text mining library
library(tm)

## Warning: package 'tm' was built under R version 3.5.2

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.5.2

Creating a collection of documents (technically referred to as Corpus) in the R environment.This basically involves loading the files created in the TextMining folder into a Corpus object.

#create corpus
docs <- Corpus(DirSource("C:/Users/dave_/Documents/olga_data_science_machine_learning/text_mining_R"))

#quick check
docs

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 30

#inspect a document number 15, dont run, too long, all good with check
#writeLines(as.character(docs[[15]]))

Pre-processing

Data cleansing, though tedious, is perhaps the most important step in text analysis. As we will see, dirty data can play havoc with the results. Furthermore, as we will also see, data cleaning is invariably an iterative process as there are always problems that are overlooked the first time around.

The tm package offers a number of transformations that ease the tedium of cleaning data. To see the available transformations type getTransformations() at the R prompt:

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

A few edits are needed before Transformations In this case, the input function would be one that replaces all instances of a character by spaces. As it turns out the gsub() function does just that.

#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, ' ', x))})

Now, apply toSpace function to docs

docs <- tm_map(docs, toSpace, '-')
docs <- tm_map(docs, toSpace, ':')

#check after each transformation

#dont run, all good
#writeLines(as.character(docs[[15]]))

Now I can remove punctuation

#Remove punctuation – replace punctuation marks with ” “
docs <- tm_map(docs, removePunctuation)

check again

#writeLines(as.character(docs[[15]]))

Ok, quick check shows that there is still some left, remove them

docs <- tm_map(docs, toSpace, ' -')

Next step is convert the corpus to lower case and remove all numbers(not always the case)

#convertion to lower case
docs <- tm_map(docs,content_transformer(tolower))

#Strip digits (std transformation, so no need for content_transformer)
docs <- tm_map(docs, removeNumbers)

The next step is to remove common words from the text. These include words such as articles (a, an, the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc) . The tm package includes a standard list of such stop words as they are referred to. We remove stop words using the standard removeWords transformation like so:

#remove stopwords using the standard list in tm
docs <- tm_map(docs, removeWords, stopwords('english'))

Finally, we remove all extraneous whitespaces using the stripWhitespace transformation:

#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)

Final check

writeLines(as.character(docs[[15]]))

## rituals information system design development leave comment introduction information system development generally viewed rational process involving steps planning requirements gathering design etc however since often involves many people natural process will social political dimensions well rational elements development process focus matters analysis coding adherence guidelines etc hand socio political aspects things differences opinion conflict organisational turf wars etc interesting thing however elements appear rational sometimes subverted achieve political ends shorn original intent become ritualsthat performed symbolic reasons rather rational ones paper discuss rituals system development drawing paper daniel robey lynne markus entitled rituals information system design background according authors labelling process system design development rational implies process can set explained logical way moreover also implies system designed clear goals can defined upfront implemented system will used manner intended designers hand political perspective emphasise differences various stakeholder groups eg users sponsors developers group uses process ways benefit sometimes detriment others paper authors discuss following two elements system development process consistent views summarised system development lifecycle techniques user involvement ill look turn next two sections emphasising rational features development lifecycle basic steps system development lifecycle common methodologies inception requirements gathering analysis specification design programming testing training rollout waterfall methodologies run whereas iterativeincremental methods loop subset many times needed easy see lifecycle rational basis – specification depends requirements can therefore done requirements gathered analysis programming can proceed design completed sounds logical rational moreover mid size large teams activities carried different individuals – business analysts architectsdesigners programmers testers trainers operations staff advantage following formal development cycle makes easier plan coordinate large development efforts least principle techniques user involvement truism success system depends critically level user interest engagement generates user involvement different phases system development therefore seen key generating maintaining user engagement common techniques solicit user involvement include requirements analysis direct interaction users necessary order get good understanding expectations system another benefit gives project team early opportunity gain user engagement steering committees typically committees composed key stakeholders group affected system although question utility steering committees true committees consist high ranking executives can help driving user engagement prototyping involves creating working model serves demonstrate subset full functionality system great advantage method user involvement gives users opportunity provide feedback early development lifecycle easy see techniques rational basis logic involving users early development process helps become familiar system thus improving chances will willing even enthusiastic adopters system rolled political players politics inevitable social system stakeholder groups differing interests case system development two important stakeholder groups users developers among things two groups differ cognitive style developers tend analyticallogical types users come broad spectrum cognitive types yes generalisation largely true position organisation corporate environment business users generally outrank technical staff affiliations users developers belong different organisational units therefore differing loyalties incentives typically member two groups different goals developers may measured success rollout whereas users may judged proficiency new system resulting gains productivity lead differences ways two groups perceive processes events example developer may see specification blueprint design whereas user might see bureaucratic document locks choices ill equipped make differences perceptions make far obvious different parties can converge common worldview assumed rational perspective indeed situations isnt clear constitutes “common interest” indeed differences lead ritualisation aspects systems development process ritualisation rational processes now look differences perspectives can lead situation processes intended rational end becoming rituals lets begin example occurs inception phase system development project formulation business case stated intent business case make rational argument particular system built ideally created jointly business technology departments practice however frequently happens one two parties given primary responsibility two parties equally represented business case ends becoming political document instead presenting balanced case presents distorted view focuses one partys needs happens business case becomes symbol rather substance – words ritual another example handover process developers users operations matter process intended ensure system indeed function promised scope document sometimes though parties attempt safeguard interests developers may pressure users sign whereas users may delay signing want check system ever thoroughly situations handover process serves forum parties argue positions rather means move project close actual process shorn original intent meaning thus ritualised even steering committees can end ritualised example committee consists senior executives different divisions can happen member will attempt safeguard interests fief committee meetings become forums bicker rather provide direction project words become symbolic events achieve little substance discussion main conclusion argument information system design implementation rational political process consequence many processes associated turn like rituals symbolise rationality actually rational said noted rituals important function serve give whole process systems development veneer rationality whilst allowing political manouevering inevitable large projects authors put rituals systems development function maintain appearance rationality systems development organisational decision making regardless whether actually produces rational outcomes systems development must symbolize rationality signify actions taken arbitrary rather acceptable within organisations ideology rituals help provide meaning actions taken within organisation feel compelled add even actions taken completely irrational arbitrary… summary… speculation experience central message paper rings true systems development design like many organisational processes procedures often hijacked different parties suit ends situations processes reduced rituals maintain facade rationality whilst providing cover politicking rational actions finally interesting note problem ritualisation rather general one many allegedly rational processes organisations symbol substance examples processes prone ritualisation include performance management project management planning hints deeper issue one think origins modern managements penchant overly prescriptive formulaic approaches managing organisations initiatives however remains speculation topic another time…

All good!

Stemming:

Typically a large corpus will contain many words that have a common root – for example: offer, offered and offering. Stemming is the process of reducing such related words to their common root, which in this case would be the word offer.

#writeLines(as.character(docs[[30]]))

Now let’s stem the corpus and reinspect it.

#load library
library(SnowballC)

## Warning: package 'SnowballC' was built under R version 3.5.2

#Stem document
docs <- tm_map(docs,stemDocument)
#writeLines(as.character(docs[[30]]))

On another important note, the output of the corpus also shows up a problem or two. First, organiz and organis are actually variants of the same stem organ. Clearly, they should be merged. Second, the word andgovern should be separated out into and and govern (this is an error in the original text). These (and other errors of their ilk) can and should be fixed up before proceeding. This is easily done using gsub() wrapped in content_transformer. Here is the code to clean up these and a few other issues that I found:

ocs <- tm_map(docs, content_transformer(gsub), pattern = 'organiz', replacement = 'organ')
docs <- tm_map(docs, content_transformer(gsub), pattern = 'organis', replacement = 'organ')
docs <- tm_map(docs, content_transformer(gsub), pattern = 'andgovern', replacement = 'govern')
docs <- tm_map(docs, content_transformer(gsub), pattern = 'inenterpris', replacement = 'enterpris')
docs <- tm_map(docs, content_transformer(gsub), pattern = 'team-', replacement = 'team')

The document term matrix

The next step in the process is the creation of the document term matrix (DTM)– a matrix that lists all occurrences of words in the corpus, by document. In the DTM, the documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document, then the matrix entry for corresponding to that row and column is 1, else it is 0 (multiple occurrences within a document are recorded – that is, if a word occurs twice in a document, it is recorded as “2” in the relevant matrix entry).

A simple example might serve to explain the structure of the TDM more clearly. Assume we have a simple corpus consisting of two documents, Doc1 and Doc2, with the following content:

Doc1: bananas are yellow

Doc2: bananas are good

Clearly there is nothing special about rows and columns – we could just as easily transpose them. If we did so, we’d get a term document matrix (TDM) in which the terms are rows and documents columns. One can work with either a DTM or TDM. I’ll use the DTM in what follows.

There are a couple of general points worth making before we proceed. Firstly, DTMs (or TDMs) can be huge – the dimension of the matrix would be number of document x the number of words in the corpus. Secondly, it is clear that the large majority of words will appear only in a few documents. As a result a DTM is invariably sparse – that is, a large number of its entries are 0.

The business of creating a DTM (or TDM) in R is as simple as:

dtm <- DocumentTermMatrix(docs)

This creates a term document matrix from the corpus and stores the result in the variable dtm. One can get summary information on the matrix by typing the variable name in the console and hitting return:

dtm

## <<DocumentTermMatrix (documents: 30, terms: 3915)>>
## Non-/sparse entries: 14040/103410
## Sparsity           : 88%
## Maximal term length: 48
## Weighting          : term frequency (tf)

This is a 30 x 4209 dimension matrix in which 88% of the rows are zero.

One can inspect the DTM, and you might want to do so for fun. However, it isn’t particularly illuminating because of the sheer volume of information that will flash up on the console. To limit the information displayed, one can inspect a small section of it like so:

inspect(dtm[1:2,1000:1005])

## <<DocumentTermMatrix (documents: 2, terms: 6)>>
## Non-/sparse entries: 0/12
## Sparsity           : 100%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##                                     Terms
## Docs                                 couch craft crisi critic current cya
##   BeyondEntitiesAndRelationships.txt     0     0     0      0       0   0
##   bigdata.txt                            0     0     0      0       0   0

This command displays terms 1000 through 1005 in the first two rows of the DTM. Note that your results may differ.

Mining the corpus

Notice that in constructing the TDM, we have converted a corpus of text into a mathematical object that can be analysed using quantitative techniques of matrix algebra. It should be no surprise, therefore, that the TDM (or DTM) is the starting point for quantitative text analysis.

For example, to get the frequency of occurrence of each word in the corpus, we simply sum over all rows to give column sums:

freq <- colSums(as.matrix(dtm))
#freq

Here we have first converted the TDM into a mathematical matrix using the as.matrix() function. We have then summed over all rows to give us the totals for each column (term). The result is stored in the (column matrix) variable freq.

Check that the dimension of freq equals the number of terms:

#length of freq - total number of terms

length(freq)

## [1] 3915

Now, will sort freq in descending order of term count

ord <- order(freq,decreasing = TRUE)

Check the most and least frequently occurring terms:

#inspect most frequent occurring terms
freq[head(ord)]

##   one     –   can manag organ  work 
##   325   303   244   230   221   209

#inspect least frequent occurring terms
freq[tail(ord)]

##   therebi timeorgan  uncommit  unionist   willing   workday 
##         1         1         1         1         1         1

The least frequent terms can be more interesting than one might think. This is because terms that occur rarely are likely to be more descriptive of specific documents. Indeed, I can recall the posts in which I have referred to Yorkshire, Zeno’s Paradox and Mr. Lou Zulli without having to go back to the corpus, but I’d have a hard time enumerating the posts in which I’ve used the word system.

Words like “can” and “one” give us no information about the subject matter of the documents in which they occur. They can therefore be eliminated without loss. Indeed, they ought to have been eliminated by the stopword removal we did earlier. However, since such words occur very frequently – virtually in all documents – we can remove them by enforcing bounds when creating the DTM, like so:

dtmr <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 20),
bounds = list(global = c(3,27))))

Here we have told R to include only those words that occur in 3 to 27 documents. We have also enforced lower and upper limit to length of the words included (between 4 and 20 characters).

Inspecting the new DTM:

dtmr

## <<DocumentTermMatrix (documents: 30, terms: 1295)>>
## Non-/sparse entries: 10086/28764
## Sparsity           : 74%
## Maximal term length: 15
## Weighting          : term frequency (tf)

Dimension is reduced to 30 X 1295

Lets check frequencies of words across documents and sort as before:

freqr <- colSums(as.matrix(dtmr))
#length should be total number of terms
length(freqr)

## [1] 1295

#create sort order (asc)
ordr <- order(freqr,decreasing=TRUE)
#inspect most frequently occurring terms
freqr[head(ordr)]

##   manag   organ    work  system project problem 
##     230     221     209     193     185     173

freqr[tail(ordr)]

##   hmmm struck multin  lower pseudo  gloss 
##      3      3      3      3      3      3

The results make sense: the top 6 keywords are pretty good descriptors of what my blogs is about – projects, management and systems. However, not all high frequency words need be significant. What they do, is give you an idea of potential classification terms.

That done, let’s take get a list of terms that occur at least a 100 times in the entire corpus. This is easily done using the findFreqTerms() function as follows:

findFreqTerms(dtmr,lowfreq=80)

##  [1] "action"     "approach"   "base"       "busi"       "data"      
##  [6] "design"     "develop"    "differ"     "discuss"    "enterpris" 
## [11] "exampl"     "group"      "howev"      "import"     "issu"      
## [16] "make"       "manag"      "mani"       "model"      "often"     
## [21] "organ"      "peopl"      "point"      "practic"    "problem"   
## [26] "process"    "project"    "question"   "said"       "situat"    
## [31] "system"     "thing"      "think"      "time"       "understand"
## [36] "view"       "well"       "will"       "work"       "chang"     
## [41] "consult"    "decis"      "even"       "like"

Here I have asked findFreqTerms() to return all terms that occur more than 80 times in the entire corpus. Note, however, that the result is ordered alphabetically, not by frequency.

Now that we have the most frequently occurring terms in hand, we can check for correlations between some of these and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents.

The tm package provides the findAssocs() function to do this. One needs to specify the DTM, the term of interest and the correlation limit. The latter is a number between 0 and 1 that serves as a lower bound for the strength of correlation between the search and result terms. For example, if the correlation limit is 1, findAssocs() will return only those words that always co-occur with the search term. A correlation limit of 0.5 will return terms that have a search term co-occurrence of at least 50% and so on.

Here are the results of running findAssocs() on some of the frequently occurring terms (system, project, organis) at a correlation of 60%.

findAssocs(dtmr,'project',0.6)

## $project
##  inher  manag  handl occurr 
##   0.82   0.69   0.68   0.67

findAssocs(dtmr,'enterpris',0.6)

## $enterpris
##        agil   increment     realist     upfront   technolog        solv 
##        0.81        0.79        0.77        0.76        0.69        0.68 
##     neither    movement       happi       adapt   architect architectur 
##        0.68        0.66        0.66        0.65        0.65        0.65 
##       chanc        fine      featur 
##        0.63        0.63        0.62

findAssocs(dtmr,'system',0.6)

## $system
##   design   subset    adopt     user   involv  specifi function   intend 
##     0.78     0.78     0.77     0.75     0.71     0.71     0.70     0.67 
##     step  softwar   specif   intent   compos   depart    phone frequent 
##     0.67     0.67     0.66     0.66     0.66     0.65     0.63     0.62 
##    today  pattern   author   wherea   cognit 
##     0.62     0.61     0.60     0.60     0.60

An important point to note that the presence of a term in these list is not indicative of its frequency. Rather it is a measure of the frequency with which the two (search and result term) co-occur (or show up together) in documents across . Note also, that it is not an indicator of nearness or contiguity. Indeed, it cannot be because the document term matrix does not store any information on proximity of terms, it is simply a “bag of words.”

As it turned out, the very basic techniques listed above were enough for me to get a handle on the original problem that led me to text mining – the analysis of free text problem descriptions in my organisation’s service management tool. What I did was to work my way through the top 50 terms and find their associations. These revealed a number of sets of keywords that occurred in multiple problem descriptions, which was good enough for me to define some useful sub-categories. These are currently being reviewed by the service management team. While they’re busy with that that, I’m looking into refining these further using techniques such as cluster analysis and tokenization. A simple case of the latter would be to look at two-word combinations in the text (technically referred to as bigrams). As one might imagine, the dimensionality of the DTM will quickly get out of hand as one considers larger multi-word combinations.

Basic graphics.

One of the really cool things about R is its graphing capability. I’ll do just a couple of simple examples to give you a flavour of its power and cool factor. There are lots of nice examples on the Web that you can try out for yourself.

Let’s first do a simple frequency histogram. I’ll use the ggplot2 package, written by Hadley Wickham to do this. Here’s the code:

wf=data.frame(term=names(freqr),occurrences=freqr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

p <- ggplot(subset(wf, freqr>100), aes(term, occurrences))
p <- p + geom_bar(stat='identity')
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p <- p + ggtitle('Term-occurance histogram (freq>100)')
p

Finally, let’s create a wordcloud for no other reason than everyone who can seems to be doing it. The code for this is:

#wordcloud
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.5.2

## Loading required package: RColorBrewer

#setting the same seed each time ensures consistent look across clouds
set.seed(42)
#limit words by specifying min frequency
wordcloud(names(freqr),freqr, min.freq=70)

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): manag could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): system could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): organ could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): chang could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): develop could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): problem could not
## be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): approach could
## not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70): practic could not
## be fit on page. It will not be plotted.

Finally, one can make the wordcloud more visually appealing by adding colour as follows:

#…add color
wordcloud(names(freqr),freqr,min.freq=70,colors=brewer.pal(6,'Dark2'))

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : system could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : exampl could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : question could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : approach could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : practic could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : enterpris could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : design could not be fit on page. It will not be plotted.

## Warning in wordcloud(names(freqr), freqr, min.freq = 70, colors =
## brewer.pal(6, : point could not be fit on page. It will not be plotted.

text_mining_R

olga

February 18, 2019