Loading Files

base <- read.delim("~/Porta Text Mining/base/comments.tab", stringsAsFactors=FALSE)
base<-base$text
write(base, file="~/Porta Text Mining/corpus/base.txt")
base<-read.delim("~/Porta Text Mining/corpus/base.txt", stringsAsFactors=FALSE)
#View(base)

Basic Text Mining in R

To start, install the packages you need to mine text You only need to do this step once.

#Needed <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc")   
#install.packages(Needed, dependencies=TRUE)   
#install.packages("Rcampdf", repos = "http://datacube.wu.ac.at/", type = "source")  
#install.packages("twitteR")
#require(twitterR)

If you get the following message: Update all/some/none? [a/s/n]: enter “a” and press return

Loading Texts

Start by saving your text files in a folder titled: “texts” This will be the “corpus” (body) of texts you are mining.

Note: The texts used in this example are a few websites about qualitative data analysis that were copied and pasted into a text document. You can use a variety of media for this, such as PDF and HTML. The text example was chosen because it most closely matches the data used in the QDA Mash-up class.

Read this next part carefully. You need to to three unique things here: 1. Create a file named “texts” where you’ll keep your data. 2. Save the file to a particular place + Mac: Desktop + PC: C: drive 3. Copy and paste the appropriate scripts below.

On a Mac, save the folder to your desktop and use the following code chunk: ## Including Plots

You can also embed plots, for example:

cname <- file.path("~", "Porta Text Mining", "corpus")   
cname  
## [1] "~/Porta Text Mining/corpus"
dir(cname)   # Use this to check to see that your texts have loaded.   
## [1] "base.txt"

On a PC, save the folder to your C: drive and use the following code chunk:

#cname <- file.path("C:", "texts")   
#cname   
#dir(cname)

Load the R package for text mining and then load your texts into R.

library(tm)   
## Loading required package: NLP
docs <- Corpus(DirSource(cname))   

If you so desire, you can read your documents in the R terminal using inspect(docs). Or, if you prefer to look at only one of the documents you loaded, then you can specify which one using something like:

inspect(docs[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 619038

In this case, you would call up only the first document you loaded. Be careful. Either of these commands will fill up your screen fast.

Preprocessing

Once you are sure that all documents loaded properly, go on to preprocess your texts. This step allows you to remove numbers, capitalization, common words, punctuation, and otherwise prepare your texts for analysis. This can be somewhat time consuming and picky, but it pays off in the end in terms of high quality analyses.

Removing punctuation: Your computer cannot actually read. Punctuation and other special characters only look like more words to your computer and R. Use the following to methods to remove them from your text.

docs <- tm_map(docs, removePunctuation)   
#inspect(docs[1]) # Check to see if it worked. 

If necesasry, such as when working with emails, you can remove special characters. This list has been customized to remove punctuation that you commonly find in emails. You can customize what is removed by changing them as you see fit, to meet your own unique needs.

for(j in seq(docs))   
   {   
     docs[[j]] <- gsub("/", " ", docs[[j]])   
     docs[[j]] <- gsub("@", " ", docs[[j]])   
     docs[[j]] <- gsub("\\|", " ", docs[[j]])
     docs[[j]] <- gsub("<>", " ", docs[[j]])
     docs[[j]] <- gsub("<*>", " ", docs[[j]])
  }   
#inspect(docs[1]) # You can check a document (in this case the first) to see if it worked.   

Removing numbers:

docs <- tm_map(docs, removeNumbers)   
# inspect(docs[1]) # Check to see if it worked.   

Converting to lowercase: As before, we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.

docs <- tm_map(docs, tolower)   
# inspect(docs[1]) # Check to see if it worked. 

Removing “stopwords” (common words) that usually have no analytic value. In every text, there are a lot of common, and uninteresting words (a, and, also, the, etc.). Such words are frequent by their nature, and will confound your analysis if they remain in the text.

# For a list of the stopwords, see:   
length(stopwords("portuguese"))   
## [1] 203
stopwords("portuguese")   
##   [1] "de"           "a"            "o"            "que"         
##   [5] "e"            "do"           "da"           "em"          
##   [9] "um"           "para"         "com"          "não"         
##  [13] "uma"          "os"           "no"           "se"          
##  [17] "na"           "por"          "mais"         "as"          
##  [21] "dos"          "como"         "mas"          "ao"          
##  [25] "ele"          "das"          "à"            "seu"         
##  [29] "sua"          "ou"           "quando"       "muito"       
##  [33] "nos"          "já"           "eu"           "também"      
##  [37] "só"           "pelo"         "pela"         "até"         
##  [41] "isso"         "ela"          "entre"        "depois"      
##  [45] "sem"          "mesmo"        "aos"          "seus"        
##  [49] "quem"         "nas"          "me"           "esse"        
##  [53] "eles"         "você"         "essa"         "num"         
##  [57] "nem"          "suas"         "meu"          "às"          
##  [61] "minha"        "numa"         "pelos"        "elas"        
##  [65] "qual"         "nós"          "lhe"          "deles"       
##  [69] "essas"        "esses"        "pelas"        "este"        
##  [73] "dele"         "tu"           "te"           "vocês"       
##  [77] "vos"          "lhes"         "meus"         "minhas"      
##  [81] "teu"          "tua"          "teus"         "tuas"        
##  [85] "nosso"        "nossa"        "nossos"       "nossas"      
##  [89] "dela"         "delas"        "esta"         "estes"       
##  [93] "estas"        "aquele"       "aquela"       "aqueles"     
##  [97] "aquelas"      "isto"         "aquilo"       "estou"       
## [101] "está"         "estamos"      "estão"        "estive"      
## [105] "esteve"       "estivemos"    "estiveram"    "estava"      
## [109] "estávamos"    "estavam"      "estivera"     "estivéramos" 
## [113] "esteja"       "estejamos"    "estejam"      "estivesse"   
## [117] "estivéssemos" "estivessem"   "estiver"      "estivermos"  
## [121] "estiverem"    "hei"          "há"           "havemos"     
## [125] "hão"          "houve"        "houvemos"     "houveram"    
## [129] "houvera"      "houvéramos"   "haja"         "hajamos"     
## [133] "hajam"        "houvesse"     "houvéssemos"  "houvessem"   
## [137] "houver"       "houvermos"    "houverem"     "houverei"    
## [141] "houverá"      "houveremos"   "houverão"     "houveria"    
## [145] "houveríamos"  "houveriam"    "sou"          "somos"       
## [149] "são"          "era"          "éramos"       "eram"        
## [153] "fui"          "foi"          "fomos"        "foram"       
## [157] "fora"         "fôramos"      "seja"         "sejamos"     
## [161] "sejam"        "fosse"        "fôssemos"     "fossem"      
## [165] "for"          "formos"       "forem"        "serei"       
## [169] "será"         "seremos"      "serão"        "seria"       
## [173] "seríamos"     "seriam"       "tenho"        "tem"         
## [177] "temos"        "tém"          "tinha"        "tínhamos"    
## [181] "tinham"       "tive"         "teve"         "tivemos"     
## [185] "tiveram"      "tivera"       "tivéramos"    "tenha"       
## [189] "tenhamos"     "tenham"       "tivesse"      "tivéssemos"  
## [193] "tivessem"     "tiver"        "tivermos"     "tiverem"     
## [197] "terei"        "terá"         "teremos"      "terão"       
## [201] "teria"        "teríamos"     "teriam"
docs <- tm_map(docs, removeWords, stopwords("portuguese"))
docs <- tm_map(docs, removeWords, stopwords("english"))  
#inspect(docs[1]) # Check to see if it worked.   

Removing particular words: If you find that a particular word or two appear in the output, but are not of value to your particular analysis. You can remove them, specifically, from the text.

docs <- tm_map(docs, removeWords, c("department", "email", "pra", "que", "por","aqui", "ziyndwvwhzuvyfnzzspartndgr"))   
# Just replace "department" and "email" with words that you would like to remove.

remove URLs

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
docs <- tm_map(docs, removeURL, lazy=TRUE) 

The code below is used for to make text fit for paper width

Combining words that should stay together

If you wish to preserve a concept is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, I am using examples that are particular to qualitative data analysis.

for (j in seq(docs))
{
docs[[j]] <- gsub("Fábio Porchat", "Porchat", docs[[j]])
docs[[j]] <- gsub("Fabio Porchat", "Porchat", docs[[j]])
docs[[j]] <- gsub("Porchá", "Porchat", docs[[j]])
docs[[j]] <- gsub("vídeos", "vídeo", docs[[j]])
docs[[j]] <- gsub("kkk*", "kkkk", docs[[j]])
docs[[j]] <- gsub("hah*", "hahah", docs[[j]])
docs[[j]] <- gsub("hah*", "hahah", docs[[j]])
docs[[j]] <- gsub("hhah*", "hahah", docs[[j]])
docs[[j]] <- gsub("aha*", "hahah", docs[[j]])
docs[[j]] <- gsub("kkkkkk*", "kkkk", docs[[j]])
docs[[j]] <- gsub("kkkkkk", "kkkk", docs[[j]])
docs[[j]] <- gsub("video", "vídeo", docs[[j]])
docs[[j]] <- gsub("nao", "não", docs[[j]])
docs[[j]] <- gsub("nao", "não", docs[[j]])
docs[[j]] <- gsub("aaa*", "aaaaa", docs[[j]])
docs[[j]] <- gsub('^[ha]+$', "aaaaa", docs[[j]])
docs[[j]] <- gsub('^[ha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[hha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[hhha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[uashha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[aeu]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('hah[[:alnum:]]', "hahaha", docs[[j]])
docs[[j]] <- gsub('hhah[[:alnum:]]', "hahaha", docs[[j]])
}

Removing common word endings (e.g., “ing”, “es”, “s”) This is referred to as “stemming” documents. We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text.

library(SnowballC)   
docs <- tm_map(docs, stemDocument, "portuguese")
docs <- tm_map(docs, stemDocument)
#inspect(docs[1]) # Check to see if it worked.   

Stripping unnecesary whitespace from your documents: The above preprocessing will leave the documents with a lot of “white space”. White space is the result of all the left over spaces that were not removed along with the words that were deleted. The white space can, and should, be removed.

docs <- tm_map(docs, stripWhitespace)   
#inspect(docs[1]) # Check to see if it worked.   

To Finish

Be sure to use the following script once you have completed preprocessing. This tells R to treat your preprocessed documents as text documents.

docs <- tm_map(docs, PlainTextDocument)   

This is the end of the preprocessing stage.

Stage the Data

To proceed, create a document term matrix. This is what you will be using from this point on.

dtm <- DocumentTermMatrix(docs, control=list(wordLengths=c(4, 12)))
#dtm 

To inspect, you can use: inspect(dtm) This will, however, fill up your terminal quickly. So you may prefer to view a subset: inspect(dtm[1:5, 1:20]) view first 5 docs & first 20 terms - modify as you like dim(dtm) This will display the number of documents & terms (in that order)

You’ll also need a transpose of this matrix. Create it using:

tdm <- TermDocumentMatrix(docs)   
#tdm  
#inspect(dtm[1, 1:20])

Explore your data

Organize terms by their frequency:

freq <- colSums(as.matrix(dtm))   
length(freq)   
## [1] 9370
ord <- order(freq)  

If you prefer to export the matrix to Excel:

m <- as.matrix(dtm)   
dim(m)   
## [1]    1 9370
write.csv(m, file="dtm.csv")   

Focus!

Er, that is, you can focus on just the interesting stuff…

#  Start by removing sparse terms:   
dtms <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum.   
#inspect(dtms)  

Word Frequency

There are a lot of terms, so for now, just check out some of the most and least frequently occurring words.

freq[head(ord)]   
## aaaaaeeeee     aaaaaf   aaaaafdx     abaixa  abaixando     abaixo 
##          1          1          1          1          1          1
freq[tail(ord)] 
##   coca  merda kellen  vídeo   nome   kkkk 
##    494    688    712    787   2327   2792
head(table(freq), 20) 
## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 5914 1222  542  379  226  155  125   80   66   53   51   39   31   33   39 
##   16   17   18   19   20 
##   25   29   10   16   17

The resulting output is two rows of numbers. The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently. Here, considering only the 20 lowest word frequencies, we can see that 995 terms appear only once. There are also a lot of others that appear very infrequently.

tail(table(freq), 20)
## freq
##  170  172  173  179  185  212  241  262  270  280  347  365  386  467  494 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  688  712  787 2327 2792 
##    1    1    1    1    1

Considering only the 20 greatest frequencies, we can see that there is a huge disparity in how frequently some terms appear.

For a less, fine-grained look at term freqency we can view a table of the terms we selected when we removed sparse terms, above. (Look just under the word “Focus”.)

freq <- colSums(as.matrix(dtms))   
#freq  

The above matrix was created using a data transformation we made earlier. What follows is an alternative that will accomplish essentially the same thing.

freq <- sort(colSums(as.matrix(dtms)), decreasing=TRUE)   
head(freq, 14) 
##    kkkk    nome   vídeo  kellen   merda    coca    puta    kkkk   dolly 
##    2792    2327     787     712     688     494     467     386     365 
##   canal    cara pessoas   achei  melhor 
##     347     280     270     262     241

An alternate view of term frequency: This will identify all terms that appear frequently (in this case, 50 or more times).

findFreqTerms(dtm, lowfreq=150)   # Change "150" to whatever is most appropriate for your text data.
##  [1] "achei"     "acho"      "agora"     "canal"     "cara"     
##  [6] "coca"      "derivação" "desse"     "dolly"     "humor"    
## [11] "kelen"     "kellen"    "kkkk"      "kkkk"      "melhor"   
## [16] "merda"     "nome"      "nomes"     "nunca"     "pessoas"  
## [21] "porta"     "puta"      "ruim"      "vídeo"

Yet another way to do this:

wf <- data.frame(word=names(freq), freq=freq)   
head(wf)  
##          word freq
## kkkk     kkkk 2792
## nome     nome 2327
## vídeo   vídeo  787
## kellen kellen  712
## merda   merda  688
## coca     coca  494

Plot Word Frequencies Plot words that appear at least 150 times.

library(ggplot2)   
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
p <- ggplot(subset(wf, freq>150), aes(word, freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p  

Relationships Between Terms

Term Correlations If you have a term in mind that you have found to be particularly meaningful to your analysis, then you may find it helpful to identify the words that most highly correlate with that term.

If words always appear together, then correlation=1.0.

findAssocs(dtms, c("nome"), corlimit=0.98) # specifying a correlation limit of 0.98 
## $nome
## numeric(0)

In this case, “question” and “analysi” were highly correlated with numerous other terms. Setting corlimit= to 0.98 prevented the list from being overly long. Feel free to adjust the corlimit= to any value you feel is necessary.

findAssocs(dtms, "contrast", corlimit=0.95) # specifying a correlation limit of 0.95  
## $contrast
## numeric(0)

Word Clouds!

Humans are generally strong at visual analytics. That is part of the reason that these have become so popular. What follows are a variety of alternatives for constructing word clouds with your text.

But first you will need to load the package that makes word clouds in R.

library(wordcloud)
## Loading required package: RColorBrewer

Plot words that occur at least 25 times.

set.seed(142)   
wordcloud(names(freq), freq, min.freq=150) 

Note: The “set.seed() function just makes the configuration of the layout of the clouds consistent each time you plot them. You can omit that part if you are not concerned with preserving a particular layout.

Plot the 100 most frequently used words.

set.seed(142)   
wordcloud(names(freq), freq, max.words=50) 

Add some color and plot words occurring at least 20 times.

set.seed(142)   
wordcloud(names(freq), freq, min.freq=150, scale=c(5, .1), colors=brewer.pal(6, "Dark2")) 

Plot the 100 most frequently occurring words.

set.seed(142)   
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)  

Clustering by Term Similarity

To do this well, you should always first remove a lot of the uninteresting or infrequent words. If you have not done so already, you can remove these with the following code.

dtmss <- removeSparseTerms(dtms, 0.10) # This makes a matrix that is only 10% empty space, maximum.  
#inspect(dtmss)   

Hierarchal Clustering

First calculate distance between words & then cluster them according to similarity.

library(cluster)   
d <- dist(t(dtmss), method="euclidian")   
fit <- hclust(d=d, method="ward")   
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
fit  
## 
## Call:
## hclust(d = d, method = "ward")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 9370
#plot(fit, hang=-1)   

Some people find dendrograms to be fairly clear to read. Others simply find them perplexing. Here, we can see two, three, four, five, six, seven, or many more groups that are identifiable in the dendrogram.

Helping to Read a Dendrogram

If you find dendrograms difficult to read, then there is still hope. To get a better idea of where the groups are in the dendrogram, you can also ask R to help identify the clusters. Here, I have arbitrarily chosen to look at five clusters, as indicated by the red boxes. If you would like to highlight a different number of groups, then feel free to change the code accordingly.

#plot.new()
#plot(fit, hang=-1)
#groups <- cutree(fit, k=5)   # "k=" defines the number of clusters you are using   
#rect.hclust(fit, k=5, border="red") # draw dendogram with red borders around the 5 clusters  

K-means clustering

The k-means clustering method will attempt to cluster words into a specified number of groups (in this case 2), such that the sum of squared distances between individual words and one of the group centers. You can change the number of groups you seek by changing the number specified within the kmeans() command.

#library(fpc)   
#d <- dist(t(dtmss), method="euclidian")   
#kfit <- kmeans(d, 2)
#kfit
#clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)   

Saving objects

save.image("~/Porta Text Mining/Untitled.RData")