base <- read.delim("~/Porta Text Mining/base/comments.tab", stringsAsFactors=FALSE)
base<-base$text
write(base, file="~/Porta Text Mining/corpus/base.txt")
base<-read.delim("~/Porta Text Mining/corpus/base.txt", stringsAsFactors=FALSE)
#View(base)
To start, install the packages you need to mine text You only need to do this step once.
#Needed <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc")
#install.packages(Needed, dependencies=TRUE)
#install.packages("Rcampdf", repos = "http://datacube.wu.ac.at/", type = "source")
#install.packages("twitteR")
#require(twitterR)
If you get the following message: Update all/some/none? [a/s/n]: enter “a” and press return
Start by saving your text files in a folder titled: “texts” This will be the “corpus” (body) of texts you are mining.
Note: The texts used in this example are a few websites about qualitative data analysis that were copied and pasted into a text document. You can use a variety of media for this, such as PDF and HTML. The text example was chosen because it most closely matches the data used in the QDA Mash-up class.
Read this next part carefully. You need to to three unique things here: 1. Create a file named “texts” where you’ll keep your data. 2. Save the file to a particular place + Mac: Desktop + PC: C: drive 3. Copy and paste the appropriate scripts below.
On a Mac, save the folder to your desktop and use the following code chunk: ## Including Plots
You can also embed plots, for example:
cname <- file.path("~", "Porta Text Mining", "corpus")
cname
## [1] "~/Porta Text Mining/corpus"
dir(cname) # Use this to check to see that your texts have loaded.
## [1] "base.txt"
On a PC, save the folder to your C: drive and use the following code chunk:
#cname <- file.path("C:", "texts")
#cname
#dir(cname)
Load the R package for text mining and then load your texts into R.
library(tm)
## Loading required package: NLP
docs <- Corpus(DirSource(cname))
If you so desire, you can read your documents in the R terminal using inspect(docs). Or, if you prefer to look at only one of the documents you loaded, then you can specify which one using something like:
inspect(docs[1])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 619038
In this case, you would call up only the first document you loaded. Be careful. Either of these commands will fill up your screen fast.
Once you are sure that all documents loaded properly, go on to preprocess your texts. This step allows you to remove numbers, capitalization, common words, punctuation, and otherwise prepare your texts for analysis. This can be somewhat time consuming and picky, but it pays off in the end in terms of high quality analyses.
Removing punctuation: Your computer cannot actually read. Punctuation and other special characters only look like more words to your computer and R. Use the following to methods to remove them from your text.
docs <- tm_map(docs, removePunctuation)
#inspect(docs[1]) # Check to see if it worked.
If necesasry, such as when working with emails, you can remove special characters. This list has been customized to remove punctuation that you commonly find in emails. You can customize what is removed by changing them as you see fit, to meet your own unique needs.
for(j in seq(docs))
{
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("<>", " ", docs[[j]])
docs[[j]] <- gsub("<*>", " ", docs[[j]])
}
#inspect(docs[1]) # You can check a document (in this case the first) to see if it worked.
Removing numbers:
docs <- tm_map(docs, removeNumbers)
# inspect(docs[1]) # Check to see if it worked.
Converting to lowercase: As before, we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.
docs <- tm_map(docs, tolower)
# inspect(docs[1]) # Check to see if it worked.
Removing “stopwords” (common words) that usually have no analytic value. In every text, there are a lot of common, and uninteresting words (a, and, also, the, etc.). Such words are frequent by their nature, and will confound your analysis if they remain in the text.
# For a list of the stopwords, see:
length(stopwords("portuguese"))
## [1] 203
stopwords("portuguese")
## [1] "de" "a" "o" "que"
## [5] "e" "do" "da" "em"
## [9] "um" "para" "com" "não"
## [13] "uma" "os" "no" "se"
## [17] "na" "por" "mais" "as"
## [21] "dos" "como" "mas" "ao"
## [25] "ele" "das" "à" "seu"
## [29] "sua" "ou" "quando" "muito"
## [33] "nos" "já" "eu" "também"
## [37] "só" "pelo" "pela" "até"
## [41] "isso" "ela" "entre" "depois"
## [45] "sem" "mesmo" "aos" "seus"
## [49] "quem" "nas" "me" "esse"
## [53] "eles" "você" "essa" "num"
## [57] "nem" "suas" "meu" "às"
## [61] "minha" "numa" "pelos" "elas"
## [65] "qual" "nós" "lhe" "deles"
## [69] "essas" "esses" "pelas" "este"
## [73] "dele" "tu" "te" "vocês"
## [77] "vos" "lhes" "meus" "minhas"
## [81] "teu" "tua" "teus" "tuas"
## [85] "nosso" "nossa" "nossos" "nossas"
## [89] "dela" "delas" "esta" "estes"
## [93] "estas" "aquele" "aquela" "aqueles"
## [97] "aquelas" "isto" "aquilo" "estou"
## [101] "está" "estamos" "estão" "estive"
## [105] "esteve" "estivemos" "estiveram" "estava"
## [109] "estávamos" "estavam" "estivera" "estivéramos"
## [113] "esteja" "estejamos" "estejam" "estivesse"
## [117] "estivéssemos" "estivessem" "estiver" "estivermos"
## [121] "estiverem" "hei" "há" "havemos"
## [125] "hão" "houve" "houvemos" "houveram"
## [129] "houvera" "houvéramos" "haja" "hajamos"
## [133] "hajam" "houvesse" "houvéssemos" "houvessem"
## [137] "houver" "houvermos" "houverem" "houverei"
## [141] "houverá" "houveremos" "houverão" "houveria"
## [145] "houveríamos" "houveriam" "sou" "somos"
## [149] "são" "era" "éramos" "eram"
## [153] "fui" "foi" "fomos" "foram"
## [157] "fora" "fôramos" "seja" "sejamos"
## [161] "sejam" "fosse" "fôssemos" "fossem"
## [165] "for" "formos" "forem" "serei"
## [169] "será" "seremos" "serão" "seria"
## [173] "seríamos" "seriam" "tenho" "tem"
## [177] "temos" "tém" "tinha" "tínhamos"
## [181] "tinham" "tive" "teve" "tivemos"
## [185] "tiveram" "tivera" "tivéramos" "tenha"
## [189] "tenhamos" "tenham" "tivesse" "tivéssemos"
## [193] "tivessem" "tiver" "tivermos" "tiverem"
## [197] "terei" "terá" "teremos" "terão"
## [201] "teria" "teríamos" "teriam"
docs <- tm_map(docs, removeWords, stopwords("portuguese"))
docs <- tm_map(docs, removeWords, stopwords("english"))
#inspect(docs[1]) # Check to see if it worked.
Removing particular words: If you find that a particular word or two appear in the output, but are not of value to your particular analysis. You can remove them, specifically, from the text.
docs <- tm_map(docs, removeWords, c("department", "email", "pra", "que", "por","aqui", "ziyndwvwhzuvyfnzzspartndgr"))
# Just replace "department" and "email" with words that you would like to remove.
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
docs <- tm_map(docs, removeURL, lazy=TRUE)
If you wish to preserve a concept is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, I am using examples that are particular to qualitative data analysis.
for (j in seq(docs))
{
docs[[j]] <- gsub("Fábio Porchat", "Porchat", docs[[j]])
docs[[j]] <- gsub("Fabio Porchat", "Porchat", docs[[j]])
docs[[j]] <- gsub("Porchá", "Porchat", docs[[j]])
docs[[j]] <- gsub("vídeos", "vídeo", docs[[j]])
docs[[j]] <- gsub("kkk*", "kkkk", docs[[j]])
docs[[j]] <- gsub("hah*", "hahah", docs[[j]])
docs[[j]] <- gsub("hah*", "hahah", docs[[j]])
docs[[j]] <- gsub("hhah*", "hahah", docs[[j]])
docs[[j]] <- gsub("aha*", "hahah", docs[[j]])
docs[[j]] <- gsub("kkkkkk*", "kkkk", docs[[j]])
docs[[j]] <- gsub("kkkkkk", "kkkk", docs[[j]])
docs[[j]] <- gsub("video", "vídeo", docs[[j]])
docs[[j]] <- gsub("nao", "não", docs[[j]])
docs[[j]] <- gsub("nao", "não", docs[[j]])
docs[[j]] <- gsub("aaa*", "aaaaa", docs[[j]])
docs[[j]] <- gsub('^[ha]+$', "aaaaa", docs[[j]])
docs[[j]] <- gsub('^[ha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[hha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[hhha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[uashha]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('^[aeu]+$', "hahaha", docs[[j]])
docs[[j]] <- gsub('hah[[:alnum:]]', "hahaha", docs[[j]])
docs[[j]] <- gsub('hhah[[:alnum:]]', "hahaha", docs[[j]])
}
Removing common word endings (e.g., “ing”, “es”, “s”) This is referred to as “stemming” documents. We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text.
library(SnowballC)
docs <- tm_map(docs, stemDocument, "portuguese")
docs <- tm_map(docs, stemDocument)
#inspect(docs[1]) # Check to see if it worked.
Stripping unnecesary whitespace from your documents: The above preprocessing will leave the documents with a lot of “white space”. White space is the result of all the left over spaces that were not removed along with the words that were deleted. The white space can, and should, be removed.
docs <- tm_map(docs, stripWhitespace)
#inspect(docs[1]) # Check to see if it worked.
Be sure to use the following script once you have completed preprocessing. This tells R to treat your preprocessed documents as text documents.
docs <- tm_map(docs, PlainTextDocument)
This is the end of the preprocessing stage.
To proceed, create a document term matrix. This is what you will be using from this point on.
dtm <- DocumentTermMatrix(docs, control=list(wordLengths=c(4, 12)))
#dtm
To inspect, you can use: inspect(dtm) This will, however, fill up your terminal quickly. So you may prefer to view a subset: inspect(dtm[1:5, 1:20]) view first 5 docs & first 20 terms - modify as you like dim(dtm) This will display the number of documents & terms (in that order)
You’ll also need a transpose of this matrix. Create it using:
tdm <- TermDocumentMatrix(docs)
#tdm
#inspect(dtm[1, 1:20])
Organize terms by their frequency:
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 9370
ord <- order(freq)
If you prefer to export the matrix to Excel:
m <- as.matrix(dtm)
dim(m)
## [1] 1 9370
write.csv(m, file="dtm.csv")
Er, that is, you can focus on just the interesting stuff…
# Start by removing sparse terms:
dtms <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum.
#inspect(dtms)
There are a lot of terms, so for now, just check out some of the most and least frequently occurring words.
freq[head(ord)]
## aaaaaeeeee aaaaaf aaaaafdx abaixa abaixando abaixo
## 1 1 1 1 1 1
freq[tail(ord)]
## coca merda kellen vídeo nome kkkk
## 494 688 712 787 2327 2792
head(table(freq), 20)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 5914 1222 542 379 226 155 125 80 66 53 51 39 31 33 39
## 16 17 18 19 20
## 25 29 10 16 17
The resulting output is two rows of numbers. The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently. Here, considering only the 20 lowest word frequencies, we can see that 995 terms appear only once. There are also a lot of others that appear very infrequently.
tail(table(freq), 20)
## freq
## 170 172 173 179 185 212 241 262 270 280 347 365 386 467 494
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 688 712 787 2327 2792
## 1 1 1 1 1
Considering only the 20 greatest frequencies, we can see that there is a huge disparity in how frequently some terms appear.
For a less, fine-grained look at term freqency we can view a table of the terms we selected when we removed sparse terms, above. (Look just under the word “Focus”.)
freq <- colSums(as.matrix(dtms))
#freq
The above matrix was created using a data transformation we made earlier. What follows is an alternative that will accomplish essentially the same thing.
freq <- sort(colSums(as.matrix(dtms)), decreasing=TRUE)
head(freq, 14)
## kkkk nome vídeo kellen merda coca puta kkkk dolly
## 2792 2327 787 712 688 494 467 386 365
## canal cara pessoas achei melhor
## 347 280 270 262 241
An alternate view of term frequency: This will identify all terms that appear frequently (in this case, 50 or more times).
findFreqTerms(dtm, lowfreq=150) # Change "150" to whatever is most appropriate for your text data.
## [1] "achei" "acho" "agora" "canal" "cara"
## [6] "coca" "derivação" "desse" "dolly" "humor"
## [11] "kelen" "kellen" "kkkk" "kkkk" "melhor"
## [16] "merda" "nome" "nomes" "nunca" "pessoas"
## [21] "porta" "puta" "ruim" "vídeo"
Yet another way to do this:
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## kkkk kkkk 2792
## nome nome 2327
## vídeo vídeo 787
## kellen kellen 712
## merda merda 688
## coca coca 494
Plot Word Frequencies Plot words that appear at least 150 times.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p <- ggplot(subset(wf, freq>150), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
Term Correlations If you have a term in mind that you have found to be particularly meaningful to your analysis, then you may find it helpful to identify the words that most highly correlate with that term.
If words always appear together, then correlation=1.0.
findAssocs(dtms, c("nome"), corlimit=0.98) # specifying a correlation limit of 0.98
## $nome
## numeric(0)
In this case, “question” and “analysi” were highly correlated with numerous other terms. Setting corlimit= to 0.98 prevented the list from being overly long. Feel free to adjust the corlimit= to any value you feel is necessary.
findAssocs(dtms, "contrast", corlimit=0.95) # specifying a correlation limit of 0.95
## $contrast
## numeric(0)
Humans are generally strong at visual analytics. That is part of the reason that these have become so popular. What follows are a variety of alternatives for constructing word clouds with your text.
But first you will need to load the package that makes word clouds in R.
library(wordcloud)
## Loading required package: RColorBrewer
Plot words that occur at least 25 times.
set.seed(142)
wordcloud(names(freq), freq, min.freq=150)
Note: The “set.seed() function just makes the configuration of the layout of the clouds consistent each time you plot them. You can omit that part if you are not concerned with preserving a particular layout.
Plot the 100 most frequently used words.
set.seed(142)
wordcloud(names(freq), freq, max.words=50)
Add some color and plot words occurring at least 20 times.
set.seed(142)
wordcloud(names(freq), freq, min.freq=150, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Plot the 100 most frequently occurring words.
set.seed(142)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)
To do this well, you should always first remove a lot of the uninteresting or infrequent words. If you have not done so already, you can remove these with the following code.
dtmss <- removeSparseTerms(dtms, 0.10) # This makes a matrix that is only 10% empty space, maximum.
#inspect(dtmss)
First calculate distance between words & then cluster them according to similarity.
library(cluster)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d=d, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
fit
##
## Call:
## hclust(d = d, method = "ward")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 9370
#plot(fit, hang=-1)
Some people find dendrograms to be fairly clear to read. Others simply find them perplexing. Here, we can see two, three, four, five, six, seven, or many more groups that are identifiable in the dendrogram.
If you find dendrograms difficult to read, then there is still hope. To get a better idea of where the groups are in the dendrogram, you can also ask R to help identify the clusters. Here, I have arbitrarily chosen to look at five clusters, as indicated by the red boxes. If you would like to highlight a different number of groups, then feel free to change the code accordingly.
#plot.new()
#plot(fit, hang=-1)
#groups <- cutree(fit, k=5) # "k=" defines the number of clusters you are using
#rect.hclust(fit, k=5, border="red") # draw dendogram with red borders around the 5 clusters
The k-means clustering method will attempt to cluster words into a specified number of groups (in this case 2), such that the sum of squared distances between individual words and one of the group centers. You can change the number of groups you seek by changing the number specified within the kmeans() command.
#library(fpc)
#d <- dist(t(dtmss), method="euclidian")
#kfit <- kmeans(d, 2)
#kfit
#clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
save.image("~/Porta Text Mining/Untitled.RData")