The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models. Tasks to accomplish:
You will need following libraries:
library(knitr)
library(NLP)
library(stringi)
library(stringr)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(pander)
library(rmarkdown)
library(wordcloud)
library(RColorBrewer)
Data are taken from here.
This project is based on two projects: Projekt 1 Projekt 2
The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. The text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, we will only focus on the English - United States data sets.
We will read data into data environment in RStudio
lineblogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
linenews <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
linetwitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Now we must to prepare data for basic statistical examination, i.e. file size, number of words, number of lines and mean of the words per line in every files.
sizeblogs <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
sizenews <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
sizetwitter <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
wordsblogs <- stri_count_words(lineblogs)
wordsnews <- stri_count_words(linenews)
wordstwitter <- stri_count_words(linetwitter)
Make data frame and summarize data.
statintable <- data.frame(Source = c("Blogs","News","Twitter"),
Size.MB = c(sizeblogs, sizenews, sizetwitter),
No.lines = c(length(lineblogs),length(linenews),length(linetwitter)),
No.words = c(sum(wordsblogs),sum(wordsnews),sum(wordstwitter)),
MeanWordsPERline = c(mean(wordsblogs),mean(wordsnews),mean(wordstwitter)))
Use pander package to show your counts.
| Source | Size.MB | No.lines | No.words | MeanWordsPERline |
|---|---|---|---|---|
| Blogs | 200.4 | 899288 | 37546246 | 41.75 |
| News | 196.3 | 1010242 | 34762395 | 34.41 |
| 159.4 | 2360148 | 30093410 | 12.75 |
We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.
textsample <- paste(lineblogs[1:2000], linenews[1:2000], linetwitter[1:2000])
corpus <- VCorpus(VectorSource(textsample))
The basic procedure for data preprocessing consists of the following key steps:
Construct a corpus from the files.
Tokenization. Clean up the corpus by removing special characters, punctuation, numbers etc. We also remove profanity that we do not want to predict.
Build basic n-gram model.
We will require the following helper functions in order to prepare our corpus.
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
# Helper function to preprocess corpus
corpus <- tm_map(corpus, toSpace, "/|@|\\|")
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, profanities)
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
corpusDf <-data.frame(text=unlist(sapply(corpus,
`[`, "content")), stringsAsFactors=F)
findNGrams <- function(corp, grams) {
ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram2 <- data.frame(table(ngram))
#pick only top 25
ngram3 <- ngram2[order(ngram2$Freq,decreasing = TRUE),][1:100,]
colnames(ngram3) <- c("String","Count")
ngram3
}
TwoGrams <- findNGrams(corpusDf, 2)
ThreeGrams <- findNGrams(corpusDf, 3)
FourGrams <- findNGrams(corpusDf, 4)
par(mfrow = c(1, 3))
palette <- brewer.pal(8,"Dark2")
wordcloud(TwoGrams[,1], TwoGrams[,2], min.freq =1,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "2-gram cloud")
wordcloud(ThreeGrams[,1], ThreeGrams[,2], min.freq =1,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "3-gram cloud")
wordcloud(FourGrams[,1], FourGrams[,2], min.freq =1,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "4-gram cloud")
par(mfrow = c(1, 2))
barplot(TwoGrams[1:15,2], cex.names=0.5, names.arg=TwoGrams[1:15,1], col="white", main="Most common bigrams in text sample", las=2, ylab = "Frequency")
barplot(ThreeGrams[1:15,2], cex.names=0.5, names.arg=ThreeGrams[1:15,1], col="white", main="Most common trigrams in text sample", las=2, ylab = "Frequency")
For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.