1 Executive summary

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models. Tasks to accomplish:

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

You will need following libraries:

library(knitr)
library(NLP)
library(stringi)
library(stringr)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(pander)
library(rmarkdown)
library(wordcloud)
library(RColorBrewer)

Data are taken from here.

This project is based on two projects: Projekt 1 Projekt 2

The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. The text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, we will only focus on the English - United States data sets.

We will read data into data environment in RStudio

lineblogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
linenews <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
linetwitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Now we must to prepare data for basic statistical examination, i.e. file size, number of words, number of lines and mean of the words per line in every files.

sizeblogs <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
sizenews <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
sizetwitter <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

wordsblogs <- stri_count_words(lineblogs)
wordsnews <- stri_count_words(linenews)
wordstwitter <- stri_count_words(linetwitter)

Make data frame and summarize data.

statintable <- data.frame(Source = c("Blogs","News","Twitter"),
                            Size.MB = c(sizeblogs, sizenews, sizetwitter),
                            No.lines = c(length(lineblogs),length(linenews),length(linetwitter)),
                            No.words = c(sum(wordsblogs),sum(wordsnews),sum(wordstwitter)),
                            MeanWordsPERline = c(mean(wordsblogs),mean(wordsnews),mean(wordstwitter)))

Use pander package to show your counts.

Source Size.MB No.lines No.words MeanWordsPERline
Blogs 200.4 899288 37546246 41.75
News 196.3 1010242 34762395 34.41
Twitter 159.4 2360148 30093410 12.75

2 Data processing

We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.

textsample <- paste(lineblogs[1:2000], linenews[1:2000], linetwitter[1:2000])
corpus <- VCorpus(VectorSource(textsample))

The basic procedure for data preprocessing consists of the following key steps:

  1. Construct a corpus from the files.

  2. Tokenization. Clean up the corpus by removing special characters, punctuation, numbers etc. We also remove profanity that we do not want to predict.

  3. Build basic n-gram model.

We will require the following helper functions in order to prepare our corpus.

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
    # Helper function to preprocess corpus
    corpus <- tm_map(corpus, toSpace, "/|@|\\|")
    corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    corpus <- tm_map(corpus, removeWords, profanities)
    corpus <- tm_map(corpus, stripWhitespace)
    return(corpus)
}

corpusDf <-data.frame(text=unlist(sapply(corpus, 
                  `[`, "content")), stringsAsFactors=F)

findNGrams <- function(corp, grams) {
  ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
                      delimiters = " \\r\\n\\t.,;:\"()?!"))
  ngram2 <- data.frame(table(ngram))
  #pick only top 25
  ngram3 <- ngram2[order(ngram2$Freq,decreasing = TRUE),][1:100,]
  colnames(ngram3) <- c("String","Count")
  ngram3
}

TwoGrams <- findNGrams(corpusDf, 2)
ThreeGrams <- findNGrams(corpusDf, 3)
FourGrams <- findNGrams(corpusDf, 4)

3 Plot word maps and histograms

par(mfrow = c(1, 3))
palette <- brewer.pal(8,"Dark2")

wordcloud(TwoGrams[,1], TwoGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "2-gram cloud")

wordcloud(ThreeGrams[,1], ThreeGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "3-gram cloud")

wordcloud(FourGrams[,1], FourGrams[,2], min.freq =1, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=1, "4-gram cloud")

par(mfrow = c(1, 2))
barplot(TwoGrams[1:15,2], cex.names=0.5, names.arg=TwoGrams[1:15,1], col="white", main="Most common bigrams in text sample", las=2, ylab = "Frequency")
barplot(ThreeGrams[1:15,2], cex.names=0.5, names.arg=ThreeGrams[1:15,1], col="white", main="Most common trigrams in text sample", las=2, ylab = "Frequency")

4 Prediction strategies and plans for Shiny app

For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.