Understanding the problem

The first step in analyzing any new data set is figuring out:

  1. what data do you have?
  2. what are the standard tools and models used for that type of data?

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Getting and sampling the data

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Task to accomplish:

  1. Loading the data in.
  2. Sampling

Instead of looking at the entire text file, we can take samples from the respective text files and consider frequencies from them. Since data from news text file are more relavant than the others, we can take more samples from this text file.

resampling <- function(filename, prob){
  con <- file(paste('./final/en_US/',filename, '.txt', sep = ""), "r")
  data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
  
  nlines <- length(data)
  
  set.seed(20)
  lines <- data[rbinom(n = nlines, size = 1, prob = prob) == 1]
  close(con)
  
  con <- file(paste('./preprocess/', filename, "_sample.txt", sep = ""), "w")
  writeLines(lines, con)
  close(con)
}

## 20% from news; 5% from blogs; 1% from twitter

resampling('en_US.blogs', 0.05)
resampling('en_US.twitter', 0.01)
resampling('en_US.news', 0.2)

N-Grams Tokenization

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
library(tm)
library(ggplot2)
library(RWeka)

con1 <- file('./preprocess/en_US.blogs_sample.txt','r')
con2 <- file('./preprocess/en_US.news_sample.txt', 'r')
con3 <- file('./preprocess/en_US.twitter_sample.txt', 'r')

blogs <- readLines(con1)
news <- readLines(con2)
twitter <- readLines(con3)

close(con1)
close(con2)
close(con3)

samples <- c(blogs, news, twitter)
rm(con1, con2, con3, blogs, news, twitter)

samples <- removeNumbers(samples) # remove numbers
samples <- removePunctuation(samples) # remove punctuations
samples <- stripWhitespace(samples) # remove special characters
samples <- tolower(samples) #lowercase all contents  
samples <- iconv(samples,"latin1","ASCII",sub = "") # remove non english words

cleanData <- data.frame(samples, stringsAsFactors=F)

words <- WordTokenizer(cleanData)
oneGram <- data.frame(table(words))
oneGram <- oneGram[order(oneGram$Freq, decreasing = TRUE), ]
head(oneGram)
##       words   Freq
## 82979   the 131896
## 84246    to  74674
## 2781    and  72091
## 1         a  64105
## 57832    of  58871
## 39153     i  47970
biGram <- NGramTokenizer(samples, Weka_control(min =2, max=2))
biGram <- data.frame(table(biGram))
biGram <- biGram[order(biGram$Freq, decreasing = TRUE), ]
head(biGram)
##         biGram  Freq
## 556987  of the 12776
## 396448  in the 11340
## 831360  to the  5919
## 566300  on the  5274
## 300134 for the  4750
## 825006   to be  4657
triGram <- NGramTokenizer(samples, Weka_control(min =3, max=3))
triGram <- data.frame(table(triGram))
triGram <- triGram[order(triGram$Freq, decreasing = TRUE), ]
head(triGram)
##             triGram Freq
## 1136969  one of the  969
## 20448      a lot of  870
## 1647527     to be a  485
## 1399105 some of the  460
## 840306     it was a  452
## 191071   as well as  450
polyGram <- NGramTokenizer(samples, Weka_control(min =4, max=4))
polyGram <- data.frame(table(polyGram))
polyGram <- polyGram[order(polyGram$Freq, decreasing = TRUE), ]
head(polyGram)
##                   polyGram Freq
## 1784063     the end of the  223
## 1837051    the rest of the  206
## 245704       at the end of  199
## 246875    at the same time  161
## 654858  for the first time  145
## 927563    in the middle of  132

Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

scientific <- function(l) {
     l <- format(l, scientific = TRUE)
     l <- gsub("^(.*)e", "'\\1'e", l)
     l <- gsub("e", "%*%10^", l)
     parse(text=l)
}

g <- ggplot(subset(oneGram, Freq>20000), aes(x = words, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
  geom_text(aes(label=Freq),vjust=-0.1) + scale_y_continuous(labels = scientific)
g

g <- ggplot(subset(biGram, Freq>2000), aes(x = biGram, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
  theme(axis.text.x = element_text(hjust = 1, angle = 60))
g

g <- ggplot(subset(triGram, Freq>300), aes(x = triGram, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
  theme(axis.text.x = element_text(hjust = 1, angle = 60))
g

Frequency sorted order in dictionary

Let’s build a plot where we can have our frequency sorted dictionary from 10% to 90% coverage.

dictionary <- function(x, sum_freq = 0){
  totalsum <- sum(oneGram$Freq)
  for(i in 1:length(oneGram$Freq)){
    sum_freq <- sum_freq + oneGram$Freq[i]
    if(sum_freq >= (x*totalsum)){
      break
    }
  }
  return (i)
}

ten_percent <- dictionary(0.1)
quarter <- dictionary(0.25)
half <- dictionary(0.5)
one_third <- dictionary(0.75)
ninty <- dictionary(0.9)

percentage <- c(10,25,50,75,90)
totalwords <- c(ten_percent, quarter, half, one_third, ninty)
qplot(percentage, totalwords, geom=c("line","point")) + geom_text(aes(label = totalwords), hjust=0.3, vjust=-0.1)

Conclusion

In this report, we have foud frequently used n-grams in the news, twitter and blogs text files. In the future, we will look at the model prediction where if we type one word like “Thank” a new word “you” has to appear for easy typing.