The first step in analyzing any new data set is figuring out:
The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.
Task to accomplish:
Instead of looking at the entire text file, we can take samples from the respective text files and consider frequencies from them. Since data from news text file are more relavant than the others, we can take more samples from this text file.
resampling <- function(filename, prob){
con <- file(paste('./final/en_US/',filename, '.txt', sep = ""), "r")
data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
nlines <- length(data)
set.seed(20)
lines <- data[rbinom(n = nlines, size = 1, prob = prob) == 1]
close(con)
con <- file(paste('./preprocess/', filename, "_sample.txt", sep = ""), "w")
writeLines(lines, con)
close(con)
}
## 20% from news; 5% from blogs; 1% from twitter
resampling('en_US.blogs', 0.05)
resampling('en_US.twitter', 0.01)
resampling('en_US.news', 0.2)
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
library(tm)
library(ggplot2)
library(RWeka)
con1 <- file('./preprocess/en_US.blogs_sample.txt','r')
con2 <- file('./preprocess/en_US.news_sample.txt', 'r')
con3 <- file('./preprocess/en_US.twitter_sample.txt', 'r')
blogs <- readLines(con1)
news <- readLines(con2)
twitter <- readLines(con3)
close(con1)
close(con2)
close(con3)
samples <- c(blogs, news, twitter)
rm(con1, con2, con3, blogs, news, twitter)
samples <- removeNumbers(samples) # remove numbers
samples <- removePunctuation(samples) # remove punctuations
samples <- stripWhitespace(samples) # remove special characters
samples <- tolower(samples) #lowercase all contents
samples <- iconv(samples,"latin1","ASCII",sub = "") # remove non english words
cleanData <- data.frame(samples, stringsAsFactors=F)
words <- WordTokenizer(cleanData)
oneGram <- data.frame(table(words))
oneGram <- oneGram[order(oneGram$Freq, decreasing = TRUE), ]
head(oneGram)
## words Freq
## 82979 the 131896
## 84246 to 74674
## 2781 and 72091
## 1 a 64105
## 57832 of 58871
## 39153 i 47970
biGram <- NGramTokenizer(samples, Weka_control(min =2, max=2))
biGram <- data.frame(table(biGram))
biGram <- biGram[order(biGram$Freq, decreasing = TRUE), ]
head(biGram)
## biGram Freq
## 556987 of the 12776
## 396448 in the 11340
## 831360 to the 5919
## 566300 on the 5274
## 300134 for the 4750
## 825006 to be 4657
triGram <- NGramTokenizer(samples, Weka_control(min =3, max=3))
triGram <- data.frame(table(triGram))
triGram <- triGram[order(triGram$Freq, decreasing = TRUE), ]
head(triGram)
## triGram Freq
## 1136969 one of the 969
## 20448 a lot of 870
## 1647527 to be a 485
## 1399105 some of the 460
## 840306 it was a 452
## 191071 as well as 450
polyGram <- NGramTokenizer(samples, Weka_control(min =4, max=4))
polyGram <- data.frame(table(polyGram))
polyGram <- polyGram[order(polyGram$Freq, decreasing = TRUE), ]
head(polyGram)
## polyGram Freq
## 1784063 the end of the 223
## 1837051 the rest of the 206
## 245704 at the end of 199
## 246875 at the same time 161
## 654858 for the first time 145
## 927563 in the middle of 132
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
scientific <- function(l) {
l <- format(l, scientific = TRUE)
l <- gsub("^(.*)e", "'\\1'e", l)
l <- gsub("e", "%*%10^", l)
parse(text=l)
}
g <- ggplot(subset(oneGram, Freq>20000), aes(x = words, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(label=Freq),vjust=-0.1) + scale_y_continuous(labels = scientific)
g
g <- ggplot(subset(biGram, Freq>2000), aes(x = biGram, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(hjust = 1, angle = 60))
g
g <- ggplot(subset(triGram, Freq>300), aes(x = triGram, y = Freq)) + geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(hjust = 1, angle = 60))
g
Let’s build a plot where we can have our frequency sorted dictionary from 10% to 90% coverage.
dictionary <- function(x, sum_freq = 0){
totalsum <- sum(oneGram$Freq)
for(i in 1:length(oneGram$Freq)){
sum_freq <- sum_freq + oneGram$Freq[i]
if(sum_freq >= (x*totalsum)){
break
}
}
return (i)
}
ten_percent <- dictionary(0.1)
quarter <- dictionary(0.25)
half <- dictionary(0.5)
one_third <- dictionary(0.75)
ninty <- dictionary(0.9)
percentage <- c(10,25,50,75,90)
totalwords <- c(ten_percent, quarter, half, one_third, ninty)
qplot(percentage, totalwords, geom=c("line","point")) + geom_text(aes(label = totalwords), hjust=0.3, vjust=-0.1)
In this report, we have foud frequently used n-grams in the news, twitter and blogs text files. In the future, we will look at the model prediction where if we type one word like “Thank” a new word “you” has to appear for easy typing.