This project allows to build a predictive text model similar to those used in SwiftKey to complete sentences and help the user to save some time typing, particularly in smartphones.
The input is a large corpora obtained from twits, news and texts of blogs, obtain from HC Corpora (www.corpora.heliohost.org). The goal is to use this initial information to learn from the data which words could be used to complete a sentence. The first step is to understand the problem, know well the data, clean and process it, and then use some prediction methods (supervised and unsupervised) in order to implement a solution.
The next report pretends to summarize main results about the data and to plan a strategy to develop a prediction model.
The data for the project is available from here. My work is based on the english locale. The three data files refer to twits, blogs and news, and the files are large enough to discourage to keep all the information in memory.
rbind(paste("data/en_US",list.files("data/en_US/"),sep="/"),
file.size(paste("data/en_US",list.files("data/en_US/"),sep="/")))
[,1] [,2]
[1,] "data/en_US/en_US.blogs.txt" "data/en_US/en_US.news.txt"
[2,] "210160014" "205811889"
[,3]
[1,] "data/en_US/en_US.twitter.txt"
[2,] "167105338"
Each data file is a collection of rows with units of information obatined from the source that gives the name to the file: a twitt, a blog entrance or a news paragraph. The file blogs has 899288 records, news has 1010242 records and twits has 2360148 records.
Since we need to remove profanity words as one of the indicated tasks, we need to consider a list of this words. I obtained one from the following link: http://www.bannedwordlist.com/
As main tools for analysis, I decided to use the packages tm, Rweka . These packages offer some methods and functions to clean and summarize the data that can be created by hand, but for me it takes too much time to program (I am not an expert), the advantage is that is a quick and partial solution but they have some limitations and the structure seem to be more complex than what is necessary for the project in hand.
Other tools in hand are R.utils, stringr, wordcloud.
For the purpose of this project, I will consider a sample of lines from the datasets, only to show the functionality and main ideas considered. For this purpose I consider a sampler function that selects at random a subset of records from the complete file:
#Function to create sample lines from an input file
samplelines <- function(filename,m=50,seed=10){
set.seed(seed) #fix a seed for reproducibility
require("R.utils")
n <- as.numeric(countLines(filename))
muestra <- sort(sample(1:n,m,replace = F))
lineas <- NULL
con <- file(filename)
for(i in 1:m) lineas <- append(lineas, readLines(con,muestra[i])[muestra[i]])
close(con)
lineas }
Now we generate sample files with, say, m=20 records each and create a corpus with those files:
The following steps are for cleaning the data. Here is important comment that the order in which is applied the filtering is important and gives different results. I decided first to put everythin in lowercase, then remove numbers that appear alone, remove all the punctuation symbols, then remove profanities and other letters that remain alone after removing puntuation (for example, “aren’t” appears as “aren t”), and at last remove the extra white spaces.
One more decision was not remove the stopwords, as those words can be useful for prediction.
library(tm)
Loading required package: NLP
library(RWeka)
library(stringr)
#read the profanities database, and put everything in lowercase, and remove punctuation
#signs
profanities <- readLines(con=file("data/badwords/badwords.txt"))
Warning in readLines(con = file("data/badwords/badwords.txt")): incomplete
final line found on 'data/badwords/badwords.txt'
profanities <-tolower(gsub("[[:punct:]]","",profanities))
#create the corpus of files.
texto <- Corpus(DirSource(directory = "data/Sample/"))
#Data cleansing
texto <- tm_map(texto,content_transformer(tolower)) #change everything to lowercase
texto <- tm_map(texto,removeNumbers) #remove numbers
texto <- tm_map(texto,content_transformer(str_replace_all),pattern = "[[:punct:]]",replacement = " ") #remove punctuation and replace by spaces
texto <- tm_map(texto,content_transformer(removeWords),words=c(profanities,"t","x"," nt ","j","m"))
texto <- tm_map(texto,stripWhitespace) #remove extra spaces
Now we can use the “clean” corpus to do some exploratory analysis and try to respond some of the questions formulated for the second task.
With the corpus we create a Termdocument matrix and analyze the distribution of words. Since the matrix is disperse, I consider only the first 100 words
tdm <- as.matrix(TermDocumentMatrix(texto))
dim(tdm)
[1] 2538 3
stat1 <- sort(rowSums(tdm),decreasing = T)[1:100]
dotchart(stat1,main="Word frequency in the sample",xlab="frequency",cex = 0.5)
In my sample, there are 2538 different words. The more frequent words (in my sample) are (considering stopping words):
stat1
the and that for you was with
351 235 77 75 63 47 43
are this but have from said out
41 41 40 35 32 32 30
all her not what they will about
29 29 29 27 26 26 25
his had one would can has who
25 23 23 23 22 22 22
more new time she were when just
21 21 21 20 19 19 18
our their your how know get some
18 18 18 17 17 16 16
last after like also back day even
15 14 14 13 13 13 13
first here make need than them then
13 13 13 13 13 13 13
there think want year don home into
12 12 12 12 11 11 11
been before both got next see take
10 10 10 10 10 10 10
two very because each life now people
10 10 9 9 9 9 9
still use which being city did love
9 9 9 8 8 8 8
most off other over second thanks well
8 8 8 8 8 8 8
work yum asparagus case could doing every
8 8 7 7 7 7 7
face lot
7 7
A wordcloud can be helpful at this point, where is obvious the frequency of the stopping words:
library(wordcloud)
Loading required package: RColorBrewer
wordcloud(names(stat1),stat1)
We consider now the distributions of 2-grams and 3-grams. Repeat a similar process that the one used for the distribution of words. However, the Marix term is much more sparse for these cases:
token_delim <- "\\t\\r\\n.!?,;\"() "
bitoken <- function(x){NGramTokenizer(x, Weka_control(min=2,max=2, delimiters = token_delim))}
tdm2 <- as.matrix(TermDocumentMatrix(texto, control = list(tokenize = bitoken)))
stat3 <- sort(rowSums(tdm2),decreasing = T)[1:100]
dotchart(stat3,main="Word frequency in the sample",xlab="frequency",cex = 0.5)
wordcloud(names(stat3),stat3)
token_delim <- "\\t\\r\\n.!?,;\"() "
tritoken <- function(x){NGramTokenizer(x, Weka_control(min=3,max=3, delimiters = token_delim))}
tdm3 <- as.matrix(TermDocumentMatrix(texto, control = list(tokenize = tritoken)))
stat4 <- sort(rowSums(tdm3),decreasing = T)[1:100]
dotchart(stat4,main="Word frequency in the sample",xlab="frequency",cex = 0.5)
wordcloud(names(stat4),stat4)
The next exploration question asks: How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? In order to obtain this data, we need to obtain the index of the cumulative frequency proportion:
stat2 <- sort(rowSums(tdm),decreasing = T)
cumfreq <- cumsum(stat2)/sum(stat2)
For the first question, we require 236 words to cover 50% of all word instances, and 1935 words to cover 90%.
I still don’t have a definite solution to identify or evaluate if a wword comes from a foreign language, and I think that as those words are not very frequent, it will be possible to include them in the set of nonstop words.
For the rest of the Project:
Thanks for your feedback.