Summary

The objective of the project is to develop and implement a text prediction algorithm. The idea is to provide the user with predictions for the next words to be typed in, based on the last words typed by the user.

This report explains the exploratory data analysis and describes in plain language, plots, code and my exploratory analysis of the course data set.

Getting the data

This activity getting the data in 3 files: en_US.twitter.txt, en_US.news.txt and en_US.blogs. Further, the data will be used in exploratory data analysis.

After I read the files, I analyze the files and show some data about them.

Opening, reading files and closing the connection

The code below open the connection, read the data and close the connection. Now the data are available to manipulation.

#opening the connection 
con_twitter <- file("C:/Users/ssmar/Documents/final/en_US/en_US.twitter.txt", "r") 
con_news <- file("C:/Users/ssmar/Documents/final/en_US/en_US.news.txt", open = "rb")
con_blogs <- file("C:/Users/ssmar/Documents/final/en_US/en_US.blogs.txt", "r") 

#reading the text
text_twitter = readLines(con_twitter)
text_news = readLines(con_news, encoding = "UTF-8", skipNul = TRUE)
text_blogs = readLines(con_blogs)

#closing the connection
close.connection(con_twitter)
close.connection(con_news)
close.connection(con_blogs)

Analysing the files

The code below do basic summaries in the files: Word counts, line counts and amount of characters. Besides that the code shows a plot with number of lines each file.

We can see th summarizes the number of documents (lines) on each source table (blogs, news, and tweets), the total number of words on each collection of documents and the characters number per document.

For instance, the blogs collection contains close to 900,000 lines or documents, a total word count of over 38 million and 208 million on character numbers.

Number of Lines Number of Words Number of Characters
Twitter 2360148 30373543 162384825
News 1010242 34372530 203223160
Blogs 899288 37334131 208361438

Exploratory Analysis

This activity perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words.

The exploratory analysis will be done with a sample of the data because the dataset is fairly large. We know that a representative sample can be used to infer facts about a population.

I join the samples in a new file to facilitate the processing the data. Afther that I clean the data and show some analysis about them.

Creating the sample

The chunk below creates the sample to each file and join the samples in the new file.

I joined the samples in new file to avoid issues with memory overflow.

#set seed 
set.seed(1234)

#sample indexes
train_twitter = sample(text_twitter,size=0.01*length(text_twitter))
train_blogs = sample(text_blogs,size=0.01*length(text_blogs))
train_news = sample(text_news,size=0.01*length(text_news))

#join the files
text = c(train_news, train_blogs, train_twitter)

#create a new file with all samples
write.csv(text, file="texts.csv", row.names=FALSE)

Creating the corpus

Now I read the sample file and I create a Corpus.

Corpus is a function from the tm package. This package contain functions for reading data from newline-delimited ‘JSON’ files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies, including n-grams.

#reading the train file
text = as.vector(read.csv("C:/Users/ssmar/Documents/Pos-Graduacao/DataScience/Course10-DataCapstone/texts.csv", as.is = T, row.names = NULL))

#creating the corpus
text = Corpus(VectorSource(text))

Transformming the data

All these transformations are done using the tm_map function from the tm package, which allow to apply several transformations such as remove lower strings, numbers, white spaces, stopwords, ponctuation and profanity words.

#removing lower string text <- tolower(text)
  text = tm_map(text, content_transformer(tolower))
  
  #removing extra whitespaces
  text = tm_map(text, stripWhitespace)
  
  #removing numbers
  text = tm_map(text, removeNumbers)
  
  #removing stopwords
  text = tm_map(text, removeWords,stopwords("english"))
  
  #removing  profanity and others words
  text = tm_map(text, removeWords, c("bitch", "cock", "shit", "fuck"))
  
  #removing ponctuation
  text = tm_map(text, removePunctuation)

Build a term-document matrix

Using the corpus of documents, we now construct a Document Term Matrix (DTM). This object is a simple triplet matrix structure (efficient for storing large sparse matrices), that has each document as a row and each n-gram (or term) as a column.

After that I show the frequent words.

#construct a Document Term Matrix (DTM)
textTDM = TermDocumentMatrix(text, control = list(stopwords = TRUE))
m <- as.matrix(textTDM)
v <- sort(rowSums(m),decreasing=TRUE)

#frequent words
d <- data.frame(word = names(v),freq=v)

#plot with 10 frequent words
barplot(head(d, 10)$freq, names.arg = rownames(head(d, 10)), main = "Frequency of the words", xlab = "Words", ylab = "Frequency")

#wordcloud with frequent words
wordcloud(words = names(v), freq = (d$freq), min.freq = 3,random.order = F)

Next steps

Further I will: understand variation in the frequencies of words and word pairs in the data. build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. create a Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.