The objective of the project is to develop and implement a text prediction algorithm. The idea is to provide the user with predictions for the next words to be typed in, based on the last words typed by the user.
This report explains the exploratory data analysis and describes in plain language, plots, code and my exploratory analysis of the course data set.
This activity getting the data in 3 files: en_US.twitter.txt, en_US.news.txt and en_US.blogs. Further, the data will be used in exploratory data analysis.
After I read the files, I analyze the files and show some data about them.
The code below open the connection, read the data and close the connection. Now the data are available to manipulation.
#opening the connection
con_twitter <- file("C:/Users/ssmar/Documents/final/en_US/en_US.twitter.txt", "r")
con_news <- file("C:/Users/ssmar/Documents/final/en_US/en_US.news.txt", open = "rb")
con_blogs <- file("C:/Users/ssmar/Documents/final/en_US/en_US.blogs.txt", "r")
#reading the text
text_twitter = readLines(con_twitter)
text_news = readLines(con_news, encoding = "UTF-8", skipNul = TRUE)
text_blogs = readLines(con_blogs)
#closing the connection
close.connection(con_twitter)
close.connection(con_news)
close.connection(con_blogs)
The code below do basic summaries in the files: Word counts, line counts and amount of characters. Besides that the code shows a plot with number of lines each file.
We can see th summarizes the number of documents (lines) on each source table (blogs, news, and tweets), the total number of words on each collection of documents and the characters number per document.
For instance, the blogs collection contains close to 900,000 lines or documents, a total word count of over 38 million and 208 million on character numbers.
| Number of Lines | Number of Words | Number of Characters | |
|---|---|---|---|
| 2360148 | 30373543 | 162384825 | |
| News | 1010242 | 34372530 | 203223160 |
| Blogs | 899288 | 37334131 | 208361438 |
This activity perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words.
The exploratory analysis will be done with a sample of the data because the dataset is fairly large. We know that a representative sample can be used to infer facts about a population.
I join the samples in a new file to facilitate the processing the data. Afther that I clean the data and show some analysis about them.
The chunk below creates the sample to each file and join the samples in the new file.
I joined the samples in new file to avoid issues with memory overflow.
#set seed
set.seed(1234)
#sample indexes
train_twitter = sample(text_twitter,size=0.01*length(text_twitter))
train_blogs = sample(text_blogs,size=0.01*length(text_blogs))
train_news = sample(text_news,size=0.01*length(text_news))
#join the files
text = c(train_news, train_blogs, train_twitter)
#create a new file with all samples
write.csv(text, file="texts.csv", row.names=FALSE)
Now I read the sample file and I create a Corpus.
Corpus is a function from the tm package. This package contain functions for reading data from newline-delimited ‘JSON’ files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies, including n-grams.
#reading the train file
text = as.vector(read.csv("C:/Users/ssmar/Documents/Pos-Graduacao/DataScience/Course10-DataCapstone/texts.csv", as.is = T, row.names = NULL))
#creating the corpus
text = Corpus(VectorSource(text))
All these transformations are done using the tm_map function from the tm package, which allow to apply several transformations such as remove lower strings, numbers, white spaces, stopwords, ponctuation and profanity words.
#removing lower string text <- tolower(text)
text = tm_map(text, content_transformer(tolower))
#removing extra whitespaces
text = tm_map(text, stripWhitespace)
#removing numbers
text = tm_map(text, removeNumbers)
#removing stopwords
text = tm_map(text, removeWords,stopwords("english"))
#removing profanity and others words
text = tm_map(text, removeWords, c("bitch", "cock", "shit", "fuck"))
#removing ponctuation
text = tm_map(text, removePunctuation)
Using the corpus of documents, we now construct a Document Term Matrix (DTM). This object is a simple triplet matrix structure (efficient for storing large sparse matrices), that has each document as a row and each n-gram (or term) as a column.
After that I show the frequent words.
#construct a Document Term Matrix (DTM)
textTDM = TermDocumentMatrix(text, control = list(stopwords = TRUE))
m <- as.matrix(textTDM)
v <- sort(rowSums(m),decreasing=TRUE)
#frequent words
d <- data.frame(word = names(v),freq=v)
#plot with 10 frequent words
barplot(head(d, 10)$freq, names.arg = rownames(head(d, 10)), main = "Frequency of the words", xlab = "Words", ylab = "Frequency")
#wordcloud with frequent words
wordcloud(words = names(v), freq = (d$freq), min.freq = 3,random.order = F)
Further I will: understand variation in the frequencies of words and word pairs in the data. build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. create a Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.