In our everyday life we usually need to type a lot of text. Some of us are faster writing long text but others will require more time and effort. When you are typing text using a keyboard in your desktop computer it would not be so difficult, however, if you are typing on your iPhones or tablet it becomes harder and sometime even frustrating. In the present project we will attempt to improve our typing experience predicting the next word in a sentence fragment. This problem is part of the topic called Natural Language Processing which combines knowledge in statistics, linguistic, machine learning, and computer science.
This will help us type text faster and also avoid misspelling. The final product of this project will be a Shiny app to predict the next word in a given text. This is a project in progress, so I will explain our exploratory data analysis along with the goals for the Shiny app and algorithm.
We present a probabilistic model to predict the next word from a text. The model will be trained on a dataset in english language. Our model will be build using the frequency of one word, 2-words, 3-words and 4 words. These combination of words together are called 1-gram, 2-gram, 3-gram and 4-gram. We will generate the probabilistic distribution of these n-grams for our dataset and from there we will predict for an unseen text the next word. First, we need to explore and understand our dataset.
The dataset for our project can be download at the following link,
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
This dataset contains several sources of information (blogs, news and twitter) from four different languages. In the present report we will use only the english dataset. First, we need to understand these three sources of information in our english datasets. As a first step we load a random set of each sources and explore it to identify the most frequent word for each sources an compare their behavior.
First, we need to load packages related with our project.
options(scipen = 1, digits = 5)
# Text Mining in R
library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(4543)
Now, we are going to explore the original dataset from different sources to know the statistic of these files, we are interested in the number of lines and number of words in each file. At the end of the following section, I show the number of words and lines in the corresponding sources, News, Blogs and Twitter.
options(scipen = 1, digits = 5)
# Loading the raw csv file
set.seed(4543)
setwd("~/home/nico/TRABAJO/programming_new/courses/Coursera/Data_science_specialization/Captone_project/project")
number_words_blogs <- as.numeric(strsplit(system("wc -c data/en_US/en_US.blogs.txt", intern = TRUE), " ")[[1]][2])
number_texts_blogs <- as.numeric(strsplit(system("wc -l data/en_US/en_US.blogs.txt", intern = TRUE), " ")[[1]][3])
number_words_twitter <- as.numeric(strsplit(system("wc -c data/en_US/en_US.twitter.txt", intern = TRUE), " ")[[1]][2])
number_texts_twitter <- as.numeric(strsplit(system("wc -l data/en_US/en_US.twitter.txt", intern = TRUE), " ")[[1]][2])
number_words_news <- as.numeric(strsplit(system("wc -c data/en_US/en_US.news.txt", intern = TRUE), " ")[[1]][2])
number_texts_news <- as.numeric(strsplit(system("wc -l data/en_US/en_US.news.txt", intern = TRUE), " ")[[1]][2])
df<- data.frame(matrix(1,nrow=3,ncol=3))
names(df) <- c("Source Name", "Number of Words", "Number of Lines")
df.table <- table(df$"Source Name",df$"Number of Words",df$"Number of Lines")
df <- matrix(c(number_words_news,number_texts_news,number_words_twitter,number_texts_twitter,number_words_blogs,number_texts_blogs),ncol=2,byrow=TRUE)
colnames(df) <- c("Number of Words", "Number of Lines")
rownames(df) <- c("News","Twitter","Blogs")
df.table <- as.table(df)
print("This table gives the number of words and lines in the corresponding sources, News, Blogs and Twitter")
df.table
From previous table, we can conclude that the number of words in each sources are very similar (~ 200 000 000) and News and Blogs contain similar number of lines (around 1 000 000) compared with Twitter (~ 900 000). In general, we observe that all our files contain similar number of lines and words. Now the question is whether they contain similar words and their frequency. Before we answer this question, we need do some cleaning in our data.
Since data from different source may contain a lot of noise, we need first to clean it. So, the first step is to remove non-printable characters from our files, otherwise, they will produce problem processing text. We are going to do this process using the command line tr with regular expression. It will remove any non-printable ASCII character.
system("tr -cd '\11\12\15\40-\176' < data/en_US/en_US.blogs.txt > data/en_US/en_US.blogs_nonP.txt")
system("tr -cd '\11\12\15\40-\176' < data/en_US/en_US.news.txt > data/en_US/en_US.news_nonP.txt")
system("tr -cd '\11\12\15\40-\176' < data/en_US/en_US.twitter.txt > data/en_US/en_US.twitter_nonP.txt")
Now, we will load a sample data from our files. We will take 5000 lines from each sources. We do this step to avoid memory problem running our code. For our Shiny app we may need to process all the files, we will analyze this problem later in the project. We load only 5000 lines from each file using command line.
system("shuf -n5000 data/en_US/en_US.blogs_nonP.txt -o data/en_US/en_US.blogs_Sample.txt")
system("shuf -n5000 data/en_US/en_US.news_nonP.txt -o data/en_US/en_US.news_Sample.txt")
system("shuf -n5000 data/en_US/en_US.twitter_nonP.txt -o data/en_US/en_US.twitter_Sample.txt")
If we have problem running these command line in R, it can be run in the shell or we can load a larger version of the initial file, not the whole file, and later use the R command sample to get a random set of lines. In any case, all these procedures are similar and it is up to you which one you use.
Here, we load the sample data and make the cleaning. In the cleaning process we remove numbers, whitespaces, punctuation and profanity words. Also we convert all the text to lower cases.
setwd("~/home/nico/TRABAJO/programming_new/courses/Coursera/Data_science_specialization/Captone_project/project")
blogsData <- readLines(file("data/en_US/en_US.blogs_Sample.txt","r"),1000)
newsData <- readLines(file("data/en_US/en_US.news_Sample.txt","r"),1000)
twitterData <- readLines(file("data/en_US/en_US.twitter_Sample.txt","r"),1000)
print("Example of two text from the blog data")
## [1] "Example of two text from the blog data"
blogsData[1:2]
## [1] "No, I can't say that I'm kept up at night dreaming of plastic surgery or bashing my head on the wall for stupid things said and done. But of course I can recognise where there is room for improvement and of course I have regrets."
## [2] "I am nearing that magical age of forty. Not quite there yet, but it is at the end of the month. Fabled as the middle ages (this assuming that I am living until 80, but with the family genes of women in my family living until the 90s, doesn't quite work for me), I am excited about this decade."
print("Example of two text from the news data")
## [1] "Example of two text from the news data"
newsData[1:2]
## [1] "After the sale of the last 4,000 sets -- 32 heavy, hardbound volumes chock-full of information and illustrations on, well, nearly everything -- will exist weightless on the Internet, where they can be quickly updated."
## [2] "\"I just kind of shifted into my ER mode,\" Gail said. \"I thought, 'OK, I'm going to be supportive, watchful, clinical and maintain calmness for everybody, especially for Alli.' \""
print("Example of two text from the twitter data")
## [1] "Example of two text from the twitter data"
twitterData[1:2]
## [1] "Clyde Stubblefield's drums are the backbone of hip-hop. Clyde Stubblefield is UNSUNG!!!"
## [2] "I love makeup because it enhances my beauty!! RT if you agree"
# Cleaning of Blogs data
corpusblogs <- VCorpus(VectorSource(blogsData))
## Warning: closing unused connection 7 (data/en_US/en_US.twitter_Sample.txt)
## Warning: closing unused connection 6 (data/en_US/en_US.news_Sample.txt)
## Warning: closing unused connection 5 (data/en_US/en_US.blogs_Sample.txt)
corpusblogs <- tm_map(corpusblogs, removeNumbers)
corpusblogs <- tm_map(corpusblogs, stripWhitespace) # remove whitespaces
corpusblogs <- tm_map(corpusblogs, removePunctuation) # remove punctuation
corpusblogs <- tm_map(corpusblogs, content_transformer(tolower)) # convert to lower case
databadwords <- VectorSource(readLines("data/en_US/badwordlist.txt"))
corpusblogs <- tm_map(corpusblogs, removeWords, databadwords, lazy=TRUE)
# Cleaning of the News data
corpusnews <- VCorpus(VectorSource(newsData))
corpusnews <- tm_map(corpusnews, removeNumbers)
corpusnews <- tm_map(corpusnews, stripWhitespace) # remove whitespaces
corpusnews <- tm_map(corpusnews, removePunctuation) # remove punctuation
corpusnews <- tm_map(corpusnews, content_transformer(tolower)) # convert to lower case
databadwords <- VectorSource(readLines("data/en_US/badwordlist.txt"))
corpusnews <- tm_map(corpusnews, removeWords, databadwords, lazy=TRUE)
# Cleaning of the Twitter data
corpustwitter <- VCorpus(VectorSource(twitterData))
corpustwitter <- tm_map(corpustwitter, removeNumbers)
corpustwitter <- tm_map(corpustwitter, stripWhitespace) # remove whitespaces
corpustwitter <- tm_map(corpustwitter, removePunctuation) # remove punctuation
corpustwitter <- tm_map(corpustwitter, content_transformer(tolower)) # convert to lower case
databadwords <- VectorSource(readLines("data/en_US/badwordlist.txt"))
corpustwitter <- tm_map(corpustwitter, removeWords, databadwords, lazy=TRUE)
Now, we will create the DocumentTerm Matrix for each source and the plot of the most frequent words.
Blogs dataset:
TDMblogs <- TermDocumentMatrix(corpusblogs)
matrixb <- as.matrix(TDMblogs)
freq_word_blog <- sort(rowSums(matrixb), decreasing=TRUE)
df_blogs <- data.frame(word=names(freq_word_blog), freq=freq_word_blog)
wordcloud(df_blogs$word, df_blogs$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),min.freq=15)
head(freq_word_blog, 10)
## the and that for with you was this have but
## 2068 1168 477 390 341 316 295 285 235 220
We can observe from the previous wordcloud and the most frequent word that for blogs the most frequent word is the. This is expected because the is a stop word, this means that it usually very often in our vocabulary. Here we decide to keep the stop word because they are going to be important for the next word prediction.
Now, let check the news source:
TDMnews <- TermDocumentMatrix(corpusnews)
matrixb <- as.matrix(TDMnews)
freq_word_news <- sort(rowSums(matrixb), decreasing=TRUE)
df_news <- data.frame(word=names(freq_word_news), freq=freq_word_news)
wordcloud(df_news$word, df_news$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),min.freq=15)
head(freq_word_news, 10)
## the and that for said was with from are but
## 1915 880 338 316 265 241 239 153 148 147
From last plot we can conclude that also for the news dataset the most frequent word is the.
Now, let study the twitter dataset:
TDMtwitter <- TermDocumentMatrix(corpustwitter)
matrixb <- as.matrix(TDMtwitter)
freq_word_twitter <- sort(rowSums(matrixb), decreasing=TRUE)
df_twitter <- data.frame(word=names(freq_word_twitter), freq=freq_word_twitter)
wordcloud(df_twitter$word, df_twitter$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),min.freq=15)
head(freq_word_twitter, 10)
## the you for and that with have just this your
## 390 214 182 166 116 79 74 70 69 66
Here again for the twitter dataset the most frequent word is the. The second most frequent word is different for different dataset (blogs -> and, news -> and, twitter -> you), this indicate that even when the most frequent word is the same, the three different dataset contain different amount of words. Also, if we check the frequency of the most relevant word, we see that they are not the same (blog -> 2068, news -> 1915, twitter -> 390). Also, this difference shows that even when we are using the same amount of sentences, twitter contains less time the word the that blogs and news. The explanation for this point is that in twitter a text can not contain more than 140 character while in blogs and news a single text can be very large.
Now I am going to plot the most frequent word for only one source (news) removing stopwords and we will see how different the wordcloud look like.
TDMnews1 <- TermDocumentMatrix(corpusnews,control = list(stopwords = TRUE))
matrixb1 <- as.matrix(TDMnews1)
freq_word_news1 <- sort(rowSums(matrixb1), decreasing=TRUE)
df_news1 <- data.frame(word=names(freq_word_news1), freq=freq_word_news1)
wordcloud(df_news1$word, df_news1$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),min.freq=15)
head(freq_word_news1, 10)
## said will one new two first last year just years
## 265 107 78 71 63 60 60 60 57 56
Now, the most frequent word is said followed by will. As we have noted removing stopwords can make a huge difference in our analysis. We consider that stopwords should not be removed from our dataset.
Now, in the following histogram we can see the frequency for each word in the news dataset after removing the stopwords.
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
first15 <- df_news1[1:15,]
ggplot(first15, aes(x=word,y=freq), ) + geom_bar(stat="Identity", fill="black") +geom_text(aes(label=freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
From last plot we can conclude that the word said is very common in the news dataset and the second most frequent word appears less than half of the times. We can combine these three dataset into only one and generate the wordcloud, however, we are not going to do it because the result will be very similar to previous figures.
When we are done with the cleaning and the exploratory data analysis we will develop the algorithm to be used in our shiny app. We are going to do the following steps:
We will use a larger sample of the data, and even if it is possible all the data (or even if we can obtain other data) from our dataset to generate our term-document matrix.
We will generate the n-gram matrix for our data set, mainly we will be focused on 1-gram, 2-gram, 3-gram and 4-gram. If the time allows we will include more higher gram in our study. We consider than 4-gram should be good enough for our Shiny app.
To develop an algorithm taking into account all the gram. For an unseen text, we will calculate the probability that the n-gram from this text appears in our n-gram matrix, if not, we reduce the search to n-1 gram and repeat the process again, till we get for the largest k-gram the next word from the incomplete text. Here, we still have to learn how to handle the case where the text is not in our dataset. For this case, we can use propose a random word starting with the initial letter or using some more complex algorithm to calculate the probability from the similar history of word in our dataset. This part is still in progress.
As the final step, we need to check that our model is working properly, so we need to find a separate data to be used as a test set, or from the same dataset take a fraction as a test set. This will provide some feedback to improve the algorithm. We may need to go back and refine the cleaning or other steps.
We will write the Shiny app including our best model. It will predict the next word from an incomplete sentence. We will recommend the most probable word but also we will provide the other four most probable word to the users.