The projects consists of building a prediction alogirithm to predict the next word when a user types in a word or multiple words.
This report is the first part and explains the exploratory data analysis of the data provided and the goals for the app and algorithm.
The main objectives of this report are as follows:
The dataset has been downloaded from the Coursera instructions and the file is unzipped to reveal 3 datasets, one for News, the second for Twitter and third for Blogs. This data is loaded into R as below and a random sample of 15000 lines are selected from each dataset.
library(tm)
library(NLP)
#Load the Twitter dataset, and take a sample of 15000 lines
twitter <- readLines('en_US.twitter.txt', skipNul = TRUE)
twitterSample <- sample(twitter[-1], 15000, replace = FALSE)
#Load the News dataset and take sample of 15000 lines
news <- readLines('en_US.news.txt', skipNul = TRUE)
newsSample <- sample(news[-1], 15000, replace = FALSE)
# Load the Blogs dataset and take sample of 15000 lines
blogs <- readLines('en_US.blogs.txt', skipNul = TRUE)
blogsSample <- sample(blogs[-1], 15000, replace = FALSE)
#Append all samples into one file
fullText <- append(append(twitterSample, newsSample), blogsSample)
#Cleanup the text
#Create corpus
textCorpus <- VCorpus(VectorSource(fullText))
#Convert all characters to lowercase
textCorpus <- tm_map(textCorpus, content_transformer(tolower))
#Remove puctuation
textCorpus <- tm_map(textCorpus, removePunctuation)
#Remove numbers
textCorpus <- tm_map(textCorpus, removeNumbers)
#Remove white space
textCorpus <- tm_map(textCorpus, stripWhitespace)
Calculate some statistics of the text files
library(stringi)
# Number of words
stri_stats_latex(blogs)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds
## 163325412 9 43302825 37865888 3
## Envirs
## 0
stri_stats_latex(twitter)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds
## 125769474 3033 36047952 30578933 963
## Envirs
## 0
stri_stats_latex(news)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds
## 12502954 0 3114374 2665742 0
## Envirs
## 0
#Number of characters in longest line
df <- as.data.frame(blogs)
df$chars <- nchar(as.character(df$blogs))
max(df$chars)
## [1] 40835
df <- as.data.frame(news)
df$chars <- nchar(as.character(df$news))
max(df$chars)
## [1] 5760
df <- as.data.frame(twitter)
df$chars <- nchar(as.character(df$twitter))
max(df$chars)
## [1] 213
library(tidytext)
library(dplyr)
#Convert the sample text into a dataframe
fullTextDf <- as.data.frame(fullText)
names(fullTextDf) <- "text"
#Toeknize into words
wordsDf <- unnest_tokens(fullTextDf, words, text, token='ngrams', n=1)
#Get count of single words
wordCounts <- count(wordsDf, words, sort=TRUE)
#plot histogram of 15 of the most frequent words
wordCounts <- head(wordCounts, 15)
barplot(wordCounts$n, names.arg=wordCounts$words)
#Toeknize into bigrams
bigramsDf <- unnest_tokens(fullTextDf, bigrams, text, token='ngrams', n=2)
#Get count of bigrams
bigramCounts <- count(bigramsDf, bigrams, sort=TRUE)
#plot histogram of 15 of the most frequent word pairs
bigramCounts <- head(bigramCounts, 15)
barplot(bigramCounts$n, names.arg=bigramCounts$bigrams)
#Toeknize into trigrams
trigramsDf <- unnest_tokens(fullTextDf, trigrams, text, token='ngrams', n=3)
#Get count of trigrams
trigramCounts <- count(trigramsDf, trigrams, sort=TRUE)
#plot histogram of 15 of the most frequent trigrams
trigramCounts <- head(trigramCounts, 15)
barplot(trigramCounts$n, names.arg=trigramCounts$trigrams)
‘Next word’ prediction will be done based on the bi-grams and tri-grams created above.
If one word is supplied by the user, the bi-grams list will be sorted on descending order of frequency and the second word of the first three rows will be offered as three choices for the next word.
If two words are supplied by the user, the tri-grams will be sorted on the descending order of their frequency and the third word of the top 3 rows will be offered as three choices for the next word.
If the user supplies more than two words, the last 2 words of the string supplied will be used to predict the next word.
In the above cases, if the word or words supplied are not found in the bi-gram and tri-gram list, a message will be displayed that there are not suggestions.
In addition to the above, a Shiny app will be developed to capture user input and display predictions.