This is a first milestone report for the data science capstone project. The goal of the project is to apply data science in the area of natural language processing(NLP). In this milestone, we download the required data and familiarize ourselves with the basics of basics of natural language processing. As part of this milestone, we perform some exploratory analysis of the data and report some statistics about the data.(For this milestone report, we have referred to “Guide to the ngram Package” )
library(ngram)
library(tm)
## Loading required package: NLP
library(tokenizers)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Read the data from Blogs, tweets and news
con_blogs <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.blogs.txt","r")
blogs<- readLines(con_blogs)
close(con_blogs)
con_tweets <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.twitter.txt","r")
tweets<- readLines(con_tweets)
close(con_tweets)
con_news <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.news.txt","r")
news<- readLines(con_news)
close(con_news)
Since the data is huge, processing the entire data takes up a very long time. Hence for this project we just take some samples from the given data. We select each line of the file with a probabality of 0.1
sampledata <- function(filedata, percentage)
{
return(filedata[as.logical(rbinom(length(filedata),1,percentage))])
}
percentage <- 0.1
blogs <- sampledata(blogs, percentage)
news <- sampledata(news, percentage)
tweets <- sampledata(tweets, percentage)
Next we create our corpus by combining the data from blogs, tweets and news
corpuslist <- c(blogs,tweets,news)
corpusData <- Corpus(VectorSource(list(corpuslist)))
summary(corpusData)
## Length Class Mode
## 1 2 PlainTextDocument list
The corpus has lot of words and characters that are not relevant for our analysis. Hence we remove these unwanted characters from the corpus
corpusData <- tm_map(corpusData, stripWhitespace)
corpusData <- tm_map(corpusData, content_transformer(tolower))
corpusData <- tm_map(corpusData,removePunctuation)
corpusData <- tm_map(corpusData, removeNumbers)
corpusData <- tm_map(corpusData, removeWords, stopwords("english"))
Extract the text from all documents in a corpus as a single string,
corpus_string <- concatenate ( lapply ( corpusData , "[", 1) )
Use ngram to find which words(single words) have highest frequency and plot the graph of the 10 most commonly used words in the corpus
ng1 <- ngram(corpus_string, n=1)
phrasetable_ng1 <- get.phrasetable(ng1)
plot_ng1 <- ggplot(data = phrasetable_ng1[1:10, ],
aes(x = ngrams, y = freq)) +
geom_bar(stat = "identity") +
xlab("1-grams") +
ylab("freq") +
ggtitle("Freq of the 10 most frequent 1-grams phrase")
print(plot_ng1)
Use ngram to find which pair of words have highest frequency and plot the graph of the 10 most commonly used pair of words in the corpus
ng2 <- ngram(corpus_string, n=2)
phrasetable_ng2 <- get.phrasetable(ng2)
plot_ng2 <- ggplot(data = phrasetable_ng2[1:10, ],
aes(x = ngrams, y = freq)) +
geom_bar(stat = "identity") +
xlab("2-grams") +
ylab("freq") +
ggtitle("Freq of the 10 most frequent 2-grams phrase")
print(plot_ng2)
Use ngram to find which set of three words(same order) have highest frequency and plot the graph of the most commonly used three words in the corpus
ng3 <- ngram(corpus_string, n=3)
phrasetable_ng3 <- get.phrasetable(ng3)
plot_ng3 <- ggplot(data = phrasetable_ng3[1:7, ],
aes(x = ngrams, y = freq)) +
geom_bar(stat = "identity") +
xlab("3-grams") +
ylab("freq") +
ggtitle("Freq of the most frequent 3-grams phrase")
print(plot_ng3)
The goal of this project is to develop a word predicting app. To achieve the project goal we will have to take a string as a user input and predict hte next word or words based on the probablity of the ngrams. The prediction model will be incorporated in a shiny app which will provide a front end for user inputs