Capstone- Week 2 Milestone Report

Introduction

This is a first milestone report for the data science capstone project. The goal of the project is to apply data science in the area of natural language processing(NLP). In this milestone, we download the required data and familiarize ourselves with the basics of basics of natural language processing. As part of this milestone, we perform some exploratory analysis of the data and report some statistics about the data.(For this milestone report, we have referred to “Guide to the ngram Package” )

library(ngram)
library(tm)

## Loading required package: NLP

library(tokenizers)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Get Data

Read the data from Blogs, tweets and news

con_blogs <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.blogs.txt","r")
blogs<- readLines(con_blogs)
close(con_blogs)

con_tweets <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.twitter.txt","r")
tweets<- readLines(con_tweets)
close(con_tweets)

con_news <- file("D:/D/Training/Coursera/Capstone/Data_To_be_Used/Coursera-SwiftKey/final/en_US/en_US.news.txt","r")
news<- readLines(con_news)
close(con_news)

Since the data is huge, processing the entire data takes up a very long time. Hence for this project we just take some samples from the given data. We select each line of the file with a probabality of 0.1

sampledata <- function(filedata, percentage)
{
  return(filedata[as.logical(rbinom(length(filedata),1,percentage))])
}

percentage <- 0.1

blogs   <- sampledata(blogs, percentage)
news   <- sampledata(news, percentage)
tweets   <- sampledata(tweets, percentage)

Next we create our corpus by combining the data from blogs, tweets and news

corpuslist <- c(blogs,tweets,news)
corpusData <- Corpus(VectorSource(list(corpuslist)))
summary(corpusData)

##   Length Class             Mode
## 1 2      PlainTextDocument list

Data Cleaning

The corpus has lot of words and characters that are not relevant for our analysis. Hence we remove these unwanted characters from the corpus

Remove extra whitespace

corpusData <- tm_map(corpusData, stripWhitespace)

Convert all letters to lower case

corpusData <- tm_map(corpusData, content_transformer(tolower))

Remove Punctuation

corpusData <- tm_map(corpusData,removePunctuation)

Remove numbers

corpusData <- tm_map(corpusData, removeNumbers)

Remove stopwords

corpusData <- tm_map(corpusData, removeWords, stopwords("english"))

Extract the text from all documents in a corpus as a single string,

corpus_string <- concatenate ( lapply ( corpusData , "[", 1) )

Data Analysis

Use ngram to find which words(single words) have highest frequency and plot the graph of the 10 most commonly used words in the corpus

ng1 <- ngram(corpus_string, n=1)
phrasetable_ng1 <- get.phrasetable(ng1)

plot_ng1 <- ggplot(data = phrasetable_ng1[1:10, ], 
                   aes(x = ngrams, y = freq)) +
                   geom_bar(stat = "identity") +
                  xlab("1-grams") +
                  ylab("freq") + 
                  ggtitle("Freq of the 10 most frequent 1-grams phrase")
print(plot_ng1)

Use ngram to find which pair of words have highest frequency and plot the graph of the 10 most commonly used pair of words in the corpus

ng2 <- ngram(corpus_string, n=2)
phrasetable_ng2 <- get.phrasetable(ng2)
plot_ng2 <- ggplot(data = phrasetable_ng2[1:10, ], 
                   aes(x = ngrams, y = freq)) +
  geom_bar(stat = "identity") +
  xlab("2-grams") +
  ylab("freq") + 
  ggtitle("Freq of the 10 most frequent 2-grams phrase")
print(plot_ng2)

Use ngram to find which set of three words(same order) have highest frequency and plot the graph of the most commonly used three words in the corpus

ng3 <- ngram(corpus_string, n=3)
phrasetable_ng3 <- get.phrasetable(ng3)
plot_ng3 <- ggplot(data = phrasetable_ng3[1:7, ], 
                   aes(x = ngrams, y = freq)) +
                   geom_bar(stat = "identity") +
                   xlab("3-grams") +
                   ylab("freq") + 
                   ggtitle("Freq of the most frequent 3-grams phrase")
print(plot_ng3)

Future Work

The goal of this project is to develop a word predicting app. To achieve the project goal we will have to take a string as a user input and predict hte next word or words based on the probablity of the ngrams. The prediction model will be incorporated in a shiny app which will provide a front end for user inputs