Data Science Capstone: Milestone Report

Baguinebie Bazongo

Introduction

The Goal of this project is to build a linguistic model to predict text. This report explores the distribution and relationship between the words, tokens, and phrases in the text in order to build a final predictive model. First, I downloaded training dataset, second I applied preprocessing methods to clean data, and third, I explored distribution and relationship between the words

Material and Methods

Dataset

I downloaded the dataset from here. The data are collected from publicly available sources such as newspaper, personal blog and Tweeter. The texts are grouped in 3 main documents named en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt and the language used is US English.

Basic summaries

The en_US.blogs.txt document contains 899,288 lines and 206,824,382 characters.

The en_US.news.txt document contains 77,259 lines and 15,639408 characters.

The en_US.news.txt document contains 2,360,148 lines and 16,2096,031 characters.

#set the directory
setwd("C:/Users/hp Probook 4540s/Desktop/Data Science Capsone/final/en_US")
#Load  packages
library(tm)

## Loading required package: NLP

library(RWeka)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer) 
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(magrittr)
library(Rgraphviz)

## Loading required package: graph
## Loading required package: grid

library(Hmisc)

## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(stringi)

#1.Import each of the 3 documents
twit<- readLines("en_US.twitter.txt",encoding="UTF-8")
blog <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
#3. Explore each of the 3 documents
stri_stats_general(twit)
stri_stats_general(blog)
stri_stats_general(news)

Sampling from dataset

Due to the large size of the dataset, I selected a random sample of lines from each document and create a corpus object to process data efficiently. Thus, I selected randomly 10% of blog document, 15% of news document and 4% of twitter document.

#Data Sampling from each document
#sample 4% of twitter document
set.seed(1)
twit_sample<-twit[sample(length(twit),size=as.integer(0.04*length(twit)))]
bag <- file("sample/twit_sample.txt", "w")
writeLines(twit_sample, con = bag)
close(bag)
#Remove twit objects
rm(twit,twit_sample)

#sample 10% of blog document
set.seed(1)
blog_sample<-blog[sample(length(blog),size=as.integer(0.1*length(blog)))]
bag <- file("sample/blog_sample.txt", "w")
writeLines(blog_sample, con = bag)
close(bag)
#Remove blog objects
rm(blog,blog_sample)

#sample 15% of news document
set.seed(1)
news_sample<-news[sample(length(news),size=as.integer(0.15*length(news)))]
bag <- file("sample/news_sample.txt", "w")
writeLines(news_sample, con = bag)
close(bag)
#Remove news objects
rm(news,news_sample,bag)

#Create corpus object and clean data
sour<-DirSource("C:/Users/hp Probook 4540s/Desktop/Data Science Capsone/final/en_US/sample")
corpo <- Corpus(sour,readerControl = list(reader=readPlain,language = "en"))

Data cleaning

I used the tm_map function from tm package to clean data for exploratory analysis. Data cleaning consisted of removing punctuations, numbers, stop words, white spaces, special characters, URL and transforming words to lower cases.

# Data processing
corpo<- tm_map(corpo, tolower)
corpo<- tm_map(corpo, removePunctuation)
corpo<- tm_map(corpo, removeNumbers)
corpo<- tm_map(corpo, removeWords, stopwords("english"))
corpo<- tm_map(corpo, stemDocument)
corpo<- tm_map(corpo, stripWhitespace)
remove<- function(x) gsub("http[[:alnum:]]*", "", x)
corpo<- tm_map(corpo,remove)
corpo <- tm_map(corpo, PlainTextDocument)

Exploratory data analysis

Given that distribution and relationship between the words may differ from blog, news or twitter documents, I applied exploratory analysis to:

Represent the word cloud for each document
Find the distribution of words in each document and plot the 30 most frequent words (1-gram)
Find the distribution of a consecutive pair of words in each document and plot the 30 most frequent pair of words (2-gram)
Find the distribution of a consecutive three words in each document and plot the 30 most frequent pair of words (3-gram)

#3. Term-Document-Matrix
matri<-TermDocumentMatrix(corpo)
matri

#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)

#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]

Results of exploratory data analysis

The Figures 1. show the words cloud for each document

The Figures 2. show the distributions of the 30 most frequent words for each document

The Figures 3. show the distributions of the 30 most frequent pair of consecutive words (2-gram) for each document

The Figures 4. show the distributions of the 30 most frequent three consecutive words (3-gram) for each document

#Wordcloud
set.seed(1)
blog.freq <- sort(blog.freq, decreasing = T)
wordcloud(words = names(blog.freq), freq = blog.freq, max.words = 60,
          random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
  title("Figure 1.1 Blog Wordcloud")

set.seed(1)
news.freq <- sort(news.freq, decreasing = T)
wordcloud(words = names(news.freq), freq = news.freq, max.words = 60,
          random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title("Figure 1.2 news Wordcloud")

set.seed(1)
twit.freq <- sort(twit.freq, decreasing = T)
wordcloud(words = names(twit.freq), freq = twit.freq, max.words = 60,
          random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title("Figure 1.3 twitter Wordcloud")

#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies

df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.1 Unigram frequency in Blog")

df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.2 Unigram frequency in news")

df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.3 Unigram frequency in twitter")

rm(df,matra,blog.freq,corpon,news.freq,twit.freq)

# 2-Gram explorartory
BigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 2, max = 2))
matri<-TermDocumentMatrix(corpo, control = list(tokenize = BigramTokenizer))

#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)
#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]
#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies

df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 3.1 2-gram frequency in Blog")

df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 3.2 2-gram frequency in news")

df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle(" Figure 3.3 2-gram frequency in twitter")

rm(df,matra,blog.freq,news.freq,twit.freq)

# 3-Gram explorartory
TrigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 3, max = 3))
matri<-TermDocumentMatrix(corpo, control = list(tokenize = TrigramTokenizer))

#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)
#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]
#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies

df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 4.1 3-gram frequency in Blog")

df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle(" Figure 4.2 3-gram frequency in news")

df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
  xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 4.3 3-gram frequency in twitter")

rm(df,matra,blog.freq,news.freq,twit.freq)

Conclusion

The purpose of this report was to describe the distribution and relationship between the words in training dataset in order to build a linguistic model to predict text. Exploratory data analysis suggested that distribution differ from each type of document (blog, new and twitter). So, we have to take into account this difference in predictive model building.