The Goal of this project is to build a linguistic model to predict text. This report explores the distribution and relationship between the words, tokens, and phrases in the text in order to build a final predictive model. First, I downloaded training dataset, second I applied preprocessing methods to clean data, and third, I explored distribution and relationship between the words
I downloaded the dataset from here. The data are collected from publicly available sources such as newspaper, personal blog and Tweeter. The texts are grouped in 3 main documents named en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt and the language used is US English.
The en_US.blogs.txt document contains 899,288 lines and 206,824,382 characters.
The en_US.news.txt document contains 77,259 lines and 15,639408 characters.
The en_US.news.txt document contains 2,360,148 lines and 16,2096,031 characters.
#set the directory
setwd("C:/Users/hp Probook 4540s/Desktop/Data Science Capsone/final/en_US")
#Load packages
library(tm)
## Loading required package: NLP
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
library(magrittr)
library(Rgraphviz)
## Loading required package: graph
## Loading required package: grid
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(stringi)
#1.Import each of the 3 documents
twit<- readLines("en_US.twitter.txt",encoding="UTF-8")
blog <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
#3. Explore each of the 3 documents
stri_stats_general(twit)
stri_stats_general(blog)
stri_stats_general(news)
Due to the large size of the dataset, I selected a random sample of lines from each document and create a corpus object to process data efficiently. Thus, I selected randomly 10% of blog document, 15% of news document and 4% of twitter document.
#Data Sampling from each document
#sample 4% of twitter document
set.seed(1)
twit_sample<-twit[sample(length(twit),size=as.integer(0.04*length(twit)))]
bag <- file("sample/twit_sample.txt", "w")
writeLines(twit_sample, con = bag)
close(bag)
#Remove twit objects
rm(twit,twit_sample)
#sample 10% of blog document
set.seed(1)
blog_sample<-blog[sample(length(blog),size=as.integer(0.1*length(blog)))]
bag <- file("sample/blog_sample.txt", "w")
writeLines(blog_sample, con = bag)
close(bag)
#Remove blog objects
rm(blog,blog_sample)
#sample 15% of news document
set.seed(1)
news_sample<-news[sample(length(news),size=as.integer(0.15*length(news)))]
bag <- file("sample/news_sample.txt", "w")
writeLines(news_sample, con = bag)
close(bag)
#Remove news objects
rm(news,news_sample,bag)
#Create corpus object and clean data
sour<-DirSource("C:/Users/hp Probook 4540s/Desktop/Data Science Capsone/final/en_US/sample")
corpo <- Corpus(sour,readerControl = list(reader=readPlain,language = "en"))
I used the tm_map function from tm package to clean data for exploratory analysis. Data cleaning consisted of removing punctuations, numbers, stop words, white spaces, special characters, URL and transforming words to lower cases.
# Data processing
corpo<- tm_map(corpo, tolower)
corpo<- tm_map(corpo, removePunctuation)
corpo<- tm_map(corpo, removeNumbers)
corpo<- tm_map(corpo, removeWords, stopwords("english"))
corpo<- tm_map(corpo, stemDocument)
corpo<- tm_map(corpo, stripWhitespace)
remove<- function(x) gsub("http[[:alnum:]]*", "", x)
corpo<- tm_map(corpo,remove)
corpo <- tm_map(corpo, PlainTextDocument)
Given that distribution and relationship between the words may differ from blog, news or twitter documents, I applied exploratory analysis to:
Represent the word cloud for each document
Find the distribution of words in each document and plot the 30 most frequent words (1-gram)
Find the distribution of a consecutive pair of words in each document and plot the 30 most frequent pair of words (2-gram)
Find the distribution of a consecutive three words in each document and plot the 30 most frequent pair of words (3-gram)
#3. Term-Document-Matrix
matri<-TermDocumentMatrix(corpo)
matri
#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)
#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]
The Figures 1. show the words cloud for each document
The Figures 2. show the distributions of the 30 most frequent words for each document
The Figures 3. show the distributions of the 30 most frequent pair of consecutive words (2-gram) for each document
The Figures 4. show the distributions of the 30 most frequent three consecutive words (3-gram) for each document
#Wordcloud
set.seed(1)
blog.freq <- sort(blog.freq, decreasing = T)
wordcloud(words = names(blog.freq), freq = blog.freq, max.words = 60,
random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title("Figure 1.1 Blog Wordcloud")
set.seed(1)
news.freq <- sort(news.freq, decreasing = T)
wordcloud(words = names(news.freq), freq = news.freq, max.words = 60,
random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title("Figure 1.2 news Wordcloud")
set.seed(1)
twit.freq <- sort(twit.freq, decreasing = T)
wordcloud(words = names(twit.freq), freq = twit.freq, max.words = 60,
random.order = F, scale = c(3,1),rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title("Figure 1.3 twitter Wordcloud")
#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies
df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.1 Unigram frequency in Blog")
df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.2 Unigram frequency in news")
df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 2.3 Unigram frequency in twitter")
rm(df,matra,blog.freq,corpon,news.freq,twit.freq)
# 2-Gram explorartory
BigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 2, max = 2))
matri<-TermDocumentMatrix(corpo, control = list(tokenize = BigramTokenizer))
#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)
#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]
#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies
df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 3.1 2-gram frequency in Blog")
df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 3.2 2-gram frequency in news")
df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle(" Figure 3.3 2-gram frequency in twitter")
rm(df,matra,blog.freq,news.freq,twit.freq)
# 3-Gram explorartory
TrigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 3, max = 3))
matri<-TermDocumentMatrix(corpo, control = list(tokenize = TrigramTokenizer))
#Change TermDocumentMatrix to a matrix
matra=as.matrix(matri)
#Words frequency for each document
blog.freq=matra[,1]
news.freq=matra[,2]
twit.freq=matra[,3]
#Select the 30 most frequent words
blog.freq<- sort((blog.freq),decreasing = TRUE)[1:30]
news.freq<- sort((news.freq),decreasing = TRUE)[1:30]
twit.freq<- sort((twit.freq),decreasing = TRUE)[1:30]
#Unique Words frequencies
df <- data.frame(term = names(blog.freq), freq = blog.freq)
library(ggplot2)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 4.1 3-gram frequency in Blog")
df <- data.frame(term = names(news.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle(" Figure 4.2 3-gram frequency in news")
df <- data.frame(term = names(twit.freq), freq = blog.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity",fill="darkmagenta") +
xlab("Terms") + ylab("Count") + coord_flip()+ggtitle("Figure 4.3 3-gram frequency in twitter")
rm(df,matra,blog.freq,news.freq,twit.freq)
The purpose of this report was to describe the distribution and relationship between the words in training dataset in order to build a linguistic model to predict text. Exploratory data analysis suggested that distribution differ from each type of document (blog, new and twitter). So, we have to take into account this difference in predictive model building.
The goal of this project is to build a predictive model. The next steps are:
Build a predictive model based on exploratory analysis
Deploy the model to a ShinyApp for application