The goal of this project is to: 1. Demonstrate that the downloaded data and successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that amassed so far.4. Get feedback on the plans for creating a prediction algorithm and Shiny app.
In this project, the following tasks are performed: 1. Load the required packages and the data. 2. Data summary. 3. Data cleaning and sample selection. 4. Building N-gram and graphical representation 5. Findings and future plans
library(NLP)
library(tm)
library(corpus)
library(ngram)
library(stringi)
library(RWeka)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(kableExtra)
library(xtable)
setwd("~/Coursera/Data Science Capstone")
blogs<-readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt",warn=FALSE,encoding="UTF-8")
news<-readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt",warn=FALSE,encoding="UTF-8")
twitter<-readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt",warn=FALSE,encoding="UTF-8")
# number of rows
nrow_blogs<-length(blogs)
nrow_news<-length(news)
nrow_twitter<-length(twitter)
# Number of character
nchar_blogs<-sum(nchar(blogs))
nchar_news<-sum(nchar(news))
nchar_twitter<-sum(nchar(twitter))
# number of words
library(stringi)
nw_blogs<-stri_stats_latex(blogs)['Words']
nw_news<-stri_stats_latex(news)['Words']
nw_twitter<-stri_stats_latex(twitter)['Words']
# File size in MB
blogs_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")/1048576
news_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.news.txt")/1048576
twitter_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")/1048576
# Summary table
summary_table<-data.frame("file"=c("Blogs","News","Twitter"),
"Lines"=c(nrow_blogs,nrow_news,nrow_twitter),
"Characters"=c(nchar_blogs,nchar_news,nchar_twitter),
"Words"=c(nw_blogs,nw_news,nw_twitter),
"Size_in_MB"=c(blogs_size,news_size,twitter_size))
# Summary table
kable(summary_table, format = "pandoc", digits = 2)
| file | Lines | Characters | Words | Size_in_MB |
|---|---|---|---|---|
| Blogs | 899288 | 206824505 | 37570839 | 200.42 |
| News | 77259 | 15639408 | 2651432 | 196.28 |
| 2360148 | 162096031 | 30451128 | 159.36 |
The above summary table shows the number of lines, number of characters, number of words, and file size of the three data files.
Since the data set is large in size this project used a random sample of 2% of each data file for the seek of simplicity even though it is not representative of the data file.
# Random sample selection
set.seed(2022)
sample_fdata <- c(sample(blogs,length(blogs)*0.02),
sample(news,length(news)*0.02),
sample(twitter,length(twitter)*0.02))
sample_fdata <- iconv(sample_fdata, "UTF-8","ASCII", sub="")
# Data Cleaning
corpus<-VCorpus(VectorSource(sample_fdata))
corpus<-tm_map(corpus,removeNumbers)
corpus<-tm_map(corpus,removePunctuation)
corpus<-tm_map(corpus,stripWhitespace)
corpus<-tm_map(corpus,tolower)
corpus<-tm_map(corpus,PlainTextDocument)
corpus<-tm_map(corpus,removeWords,stopwords("english"))
This project also used Text Mining(TM) package to clean up the data including remove numbers, strip White space, and so on to create ngram.
unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
unigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=unigram))
bigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=bigram))
trigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=trigram))
unigram_FT <- findFreqTerms(unigram_TDM,lowfreq=500)
bigram_FT <- findFreqTerms(bigram_TDM,lowfreq=50)
trigram_FT <- findFreqTerms(trigram_TDM,lowfreq=10)
unigram_freq <- rowSums(as.matrix(unigram_TDM[unigram_FT,]))
unigram_df <- data.frame(Word=names(unigram_freq),frequency=unigram_freq)
unigram_df_order <- unigram_df[order(-unigram_df$frequency), ]
bigram_freq<-rowSums(as.matrix(bigram_TDM[bigram_FT,]))
bigram_df<-data.frame(Word=names(bigram_freq),frequency=bigram_freq)
bigram_df_order<-bigram_df[order(-bigram_df$frequency), ]
trigram_freq<-rowSums(as.matrix(trigram_TDM[trigram_FT,]))
trigram_df<-data.frame(Word=names(trigram_freq),frequency=trigram_freq)
trigram_df_order<-trigram_df[order(-trigram_df$frequency),]
Summary tables of the most frequent one word, two words, and three words are presented below
summary_tab <- cbind(head(unigram_df_order, 10), head(bigram_df_order, 10),head(trigram_df_order, 10) )
names(summary_tab) <- c("One word","Frequency", "Two words","Frequency", "Three words","Frequency")
row.names(summary_tab) <- c(1:10)
summary_tab
## One word Frequency Two words Frequency Three words
## 1 just 4937 right now 462 happy mothers day
## 2 like 4421 cant wait 401 cant wait see
## 3 will 4342 dont know 329 let us know
## 4 one 4269 last night 291 happy new year
## 5 get 3847 im going 270 im pretty sure
## 6 can 3806 feel like 247 im looking forward
## 7 time 3312 looking forward 214 cant wait till
## 8 love 3056 can get 187 looking forward seeing
## 9 good 3043 happy birthday 187 cant wait hear
## 10 dont 3025 first time 178 cant wait get
## Frequency
## 1 74
## 2 62
## 3 49
## 4 34
## 5 30
## 6 25
## 7 24
## 8 23
## 9 21
## 10 20
The visual representation of Unigram, Bigram, and Trigram is presented below
# Unigram Bar chart
unigram_gg <- ggplot(unigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
geom_bar(stat="identity", fill="steelblue")+
labs(title="Unigram",x="Words",y="Frequency")
unigram_gg + coord_flip()
# Bigram Bar chart
bigram_gg <- ggplot(bigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
geom_bar(stat="identity", fill="steelblue")+
labs(title="Bigram",x="Words",y="Frequency")
bigram_gg+coord_flip()
# Trigram Bar chart
trigram_gg<-ggplot(trigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
geom_bar(stat="identity", fill="steelblue")+
labs(title="Trigram",x="Words",y="Frequency")
trigram_gg+coord_flip()
The result of this project shows the most frequent N-gram of the data set. The result also shows that the less the N-gram, the more frequent they appear. In addition, Careful cleaning, and sampling, are crucial to getting meaningful results in the data.
In this project we used only 2% of the data, In the future, the sample should be representative of the data set. Preliminary analysis will be conducted, based on preliminary analysis suitable predictive models will be used and the model will also be validated. The result or findings will be present using the Shiny app.