Data Science Capstone - Week 2 Project

Objectives of the Project

The goal of this project is to: 1. Demonstrate that the downloaded data and successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that amassed so far.4. Get feedback on the plans for creating a prediction algorithm and Shiny app.

In this project, the following tasks are performed: 1. Load the required packages and the data. 2. Data summary. 3. Data cleaning and sample selection. 4. Building N-gram and graphical representation 5. Findings and future plans

Load required libraries

library(NLP)
library(tm)
library(corpus)
library(ngram)
library(stringi)
library(RWeka)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(kableExtra)
library(xtable)

Load data

setwd("~/Coursera/Data Science Capstone")

blogs<-readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt",warn=FALSE,encoding="UTF-8")
news<-readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt",warn=FALSE,encoding="UTF-8")
twitter<-readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt",warn=FALSE,encoding="UTF-8")

Data Summary

# number of rows
nrow_blogs<-length(blogs)
nrow_news<-length(news)
nrow_twitter<-length(twitter)

# Number of character
nchar_blogs<-sum(nchar(blogs))
nchar_news<-sum(nchar(news))
nchar_twitter<-sum(nchar(twitter))

# number of words
library(stringi)
nw_blogs<-stri_stats_latex(blogs)['Words']
nw_news<-stri_stats_latex(news)['Words']
nw_twitter<-stri_stats_latex(twitter)['Words']

# File size in MB
blogs_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")/1048576
news_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.news.txt")/1048576
twitter_size<-file.size("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")/1048576

# Summary table
summary_table<-data.frame("file"=c("Blogs","News","Twitter"),
                  "Lines"=c(nrow_blogs,nrow_news,nrow_twitter),
                  "Characters"=c(nchar_blogs,nchar_news,nchar_twitter),
                  "Words"=c(nw_blogs,nw_news,nw_twitter),
                  "Size_in_MB"=c(blogs_size,news_size,twitter_size))

# Summary table
kable(summary_table, format = "pandoc", digits = 2)

file	Lines	Characters	Words	Size_in_MB
Blogs	899288	206824505	37570839	200.42
News	77259	15639408	2651432	196.28
Twitter	2360148	162096031	30451128	159.36

The above summary table shows the number of lines, number of characters, number of words, and file size of the three data files.

Data cleaning and sample selection

Since the data set is large in size this project used a random sample of 2% of each data file for the seek of simplicity even though it is not representative of the data file.

# Random sample selection
set.seed(2022)
sample_fdata <- c(sample(blogs,length(blogs)*0.02),
                 sample(news,length(news)*0.02),
                 sample(twitter,length(twitter)*0.02))

sample_fdata <- iconv(sample_fdata, "UTF-8","ASCII", sub="")

# Data Cleaning
corpus<-VCorpus(VectorSource(sample_fdata))
corpus<-tm_map(corpus,removeNumbers)
corpus<-tm_map(corpus,removePunctuation)
corpus<-tm_map(corpus,stripWhitespace)
corpus<-tm_map(corpus,tolower)
corpus<-tm_map(corpus,PlainTextDocument)
corpus<-tm_map(corpus,removeWords,stopwords("english"))

This project also used Text Mining(TM) package to clean up the data including remove numbers, strip White space, and so on to create ngram.

Building N-gram

unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))

unigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=unigram))
bigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=bigram))
trigram_TDM <- TermDocumentMatrix(corpus,control=list(tokenize=trigram))

unigram_FT <- findFreqTerms(unigram_TDM,lowfreq=500)
bigram_FT <- findFreqTerms(bigram_TDM,lowfreq=50)
trigram_FT <- findFreqTerms(trigram_TDM,lowfreq=10)

unigram_freq <- rowSums(as.matrix(unigram_TDM[unigram_FT,]))
unigram_df <- data.frame(Word=names(unigram_freq),frequency=unigram_freq)
unigram_df_order <- unigram_df[order(-unigram_df$frequency), ]

bigram_freq<-rowSums(as.matrix(bigram_TDM[bigram_FT,]))
bigram_df<-data.frame(Word=names(bigram_freq),frequency=bigram_freq)
bigram_df_order<-bigram_df[order(-bigram_df$frequency), ]

trigram_freq<-rowSums(as.matrix(trigram_TDM[trigram_FT,]))
trigram_df<-data.frame(Word=names(trigram_freq),frequency=trigram_freq)
trigram_df_order<-trigram_df[order(-trigram_df$frequency),]

Summary tables of the most frequent one word, two words, and three words are presented below

summary_tab <- cbind(head(unigram_df_order, 10), head(bigram_df_order, 10),head(trigram_df_order, 10) )
names(summary_tab) <- c("One word","Frequency", "Two words","Frequency", "Three words","Frequency")
row.names(summary_tab) <- c(1:10)
summary_tab

##    One word Frequency       Two words Frequency            Three words
## 1      just      4937       right now       462      happy mothers day
## 2      like      4421       cant wait       401          cant wait see
## 3      will      4342       dont know       329            let us know
## 4       one      4269      last night       291         happy new year
## 5       get      3847        im going       270         im pretty sure
## 6       can      3806       feel like       247     im looking forward
## 7      time      3312 looking forward       214         cant wait till
## 8      love      3056         can get       187 looking forward seeing
## 9      good      3043  happy birthday       187         cant wait hear
## 10     dont      3025      first time       178          cant wait get
##    Frequency
## 1         74
## 2         62
## 3         49
## 4         34
## 5         30
## 6         25
## 7         24
## 8         23
## 9         21
## 10        20

Graphical Representation

The visual representation of Unigram, Bigram, and Trigram is presented below

# Unigram Bar chart 
unigram_gg <- ggplot(unigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
    geom_bar(stat="identity", fill="steelblue")+
    labs(title="Unigram",x="Words",y="Frequency")
unigram_gg + coord_flip()

# Bigram Bar chart 
bigram_gg <- ggplot(bigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
    geom_bar(stat="identity", fill="steelblue")+
    labs(title="Bigram",x="Words",y="Frequency")
bigram_gg+coord_flip()

# Trigram Bar chart
trigram_gg<-ggplot(trigram_df_order[1:10,],aes(x=reorder(Word,frequency),y=frequency))+
    geom_bar(stat="identity", fill="steelblue")+
    labs(title="Trigram",x="Words",y="Frequency")
trigram_gg+coord_flip()

Findings

The result of this project shows the most frequent N-gram of the data set. The result also shows that the less the N-gram, the more frequent they appear. In addition, Careful cleaning, and sampling, are crucial to getting meaningful results in the data.

Future plans

In this project we used only 2% of the data, In the future, the sample should be representative of the data set. Preliminary analysis will be conducted, based on preliminary analysis suitable predictive models will be used and the model will also be validated. The result or findings will be present using the Shiny app.