INTRODUCTION & SUMMARY

This task is part of Coursera - John Hokins Capstone Project Task 02. This Task is to perform Text Exploratory Data Analysis and Modelling the training data.

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs ( http://rpubs.com/ ) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

2. Create a basic report of summary statistics about the data sets.

3. Report any interesting findings that you amassed so far.

4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Below are the steps to perform analysis :

Import Dataset, necessary library and generate words statistic, we will limit the training set from English text from blogs, news and twitter
Data Cleaning, convert all text to lower case, remove unnecessary character like “.”,“,”,“%$#@” etc, we just analyze words.
Build Summary, Insight and Findings
Interesting Part,
- The package tm_map, has good ability to recognize unnecessary word for data cleaning purposes, like unnecessary punctuation, removing special character, removing numbers, stopwords etc
- Monogram, bigram and trigram have different distribution and frequent words so these should be complementary to predict next word

1. Import Dataset

Import data from working directory and will limit to 1000 lines each training files, so we will have 3000 sample lines. This to avoid big computer processing and consume big amount of memory.

Below are word summary :

number of words : 88780
number of lines : 3000 –> due to our limitation on 1000 lines per files

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tm)

## Warning: package 'tm' was built under R version 4.4.2

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(ggplot2)
library(RWeka)

## Warning: package 'RWeka' was built under R version 4.4.2

library(SnowballC)
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.4.2

## Loading required package: RColorBrewer

library(stringi)
sample_blogs <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.blogs.txt",1000, encoding="UTF-8")
sample_twitter <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.twitter.txt",1000, encoding="UTF-8")
sample_news <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.news.txt",1000, encoding="UTF-8")
# combine all samples data into variable sample_data
sample_data <- c(sample_blogs,sample_twitter,sample_news)
sumwords <- sum(stri_count_words(sample_data))
numlines <- length(sample_data)
sumwords

## [1] 88780

numlines

## [1] 3000

glimpse(sample_data)

##  chr [1:3000] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...

2. Cleaning Data

Basically we will only deal with words, we will remove non-words and replace with space. Below are non-words cleaning process, which required additional package “SnowballC”. Other unnecessary character could be added whenever required.

Transform all letters to lower case
Remove Numbers
Remove Punctuations
Remove stopwords (English Stopwords, like the, a, an)
Transform special character to space, like “/’,”@“,”|“, other character could be added when required.
Text stemming, which transform word into its root meaning like walking, walked to walk

# convert sample_data to corpus
sample_data_corpus <- VCorpus(VectorSource(sample_data))
# this function will replace special character to space and other basic unnecessary character
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "/")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, ",")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, ".")
sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "@")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "$")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "%")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "\\|")
sample_data_corpus <- tm_map(sample_data_corpus, content_transformer(tolower))
# sample_data_corpus <- tm_map(sample_data_corpus, removeWords,stopwords("english"))
sample_data_corpus <- tm_map(sample_data_corpus, removePunctuation)
sample_data_corpus <- tm_map(sample_data_corpus, removeNumbers)
sample_data_corpus <- tm_map(sample_data_corpus, stripWhitespace)
sample_data_corpus <- tm_map(sample_data_corpus, stemDocument)

3. Build Summary, Insight and Findings

Based on previous clean data, we will generate insight

# Building Terms Document Matrix to count words frequency and we will generate word statistics
summary_word <- TermDocumentMatrix(sample_data_corpus)
summary_word_stat <- as.matrix(summary_word)
# head(summary_word_stat)

3.1 Most Frequent Word - Monogram

Top 20 Most Frequent Words - Monogram

# convert clean data to terms document matrix to calculate word statistics
word_summary <- sort(rowSums(summary_word_stat),decreasing=TRUE)
word_summary_table <- data.frame(word = names(word_summary), freq = word_summary)
head(word_summary_table,15)

##      word freq
## the   the 4337
## and   and 2205
## that that 1019
## for   for  926
## with with  664
## you   you  609
## was   was  576
## have have  477
## but   but  445
## this this  434
## are   are  366
## not   not  363
## from from  325
## said said  304
## his   his  288

Most_Frequent <- head(word_summary_table,20)
ggplot(Most_Frequent,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")

Monogram WordCLoud

Generate Word Cloud

wordcloud(words=word_summary_table$word,freq = word_summary_table$freq,min.freq = 10, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))

3.2 Most Frequent Word - Bigram

Below are top 10 Most Word - Bigram. We will use function can be found from www.geeksforgeeks.org lin below. Then we transform data into frequency, this procedures will be applicable for trigram by changing function to 3 words

# Here we create function to generate bigram, can be seen from following page https://www.geeksforgeeks.org/how-to-fix-document-term-matrix-in-r-bigram-tokenizer-not-working/

correctBigramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}

bigram_word_summary <- TermDocumentMatrix(sample_data_corpus,control = list(tokenize = correctBigramTokenizer))
inspect(bigram_word_summary)

## <<TermDocumentMatrix (terms: 55258, documents: 3000)>>
## Non-/sparse entries: 81719/165692281
## Sparsity           : 100%
## Maximal term length: 72
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      180 22 313 376 483 50 558 65 851 951
##   and the    0  1   0   0   0  0   0  0   1   0
##   at the     0  0   0   0   0  0   0  1   0   0
##   for the    0  0   0   0   1  0   1  0   1   1
##   in a       0  1   0   5   1  0   0  0   0   0
##   in the     0  2   0   0   1  0   1  3   1   0
##   of the     2  1   0   0   3  0   0  1   1   0
##   on the     2  0   0   0   0  3   1  1   0   0
##   to be      0  0   0   1   1  1   2  0   0   0
##   to the     0  0   0   1   0  0   1  0   0   0
##   with the   1  1   0   0   0  0   1  0   0   0

bigram_matrix <- as.matrix(bigram_word_summary)
bigram_word_summary_stat <- sort(rowSums(bigram_matrix),decreasing=TRUE)
bigram_word_table <- data.frame(word = names(bigram_word_summary_stat), freq = bigram_word_summary_stat)
head(bigram_word_table,10)

##              word freq
## of the     of the  390
## in the     in the  380
## to the     to the  196
## for the   for the  168
## to be       to be  167
## on the     on the  166
## and the   and the  120
## in a         in a  120
## at the     at the  119
## with the with the  100

most_frequent_bigram <- head(bigram_word_table,20)
ggplot(most_frequent_bigram,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")

write.csv(bigram_word_table,file = "bigram.csv",row.names = FALSE)

Bigram WordCloud

Generate Bigram Word Cloud

wordcloud(words=bigram_word_table$word,freq = bigram_word_table$freq,min.freq = 7, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))

3.3 Most Frequent Word - Trigram

Below are top 10 Most Trigram Word

correcttrigramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)}

bigram_word_summary <- TermDocumentMatrix(sample_data_corpus,control = list(tokenize = correcttrigramTokenizer))
inspect(bigram_word_summary)

## <<TermDocumentMatrix (terms: 76287, documents: 3000)>>
## Non-/sparse entries: 80213/228780787
## Sparsity           : 100%
## Maximal term length: 90
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        180 22 313 376 483 50 558 65 851 951
##   a coupl of   0  0   0   0   0  0   0  0   0   0
##   a lot of     0  0   0   0   0  0   0  0   0   0
##   as much as   0  0   0   0   0  0   0  0   0   0
##   be abl to    0  0   0   0   0  0   0  0   0   0
##   i had a      0  0   0   0   0  0   0  0   0   0
##   i want to    0  0   1   0   0  0   0  0   0   0
##   it is a      0  0   0   0   0  0   0  0   0   0
##   it was a     0  0   0   0   0  0   0  0   0   0
##   one of the   0  0   0   0   0  0   0  1   0   0
##   the end of   0  0   0   0   1  0   0  0   0   0

trigram_matrix <- as.matrix(bigram_word_summary)
trigram_word_summary_stat <- sort(rowSums(trigram_matrix),decreasing=TRUE)
trigram_word_table <- data.frame(word = names(trigram_word_summary_stat), freq = trigram_word_summary_stat)
head(trigram_word_table,10)

##                  word freq
## a lot of     a lot of   43
## one of the one of the   31
## i want to   i want to   22
## a coupl of a coupl of   15
## as much as as much as   14
## i had a       i had a   14
## the end of the end of   14
## be abl to   be abl to   13
## it is a       it is a   13
## it was a     it was a   13

most_frequent_trigram <- head(trigram_word_table,20)
ggplot(most_frequent_trigram,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")

write.csv(trigram_word_table,file = "trigram.csv",row.names = FALSE)

Trigram WordCloud

Generate trigram Word Cloud

# knitr::opts_chunk$set(warning = FALSE) 
wordcloud(words=trigram_word_table$word,freq = bigram_word_table$freq,min.freq = 4, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))

3. Interesting Part

Some Interesting part regarding this process and analyzing data :

The package tm_map, has good ability to recognize unnecessary word for data cleaning purposes, like unnecessary punctuation, removing special character, removing numbers, stopwords etc
Monogram, bigram and trigram have different distribution and frequent words so these should be complementary to predict next word

COURSERA -JHU-CAPSTONE-TASK2

Gito

2024-11-22