This task is part of Coursera - John Hokins Capstone Project Task 02. This Task is to perform Text Exploratory Data Analysis and Modelling the training data.
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs ( http://rpubs.com/ ) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Below are the steps to perform analysis :
Import data from working directory and will limit to 1000 lines each training files, so we will have 3000 sample lines. This to avoid big computer processing and consume big amount of memory.
Below are word summary :
number of words : 88780
number of lines : 3000 –> due to our limitation on 1000 lines per files
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tm)
## Warning: package 'tm' was built under R version 4.4.2
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(ggplot2)
library(RWeka)
## Warning: package 'RWeka' was built under R version 4.4.2
library(SnowballC)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.4.2
## Loading required package: RColorBrewer
library(stringi)
sample_blogs <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.blogs.txt",1000, encoding="UTF-8")
sample_twitter <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.twitter.txt",1000, encoding="UTF-8")
sample_news <- readLines("C:/RSTUDIO/COURSERA/R DATA SCIENCE/CAPSTONE PROJECT/DATASET/TRAINING DATA/final/en_US/en_US.news.txt",1000, encoding="UTF-8")
# combine all samples data into variable sample_data
sample_data <- c(sample_blogs,sample_twitter,sample_news)
sumwords <- sum(stri_count_words(sample_data))
numlines <- length(sample_data)
sumwords
## [1] 88780
numlines
## [1] 3000
glimpse(sample_data)
## chr [1:3000] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...
Basically we will only deal with words, we will remove non-words and replace with space. Below are non-words cleaning process, which required additional package “SnowballC”. Other unnecessary character could be added whenever required.
Transform all letters to lower case
Remove Numbers
Remove Punctuations
Remove stopwords (English Stopwords, like the, a, an)
Transform special character to space, like “/’,”@“,”|“, other character could be added when required.
Text stemming, which transform word into its root meaning like walking, walked to walk
# convert sample_data to corpus
sample_data_corpus <- VCorpus(VectorSource(sample_data))
# this function will replace special character to space and other basic unnecessary character
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "/")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, ",")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, ".")
sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "@")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "$")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "%")
# sample_data_corpus <- tm_map(sample_data_corpus, toSpace, "\\|")
sample_data_corpus <- tm_map(sample_data_corpus, content_transformer(tolower))
# sample_data_corpus <- tm_map(sample_data_corpus, removeWords,stopwords("english"))
sample_data_corpus <- tm_map(sample_data_corpus, removePunctuation)
sample_data_corpus <- tm_map(sample_data_corpus, removeNumbers)
sample_data_corpus <- tm_map(sample_data_corpus, stripWhitespace)
sample_data_corpus <- tm_map(sample_data_corpus, stemDocument)
Based on previous clean data, we will generate insight
# Building Terms Document Matrix to count words frequency and we will generate word statistics
summary_word <- TermDocumentMatrix(sample_data_corpus)
summary_word_stat <- as.matrix(summary_word)
# head(summary_word_stat)
Top 20 Most Frequent Words - Monogram
# convert clean data to terms document matrix to calculate word statistics
word_summary <- sort(rowSums(summary_word_stat),decreasing=TRUE)
word_summary_table <- data.frame(word = names(word_summary), freq = word_summary)
head(word_summary_table,15)
## word freq
## the the 4337
## and and 2205
## that that 1019
## for for 926
## with with 664
## you you 609
## was was 576
## have have 477
## but but 445
## this this 434
## are are 366
## not not 363
## from from 325
## said said 304
## his his 288
Most_Frequent <- head(word_summary_table,20)
ggplot(Most_Frequent,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")
Generate Word Cloud
wordcloud(words=word_summary_table$word,freq = word_summary_table$freq,min.freq = 10, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))
Below are top 10 Most Word - Bigram. We will use function can be found from www.geeksforgeeks.org lin below. Then we transform data into frequency, this procedures will be applicable for trigram by changing function to 3 words
# Here we create function to generate bigram, can be seen from following page https://www.geeksforgeeks.org/how-to-fix-document-term-matrix-in-r-bigram-tokenizer-not-working/
correctBigramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
bigram_word_summary <- TermDocumentMatrix(sample_data_corpus,control = list(tokenize = correctBigramTokenizer))
inspect(bigram_word_summary)
## <<TermDocumentMatrix (terms: 55258, documents: 3000)>>
## Non-/sparse entries: 81719/165692281
## Sparsity : 100%
## Maximal term length: 72
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 180 22 313 376 483 50 558 65 851 951
## and the 0 1 0 0 0 0 0 0 1 0
## at the 0 0 0 0 0 0 0 1 0 0
## for the 0 0 0 0 1 0 1 0 1 1
## in a 0 1 0 5 1 0 0 0 0 0
## in the 0 2 0 0 1 0 1 3 1 0
## of the 2 1 0 0 3 0 0 1 1 0
## on the 2 0 0 0 0 3 1 1 0 0
## to be 0 0 0 1 1 1 2 0 0 0
## to the 0 0 0 1 0 0 1 0 0 0
## with the 1 1 0 0 0 0 1 0 0 0
bigram_matrix <- as.matrix(bigram_word_summary)
bigram_word_summary_stat <- sort(rowSums(bigram_matrix),decreasing=TRUE)
bigram_word_table <- data.frame(word = names(bigram_word_summary_stat), freq = bigram_word_summary_stat)
head(bigram_word_table,10)
## word freq
## of the of the 390
## in the in the 380
## to the to the 196
## for the for the 168
## to be to be 167
## on the on the 166
## and the and the 120
## in a in a 120
## at the at the 119
## with the with the 100
most_frequent_bigram <- head(bigram_word_table,20)
ggplot(most_frequent_bigram,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")
write.csv(bigram_word_table,file = "bigram.csv",row.names = FALSE)
Generate Bigram Word Cloud
wordcloud(words=bigram_word_table$word,freq = bigram_word_table$freq,min.freq = 7, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))
Below are top 10 Most Trigram Word
correcttrigramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)}
bigram_word_summary <- TermDocumentMatrix(sample_data_corpus,control = list(tokenize = correcttrigramTokenizer))
inspect(bigram_word_summary)
## <<TermDocumentMatrix (terms: 76287, documents: 3000)>>
## Non-/sparse entries: 80213/228780787
## Sparsity : 100%
## Maximal term length: 90
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 180 22 313 376 483 50 558 65 851 951
## a coupl of 0 0 0 0 0 0 0 0 0 0
## a lot of 0 0 0 0 0 0 0 0 0 0
## as much as 0 0 0 0 0 0 0 0 0 0
## be abl to 0 0 0 0 0 0 0 0 0 0
## i had a 0 0 0 0 0 0 0 0 0 0
## i want to 0 0 1 0 0 0 0 0 0 0
## it is a 0 0 0 0 0 0 0 0 0 0
## it was a 0 0 0 0 0 0 0 0 0 0
## one of the 0 0 0 0 0 0 0 1 0 0
## the end of 0 0 0 0 1 0 0 0 0 0
trigram_matrix <- as.matrix(bigram_word_summary)
trigram_word_summary_stat <- sort(rowSums(trigram_matrix),decreasing=TRUE)
trigram_word_table <- data.frame(word = names(trigram_word_summary_stat), freq = trigram_word_summary_stat)
head(trigram_word_table,10)
## word freq
## a lot of a lot of 43
## one of the one of the 31
## i want to i want to 22
## a coupl of a coupl of 15
## as much as as much as 14
## i had a i had a 14
## the end of the end of 14
## be abl to be abl to 13
## it is a it is a 13
## it was a it was a 13
most_frequent_trigram <- head(trigram_word_table,20)
ggplot(most_frequent_trigram,aes(x=reorder(word,freq),word,y=freq))+geom_bar(stat = "identity",fill = "blue")+coord_flip()+theme_bw()+labs(x="Word",y="Frequency", title = "Most Frequent Words")
write.csv(trigram_word_table,file = "trigram.csv",row.names = FALSE)
Generate trigram Word Cloud
# knitr::opts_chunk$set(warning = FALSE)
wordcloud(words=trigram_word_table$word,freq = bigram_word_table$freq,min.freq = 4, max.words = 100, random.order = FALSE,rot.per = 0.35,colors = brewer.pal(8,"Dark2"))
Some Interesting part regarding this process and analyzing data :
The package tm_map, has good ability to recognize unnecessary word for data cleaning purposes, like unnecessary punctuation, removing special character, removing numbers, stopwords etc
Monogram, bigram and trigram have different distribution and frequent words so these should be complementary to predict next word