Getting data

Let’s first load all the libraries we need:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
library(ngram)
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.0.3
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 4.0.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 4.0.3
## R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.10.1 (2020-08-26 22:50:31 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,
##     warnings

Then, we extract datas for the three files:

setwd(dir = "./final/en_US/")
connection <- file("en_US.blogs.txt")
blogs_data <- readLines(connection, encoding="UTF-8", skipNul=TRUE)
close(connection)

connection <- file("en_US.news.txt")
news_data <- readLines(connection, encoding="UTF-8", skipNul=TRUE)
## Warning in readLines(connection, encoding = "UTF-8", skipNul = TRUE): ligne
## finale incomplète trouvée dans 'en_US.news.txt'
close(connection)

connection <- file("en_US.twitter.txt")
twitter_data <- readLines(connection, encoding="UTF-8", skipNul=TRUE)
close(connection)

We have now three datas. We will create a dataframe for each files to have an easy lookup on them. The code below defines a function that returns a dataFrame given a data and fileName:

createDataFrame <- function(data, fileName) {
  
Lines <- length(data) 
Size <- gsub(' ',' ' , object.size(blogs_data))
wordCount <- wordcount(data, sep=" ", count.function = sum)
dataFrame <- data.frame(FileName = fileName , 
                   FileSize = Size,
                   WordCount = wordCount,
                   Lines = Lines)
  
  return(dataFrame)
}

We use that function to create the dataframes and take a look at the file’s size, the number of words and the number of lines of each lines:

blogs_data_frame <- createDataFrame(blogs_data, "en_US.blogs.txt")
news_data_frame <- createDataFrame(news_data, "en_US.news.txt")
twitter_data_frame <- createDataFrame(twitter_data, "en_US.twitter.txt")

print(blogs_data_frame)
##          FileName  FileSize WordCount  Lines
## 1 en_US.blogs.txt 267758632  37334131 899288
print(news_data_frame)
##         FileName  FileSize WordCount Lines
## 1 en_US.news.txt 267758632   2643969 77259
print(twitter_data_frame)
##            FileName  FileSize WordCount   Lines
## 1 en_US.twitter.txt 267758632  30373583 2360148

Cleaning data

As we saw previously, the files are too big. We will create a corpus using the three datas. For that, we first create a sample data with a small part of each files:

## [1] "I've made America's Test Kitchen's triple chocolate mousse cake various ways now - as individual cakes, a full sized cake, and one layered with strawberries - and it always turns out great. In July I made one again, this time for a goodbye lunch for a coworker who was leaving the company, and I simply decorated it with fresh raspberries."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "Going by the trend of NASA discovery that they want to use Indian Astrology Tables for planetary mechanics calculations, planetary motions and equinoxes and if after 15 years if NASA or some other western scientific agency comes up with the statement that what said in the Future Forecasts (Panchangam) about behavior predictability is correct at least in 70% of the cases then will all these progressive and democratic organizations will agitate to include astrology in the regular syllabus of school curriculum ? May be they do if we ask NASA to publish their findings in Telugu rather than in English. HR commission correctly pointed that they cannot register a case in this issue and they may try with police. Putting astrologers in to jails is the worst thing and most oppressive thing that takes us back to Papal Era of banning the science and killing astronomers or even to the era of Kamsa who jailed the astrologers who predicted his death. When do we wake up and appreciate our own vernacular wisdom not as religion but as secular science ?"
## [3] "I won’t be uttering a word more until we are all done"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [4] "Therefore there are, within this report, numerous grounds for concern. However, in terms of specific issues, three matters of special concern need to be highlighted:"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [5] "In most cases, the no-contact provision can only be removed once the person you have been forbidden to contact goes to court and states the reasons why they want the provision lifted."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [6] "14 within"

Then we create a corpus and apply some tokenizations:

Then, we create unigram and bigram by using NGramToenizer. First parameter is the corpus, second is an object of class Weka_control which set the minimum and maximum size of ngram (respectively 1 and 2 for unigram and bigram):

Using the unigram, we create the document matrix and count the frequency of each unigram:

##     words frequency
## ’ll   ’ll        84
## ’re   ’re        77
## ’ve   ’ve       103
## 100   100        47
## 1st   1st        22
## 200   200        20

We do the same with bigrams:

##                     words frequency
## can get           can get        56
## can help         can help        27
## can make         can make        36
## can see           can see        27
## cant believe cant believe        20
## cant wait       cant wait        79

Generating word clouds.

We can have an easy view of the most frequents unigram and bigram by generating a wordcloud for each:

## Warning in wordcloud(words = freq_2$words, freq = freq_2$frequency, max.words =
## 30, : cant wait could not be fit on page. It will not be plotted.

From those wordclouds, we can see that ‘just’, ‘like’, ‘one’, ‘get’, ‘will’ for example are the most frequent words in our corpus and ‘right now’, ‘cant wait’, ‘last night’ are the most frequent bigram.

Plotting N-gram

Another way to have a quick view on the most frequent unigram and bigram is to plot them:

We retrieve the same most frequent unigram and bigram that we observed from the wordcloud but here we have a more precise idea of the frequence.