Introduction

The goal of this report is to do the exploratory analysis of the English texts (from blogs, news, and twitter) with the subsequent goal of building a predictive text model. It provides summaries of the used files and basic histograms.

Data

This report uses three English text files - usblogs, usnews, ustwitter.

setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
usblogs<-readLines("en_US/en_US.blogs.txt",skipNul=TRUE,encoding="UTF-8")
usnews<-readLines("en_US/en_US.news.txt",skipNul=TRUE,encoding="UTF-8")
ustwitter<-readLines("en_US/en_US.twitter.txt",skipNul=TRUE,encoding="UTF-8")

Files Summary

This section provides a basic summary of 3 English text files mentioned above. It includes file size, number of lines, word count, and number of characters.

library(knitr)
library(ngram)
setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
bfile<-round(file.size("en_US/en_US.blogs.txt")/1024^2,2)
nfile<-round(file.size("en_US/en_US.news.txt")/1024^2,2)
tfile<-round(file.size("en_US/en_US.twitter.txt")/1024^2,2)
blines<-length(usblogs)
nlines<-length(usnews)
tlines<-length(ustwitter)
bwords<-wordcount(usblogs)
nwords<-wordcount(usnews)
twords<-wordcount(ustwitter)
bchar<-sum(nchar(usblogs))
nchar<-sum(nchar(usnews))
tchar<-sum(nchar(ustwitter))
t1<- matrix(c(bfile,nfile,tfile,
              blines,nlines,tlines,
              bwords,nwords,twords,
              bchar,nchar,tchar),ncol=3,byrow=TRUE)
colnames(t1) <- c("Blogs","News","Twitter")
rownames(t1) <- c("File Size, MB","Number of Lines",
                  "Word Count","Number of Characters")
kable(format(t1,big.mark=","),caption='Basic Summary')
Basic Summary
Blogs News Twitter
File Size, MB 200.42 196.28 159.36
Number of Lines 899,288.00 77,259.00 2,360,148.00
Word Count 37,334,131.00 2,643,969.00 30,373,583.00
Number of Characters 206,824,505.00 15,639,408.00 162,096,241.00

Data Samples

The datasets are fairly large, so the next step is to randomly sample (1%) the files. Code for only 1 file is displayed below. All three files are combined into a single sample.

setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
library(LaF)
set.seed(111)
sampleblogs<-"en_US/en_US.blogs.txt"
sblogs<-sample_lines(sampleblogs, length(sampleblogs)/100)
writeLines(sblogs,con="sampleblogs.txt",sep="\n",useBytes=F)
sblogs<-readLines("sampleblogs.txt",skipNul=TRUE,encoding="UTF-8")

Data Cleaning

First, I will create a corpus from the sample that consists of texts sampled from English blogs, news, and twitter. Then I will tokenize it and remove the punctiation, symbols, and separators. I also remove stop words - the most common words in a language - such as I, we, me ,our, her, etc, when I construct a document-feature matrix.

The next task is to check whether there are profanity words that need to be removed from the chosen sample. I am using a dataset containing a character vector of profane words from Alejandro U. Alvarez from the lexicon package. It contains 438 elements.

library(ngram)
## Package for managing and analyzing text
library(quanteda)
## Profanity data base
library(lexicon)
corp<-corpus(sample) 
t<-tokens(corp, remove_punct = TRUE,remove_symbols = TRUE,
          remove_separators = TRUE)
d<-dfm(t,remove=stopwords("english"),remove_punct=TRUE)
fr<-textstat_frequency(d)
## Frequency of profanity words
frbad<-fr[which(fr$feature %in% profanity_alvarez),]
## Total number of profanity words in the sample
sum(frbad$frequency)
## [1] 1403
## Percentage of profanity words in the sample
sum(frbad$frequency)/wordcount(sample)*100
## [1] 0.13653

Despite the fact that the percentage of the profanity words is really small (0.14%), these words should be removed.

corp<-corpus(sample) 
t<-tokens(corp, remove_punct = TRUE)
removepr<-tokens_remove(t,profanity_alvarez)
dpr<-dfm(removepr,remove=stopwords("english"),remove_punct=TRUE)
frpr<-textstat_frequency(dpr)
frbad<-frpr[which(frpr$feature %in% profanity_alvarez),]
sum(frbad$frequency)
## [1] 0

Now there are no profanity words in the sample that match the Alvarez database.

UNIGRAMS

Next step is the analysis of the unigrams. n-gram is a contiguous sequence of n items from a given sample of text or speech. In this case, a unigram is a single word.

Below is the frequency of the top features - words.

head(frpr,10)
##    feature frequency rank docfreq group
## 1     said      3078    1    2799   all
## 2     just      3044    2    2832   all
## 3      one      2998    3    2612   all
## 4     like      2742    4    2464   all
## 5      can      2529    5    2202   all
## 6      get      2233    6    2034   all
## 7     time      2209    7    1998   all
## 8      new      1946    8    1755   all
## 9      now      1834    9    1741   all
## 10    good      1796   10    1667   all

Code below creats a frequency histogram of the most popular words in the sample. This excludes stop and profanity words. As one can see, the most frequently used word is said followed by just and one.

library(ggplot2)
## Labeling the first 15 words (for the graph)
forlabel<-frpr$frequency[1:15]
dpr %>% 
    textstat_frequency(n = 15) %>% 
    ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
    geom_bar(stat="identity",color="deepskyblue4",fill="white")+
    geom_text(aes(label=forlabel), vjust=-0.3, size=3)+
    labs(x = "Most Popular Words", y = "Frequency") +
    ggtitle("Frequency Histogram") +
    theme(plot.title = element_text(hjust = 0.5,face="bold"))

Wordcloud provides a nice visual display of the 100 most common words.

set.seed(161)
textplot_wordcloud(dpr, max_words = 100)

BIGRAMS

Next step is the analysis of the bigrams - pairs of words. Stop words and profanity words are removed as well. The most frequently used pairs of words are right_now followed by new_york and last_year.

t<-tokens(corp, remove_punct = TRUE,remove_symbols = TRUE,
          remove_separators = TRUE)
t2<-tokens_remove(t,pattern = stopwords("en"))
t2<-tokens_remove(t2,profanity_alvarez)
t2 <- tokens_ngrams(t2, n = 2)
d2<-dfm(t2)
fr2<-textstat_frequency(d2)
head(fr2,10)
##            feature frequency rank docfreq group
## 1        right_now       229    1     226   all
## 2         new_york       166    2     157   all
## 3        last_year       162    3     160   all
## 4       last_night       156    4     152   all
## 5        years_ago       144    5     141   all
## 6      high_school       141    6     132   all
## 7       first_time       134    7     132   all
## 8        last_week       123    8     123   all
## 9  looking_forward       122    9     122   all
## 10      looks_like       119   10     118   all
forlabel2<-fr2$frequency[1:20]
d2 %>% 
    textstat_frequency(n = 20) %>% 
    ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
    geom_bar(stat="identity",color="deepskyblue4",fill="white")+
    geom_text(aes(label=forlabel2), vjust=-0.3, size=3)+
    labs(x = "Most Popular Pairs", y = "Frequency") +
    ggtitle("Bigrams") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5),
          plot.title = element_text(hjust = 0.5,face="bold"))

Word cloud displays 35 most popular pairs of words.

TRIGRAMS

Analysis for trigrams is done similarly to the analysis of bigrams (therefore, code is not included). The most frequently used trigrams are let_us_know followed by president_barack_obama and new_york_city.

##                   feature frequency rank docfreq group
## 1             let_us_know        27    1      27   all
## 2  president_barack_obama        18    2      18   all
## 3          new_york_times        17    3      17   all
## 4           new_york_city        17    3      16   all
## 5          happy_new_year        15    5      15   all
## 6           two_years_ago        15    5      15   all
## 7       happy_mothers_day        15    5      15   all
## 8         st_louis_county        15    5      15   all
## 9           cinco_de_mayo        14    9      13   all
## 10         love_love_love        13   10      10   all

Frequency Sorted Sample

Frequency sorted sample is used to determine the number of unique words that are needed to cover 50% (or 90%) of all word instances in the sample.

Cumulative percentage graph is constructed below.

## Cumulative frequency
a<-cumsum(frpr$freq)
## % frequency
cumpercent<-round(a/max(a)*100,1)
df<-cbind(frpr,cumpercent)
## The number of words 50% and 90% of cumulative frequency is reached
line50<-as.numeric(row.names(df[df$cumpercent==50,]))[1]
line90<-as.numeric(row.names(df[df$cumpercent==90,]))[1]
ggplot(df, aes(x = as.numeric(row.names(df)), y = cumpercent)) + 
    geom_line()+
    labs(x = "Number of words", y = "Cumulative percentage") +
    ggtitle("Frequency sorted sample")+
    theme(plot.title = element_text(hjust = 0.5,face="bold"))+
    geom_hline(yintercept=50,color="blue",linetype=4)+
    geom_vline(xintercept=line50,color="blue",linetype=4)+
    geom_hline(yintercept=90,color="cadetblue4",linetype=5)+
    geom_vline(xintercept=line90,color="cadetblue4",linetype=5)+
    geom_text(aes(line50, 0, label=line50, vjust=0), hjust=0,
              color="blue")+
    geom_text(aes(line90, 0, label=line90, vjust=0,hjust=0), 
              color="cadetblue4") 

It is required to have about 1050 unique words to cover 50% of all word instances in the sample, and 16840 to cover 90%.