The goal of this report is to do the exploratory analysis of the English texts (from blogs, news, and twitter) with the subsequent goal of building a predictive text model. It provides summaries of the used files and basic histograms.
This report uses three English text files - usblogs, usnews, ustwitter.
setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
usblogs<-readLines("en_US/en_US.blogs.txt",skipNul=TRUE,encoding="UTF-8")
usnews<-readLines("en_US/en_US.news.txt",skipNul=TRUE,encoding="UTF-8")
ustwitter<-readLines("en_US/en_US.twitter.txt",skipNul=TRUE,encoding="UTF-8")
This section provides a basic summary of 3 English text files mentioned above. It includes file size, number of lines, word count, and number of characters.
library(knitr)
library(ngram)
setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
bfile<-round(file.size("en_US/en_US.blogs.txt")/1024^2,2)
nfile<-round(file.size("en_US/en_US.news.txt")/1024^2,2)
tfile<-round(file.size("en_US/en_US.twitter.txt")/1024^2,2)
blines<-length(usblogs)
nlines<-length(usnews)
tlines<-length(ustwitter)
bwords<-wordcount(usblogs)
nwords<-wordcount(usnews)
twords<-wordcount(ustwitter)
bchar<-sum(nchar(usblogs))
nchar<-sum(nchar(usnews))
tchar<-sum(nchar(ustwitter))
t1<- matrix(c(bfile,nfile,tfile,
blines,nlines,tlines,
bwords,nwords,twords,
bchar,nchar,tchar),ncol=3,byrow=TRUE)
colnames(t1) <- c("Blogs","News","Twitter")
rownames(t1) <- c("File Size, MB","Number of Lines",
"Word Count","Number of Characters")
kable(format(t1,big.mark=","),caption='Basic Summary')
| Blogs | News | ||
|---|---|---|---|
| File Size, MB | 200.42 | 196.28 | 159.36 |
| Number of Lines | 899,288.00 | 77,259.00 | 2,360,148.00 |
| Word Count | 37,334,131.00 | 2,643,969.00 | 30,373,583.00 |
| Number of Characters | 206,824,505.00 | 15,639,408.00 | 162,096,241.00 |
The datasets are fairly large, so the next step is to randomly sample (1%) the files. Code for only 1 file is displayed below. All three files are combined into a single sample.
setwd("G:/_2020 Coursera Data Science/10 Data Science Capstone/final")
library(LaF)
set.seed(111)
sampleblogs<-"en_US/en_US.blogs.txt"
sblogs<-sample_lines(sampleblogs, length(sampleblogs)/100)
writeLines(sblogs,con="sampleblogs.txt",sep="\n",useBytes=F)
sblogs<-readLines("sampleblogs.txt",skipNul=TRUE,encoding="UTF-8")
First, I will create a corpus from the sample that consists of texts sampled from English blogs, news, and twitter. Then I will tokenize it and remove the punctiation, symbols, and separators. I also remove stop words - the most common words in a language - such as I, we, me ,our, her, etc, when I construct a document-feature matrix.
The next task is to check whether there are profanity words that need to be removed from the chosen sample. I am using a dataset containing a character vector of profane words from Alejandro U. Alvarez from the lexicon package. It contains 438 elements.
library(ngram)
## Package for managing and analyzing text
library(quanteda)
## Profanity data base
library(lexicon)
corp<-corpus(sample)
t<-tokens(corp, remove_punct = TRUE,remove_symbols = TRUE,
remove_separators = TRUE)
d<-dfm(t,remove=stopwords("english"),remove_punct=TRUE)
fr<-textstat_frequency(d)
## Frequency of profanity words
frbad<-fr[which(fr$feature %in% profanity_alvarez),]
## Total number of profanity words in the sample
sum(frbad$frequency)
## [1] 1403
## Percentage of profanity words in the sample
sum(frbad$frequency)/wordcount(sample)*100
## [1] 0.13653
Despite the fact that the percentage of the profanity words is really small (0.14%), these words should be removed.
corp<-corpus(sample)
t<-tokens(corp, remove_punct = TRUE)
removepr<-tokens_remove(t,profanity_alvarez)
dpr<-dfm(removepr,remove=stopwords("english"),remove_punct=TRUE)
frpr<-textstat_frequency(dpr)
frbad<-frpr[which(frpr$feature %in% profanity_alvarez),]
sum(frbad$frequency)
## [1] 0
Now there are no profanity words in the sample that match the Alvarez database.
Next step is the analysis of the unigrams. n-gram is a contiguous sequence of n items from a given sample of text or speech. In this case, a unigram is a single word.
Below is the frequency of the top features - words.
head(frpr,10)
## feature frequency rank docfreq group
## 1 said 3078 1 2799 all
## 2 just 3044 2 2832 all
## 3 one 2998 3 2612 all
## 4 like 2742 4 2464 all
## 5 can 2529 5 2202 all
## 6 get 2233 6 2034 all
## 7 time 2209 7 1998 all
## 8 new 1946 8 1755 all
## 9 now 1834 9 1741 all
## 10 good 1796 10 1667 all
Code below creats a frequency histogram of the most popular words in the sample. This excludes stop and profanity words. As one can see, the most frequently used word is said followed by just and one.
library(ggplot2)
## Labeling the first 15 words (for the graph)
forlabel<-frpr$frequency[1:15]
dpr %>%
textstat_frequency(n = 15) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",color="deepskyblue4",fill="white")+
geom_text(aes(label=forlabel), vjust=-0.3, size=3)+
labs(x = "Most Popular Words", y = "Frequency") +
ggtitle("Frequency Histogram") +
theme(plot.title = element_text(hjust = 0.5,face="bold"))
Wordcloud provides a nice visual display of the 100 most common words.
set.seed(161)
textplot_wordcloud(dpr, max_words = 100)
Next step is the analysis of the bigrams - pairs of words. Stop words and profanity words are removed as well. The most frequently used pairs of words are right_now followed by new_york and last_year.
t<-tokens(corp, remove_punct = TRUE,remove_symbols = TRUE,
remove_separators = TRUE)
t2<-tokens_remove(t,pattern = stopwords("en"))
t2<-tokens_remove(t2,profanity_alvarez)
t2 <- tokens_ngrams(t2, n = 2)
d2<-dfm(t2)
fr2<-textstat_frequency(d2)
head(fr2,10)
## feature frequency rank docfreq group
## 1 right_now 229 1 226 all
## 2 new_york 166 2 157 all
## 3 last_year 162 3 160 all
## 4 last_night 156 4 152 all
## 5 years_ago 144 5 141 all
## 6 high_school 141 6 132 all
## 7 first_time 134 7 132 all
## 8 last_week 123 8 123 all
## 9 looking_forward 122 9 122 all
## 10 looks_like 119 10 118 all
forlabel2<-fr2$frequency[1:20]
d2 %>%
textstat_frequency(n = 20) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",color="deepskyblue4",fill="white")+
geom_text(aes(label=forlabel2), vjust=-0.3, size=3)+
labs(x = "Most Popular Pairs", y = "Frequency") +
ggtitle("Bigrams") +
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5),
plot.title = element_text(hjust = 0.5,face="bold"))
Word cloud displays 35 most popular pairs of words.
Analysis for trigrams is done similarly to the analysis of bigrams (therefore, code is not included). The most frequently used trigrams are let_us_know followed by president_barack_obama and new_york_city.
## feature frequency rank docfreq group
## 1 let_us_know 27 1 27 all
## 2 president_barack_obama 18 2 18 all
## 3 new_york_times 17 3 17 all
## 4 new_york_city 17 3 16 all
## 5 happy_new_year 15 5 15 all
## 6 two_years_ago 15 5 15 all
## 7 happy_mothers_day 15 5 15 all
## 8 st_louis_county 15 5 15 all
## 9 cinco_de_mayo 14 9 13 all
## 10 love_love_love 13 10 10 all
Frequency sorted sample is used to determine the number of unique words that are needed to cover 50% (or 90%) of all word instances in the sample.
Cumulative percentage graph is constructed below.
## Cumulative frequency
a<-cumsum(frpr$freq)
## % frequency
cumpercent<-round(a/max(a)*100,1)
df<-cbind(frpr,cumpercent)
## The number of words 50% and 90% of cumulative frequency is reached
line50<-as.numeric(row.names(df[df$cumpercent==50,]))[1]
line90<-as.numeric(row.names(df[df$cumpercent==90,]))[1]
ggplot(df, aes(x = as.numeric(row.names(df)), y = cumpercent)) +
geom_line()+
labs(x = "Number of words", y = "Cumulative percentage") +
ggtitle("Frequency sorted sample")+
theme(plot.title = element_text(hjust = 0.5,face="bold"))+
geom_hline(yintercept=50,color="blue",linetype=4)+
geom_vline(xintercept=line50,color="blue",linetype=4)+
geom_hline(yintercept=90,color="cadetblue4",linetype=5)+
geom_vline(xintercept=line90,color="cadetblue4",linetype=5)+
geom_text(aes(line50, 0, label=line50, vjust=0), hjust=0,
color="blue")+
geom_text(aes(line90, 0, label=line90, vjust=0,hjust=0),
color="cadetblue4")
It is required to have about 1050 unique words to cover 50% of all word instances in the sample, and 16840 to cover 90%.