The purpose of this project is to:
Step one is to load all required libraries and create some initial variables.
library(stringr)
library(stringi)
library(hunspell)
library(DescTools)
library(ggplot2)
library(knitr)
library(kableExtra)
library(gtable)
library(grid)
library(scales)
urltext <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
#urltext <- c("enUS.mini.txt","enUS.mini.txt","enUS.mini.txt")
datatype <- c("Blogs","News","Twitter")
The learner has been given three very large text files to analyze with the end purpose being to develop a predictive model. In this phase the intent is to get to know the files and determine if there are any key differences between them. The files represent blog posts, twitter posts, and news posts. For this analysis only single word analysis is undertaken.
In this step each of the three files is read by a common function call and then key elements are returned for further analysis.
info_stor <- data.frame(Data_Type = "",
Lines = "",
Characters = "",
Words = "",
Percentile_50th = "",
Percentile_75th = "",
Percentile_90th = "",
Percentile_95th = "",stringsAsFactors=FALSE)
for (i in 1:3){
data_values <- proc_file(urltext[i])
info_stor[i,1] <- datatype[i]
for (j in 2:8){
info_stor[i,j] <- data_values[j]
}
if (i == 1) {
blogswords <- data.frame(data_values[1])
} else if (i == 2) {
newswords <- data.frame(data_values[1])
} else {
twitterwords <- data.frame(data_values[1])
}
}
Begin by looking at similarities and differences between the data files. This table shows the relative sizes in lines, characters, and words (after cleaning out non words character strings). It also shows how many unique words it takes to include 50 percent, 75 percent, 90 percent, and 95 percent of the total word count.
kable(info_stor) %>% kable_styling(boostrap_options=c("striped","hover","condensed"))
## Currently generic markdown table using pandoc is not supported.
| Data_Type | Lines | Characters | Words | Percentile_50th | Percentile_75th | Percentile_90th | Percentile_95th |
|---|---|---|---|---|---|---|---|
| Blogs | 899288 | 199231704 | 62348 | 83 | 849 | 4153 | 8676 |
| News | 77259 | 14916062 | 36035 | 119 | 1162 | 4478 | 8488 |
| 2360148 | 152968969 | 51277 | 88 | 559 | 2597 | 5479 |
From this table we see some differences between three three file types. Most notably, the number of words required to achieve the 50%, 75%, 90% and 95% of the cumulative frequency of the total words is significantly different, with the nummber of words in the twitter file being far fewer.
It is also useful to look at the plots of the most frequently used words in each of the three files. The following three graphs provide a look at the both the Cumulative Frequence Percentage and the Word Count Frequency for each of the three data sources, Blogs, News, and Twitter.
for (i in 1:3) {
graphtitle <- paste("Cumulative Percentage and Frequencies of the 40 Most Used Words in ",info_stor[i,1]," Posts", sep='')
if (i == 1) {
mval <- blogswords[1,2]/blogswords[1,3]
run_plot(blogswords,graphtitle,mval)
} else if (i == 2) {
mval <- newswords[1,2]/newswords[1,3]
run_plot(newswords,graphtitle,mval)
} else {
mval <- twitterwords[1,2]/twitterwords[1,3]
run_plot(twitterwords,graphtitle,mval)
}
}
From the plots we see that there are some similarities and some differences in the frequencies of the words. For example, the pronouns “I”, “my”, and “you” appear much higher in the Twitter list than in the other two lists. The word “about” shows up in the news and blogs list, but not in the Twitter list. It will be especially interesting to look at the differences in the bigrams and trigrams in future analysis.
The next steps are to move the analysis forward with additional discoveries related to n-grams in each of the data files. This is the next step toward building the predictive model allowing the development of a shiny app to predict the next word of text from a typed in sentence.
This function is called to read the data file for each of the three types of information and return key types of information used in the above report.
# This is the function that is used to process each of the data files
proc_file <- function(urlname){
options(warn = -1)
sentences<-readLines(urlname, skipNul = TRUE)
options(warn = 0)
#Replace period and comma
sentences<-gsub("\\.","",sentences)
sentences<-gsub("\\,","",sentences)
sentences <- str_trim(sentences) # eliminate the multiple spaces
sentences <- gsub("[^A-Za-z ]","",sentences) # replace anything crazy
sentences <- tolower(sentences) #set everything to lower case
#Get number of words, lines, and characters
words.lines <- stri_stats_general(sentences)['Lines']
words.chars <- stri_stats_general(sentences)['Chars']
#Split sentence into words
words<-strsplit(sentences," ")
words<- as.character(unlist(words))
correct <- hunspell_check(words)
correctwords <- as.data.frame(cbind(words,correct),col.names=c("word","correct"))
correctwords <- subset(correctwords, correct!="FALSE")
correctwords <- correctwords[!(correctwords$words == ""),]
words <- as.vector(as.character(correctwords[,1]))
#Calculate word frequencies
words.freq<-table(words)
words.table <- as.data.frame(words.freq,names=c("word","freq"))
words.table$Freq <- as.integer(words.table$Freq)
words.table <- words.table[order(-words.table$Freq),]
row.names(words.table) <- 1:nrow(words.table)
#add row and cumnulative percentage
words.table$freqpct <- words.table$Freq/sum(words.table$Freq)
words.table$cumpct <- cumsum(words.table$Freq)/sum(words.table$Freq)
words.total <- nrow(words.table)
words.50 <- Closest(words.table$cumpct, 0.50, which = TRUE)
words.75 <- Closest(words.table$cumpct, 0.75, which = TRUE)
words.90 <- Closest(words.table$cumpct, 0.90, which = TRUE)
words.95 <- Closest(words.table$cumpct, 0.95, which = TRUE)
#Create return values with the ordered table, and the information about the table.
return.list <- list("table"= words.table,
"Lines" = words.lines,
"Characters" = words.chars,
"Words" = words.total,
"Words at 50th pct" = words.50,
"Words at 75th pct" = words.75,
"Words at 90th pct" = words.90,
"Words at 95th pct" = words.95)
}
run_plot <- function(tn,title,mval) {
#This function actually produces the plots. The table name, the chart title, and the multiplier value for the secondary y axis is passed and everything else is automatic.
grid.newpage()
p1 <- ggplot(data = tn[1:40,], aes(x = reorder(table.words, table.cumpct))) +
geom_bar(aes(y = table.cumpct), stat="identity", fill = "red", alpha = 0.50) +
scale_y_continuous(sec.axis = sec_axis(trans = ~.*mval,name="Word Count Frequency", labels = comma)) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p1 <- p1 + labs(x = "Words", y = "Cumulative Frequency", title = graphtitle)
p1 <- p1 + theme(axis.text.x=element_text(angle=90,hjust=1))
p1 <- p1 + annotate(geom="text", x=10, y = .4, hjust = 0,
label = "Word Count is Blue", color="blue")
p1 <- p1 + annotate(geom="text", x=10, y = .37, hjust = 0,
label = "Cum Freq Pct is Red", color="red")
p2 <- ggplot(data = tn[1:40,], aes(x = reorder(table.words, table.cumpct))) + geom_bar(aes(y = table.Freq), stat="identity", fill = "blue", alpha = 0.50) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) %+replace% theme(panel.background = element_rect(fill = NA))
# extract gtable
g1 <- ggplot_gtable(ggplot_build(p1))
g2 <- ggplot_gtable(ggplot_build(p2))
# overlap the panel of 2nd plot on that of 1st plot
pp <- c(subset(g1$layout, name == "panel", se = t:r))
g <- gtable_add_grob(g1, g2$grobs[[which(g2$layout$name == "panel")]], pp$t,
pp$l, pp$b, pp$l)
grid.draw(g)
}