Three corpora of text (initially based on English language): a set of internet blogs, news captured from internet, and a set of TWITTER messages; are used as an input to anaylize statistical trends of words. This trends will be used to build a model to predict the most probable word in a message. The goal of the project is to build a predictive model to foretell the next word as a person types a sentence, for instance in a smart phone keyboard. A shiny UI will be implemented, and a description of the technology of Swiftkey to ease text entry in mobile computers.
It is employed RStudio (Version 0.99.893) and R (version 3.2.4 revised). The libraries empyed are stringi and ggplot2. To make the code more readable it is used the pipe operator magrittr library. This report is elaborated using markdown format and the library knitr to create html and finally published in RPUBS.
The required libraries are loaded, and parallel cluster are build; to improve performance.
# Required R libraries
library(dplyr)
library(stringi)
library(doParallel)
library(ggplot2)
library(knitr)
# Setup parallel clusters to improve execution time
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(stringi)))
File Name File Size Line Count Word Count
en_US.news.txt 196.28 1010242 29313526
en_US.twitter.txt 159.36 2360148 28632620
en_US.blogs.txt 200.42 899288 35314678
Data is provided for this Course on Agreement with Swiftkey and Johns Hopkins University. The zip file is connected to a file. The US data is used as model training. The data is downloaded if not present in the folders.
The files included in the analysis are:
list.files("data/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Basic statistics of those three data files (blogs, news and twitter) are performed for: Line, Character and Word counts, and Words Per Line (WPL) summaries. Some histograms are presented to display frequency distribution of these data.
The Words per line (WPL) in blogs are generally higher (at 41.75 mean), followed by news (at 34.41 mean) and twits (at 12.75 mean). This result is expected given the purpose and use of those communication channels.
From the histograms, we also noticed that the WPL for all data types are right-skewed (i.e. longer right tail). This may be an indication of the general trend towards short and concised communications.
# WPL on each line for each data type
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))
# Descriptive statistics and summary info for each data type
rawstats<-data.frame(
File=c("blogs","news","twitter"),
t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
# Compute words per line summary
WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
)
print(rawstats)
## File Lines LinesNEmpty Chars CharsNWhite TotalWords WPL.Min.
## 1 blogs 899288 899288 206824382 170389539 37570839 0
## 2 news 1010242 1010242 203223154 169860866 34494539 1
## 3 twitter 2360148 2360148 162096031 134082634 30451128 1
## WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1 9 28 41.75 60 6726
## 2 19 32 34.41 46 1796
## 3 7 12 12.75 18 47
All data types analyzed are right skewed. The right extreme values probably are outliers or sampling errors. Blogs show more of those unfrequent but extreme high values, followed by news. Twiters are more normal in frequency distribution.
# Display histogram for each data type
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
xlab="No. of Words",ylab="Frequency",binwidth=1)
The three types of media channels provided: blogs, news, and twitter show a remarkably difference in the number of words per line. However, in the case of blogs and news, it is necessary to perform a selction of outliers to exlude spurious effects, perhaps in some cases due to sampling errors. A potential application for the next phase in this analysis is to develop a SHINY application to visualize how the frequency distribution of words per line is affected, when those outliers cases are excluded from the analysis. In order to build a predictive model it is necessary to determine the probability of a word give another one.
In order to build the predictive model it is necessary to analyze the frequency distribution of words in single and combinations of two, and in triplets.This could be a table sorted as a reference for the input word, select the most probable subsequential word.
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1 - 54. doi:http://dx.doi.org/10.18637/jss.v025.i05
Feinerer, I., & Hornik, K. (2012). tm: Text Mining Package. R package version 0.5-7.1, 1(8).
Feinerer, I. (2008). An introduction to text mining in R. R News, 8(2), 19-22.
Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229-247.
Shrabanti, M., & Anita, P. (2015). NEW APPROACH OF TEXT MINING IN R. Computer Science & Telecommunications, 45(1).