Text Analysis: Milestone Report

1. Introduction

Three corpora of text (initially based on English language): a set of internet blogs, news captured from internet, and a set of TWITTER messages; are used as an input to anaylize statistical trends of words. This trends will be used to build a model to predict the most probable word in a message. The goal of the project is to build a predictive model to foretell the next word as a person types a sentence, for instance in a smart phone keyboard. A shiny UI will be implemented, and a description of the technology of Swiftkey to ease text entry in mobile computers.

2. Methods

It is employed RStudio (Version 0.99.893) and R (version 3.2.4 revised). The libraries empyed are stringi and ggplot2. To make the code more readable it is used the pipe operator magrittr library. This report is elaborated using markdown format and the library knitr to create html and finally published in RPUBS.

2.1. Setup R Environment for data processing and analysis

The required libraries are loaded, and parallel cluster are build; to improve performance.

# Required R libraries
library(dplyr)
library(stringi)
library(doParallel)
library(ggplot2)
library(knitr)


# Setup parallel clusters to improve execution time
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(stringi)))

File Name File Size Line Count Word Count

en_US.news.txt 196.28 1010242 29313526

en_US.twitter.txt 159.36 2360148 28632620

en_US.blogs.txt 200.42 899288 35314678

2.2. Data

Data is provided for this Course on Agreement with Swiftkey and Johns Hopkins University. The zip file is connected to a file. The US data is used as model training. The data is downloaded if not present in the folders.

The files included in the analysis are:

list.files("data/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

3. Exploratory Data Analysis

Basic statistics of those three data files (blogs, news and twitter) are performed for: Line, Character and Word counts, and Words Per Line (WPL) summaries. Some histograms are presented to display frequency distribution of these data.

The Words per line (WPL) in blogs are generally higher (at 41.75 mean), followed by news (at 34.41 mean) and twits (at 12.75 mean). This result is expected given the purpose and use of those communication channels.

From the histograms, we also noticed that the WPL for all data types are right-skewed (i.e. longer right tail). This may be an indication of the general trend towards short and concised communications.

# WPL on each line for each data type
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))

# Descriptive statistics and summary info for each data type
rawstats<-data.frame(
            File=c("blogs","news","twitter"), 
            t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
                    TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
            # Compute words per line summary
            WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
            )
print(rawstats)

##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords WPL.Min.
## 1   blogs  899288      899288 206824382   170389539   37570839        0
## 2    news 1010242     1010242 203223154   169860866   34494539        1
## 3 twitter 2360148     2360148 162096031   134082634   30451128        1
##   WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1           9         28    41.75          60     6726
## 2          19         32    34.41          46     1796
## 3           7         12    12.75          18       47

3.1. Data frequency - Histograms

All data types analyzed are right skewed. The right extreme values probably are outliers or sampling errors. Blogs show more of those unfrequent but extreme high values, followed by news. Twiters are more normal in frequency distribution.

# Display histogram for each data type
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
      xlab="No. of Words",ylab="Frequency",binwidth=1)

4. Conclusions

The three types of media channels provided: blogs, news, and twitter show a remarkably difference in the number of words per line. However, in the case of blogs and news, it is necessary to perform a selction of outliers to exlude spurious effects, perhaps in some cases due to sampling errors. A potential application for the next phase in this analysis is to develop a SHINY application to visualize how the frequency distribution of words per line is affected, when those outliers cases are excluded from the analysis. In order to build a predictive model it is necessary to determine the probability of a word give another one.

In order to build the predictive model it is necessary to analyze the frequency distribution of words in single and combinations of two, and in triplets.This could be a table sorted as a reference for the input word, select the most probable subsequential word.

5. References

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1 - 54. doi:http://dx.doi.org/10.18637/jss.v025.i05
Feinerer, I., & Hornik, K. (2012). tm: Text Mining Package. R package version 0.5-7.1, 1(8).
Feinerer, I. (2008). An introduction to text mining in R. R News, 8(2), 19-22.
Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229-247.
Shrabanti, M., & Anita, P. (2015). NEW APPROACH OF TEXT MINING IN R. Computer Science & Telecommunications, 45(1).