This report is doing exploratory analysis of social medial data provided by Swiftkey for fNatural Language Processing and word prediction. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In this report, we going to look at the basic information about the data, do some data cleaning, visually explore the frequency of word and make a plan for next stage.

##Data aquisition

In this step, we downloaded and unzipped provided files for processing.

options(rpubs.upload.method = "internal")
if(!getwd()=="/Users/timsong/Dropbox/Coursera/capstone/")
  setwd("/Users/timsong/Dropbox/Coursera/capstone/")
uri="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "Coursera-SwiftKey"
if (!file.exists(filename)) {
  download.file(uri, destfile=filename, method="curl")
  unzip(Coursera-SwiftKey)
}

From the unzipped files we can see that non-English languages like Russian,Danish and Finnish are also avalable for analysis but we are focusing on English for this project.

##Data information

We are going to look at the filenames, files size, numbers of lines and characters count per line information.

names <- list.files("../final/en_US/", pattern="*.txt")
blog.size <- round(file.info(paste0("../final/en_US/",names[1]))$size/1024^2)
news.size <- round(file.info(paste0("../final/en_US/",names[2]))$size/1024^2) 
twitter.size <- round(file.info(paste0("../final/en_US/",names[3]))$size/1024^2) 
con <- file("../final/en_US/en_US.blogs.txt",open='r')
blog <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.news.txt",open='r')
news <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.twitter.txt",open='r')
twit <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
#summary of files
blog.summary <- c(blog.size,length(blog),max(nchar(blog)))
news.summary <- c(news.size,length(news),max(nchar(news)))
twitter.summary <- c(twitter.size,length(twit),max(nchar(twit)))
#create data frame
filesum <- rbind(blog.summary,news.summary,twitter.summary)
filesum.df <- as.data.frame(filesum)
colnames(filesum.df) <- c('File size','Number of Lines','Maximum Characters' ) 
#suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(gridExtra)))))
#grid.table(filesum.df)
print(filesum.df)

                   File size Number of Lines Maximum Characters
   blog.summary          200          899288              40833
   news.summary          196         1010243              11384
   twitter.summary       159         2360148                140

As shown above, the social media has over 4 Million lines of text and it is not surprised that Twitter has a max of 140 characters per line due to its own restriction; Twiter has more lines than blogs and news. All three data sources are around 200MB sizewise.

##Data cleaning

##  [1] "wordcloud"    "RColorBrewer" "ggplot2"      "tm"          
##  [5] "NLP"          "gridExtra"    "knitr"        "stats"       
##  [9] "graphics"     "grDevices"    "utils"        "datasets"    
## [13] "methods"      "base"

Given the huge size of corpus, we randomly sampled 10,000 from each of the media files. Initial explore shows there are special characters,such as URL and hash tags, foreign languages and profanity in the data, so we are doing some cleaning before futher exploring.

path <- DirSource("../samp/", encoding = "UTF-8")
sample <- Corpus(path, readerControl = list(reader = readPlain))
# corpus cleaning
removeURL <- function(x) gsub("((https?|ftp)://)?www\\.[A-z0-9]+\\.[A-z0.9]{2,}", "", x)
removehash <- function(x) gsub("\\#","",x)
rmchar <- function(x) gsub('[])(,;#%$^*\\~{}[&+=@/"`|<>_]+NA', "", x)
rmnum <- function(x) gsub('[[:digit:]]+', "", x)
pfile <- file('../final/profanity.txt',"r")
profanity <- readLines(pfile,skipNul=TRUE)
close(pfile)
#clean punctuation,numbers, special characters and fileter profanity
sample.clean <- tm_map(sample,content_transformer(removePunctuation))
sample.clean <- tm_map(sample.clean,content_transformer(removeNumbers))
sample.clean <- tm_map(sample.clean,content_transformer(stripWhitespace))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),profanity)
sample.clean <- tm_map(sample.clean,content_transformer(removeURL))
sample.clean <- tm_map(sample.clean,content_transformer(removehash))
sample.clean <- tm_map(sample.clean,content_transformer(rmchar))
sample.clean <- tm_map(sample.clean,content_transformer(rmnum))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),stopwords('English'))
sample.clean <- tm_map(sample.clean, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
sample.clean <- tm_map(sample.clean,content_transformer(tolower),lazy=T)

##Most used words

###Most used one word

After data cleaning, we are ready for reviewing the most frequent used words. Let's first look at the 30 most used words in our cleaned sample after removing the sparse terms.

#Make a matrix
tdm <- TermDocumentMatrix(sample.clean)
# remove sparse
tdm <- removeSparseTerms(tdm,.99)
termfreq <- rowSums(as.matrix(tdm))
#make data frame
term.sub <- data.frame(term=names(termfreq),freq=termfreq)
term.df <-  term.sub[order(term.sub$freq,decreasing=T),]
term.dfp <- term.df[1:30,]
#rm(tdm,termfreq,term.sub,term.df)
suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(ggplot2, quietly=TRUE)))))

##  [1] "wordcloud"    "RColorBrewer" "ggplot2"      "tm"          
##  [5] "NLP"          "gridExtra"    "knitr"        "stats"       
##  [9] "graphics"     "grDevices"    "utils"        "datasets"    
## [13] "methods"      "base"

ggplot(term.dfp,aes(x=term,y=freq))+geom_bar(stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()

plot of chunk unnamed-chunk-6

We see that word 'the' is the most used word in the social media sample. While there are wrod like 'cant' occures in our top list, we can think of this as the result of removing of “'”, so in next word prediction stage, the “stemming”“ may be considered.

###Most used word combinations

Similarly we can easily look the mostly used two words or three words combination. Now let's visually look at the word cloud for first 30 most used three words combinations.

##  [1] "wordcloud"    "RColorBrewer" "ggplot2"      "tm"          
##  [5] "NLP"          "gridExtra"    "knitr"        "stats"       
##  [9] "graphics"     "grDevices"    "utils"        "datasets"    
## [13] "methods"      "base"

plot of chunk unnamed-chunk-7

##Next steps

This report explore the data source and the basic analysis. To construct a prediction model we should plan next steps:

We should look out the issue we have seen from this preliminary analyis such as apostrophe.
Further data cleaning as well as dealing with special characters.
Try differnt Sample to improve the coverage of words.
Consider a bigger 'n' for Ngram tokenization when building the prediction model.
Pay attention to memory issue when building prediction app.