Milestone report

This report is doing exploratory analysis of social medial data provided by Swiftkey for fNatural Language Processing and word prediction. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In this report, we’ll going to look at the basic information about the data, do some data cleaning, visually explore the frequency of word and make a plan for next stage.

Data obtaining

In this step, we downloaded and unzipped provided files for processing.

if(!getwd()=="C:/Users/song/Dropbox/Coursera/capstone/")
  setwd("C:/Users/song/Dropbox/Coursera/capstone/")
uri="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "Coursera-SwiftKey"
if (!file.exists(filename)) {
  download.file(uri, destfile=filename, method="curl")
  unzip(Coursera-SwiftKey)
}

From the unzipped files we can see that non-English languages like Russian,Danish and Finnish are also avalable for analysis but we are focusing on English for this project.

Data information

We are going to look at the filenames, files size, numbers of lines and characters count per line information.

names <- list.files("../final/en_US/", pattern="*.txt")
blog.size <- round(file.info(paste0("../final/en_US/",names[1]))$size/1024^2)
news.size <- round(file.info(paste0("../final/en_US/",names[2]))$size/1024^2) 
twitter.size <- round(file.info(paste0("../final/en_US/",names[3]))$size/1024^2) 
con <- file("../final/en_US/en_US.blogs.txt",open='r')
blog <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.news.txt",open='r')
news <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.twitter.txt",open='r')
twit <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
#summary of files
blog.summary <- c(blog.size,length(blog),max(nchar(blog)))
news.summary <- c(news.size,length(news),max(nchar(news)))
twitter.summary <- c(twitter.size,length(twit),max(nchar(twit)))
#create data frame
filesum <- rbind(blog.summary,news.summary,twitter.summary)
filesum.df <- as.data.frame(filesum)
colnames(filesum.df) <- c('File size','Number of Lines','Maximum Characters' ) 
suppressMessages(require(gridExtra))
grid.table(filesum.df)

As shown above, the social media has over 4 Million lines of text and it is not surprised that Twitter has a max of 140 characters per line due to its own restriction; Twiter has more lines than blogs and news. All three data sources are around 200MB sizewise.

Data cleaning

##  [1] "tm"        "NLP"       "gridExtra" "stats"     "graphics" 
##  [6] "grDevices" "utils"     "datasets"  "methods"   "base"

Given the huge size of corpus, we randomly sampled 10,000 from the each of the media files. Initial explore shows there are special characters,such as URL and hash tags, foreign languages and profanity in the data, so we are doing some cleaning before futher exploring.

path <- DirSource("../samp/", encoding = "UTF-8")
sample <- Corpus(path, readerControl = list(reader = readPlain))
# corpus cleaning
removeURL <- function(x) gsub("((https?|ftp)://)?www\\.[A-z0-9]+\\.[A-z0.9]{2,}", "", x)
removehash <- function(x) gsub("\\#","",x)
rmchar <- function(x) gsub('[])(,;#%$^*\\~{}[&+=@/"`|<>_]+NA', "", x)
rmnum <- function(x) gsub('[[:digit:]]+', "", x)
pfile <- file('../final/profanity.txt',"r")
profanity <- readLines(pfile,skipNul=TRUE)
close(pfile)
#clean punctuation,numbers, special characters and fileter profanity
sample.clean <- tm_map(sample,content_transformer(removePunctuation))
sample.clean <- tm_map(sample.clean,content_transformer(removeNumbers))
sample.clean <- tm_map(sample.clean,content_transformer(stripWhitespace))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),profanity)
sample.clean <- tm_map(sample.clean,content_transformer(removeURL))
sample.clean <- tm_map(sample.clean,content_transformer(removehash))
sample.clean <- tm_map(sample.clean,content_transformer(rmchar))
sample.clean <- tm_map(sample.clean,content_transformer(rmnum))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),stopwords('English'))
sample.clean <- tm_map(sample.clean, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
sample.clean <- tm_map(sample.clean,content_transformer(tolower),lazy=T)

Most used words

Most used one word

After data cleaning, we are ready for reviewing the most frequent used words. Let’s first look at the 30 most used words in our cleaned sample after removing the sparse terms.

#Make a matrix
tdm <- TermDocumentMatrix(sample.clean)
# remove sparse
tdm <- removeSparseTerms(tdm,.99)
termfreq <- rowSums(as.matrix(tdm))
#make data frame
term.sub <- data.frame(term=names(termfreq),freq=termfreq)
term.df <-  term.sub[order(term.sub$freq,decreasing=T),]
term.dfp <- term.df[1:30,]
rm(tdm,termfreq,term.sub,term.df)
suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(ggplot2)))))

##  [1] "ggplot2"   "tm"        "NLP"       "gridExtra" "stats"    
##  [6] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [11] "base"

ggplot(term.dfp,aes(x=term,y=freq))+geom_bar(stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()

We see that word ‘the’ is the most used word in the socail media sample. While there are wrod like ‘cant’ occures in our top list, we can think of this as the result of removing of “’”, so in next word prediction stage, the “stemming”" may be considered.

Most used two words

While we have seen the most used words in the sample, let’s look at the most freqently used two word utilizing Ngram tokenizer.

Next steps

This report explore the data source and the basic analysis. To construct a prediction model we should plan next steps:

We should look out the issue we have seen from this preliminary analyis such as apostrophe.
Further data cleaning as well as dealing with special characters.
Try differnt Sample to improve the coverage of words.
Consider a bigger ‘n’ for Ngram tokenization when building the prediction model.
Pay attention to memory issue when building prediction app.