This report is doing exploratory analysis of social medial data provided by Swiftkey for fNatural Language Processing and word prediction. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In this report, we going to look at the basic information about the data, do some data cleaning, visually explore the frequency of word and make a plan for next stage.
##Data aquisition
In this step, we downloaded and unzipped provided files for processing.
options(rpubs.upload.method = "internal")
if(!getwd()=="/Users/timsong/Dropbox/Coursera/capstone/")
setwd("/Users/timsong/Dropbox/Coursera/capstone/")
uri="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "Coursera-SwiftKey"
if (!file.exists(filename)) {
download.file(uri, destfile=filename, method="curl")
unzip(Coursera-SwiftKey)
}
From the unzipped files we can see that non-English languages like Russian,Danish and Finnish are also avalable for analysis but we are focusing on English for this project.
##Data information
We are going to look at the filenames, files size, numbers of lines and characters count per line information.
names <- list.files("../final/en_US/", pattern="*.txt")
blog.size <- round(file.info(paste0("../final/en_US/",names[1]))$size/1024^2)
news.size <- round(file.info(paste0("../final/en_US/",names[2]))$size/1024^2)
twitter.size <- round(file.info(paste0("../final/en_US/",names[3]))$size/1024^2)
con <- file("../final/en_US/en_US.blogs.txt",open='r')
blog <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.news.txt",open='r')
news <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.twitter.txt",open='r')
twit <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
#summary of files
blog.summary <- c(blog.size,length(blog),max(nchar(blog)))
news.summary <- c(news.size,length(news),max(nchar(news)))
twitter.summary <- c(twitter.size,length(twit),max(nchar(twit)))
#create data frame
filesum <- rbind(blog.summary,news.summary,twitter.summary)
filesum.df <- as.data.frame(filesum)
colnames(filesum.df) <- c('File size','Number of Lines','Maximum Characters' )
#suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(gridExtra)))))
#grid.table(filesum.df)
print(filesum.df)
File size Number of Lines Maximum Characters
blog.summary 200 899288 40833
news.summary 196 1010243 11384
twitter.summary 159 2360148 140
As shown above, the social media has over 4 Million lines of text and it is not surprised that Twitter has a max of 140 characters per line due to its own restriction; Twiter has more lines than blogs and news. All three data sources are around 200MB sizewise.
##Data cleaning
## [1] "wordcloud" "RColorBrewer" "ggplot2" "tm"
## [5] "NLP" "gridExtra" "knitr" "stats"
## [9] "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
Given the huge size of corpus, we randomly sampled 10,000 from each of the media files. Initial explore shows there are special characters,such as URL and hash tags, foreign languages and profanity in the data, so we are doing some cleaning before futher exploring.
path <- DirSource("../samp/", encoding = "UTF-8")
sample <- Corpus(path, readerControl = list(reader = readPlain))
# corpus cleaning
removeURL <- function(x) gsub("((https?|ftp)://)?www\\.[A-z0-9]+\\.[A-z0.9]{2,}", "", x)
removehash <- function(x) gsub("\\#","",x)
rmchar <- function(x) gsub('[])(,;#%$^*\\~{}[&+=@/"`|<>_]+NA', "", x)
rmnum <- function(x) gsub('[[:digit:]]+', "", x)
pfile <- file('../final/profanity.txt',"r")
profanity <- readLines(pfile,skipNul=TRUE)
close(pfile)
#clean punctuation,numbers, special characters and fileter profanity
sample.clean <- tm_map(sample,content_transformer(removePunctuation))
sample.clean <- tm_map(sample.clean,content_transformer(removeNumbers))
sample.clean <- tm_map(sample.clean,content_transformer(stripWhitespace))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),profanity)
sample.clean <- tm_map(sample.clean,content_transformer(removeURL))
sample.clean <- tm_map(sample.clean,content_transformer(removehash))
sample.clean <- tm_map(sample.clean,content_transformer(rmchar))
sample.clean <- tm_map(sample.clean,content_transformer(rmnum))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),stopwords('English'))
sample.clean <- tm_map(sample.clean, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
sample.clean <- tm_map(sample.clean,content_transformer(tolower),lazy=T)
##Most used words
###Most used one word
After data cleaning, we are ready for reviewing the most frequent used words. Let's first look at the 30 most used words in our cleaned sample after removing the sparse terms.
#Make a matrix
tdm <- TermDocumentMatrix(sample.clean)
# remove sparse
tdm <- removeSparseTerms(tdm,.99)
termfreq <- rowSums(as.matrix(tdm))
#make data frame
term.sub <- data.frame(term=names(termfreq),freq=termfreq)
term.df <- term.sub[order(term.sub$freq,decreasing=T),]
term.dfp <- term.df[1:30,]
#rm(tdm,termfreq,term.sub,term.df)
suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(ggplot2, quietly=TRUE)))))
## [1] "wordcloud" "RColorBrewer" "ggplot2" "tm"
## [5] "NLP" "gridExtra" "knitr" "stats"
## [9] "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
ggplot(term.dfp,aes(x=term,y=freq))+geom_bar(stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()
We see that word 'the' is the most used word in the social media sample. While there are wrod like 'cant' occures in our top list, we can think of this as the result of removing of “'”, so in next word prediction stage, the “stemming”“ may be considered.
###Most used word combinations
Similarly we can easily look the mostly used two words or three words combination. Now let's visually look at the word cloud for first 30 most used three words combinations.
## [1] "wordcloud" "RColorBrewer" "ggplot2" "tm"
## [5] "NLP" "gridExtra" "knitr" "stats"
## [9] "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
##Next steps
This report explore the data source and the basic analysis. To construct a prediction model we should plan next steps: