This report is doing exploratory analysis of social medial data provided by Swiftkey for fNatural Language Processing and word prediction. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In this report, weāll going to look at the basic information about the data, do some data cleaning, visually explore the frequency of word and make a plan for next stage.
In this step, we downloaded and unzipped provided files for processing.
if(!getwd()=="C:/Users/song/Dropbox/Coursera/capstone/")
setwd("C:/Users/song/Dropbox/Coursera/capstone/")
uri="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filename <- "Coursera-SwiftKey"
if (!file.exists(filename)) {
download.file(uri, destfile=filename, method="curl")
unzip(Coursera-SwiftKey)
}
From the unzipped files we can see that non-English languages like Russian,Danish and Finnish are also avalable for analysis but we are focusing on English for this project.
We are going to look at the filenames, files size, numbers of lines and characters count per line information.
names <- list.files("../final/en_US/", pattern="*.txt")
blog.size <- round(file.info(paste0("../final/en_US/",names[1]))$size/1024^2)
news.size <- round(file.info(paste0("../final/en_US/",names[2]))$size/1024^2)
twitter.size <- round(file.info(paste0("../final/en_US/",names[3]))$size/1024^2)
con <- file("../final/en_US/en_US.blogs.txt",open='r')
blog <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.news.txt",open='r')
news <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
con <- file("../final/en_US/en_US.twitter.txt",open='r')
twit <- readLines(con,skipNul=TRUE,encoding = "UTF-8")
close(con)
#summary of files
blog.summary <- c(blog.size,length(blog),max(nchar(blog)))
news.summary <- c(news.size,length(news),max(nchar(news)))
twitter.summary <- c(twitter.size,length(twit),max(nchar(twit)))
#create data frame
filesum <- rbind(blog.summary,news.summary,twitter.summary)
filesum.df <- as.data.frame(filesum)
colnames(filesum.df) <- c('File size','Number of Lines','Maximum Characters' )
suppressMessages(require(gridExtra))
grid.table(filesum.df)
As shown above, the social media has over 4 Million lines of text and it is not surprised that Twitter has a max of 140 characters per line due to its own restriction; Twiter has more lines than blogs and news. All three data sources are around 200MB sizewise.
## [1] "tm" "NLP" "gridExtra" "stats" "graphics"
## [6] "grDevices" "utils" "datasets" "methods" "base"
Given the huge size of corpus, we randomly sampled 10,000 from the each of the media files. Initial explore shows there are special characters,such as URL and hash tags, foreign languages and profanity in the data, so we are doing some cleaning before futher exploring.
path <- DirSource("../samp/", encoding = "UTF-8")
sample <- Corpus(path, readerControl = list(reader = readPlain))
# corpus cleaning
removeURL <- function(x) gsub("((https?|ftp)://)?www\\.[A-z0-9]+\\.[A-z0.9]{2,}", "", x)
removehash <- function(x) gsub("\\#","",x)
rmchar <- function(x) gsub('[])(,;#%$^*\\~{}[&+=@/"`|<>_]+NA', "", x)
rmnum <- function(x) gsub('[[:digit:]]+', "", x)
pfile <- file('../final/profanity.txt',"r")
profanity <- readLines(pfile,skipNul=TRUE)
close(pfile)
#clean punctuation,numbers, special characters and fileter profanity
sample.clean <- tm_map(sample,content_transformer(removePunctuation))
sample.clean <- tm_map(sample.clean,content_transformer(removeNumbers))
sample.clean <- tm_map(sample.clean,content_transformer(stripWhitespace))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),profanity)
sample.clean <- tm_map(sample.clean,content_transformer(removeURL))
sample.clean <- tm_map(sample.clean,content_transformer(removehash))
sample.clean <- tm_map(sample.clean,content_transformer(rmchar))
sample.clean <- tm_map(sample.clean,content_transformer(rmnum))
sample.clean <- tm_map(sample.clean,content_transformer(removeWords),stopwords('English'))
sample.clean <- tm_map(sample.clean, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
sample.clean <- tm_map(sample.clean,content_transformer(tolower),lazy=T)
After data cleaning, we are ready for reviewing the most frequent used words. Letās first look at the 30 most used words in our cleaned sample after removing the sparse terms.
#Make a matrix
tdm <- TermDocumentMatrix(sample.clean)
# remove sparse
tdm <- removeSparseTerms(tdm,.99)
termfreq <- rowSums(as.matrix(tdm))
#make data frame
term.sub <- data.frame(term=names(termfreq),freq=termfreq)
term.df <- term.sub[order(term.sub$freq,decreasing=T),]
term.dfp <- term.df[1:30,]
rm(tdm,termfreq,term.sub,term.df)
suppressPackageStartupMessages(suppressWarnings(suppressMessages((library(ggplot2)))))
## [1] "ggplot2" "tm" "NLP" "gridExtra" "stats"
## [6] "graphics" "grDevices" "utils" "datasets" "methods"
## [11] "base"
ggplot(term.dfp,aes(x=term,y=freq))+geom_bar(stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()
We see that word ātheā is the most used word in the socail media sample. While there are wrod like ācantā occures in our top list, we can think of this as the result of removing of āāā, so in next word prediction stage, the āstemmingā" may be considered.
While we have seen the most used words in the sample, letās look at the most freqently used two word utilizing Ngram tokenizer.
This report explore the data source and the basic analysis. To construct a prediction model we should plan next steps: