This is a brief summary on the exploratory data analysis performed on the swiftkey dataset. The objective of this EDA is to prepare the data for further statistical analysis (prediction analysis).
The data is provided in the coursera course site (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and comprises of four sets of files each containing text data from blogs, news and twiiter. There are four directories in different european languages of the text files but for my report i will use the files with the English language (en_US).
The packages used are the tm and quanteda which are recomended by coursera. Quanteda will be used later for future text analysis.
data <- file.path(".", "final", "en_US")
library(tm); library(readtext) ## load the required packages
## Loading required package: NLP
# Read files into R
blogs <- readLines(con = file("en_us.blogs.txt", "r"), skipNul = T)
news <- readLines(con = file("en_us.news.txt", "r"), skipNul = T)
## Warning in readLines(con = file("en_us.news.txt", "r"), skipNul = T):
## incomplete final line found on 'en_us.news.txt'
twitter <- readLines(con = file("en_us.twitter.txt", "r"), skipNul = T)
Basic summary of the data
# Get to know your data
length(blogs); length(news); length(twitter)
## [1] 899288
## [1] 77259
## [1] 2360148
The files are too big and therefore for faster processing i will use a subset of the data.
merge <- paste(blogs[1:4000], news[1:4000], twitter[1:4000])
# create a corpus
data_source <- VectorSource(merge)
corpus <- Corpus(data_source)
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation
## drops documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents
corpus <- tm_map(corpus, removeWords, stopwords())
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords()):
## transformation drops documents
Create a docment term matrix and find the frequency of words that will be useful in further analysis of the data
dtm <- as.matrix(DocumentTermMatrix(corpus))
freq <- colSums(dtm)
freq <- sort(freq, decreasing = T)
Show summary plots in form of word cloud and Histogram on frequency of words
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(111)
wordcloud(names(freq), freq, max.words=100)
## Warning in wordcloud(names(freq), freq, max.words = 100): said could not be
## fit on page. It will not be plotted.
barplot(freq[1:20], cex.names=1.0, names.arg=freq[1:20], col="blue", main="Word frequency", las=2)
I will include the quanteda package when formulating my prediction algorithim as well as the tm package. I will further employ several text analysis techniques useful in the NLP. These include: 1. Parsing 2. Stemming 3. Text segmentation 4. Named entity recognition and 5. Sentiment analysis.