Aimie Faucett
13 June, 2016
Coursera Data Science Specialization Capstone Project Week 2 Report
The English data sets include corpora from blogs, twitter, and news sources. The data are loaded using the readLines command.
en.blogs <- readLines('en_US/en_US.blogs.txt')
en.news <- readLines('en_US/en_US.news.txt')
en.tweets <- readLines('en_US/en_US.twitter.txt')
Meta data of each data set are as follows:
DataSet Size_MB LinesCount WordCount
1 Blogs 260.56432 899288 37334131
2 News 20.11139 77259 2643969
3 Twitter 316.03734 2360148 30373545
Since the data sets are quite large, the first 1000 lines of the Blog data set are used for data exploration. Before running any analyses, each line is cleaned. Cleaning involves:
Once the data have been cleaned, the next step is to perform the exploratory analysis.
One technique for predicting the next item in a sequence of words in natural language processing is using an n-gram model to find probabilites of sequences of words. A basic loop was used to find the unigram (n = 1), bigram (n = 2), and trigram (n = 3) of the first 1000 lines in the Blog data set. The code to do this is highlighted below:
for (i in 1:nLines) { # where nLines = 1000
clean.charVec <- tolower(strsplit(en.blogs[i], '')[[1]])[grep(paste(' ',paste(letters, collapse='|'),sep='|'), gsub('\\.|\\?|\\!|\\(|\\)','', tolower(strsplit(en.blogs[i], '')[[1]])))]
clean.chunk <- strsplit(paste(clean.charVec, collapse=''), ' ')[[1]]
clean.chunk <- sapply(1:length(clean.chunk), function(x) correct(clean.chunk[x]))
if(length(clean.chunk < 3)) { next() } else { # perform n-gram analysis
gram.1 <- with(data.frame(Combo = clean.chunk, Count = 1), aggregate(Count~Combo, FUN=sum))
gram.2 <- with(data.frame(Combo = do.call(rbind, lapply(2:length(clean.chunk), function(x) paste(clean.chunk[(x-1):x], collapse=' '))), Count = 1), aggregate(Count~Combo, FUN=sum))
gram.3 <- with(data.frame(Combo = do.call(rbind, lapply(3:length(clean.chunk), function(x) paste(clean.chunk[(x-2):x], collapse=' '))), Count = 1), aggregate(Count~Combo, FUN=sum))
gram.1.all <- rbind(gram.1.all, gram.1); gram.2.all <- rbind(gram.2.all, gram.2); gram.3.all <- rbind(gram.3.all, gram.3)}}
At this time, the probabilites of each word given a unigram, bigram, and trigram model are not computed. Future work may involve determining these probabilies; however, results of the frequencies of each combination have been found and are displayed on the next two slides (note that the y-axis scale is logrithmic).
Observations of interest are outlined in bullet points.
Based on the observations on the prior slide, further analysis is done to see what the top 100 combinations (based on frequency) are using each of the n-gram models. Results are summarized in Pareto-style charts made using the ggplot2 package. The plots axes can be difficult to read, so the charts are spread over three slides to facilitate reading.
Unigram Results:
Bigram Results:
Trigram Results:
Based on preliminary reading regarding natural language processing, the analysis of the corpora needs to be enhanced.
The Shiny app will be developed once the predictive algorithm is complete. The app should have, at minimum, the following features: