Natural language processing (NLP)
This data science capstone is corporated with SwiftKey, who builds a smart keyboard to predict words that makes people to type easily on their mobile devices.
In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.
The data set is downloaded from Coursera site, the Capstone Dataset. It comes from a corpus called HC Corpora.
In order to predict word by doing text mining, I will acquisit and clean the data sets, and then do exploratory analysis.
I will briefly summarize the major features of the data I have identified.
I will make a plan for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
1.Basic summaries of the three data sets, including word counts, line counts and basic data tables.
2.Basic plots, such as histograms to illustrate features of the data.
By looking at the Wiki page, Coursera course on NLP, some documents about R and text Mining, and going through all the amazing discussions in the forum, I plan to use tri-grams, which give both good accurary and quriey speed.
My plan is as follow:
I choose the en_US data sets for the prediction model training. This data contain three sources from blogs, news, and twitter.
The size of each data set is as follow:
#linux comand line console(bash code)
echo 'lines words characters size filename' ;\
wc * |awk '{printf"%s\t%s\t%s\t%sM\t%s\t\n", $1, $2, $3, $3/1024/1024, $4}'
#result
lines words characters size filename
899288 37334131 210160014 200.424M en_US.blogs.txt
1010242 34372530 205811889 196.278M en_US.news.txt
2360148 30373583 167105338 159.364M en_US.twitter.txt
4269678 102080244 583077241 556.066M total
By breifly looking at the data in linux comand line console, we can find that there are 4,269,678 lines, 102,080,244 words, 583,077,241 characters and 556M data in total. The blogs contains 899,288 lines and 200M, news has 1,010,244 lines and 196M, and twitter data have 2,360,148 lines and 159M in size.
Then I filter profanity words.
#linux comand line console(bash code)
for i in en_US.blogs.txt en_US.news.txt en_US.twitter.txt; do
##remove non-characters in the files
tr -c " a-zA-Z\n" " " < $i > clean1
##remove extra spaces
tr -s " " < clean1 > clean2
## convert to lower case
tr 'A-Z' 'a-z' < clean2 > clean3
## remove all profanity words
profanityWord=`paste -d\| -s bad-words.txt|sed 's/ //g'` ##remove' in the bad-words.txt
echo "awk '{gsub(\"$profanityWord\",\"\")}1' clean3 >clean-$i" > clean.sh
sh clean.sh
done
#linux comand line console(bash code)
##create word list
tr -sc 'A-Za-z' '\n' < clean-$i > oneword-$i
##1-gram(one word) frequency
sort oneword-$i | uniq -c | sort -n -r |head -n 100 > freq-one-$i
##2-gram(two words) frequency
sed '1d' oneword-$i > rm1word-$i
paste oneword-$i rm1word-$i > twoword-$i
sort twoword-$i | uniq -c | sort -n -r |head -n 100 > freq-two-$i
##3-gram(three words) frequency
sed '1d' rm1word-$i > rm2word-$i
paste twoword-$i rm2word-$i > threeword-$i
sort threeword-$i| uniq -c | sort -n -r |head -n 100 > freq-three-$i
##4-gram(four words) frequency
sed '1d' rm2word-$i > rm3word-$i
paste threeword-$i rm3word-$i > fourword-$i
sort fourword-$i| uniq -c | sort -n -r |head -n 100 > freq-four-$iI use the top 100 frequently appeared n-gram to draw the histograms and wordclouds.
### plot the histograms
par(mfrow = c(3, 1), mar =c(3,5,2,4), las=2)
histblog<-read.table("freq-one-en_US.blogs.txt", header=F)
histnews<-read.table("freq-one-en_US.news.txt", header=F)
histtw<-read.table("freq-one-en_US.twitter.txt", header=F)
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 1-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 1-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 1-gram in en_US.twitter")
### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:100,2], histblog[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.blogs")
wordcloud(histnews[1:100,2], histnews[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.news")
wordcloud(histtw[1:100,2], histtw[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.twitter")
### plot the histograms
par(mfrow = c(3, 1), mar =c(5,5,2,4), las=2)
histblog<-read.table("freq-two-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3])
histnews<-read.table("freq-two-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3])
histtw<-read.table("freq-two-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 2-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 2-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 2-gram in en_US.twitter")
### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:70,2], histblog[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.blogs")
wordcloud(histnews[1:70,2], histnews[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.news")
wordcloud(histtw[1:70,2], histtw[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.twitter")
### plot the histograms
par(mfrow = c(3, 1), mar =c(8,5,2,4), las=2)
histblog<-read.table("freq-three-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3], histblog[,4])
histnews<-read.table("freq-three-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3],histnews[,4])
histtw<-read.table("freq-three-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3],histtw[,4])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 3-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 3-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 3-gram in en_US.twitter")
### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:40,2], histblog[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.blogs")
wordcloud(histnews[1:40,2], histnews[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.news")
wordcloud(histtw[1:40,2], histtw[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.twitter")
### plot the histograms
par(mfrow = c(3, 1), mar =c(10,5,1,4), las=2)
histblog<-read.table("freq-four-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3], histblog[,4], histblog[,5])
histnews<-read.table("freq-four-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3],histnews[,4],histnews[,5])
histtw<-read.table("freq-four-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3],histtw[,4],histtw[,5])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 4-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 4-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 4-gram in en_US.twitter")
### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:30,2], histblog[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.blogs")
wordcloud(histnews[1:30,2], histnews[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.news")
wordcloud(histtw[1:30,2], histtw[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.twitter")