Natural language processing (NLP)

Introduction

This data science capstone is corporated with SwiftKey, who builds a smart keyboard to predict words that makes people to type easily on their mobile devices.

In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

Data Source

The data set is downloaded from Coursera site, the Capstone Dataset. It comes from a corpus called HC Corpora.

Motivation of This Report

In order to predict word by doing text mining, I will acquisit and clean the data sets, and then do exploratory analysis.

I will briefly summarize the major features of the data I have identified.

I will make a plan for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

Exploratory Analysis in This Report

1.Basic summaries of the three data sets, including word counts, line counts and basic data tables.

2.Basic plots, such as histograms to illustrate features of the data.

My Plan for Creating the Prediction Algorithm and Shiny App

By looking at the Wiki page, Coursera course on NLP, some documents about R and text Mining, and going through all the amazing discussions in the forum, I plan to use tri-grams, which give both good accurary and quriey speed.

My plan is as follow:

  1. Obtain the data
  2. Clean the data–remove the strange symbols, numbers, puncuation and uninterpable words, change all words to lowercase.
  3. Filter profanity words. I choose to use the Offensive/Profane Word List from Luis von Ahn at Carnegie Mellon.
  4. Tokenize the sentence.(break down sentences into units of meaning).
  5. Perform exploratory analysis of the data, understanding the distribution and relationship of 1, 2-grams and 3-grams in the dataset.
  6. Build basic n-gram model.
  7. Build a predictive model.
  8. Evaluate the model for efficiency and accuracy and optimize the model.
  9. Create a Shiny app that accepts an n-gram and predicts the next word.

Exploratory Analysis

Summaries of the three data sets

I choose the en_US data sets for the prediction model training. This data contain three sources from blogs, news, and twitter.

word counts, line counts of each file

The size of each data set is as follow:

#linux comand line console(bash code)
echo 'lines   words           characters      size            filename' ;\
wc * |awk '{printf"%s\t%s\t%s\t%sM\t%s\t\n", $1, $2, $3, $3/1024/1024, $4}'
#result
lines   words           characters      size            filename
899288  37334131        210160014       200.424M        en_US.blogs.txt 
1010242 34372530        205811889       196.278M        en_US.news.txt  
2360148 30373583        167105338       159.364M        en_US.twitter.txt       
4269678 102080244       583077241       556.066M        total    

By breifly looking at the data in linux comand line console, we can find that there are 4,269,678 lines, 102,080,244 words, 583,077,241 characters and 556M data in total. The blogs contains 899,288 lines and 200M, news has 1,010,244 lines and 196M, and twitter data have 2,360,148 lines and 159M in size.

Pre-processing

step1 Clean the data

  • I remove the strange symbols, numbers, puncuation, white spaces and uninterpable words, change all words to lowercase.
  • Then I filter profanity words.

    #linux comand line console(bash code)
    for i in en_US.blogs.txt en_US.news.txt en_US.twitter.txt; do
    ##remove non-characters in the files
    tr -c " a-zA-Z\n"  " " < $i > clean1
    ##remove extra spaces
    tr -s " " < clean1 > clean2
    ## convert to lower case
    tr 'A-Z' 'a-z' < clean2 > clean3
    ## remove all profanity words
    profanityWord=`paste  -d\| -s bad-words.txt|sed 's/ //g'` ##remove' in the bad-words.txt
    echo "awk '{gsub(\"$profanityWord\",\"\")}1' clean3 >clean-$i" > clean.sh
    sh clean.sh
    done

    step2 Tokenize and get the n-gram frequency

    #linux comand line console(bash code)
    ##create word list
    tr -sc 'A-Za-z' '\n' < clean-$i > oneword-$i
    ##1-gram(one word) frequency
    sort oneword-$i | uniq -c | sort -n -r |head -n 100 > freq-one-$i
    ##2-gram(two words) frequency
    sed '1d' oneword-$i > rm1word-$i
    paste oneword-$i rm1word-$i   > twoword-$i
    sort twoword-$i | uniq -c | sort -n -r |head -n 100 > freq-two-$i
    ##3-gram(three words) frequency
    sed '1d' rm1word-$i > rm2word-$i
    paste twoword-$i rm2word-$i   > threeword-$i
    sort threeword-$i| uniq -c | sort -n -r |head -n 100 > freq-three-$i
    ##4-gram(four words) frequency
    sed '1d' rm2word-$i > rm3word-$i
    paste threeword-$i rm3word-$i   > fourword-$i
    sort fourword-$i| uniq -c | sort -n -r |head -n 100 > freq-four-$i

Histograms and word cloud of the n-gram frequency

I use the top 100 frequently appeared n-gram to draw the histograms and wordclouds.

Histograms and word cloud of 1-gram

### plot the histograms
par(mfrow = c(3, 1), mar =c(3,5,2,4), las=2)
histblog<-read.table("freq-one-en_US.blogs.txt", header=F)
histnews<-read.table("freq-one-en_US.news.txt", header=F)
histtw<-read.table("freq-one-en_US.twitter.txt", header=F)
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 1-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 1-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 1-gram in en_US.twitter")

### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:100,2], histblog[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.blogs")
wordcloud(histnews[1:100,2], histnews[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.news")
wordcloud(histtw[1:100,2], histtw[1:100,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "1-gram of en_US.twitter")

Histograms and word cloud of 2-gram

### plot the histograms
par(mfrow = c(3, 1), mar =c(5,5,2,4), las=2)
histblog<-read.table("freq-two-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3])
histnews<-read.table("freq-two-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3])
histtw<-read.table("freq-two-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 2-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 2-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 2-gram in en_US.twitter")

### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:70,2], histblog[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.blogs")
wordcloud(histnews[1:70,2], histnews[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.news")
wordcloud(histtw[1:70,2], histtw[1:70,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2)
text(x=0.5, y=0, "2-gram of en_US.twitter")

Histograms and word cloud of 3-gram

### plot the histograms
par(mfrow = c(3, 1), mar =c(8,5,2,4), las=2)
histblog<-read.table("freq-three-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3], histblog[,4])
histnews<-read.table("freq-three-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3],histnews[,4])
histtw<-read.table("freq-three-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3],histtw[,4])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 3-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 3-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 3-gram in en_US.twitter")

### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:40,2], histblog[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.blogs")
wordcloud(histnews[1:40,2], histnews[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.news")
wordcloud(histtw[1:40,2], histtw[1:40,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(3, 0.2))
text(x=0.5, y=0, "3-gram of en_US.twitter")

Histograms and word cloud of 4-gram

### plot the histograms
par(mfrow = c(3, 1), mar =c(10,5,1,4), las=2)
histblog<-read.table("freq-four-en_US.blogs.txt", header=F)
histblog[,2]<-paste(histblog[,2],histblog[,3], histblog[,4], histblog[,5])
histnews<-read.table("freq-four-en_US.news.txt", header=F)
histnews[,2]<-paste(histnews[,2],histnews[,3],histnews[,4],histnews[,5])
histtw<-read.table("freq-four-en_US.twitter.txt", header=F)
histtw[,2]<-paste(histtw[,2],histtw[,3],histtw[,4],histtw[,5])
barplot(histblog[1:20,1],names.arg=histblog[1:20,2],col="red", main = "Histogram of 4-gram in en_US.blogs")
barplot(histnews[1:20,1],names.arg=histnews[1:20,2],col="red", main = "Histogram of 4-gram in en_US.news")
barplot(histtw[1:20,1],names.arg=histtw[1:20,2],col="red", main = "Histogram of 4-gram in en_US.twitter")

### plot the word cloud
library(wordcloud)
require(RColorBrewer)
par(mfrow = c(1, 3), mar =c(2,4,0,2))
pal2 <- brewer.pal(8,"Dark2")
wordcloud(histblog[1:30,2], histblog[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.blogs")
wordcloud(histnews[1:30,2], histnews[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.news")
wordcloud(histtw[1:30,2], histtw[1:30,1], min.freq =100, random.order = F, ordered.colors = F, colors=pal2, scale = c(2, 0.1))
text(x=0.5, y=0, "4-gram of en_US.twitter")

Summary

  1. I have shown my plan for creating the prediction algorithm and Shiny app.
  2. I summarized the major features of the data I have identified.
  3. I have done exploratory Analysis including word counts, line counts and basic data tables.
  4. I have done basic plots, such as histograms and word cloud of 1,2,3,4-gram.
  5. Besides a lot of codes shown in the report, I hope it looks concise to you. Thanks!