Data Science Capstone: Milestone Report

Instruction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Initial Data Processing

In this session, we will download the data and do some initial processing to clean the data, make it ready for analysis. First, import the necessary packages:

library(NLP)
library(tm)
library(stringi)
library(wordcloud)

## Loading required package: RColorBrewer

library(ggplot2)

## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RWeka)
library(textmineR)

## Loading required package: Matrix

## 
## Attaching package: 'textmineR'

## The following object is masked from 'package:Matrix':
## 
##     update

## The following object is masked from 'package:stats':
## 
##     update

Downloading and reading the data

Download the ZIP file and then unzip it. Check if the files exist before processing.

filename <- "MySwiftKey.zip"
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists(filename)){
        download.file(fileURL, filename, method = "curl")
}
foldername <- "final"
if (!file.exists(foldername)){
        unzip(filename)
}

## Warning in unzip(filename): error 1 in extracting from zip file

Let first take the english language as a start, set directory to the en-US folder.

setwd("/Users/dongyingwang/Documents/R-studio/DataScienceCapstone_project_MySwiftKey/final/en_US")

Read in 3 files into R, blogs, news and twitters:

fileName = "en_US.blogs.txt"
con = file(fileName, open = "r")
lineBlogs = readLines(con, encoding='UTF-8') 
close(con)

fileName = "en_US.news.txt"
con = file(fileName, open = "r")
lineNews = readLines(con, encoding='UTF-8') 
close(con)

fileName = "en_US.twitter.txt"
con = file(fileName, open = "r")
lineTwitters = readLines(con, encoding='UTF-8')

## Warning in readLines(con, encoding = "UTF-8"): line 167155 appears to contain an
## embedded nul

## Warning in readLines(con, encoding = "UTF-8"): line 268547 appears to contain an
## embedded nul

## Warning in readLines(con, encoding = "UTF-8"): line 1274086 appears to contain
## an embedded nul

## Warning in readLines(con, encoding = "UTF-8"): line 1759032 appears to contain
## an embedded nul

close(con)

Sampling and cleaning the text

In this session, I’ll only take twitter text as an example to show the pre-processing I did before analysis the data. First, sample 5% of the dataset:

set.seed(404) # For reproducibility 
percentage = 0.005
sampleTwitters <- sample(lineTwitters, percentage*length(lineTwitters))
length(sampleTwitters)

## [1] 11800

head(sampleTwitters)

## [1] "No, thank you... You have some good input."                                                                                    
## [2] "swarley is bathing himself on my final project #whatever #nobigdeal and now he just slapped madden in the face. whats going on"
## [3] "Yeah, I guess I'm the only M. Bennett fan too. I waffled on what TE to nab, but like his upside and youth."                    
## [4] "My momma didn't rise no fool :)"                                                                                               
## [5] "Word through the grapevine is that you're a fellow Trojan... Fight on."                                                        
## [6] "Oklahoma St. Has never heard of a QB spyq"

Then remove all weird characters and turn this data set into a corpus to do the text mining.

sampleTwitters<- iconv(sampleTwitters, 'UTF-8', 'ASCII', "byte") ## Remove all weird characters
sourceVec <- VectorSource(sampleTwitters) # turn character into source object
corpVec <- VCorpus(sourceVec) # turn source object into corpus
corpVec <- tm_map(corpVec, tolower) # covert to lower case
corpVec <- tm_map(corpVec, removeWords, stopwords("english")) # remove 'a', 'an', etc.
corpVec <- tm_map(corpVec, removePunctuation)
corpVec <- tm_map(corpVec, removeNumbers)
corpVec <- tm_map(corpVec, stripWhitespace)
corpVec <- tm_map(corpVec, PlainTextDocument)

Another important step is to remove the offensive languages, I found the list from the website, https://www.cs.cmu.edu/~biglou/resources/.

filename <- "offensive_words.txt"
fileURL <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
if (!file.exists(filename)){
        download.file(fileURL, filename, method = "curl")
}
offensive_words <- readLines(filename, encoding='UTF-8')
offensive_words <- offensive_words[offensive_words != ""]
corpVec <- tm_map(corpVec, removeWords, offensive_words)

Let’s make a word cloud to see some most frequently used words in US twitters:

wordcloud(corpVec, max.words = 100, random.order = FALSE,
          rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

Exploratory data analysis

Let’s turn the corpus into a data frame and then look at the barplot of most frequently used words in US twitters, see if it’s consistent with the word cloud my made previously.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm_1 <- TermDocumentMatrix(corpVec, control = list(tokenize = UnigramTokenizer))
term_matrix_1 <- as.matrix(tdm_1)   ## convert our term-document-matrix into a normal matrix
freq_words <- rowSums(term_matrix_1)
freq_words <- as.data.frame(sort(freq_words, decreasing=TRUE))
freq_words$term <- rownames(freq_words)
colnames(freq_words) <- c("Frequency","words")
g <- ggplot(data = freq_words[1:20,], 
            aes(x = reorder(words, Frequency), y = Frequency, fill = Frequency)) 
g + geom_bar(stat="identity") + xlab('Words') + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()

From above we can see the top 3 most frequently used words in US twitters are: just, like, get, love and good.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm_2 <- TermDocumentMatrix(corpVec, control = list(tokenize = BigramTokenizer))
term_matrix_2 <- as.matrix(tdm_2)   ## convert our term-document-matrix into a normal matrix
freq_2words <- rowSums(term_matrix_2)
freq_2words <- as.data.frame(sort(freq_2words, decreasing=TRUE))
freq_2words$term <- rownames(freq_2words)
colnames(freq_2words) <- c("Frequency","term")
g <- ggplot(data = freq_2words[1:20,], 
            aes(x = reorder(term, Frequency), y = Frequency, fill = Frequency)) 
g + geom_bar(stat="identity") + xlab('2-Words Terms') + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()

From above we can see the top 3 most frequently used 2-words terms in US twitters are: right now, last night, happy birthday, looking forward and just got.

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm_3 <- TermDocumentMatrix(corpVec, control = list(tokenize = TrigramTokenizer))
term_matrix_3 <- as.matrix(tdm_3)   ## convert our term-document-matrix into a normal matrix
freq_3words <- rowSums(term_matrix_3)
freq_3words <- as.data.frame(sort(freq_3words, decreasing=TRUE))
freq_3words$term <- rownames(freq_3words)
colnames(freq_3words) <- c("Frequency","term")
g <- ggplot(data = freq_3words[1:20,], 
            aes(x = reorder(term, Frequency), y = Frequency, fill = Frequency)) 
g + geom_bar(stat="identity") + xlab('3-Words Terms') + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()

From above we can see the top 3 most frequently used 2-words terms in US twitters are: like like like, happy mothers day, foul foul foul, let us know and cinco de mayo.

Plan for the future steps

For now, we already display the n-gram analysis to the US twitter data set. I’m now facing the data set size too large problem. In the future, that would be a main thing to fix.

Also, for the other data in blogs and news, people have different expression habits. Further more, there might be more differences when switching to another language. The final model should include all those considerations.

Finally, after further analysis, text modeling, and text prediction, I’ll implement model as a Shiny App.