Data Science Capstone

1.Introduction

This serves as the milestone report for for Data Science Capstone project. The report will brief you the summary statistics and some algorithims used. Some explanatory data analysis will also be introduced.

2.Data

This assessment makes use of raw text data from three sources (news headlines, blog entries, and user tweets). Datasets were made available in German, Russian and English, however only the English datasets were utilized as part of this project.

3.Loading Data

First step, load English datasets into R.

raw_twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
raw_news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8")
raw_blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
raw_data <- c(raw_twitter, raw_news, raw_blogs)

4. Exploratory Analysis

Perform some basis analysis on datasets based on the first quiz.

The size of files:

file.info("final/en_US/en_US.twitter.txt")$size/1024^2

## [1] 159.3641

file.info("final/en_US/en_US.news.txt")$size/1024^2

## [1] 196.2775

file.info("final/en_US/en_US.blogs.txt")$size/1024^2

## [1] 200.4242

The lines of text:

length(raw_twitter)

## [1] 2360148

length(raw_news)

## [1] 77259

length(raw_blogs)

## [1] 899288

The length of the longest line seen in any of the three en_US data sets:

max(nchar(raw_twitter))

## [1] 140

max(nchar(raw_news))

## [1] 5760

max(nchar(raw_blogs))

## [1] 40833

We have the table from the above results:

File (name)	Size (mb)	Line (no.)	Longest Line (chars)
en_US.twitter.txt	159.36	2360148	140
en_US.news.txt	196.28	77259	5760
en_US.blogs.txt	200.42	899288	40833

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

val_love <- sum(grepl(pattern = "love", x = raw_twitter))
val_hate <- sum(grepl(pattern = "hate", x = raw_twitter))
val_love / val_hate

## [1] 4.108592

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

raw_twitter[grep(pattern = "biostat", x = raw_twitter)]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”?

sum(grepl(pattern = "A computer once beat me at chess, but it was no match for me at kickboxing", x = raw_twitter))

## [1] 3

Now we calculate the frequency of the words appreared in each file and construct a word cloud for each file.

Wordcloud for twitter

library(tm)

## Loading required package: NLP

library(wordcloud)

## Loading required package: RColorBrewer

#Sampling the data using binomial distribution:
index <- as.logical (rbinom (n = length (raw_twitter), size = 1, prob = 0.10))
twitterdata<- raw_twitter [index]
#Delete unidentified lines
dat = grep("NotKnown",iconv(twitterdata,"latin1","ASCII",sub = "NotKnown"))
twitterdata = twitterdata[-dat]
#Pre-Process Data
twitterscloud <- Corpus(VectorSource(twitterdata))
twitterscloud <- tm_map(twitterscloud, removeNumbers)
twitterscloud <- tm_map(twitterscloud, removePunctuation)
twitterscloud <- tm_map(twitterscloud, tolower)
twitterscloud <- tm_map(twitterscloud, removeWords,stopwords("english"))
finaltwitter  <- tm_map(twitterscloud,PlainTextDocument)
#Drawing the Graph
wordcld <- wordcloud (finaltwitter, 
           scale=c(5,0.5), 
           max.words=200, 
           random.order=FALSE, 
           rot.per=0.35, 
           use.r.layout=FALSE, 
           colors=brewer.pal(8, 'Dark2'))

Wordcloud for blog

#Sampling the data using binomial distribution:
index <- as.logical (rbinom (n = length (raw_blogs), size = 1, prob = 0.10))
blogdata<- raw_blogs [index]
#Delete unidentified lines
dat = grep("NotKnown",iconv(blogdata,"latin1","ASCII",sub = "NotKnown"))
blogdata = blogdata[-dat]
#Pre-Process Data
blogscloud <- Corpus(VectorSource(blogdata))
blogscloud <- tm_map(blogscloud, removeNumbers)
blogscloud <- tm_map(blogscloud, removePunctuation)
blogscloud <- tm_map(blogscloud, tolower)
blogscloud <- tm_map(blogscloud, removeWords,stopwords("english"))
finalblog  <- tm_map(blogscloud,PlainTextDocument)
#Drawing the Graph
wordcld <- wordcloud (finalblog, 
           scale=c(5,0.5), 
           max.words=200, 
           random.order=FALSE, 
           rot.per=0.35, 
           use.r.layout=FALSE, 
           colors=brewer.pal(8, 'Dark2'))

Wordcloud for news

#Sampling the data using binomial distribution:
index <- as.logical (rbinom (n = length (raw_news), size = 1, prob = 0.10))
newsdata<- raw_news [index]
#Delete unidentified lines
dat = grep("NotKnown",iconv(newsdata,"latin1","ASCII",sub = "NotKnown"))
newsdata = newsdata[-dat]
#Pre-Process Data
newscloud <- Corpus(VectorSource(newsdata))
newscloud <- tm_map(newscloud, removeNumbers)
newscloud <- tm_map(newscloud, removePunctuation)
newscloud <- tm_map(newscloud, tolower)
newscloud <- tm_map(newscloud, removeWords,stopwords("english"))
finalnews <- tm_map(newscloud,PlainTextDocument)
#Drawing the Graph
wordcld <- wordcloud (finalnews, 
           scale=c(5,0.5), 
           max.words=200, 
           random.order=FALSE, 
           rot.per=0.35, 
           use.r.layout=FALSE, 
           colors=brewer.pal(8, 'Dark2'))

5. Next steps

Searching the relationship between words
Select the suitable NLP algorithm
Build basic n-gram model
Construct Shiny apps

Data Science Capstone - MileStone

Anh Nguyen

Feb, 2017

1.Introduction

2.Data

3.Loading Data

4. Exploratory Analysis

5. Next steps