Data Science Capstone Week2 Slides

Jian Liang
2017 - 11 - 15

Summary

This is the first step to move forward to accomplish the capstone project on Coursera for Data Science by JHU. Agenda as followings about my exploratory analysis of the (en) training data set:

Load the data
Clean data
Single word freq summary
N-gram inspect (n=2 is showed)

Load Data

Data Source: Capstone Dataset

Packages applies: “ngram”

head(En_US_Twitter,5)

                                                                                                              txt
1   How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
2 When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
3                                                                        they've decided its more fun if I don't.
4                            So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
5                                                 Words from a complete stranger! Made my birthday even better :)

Clean Data

Remove punction
Exclude na rows (select nchar>0)
Characters to lower case

En_US_Twitter$txt<-sapply(En_US_Twitter$txt,function(x) gsub("(?!')[[:punct:]]", "", x, perl=TRUE))

En_US_Twitter$txt<-sapply(En_US_Twitter$txt,tolower)
En_US_Twitter = na.omit(En_US_Twitter)
En_US_Twitter = data.frame(En_US_Twitter [which(nchar(En_US_Twitter$txt)>0),],stringsAsFactors = FALSE)
names(En_US_Twitter)=c("txt")

Table for all single word freq (top 10)

library(ngram)
ng_twitter <-ngram(En_US_Twitter$txt,n=1)
twitter_table = get.phrasetable(ng_twitter)
head_twitter_n1_table = head(twitter_table,10)
Plot_1 = barplot(head_twitter_n1_table$freq,names.arg = head_twitter_n1_table$ngrams,axisnames = TRUE , xlab = "word" , ylab="Frequency",axes=TRUE,col=rgb(0.3,0.9,0.4,0.6))
text(Plot_1, y= head_twitter_n1_table$freq, label = head_twitter_n1_table$freq,cex = 0.8, col = "black")

plot of chunk unnamed-chunk-7

2-gram plot

Barplot for top 10 2-word “phrase"s; and word cloud for top 100 freq 2-word "phrase"s.

plot of chunk unnamed-chunk-8