Jian Liang
2017 - 11 - 15
This is the first step to move forward to accomplish the capstone project on Coursera for Data Science by JHU. Agenda as followings about my exploratory analysis of the (en) training data set:
Data Source: Capstone Dataset
Packages applies: “ngram”
head(En_US_Twitter,5)
txt
1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
2 When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
3 they've decided its more fun if I don't.
4 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
5 Words from a complete stranger! Made my birthday even better :)
En_US_Twitter$txt<-sapply(En_US_Twitter$txt,function(x) gsub("(?!')[[:punct:]]", "", x, perl=TRUE))
En_US_Twitter$txt<-sapply(En_US_Twitter$txt,tolower)
En_US_Twitter = na.omit(En_US_Twitter)
En_US_Twitter = data.frame(En_US_Twitter [which(nchar(En_US_Twitter$txt)>0),],stringsAsFactors = FALSE)
names(En_US_Twitter)=c("txt")
library(ngram)
ng_twitter <-ngram(En_US_Twitter$txt,n=1)
twitter_table = get.phrasetable(ng_twitter)
head_twitter_n1_table = head(twitter_table,10)
Plot_1 = barplot(head_twitter_n1_table$freq,names.arg = head_twitter_n1_table$ngrams,axisnames = TRUE , xlab = "word" , ylab="Frequency",axes=TRUE,col=rgb(0.3,0.9,0.4,0.6))
text(Plot_1, y= head_twitter_n1_table$freq, label = head_twitter_n1_table$freq,cex = 0.8, col = "black")
Barplot for top 10 2-word “phrase"s; and word cloud for top 100 freq 2-word "phrase"s.