This is for R Capstone project. The data has blogs, news and twitter data available in US, DE, FI and RU. I use the corpus library for data analysis.
library(corpus)
##Load data
blogsUS <- readLines('final/en_US/en_US.blogs.txt',encoding='UTF-8', warn=FALSE)
twitterUS <-readLines('final/en_US/en_US.twitter.txt',encoding='UTF-8', warn=FALSE)
newsUS <-readLines('final/en_US/en_US.news.txt',encoding='UTF-8', warn=FALSE)
##Read in data and get rid of punctuations
txtsblogsUS <- text_split(blogsUS)
txtsnewsUS <- text_split(newsUS)
txtstwitterUS <- text_split(twitterUS)
txtsblogsUS$text <- gsub("[[:punct:]]", "", txtsblogsUS$text, perl=TRUE)
txtsnewsUS$text <- gsub("[[:punct:]]", "", txtsnewsUS$text, perl=TRUE)
txtstwitterUS$text <- gsub("[[:punct:]]", "", txtstwitterUS$text, perl=TRUE)
##Take a look at basic summary: word counts, line counts and basic data tables
tsblogsUS <- text_stats(txtsblogsUS)
tsnewsUS <- text_stats(txtsnewsUS)
tstwitterUS <- text_stats(txtstwitterUS)
The word counts for blogs, news, twitter are 37596462, 2644384 and 29859846 respectively. The lines counts for blogs, news, twitter are 2348065, 148313 and 3729343 respectively. The basic data rows/elements for blogs, news, twitter are 2349952, 148405 and 3748986 respectively.
##Take a look at ngrams
ngram2blogsUS <- term_stats(txtsblogsUS$text, ngrams = 2, types = TRUE, subset = !type1 %in% stopwords_en)
ngram2newsUS <- term_stats(txtsnewsUS$text, ngrams = 2, types = TRUE, subset = !type1 %in% stopwords_en)
ngram2twitterUS <- term_stats(txtstwitterUS$text, ngrams = 2, types = TRUE, subset = !type1 %in% stopwords_en)
As shown in the 2-gram summary, the most frequent 2-gram terms in the blog, news and twitter include ‘one of’ and ‘going to’. Therefore we plot the frequency of the following word of ‘going to’ in across three docs and make plots for the first word ‘going’.
txtsblogsUS$goingcount <- text_count(txtsblogsUS$text, 'going')
histblogs <- aggregate(txtsblogsUS$goingcount, by=list(parent=txtsblogsUS$parent), FUN=sum)
txtsnewsUS$goingcount <- text_count(txtsnewsUS$text, 'going')
histnews <- aggregate(txtsnewsUS$goingcount, by=list(parent=txtsnewsUS$parent), FUN=sum)
txtstwitterUS$goingcount <- text_count(txtstwitterUS$text, 'going')
histtwitter <- aggregate(txtstwitterUS$goingcount, by=list(parent=txtstwitterUS$parent), FUN=sum)
hist(histblogs$x)
hist(histnews$x)
hist(histtwitter$x)
In most phrases across all different document, the ‘going’ appear once or none. Blogs has most ‘going’ phrases. Next we will examine what are the mostly likely words following ‘going’.