Load libraries:
library(dplyr)
library(tidytext)
library(rbenchmark)
library(ggplot2)
library(tm)
Read files and store into objects b(blogs), t(twitter) and n(news).
t0 <- "data/en_US/en_US.twitter.txt"
b0 <- "data/en_US/en_US.blogs.txt"
n0 <- "data/en_US/en_US.news.txt"
con <- file(t0, "r"); t<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)
con <- file(b0, "r"); b<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)
con <- file(n0, "r"); n<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)
First of all, lets see some examples of the lines contained in these files to see what kind of text are we working with.
set.seed(12)
df <- data.frame(cbind(sample(t,3),sample(b,3),sample(n,3)))
names(df) <- c("twitter", "blogs", "news")
df
## twitter
## 1 Hawk Cards, Get Your Hawk Cards Here
## 2 I agree. And we need to get that misogynist Ed Schultz off the air, too. Keep up the good fight
## 3 Who tryna smoke a few bowls of strong wit me
## blogs
## 1 Well its one for the money
## 2 Jake was a 15 year old consuming mass quantities of food. (his friends dressed as the same) Max had multiple costumes, one was a greaser with his new black leather jacket. Because Dad was sick and Mom is lame, Max, dressed in army garb, volunteered to cruise the neighborhood with Luke. Here they are with Jack O' Lantern faces.
## 3 and why for girls only?
## news
## 1 Up to 89 percent of Capistranos 2,200 teachers participated in the strike at its height; attendance at the 52,000-student district dipped as low as 30 percent. Hundreds of substitutes filled in for the picketing teachers, with schedules altered, afterschool sports canceled and other programs curtailed.
## 2 ST. PETERSBURG Regardless of how low the economy goes in the next two years, the new $35-million Salvador Dali Museum will be built and open in early 2011, museum director Hank Hine said Thursday.
## 3 That celebratory Kool & the Gang vibe translates across genres. The band, which last released an album in 2007, has opened for rock acts before, including Meat Loaf and Def Leppard, and last year wound up on the bill at Glastonbury, one of England's biggest rock festivals.
As we see, in all of three, there are special characters (#,’'…), misspelled words or things that are not actualy words (HAHAHA, etc.). So, some cleaning must be done in data but first, les look at the numbers:
Get number of lines of each file:
length(t) #twitter
## [1] 2360148
length(b) #blogs
## [1] 899288
length(n) #news
## [1] 77259
It has been performed a benchmark in order to get the best method to cound words in each line. It has been used a sample of 100 elements of t object and is has been replicated 10 times.
countwordsbenchmark <- benchmark("Count words with strsplit" = {lengths(strsplit(sample(t,1000), "\\W+"))},
"Count words with gregexpr and sapply" = {sapply(gregexpr("[[:alpha:]]+", sample(t,1000)), function(x) sum(x > 0))},
replications = 10,
columns = c("test", "replications", "elapsed")
)
countwordsbenchmark
## test replications elapsed
## 2 Count words with gregexpr and sapply 10 0.20
## 1 Count words with strsplit 10 0.41
So, now that we get the best method (between this two gregexpr+sapply wins) to get the words, lets calculate the mean number of words per line:
tw <- sapply(gregexpr("[[:alpha:]]+", t), function(x) sum(x > 0)) #words per line in twitter
bw <- sapply(gregexpr("[[:alpha:]]+", b), function(x) sum(x > 0)) #words per line in blogs
nw <- sapply(gregexpr("[[:alpha:]]+", n), function(x) sum(x > 0)) #words per line in news
mean(tw) #mean of words per line in twitter
## [1] 12.94648
mean(bw) #mean of words per line in blogs
## [1] 42.11538
mean(nw) #mean of words per line in news
## [1] 34.45644
Just before get into text counting, “cleaning” and n-grams, lets get a random sample from these files. The value of the sample would be the 25% of lines from the file with less lines.
nsamples <- round(min(length(t), length(b), length(n))*0.25)
t2 <- sample(t,nsamples) #create sample of text to improve performance
b2 <- sample(b,nsamples) #create sample of text to improve performance
n2 <- sample(n,nsamples) #create sample of text to improve performance
t2w <- sapply(gregexpr("[[:alpha:]]+", t2), function(x) sum(x > 0)) #words per line in twitter
b2w <- sapply(gregexpr("[[:alpha:]]+", b2), function(x) sum(x > 0)) #words per line in blogs
n2w <- sapply(gregexpr("[[:alpha:]]+", n2), function(x) sum(x > 0)) #words per line in news
mean(t2w) #mean of words per line in twitter
## [1] 12.97722
mean(b2w) #mean of words per line in blogs
## [1] 42.10381
mean(n2w) #mean of words per line in news
## [1] 34.56831
As you see, mean values are almost the same as the files are huge enough. So, We will continue with sampled data to improve performance.
################ |
Here could be something to detect the language of the text (may be using freq of stopwords…) |
################ |
Clean data using some function from “tm” package:
#build function for text cleaning
textcleaning <- function(x){
textclean <- x %>%
removeNumbers() %>%
removeWords(stopwords("english")) %>%
removePunctuation(preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) %>%
stripWhitespace()
return(textclean)
}
tc <- textcleaning(t2)
bc <- textcleaning(b2)
nc <- textcleaning(n2)
Lets create here the tibbles (sim. to data.frame) where each line contains one register (piece of text):
t3 <- tibble(line = 1:length(tc), words = tc)
b3 <- tibble(line = 1:length(bc), words = bc)
n3 <- tibble(line = 1:length(nc), words = nc)
#join all three tibbles in one object
tbn <- bind_rows("twitter" = t3, "blogs" = b3, "news" = n3, .id = "group")
Tokenize words and create 1,2,3-ngrams:
data(stop_words)
tbn1 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 1)
tbn1 <- tbn1 %>% anti_join(stop_words) #remove most common english words in 1 n-gram
tbn2 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 2)
tbn3 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 3)
sample_n(tbn1, 5)
## # A tibble: 5 x 3
## group line word
## <chr> <int> <chr>
## 1 blogs 15952 bring
## 2 news 14599 performed
## 3 blogs 18950 spreading
## 4 news 11430 planned
## 5 blogs 9529 fatal
sample_n(tbn2, 5)
## # A tibble: 5 x 3
## group line word
## <chr> <int> <chr>
## 1 news 4345 marigolds scarlet
## 2 news 7954 three areas
## 3 news 13788 steve buckstein
## 4 blogs 4023 everything including
## 5 news 7957 even hitting
sample_n(tbn3, 5)
## # A tibble: 5 x 3
## group line word
## <chr> <int> <chr>
## 1 twitter 2152 coolest swolest dude
## 2 blogs 12119 choose stay single
## 3 blogs 7312 time canadian advertisers
## 4 twitter 11822 talents south beach
## 5 blogs 14097 different aspects persuasive
Calculate n, total and term frequencies in each n-gram:
totaltf <- function (x){
ntf <- x %>%
count(group, word, sort=T)
total <- ntf %>%
group_by(group) %>%
summarize(total = sum(n))
ntf <- left_join(ntf, total)
ntf <- mutate(ntf, tf = n/total)
return(ntf)
}
TBN1 <- totaltf(tbn1)
TBN2 <- totaltf(tbn2)
TBN3 <- totaltf(tbn3)
#join all three tibbles in one object
TBN <- bind_rows("ngram1" = TBN1, "ngram2" = TBN2, "ngram3" = TBN3, .id = "ngram")
TBN <- TBN[TBN$word != 'NA',] #remove NA values
sample_n(TBN, 10)
## # A tibble: 10 x 6
## ngram group word n total tf
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 ngram2 blogs came large 2 441852 0.00000453
## 2 ngram2 blogs when bad 1 441852 0.00000226
## 3 ngram2 news trip disaster 1 386426 0.00000259
## 4 ngram2 news drop average 1 386426 0.00000259
## 5 ngram2 blogs said previous 1 441852 0.00000226
## 6 ngram2 blogs folding plate 1 441852 0.00000226
## 7 ngram3 news analysts predict unemployment 1 367751 0.00000272
## 8 ngram2 twitter become really 2 136048 0.0000147
## 9 ngram3 blogs avengers even enemies 1 424162 0.00000236
## 10 ngram2 twitter out soccer 1 136048 0.00000735
Select just the 10 higher tf values within each “ngram” and “group”:
TBNx <- TBN %>% group_by(group, ngram) %>% top_n(10)
TBNx[TBNx$ngram == "ngram3",]
## # A tibble: 30 x 6
## # Groups: group, ngram [3]
## ngram group word n total tf
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 ngram3 blogs i know i 74 424162 0.000174
## 2 ngram3 blogs i think i 64 424162 0.000151
## 3 ngram3 blogs i dont know 45 424162 0.000106
## 4 ngram3 blogs i feel like 42 424162 0.0000990
## 5 ngram3 blogs i thought i 37 424162 0.0000872
## 6 ngram3 twitter i think i 35 118403 0.000296
## 7 ngram3 blogs i dont think 31 424162 0.0000731
## 8 ngram3 news the year old 28 367751 0.0000761
## 9 ngram3 news new york city 27 367751 0.0000734
## 10 ngram3 news president barack obama 27 367751 0.0000734
## # ... with 20 more rows
Plot the most common words in each n-gram and group:
g <- ggplot(TBNx, aes(x = word, y = tf))+
geom_bar(stat = "identity", fill = "red", width = 0.8, alpha = 0.95)+
labs(title = "1, 2 and 3 n-grams by source", x = "Words", y = "Term frequency")+
facet_wrap(~ngram+group, ncol = 3, scales = "free")+
theme(text = element_text(size=8), axis.text.x=element_text(angle = -90, hjust = 0))+
theme(text = element_text(size=8), axis.text.y=element_text(angle = -90, hjust = 0))
g
Create wordcloud plots
library(wordcloud)
par(mfrow=c(1,3))
wordcloud(TBN1$word, TBN1$tf, scale = c(3,1), max.words=50, random.order=FALSE,
rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Blues"))
wordcloud(TBN2$word, TBN2$tf, scale = c(3,1), max.words=50, random.order=FALSE,
rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(5, "Reds"))
wordcloud(TBN3$word, TBN3$tf, scale = c(3,1), max.words=50, random.order=FALSE,
rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(3, "Dark2"))
Wordclouds for 1-gram, 2-gram, 3-gram