Data Science Capstone - Exploratory analysis of language

R Markdown

Load libraries:

library(dplyr)
library(tidytext)
library(rbenchmark)
library(ggplot2)
library(tm)

Read files and store into objects b(blogs), t(twitter) and n(news).

t0 <- "data/en_US/en_US.twitter.txt"
b0 <- "data/en_US/en_US.blogs.txt"
n0 <- "data/en_US/en_US.news.txt"
con <- file(t0, "r"); t<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)
con <- file(b0, "r"); b<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)
con <- file(n0, "r"); n<-readLines(con,encoding="UTF-8", skipNul = TRUE); close(con)

Overview of text style:

First of all, lets see some examples of the lines contained in these files to see what kind of text are we working with.

set.seed(12)
df <- data.frame(cbind(sample(t,3),sample(b,3),sample(n,3)))
names(df) <- c("twitter", "blogs", "news")
df

##                                                                                           twitter
## 1                                                            Hawk Cards, Get Your Hawk Cards Here
## 2 I agree. And we need to get that misogynist Ed Schultz off the air, too. Keep up the good fight
## 3                                                    Who tryna smoke a few bowls of strong wit me
##                                                                                                                                                                                                                                                                                                                                      blogs
## 1                                                                                                                                                                                                                                                                                                             Well its one for the money
## 2 Jake was a 15 year old consuming mass quantities of food. (his friends dressed as the same) Max had multiple costumes, one was a greaser with his new black leather jacket. Because Dad was sick and Mom is lame, Max, dressed in army garb, volunteered to cruise the neighborhood with Luke. Here they are with Jack O' Lantern faces.
## 3                                                                                                                                                                                                                                                                                                                  and why for girls only?
##                                                                                                                                                                                                                                                                                                               news
## 1 Up to 89 percent of Capistranos 2,200 teachers participated in the strike at its height; attendance at the 52,000-student district dipped as low as 30 percent. Hundreds of substitutes filled in for the picketing teachers, with schedules altered, afterschool sports canceled and other programs curtailed.
## 2                                                                                                           ST. PETERSBURG  Regardless of how low the economy goes in the next two years, the new $35-million Salvador Dali Museum will be built and open in early 2011, museum director Hank Hine said Thursday.
## 3                                That celebratory Kool & the Gang vibe translates across genres. The band, which last released an album in 2007, has opened for rock acts before, including Meat Loaf and Def Leppard, and last year wound up on the bill at Glastonbury, one of England's biggest rock festivals.

As we see, in all of three, there are special characters (#,’'…), misspelled words or things that are not actualy words (HAHAHA, etc.). So, some cleaning must be done in data but first, les look at the numbers:

Get number of lines of each file:

length(t) #twitter

## [1] 2360148

length(b) #blogs

## [1] 899288

length(n) #news

## [1] 77259

Get number of words in each file

It has been performed a benchmark in order to get the best method to cound words in each line. It has been used a sample of 100 elements of t object and is has been replicated 10 times.

countwordsbenchmark <- benchmark("Count words with strsplit" = {lengths(strsplit(sample(t,1000), "\\W+"))},
          "Count words with gregexpr and sapply" = {sapply(gregexpr("[[:alpha:]]+", sample(t,1000)), function(x) sum(x > 0))},
          replications = 10,
          columns = c("test", "replications", "elapsed")
)
countwordsbenchmark

##                                   test replications elapsed
## 2 Count words with gregexpr and sapply           10    0.20
## 1            Count words with strsplit           10    0.41

So, now that we get the best method (between this two gregexpr+sapply wins) to get the words, lets calculate the mean number of words per line:

tw <- sapply(gregexpr("[[:alpha:]]+", t), function(x) sum(x > 0)) #words per line in twitter
bw <- sapply(gregexpr("[[:alpha:]]+", b), function(x) sum(x > 0)) #words per line in blogs
nw <- sapply(gregexpr("[[:alpha:]]+", n), function(x) sum(x > 0)) #words per line in news 
mean(tw) #mean of words per line in twitter

## [1] 12.94648

mean(bw) #mean of words per line in blogs

## [1] 42.11538

mean(nw) #mean of words per line in news

## [1] 34.45644

Just before get into text counting, “cleaning” and n-grams, lets get a random sample from these files. The value of the sample would be the 25% of lines from the file with less lines.

nsamples <- round(min(length(t), length(b), length(n))*0.25)
t2 <- sample(t,nsamples) #create sample of text to improve performance
b2 <- sample(b,nsamples) #create sample of text to improve performance
n2 <- sample(n,nsamples) #create sample of text to improve performance

t2w <- sapply(gregexpr("[[:alpha:]]+", t2), function(x) sum(x > 0)) #words per line in twitter
b2w <- sapply(gregexpr("[[:alpha:]]+", b2), function(x) sum(x > 0)) #words per line in blogs
n2w <- sapply(gregexpr("[[:alpha:]]+", n2), function(x) sum(x > 0)) #words per line in news
mean(t2w) #mean of words per line in twitter

## [1] 12.97722

mean(b2w) #mean of words per line in blogs

## [1] 42.10381

mean(n2w) #mean of words per line in news

## [1] 34.56831

As you see, mean values are almost the same as the files are huge enough. So, We will continue with sampled data to improve performance.

################

Here could be something to detect the language of the text (may be using freq of stopwords…)

################

Data cleaning

Clean data using some function from “tm” package:

#build function for text cleaning
textcleaning <- function(x){ 
    textclean <- x %>% 
    removeNumbers() %>%
    removeWords(stopwords("english")) %>%
    removePunctuation(preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) %>%
    stripWhitespace()
    return(textclean)
}

tc <- textcleaning(t2)
bc <- textcleaning(b2)
nc <- textcleaning(n2)

Working with tidy text and g-grams

Lets create here the tibbles (sim. to data.frame) where each line contains one register (piece of text):

t3 <- tibble(line = 1:length(tc), words = tc)
b3 <- tibble(line = 1:length(bc), words = bc)
n3 <- tibble(line = 1:length(nc), words = nc)

#join all three tibbles in one object
tbn <- bind_rows("twitter" = t3, "blogs" = b3, "news" = n3, .id = "group")

Tokenize words and create 1,2,3-ngrams:

data(stop_words)
tbn1 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 1)
tbn1 <- tbn1 %>% anti_join(stop_words) #remove most common english words in 1 n-gram
tbn2 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 2)
tbn3 <- tbn %>% unnest_tokens(word, words, token = "ngrams", n = 3)
sample_n(tbn1, 5)

## # A tibble: 5 x 3
##   group  line word     
##   <chr> <int> <chr>    
## 1 blogs 15952 bring    
## 2 news  14599 performed
## 3 blogs 18950 spreading
## 4 news  11430 planned  
## 5 blogs  9529 fatal

sample_n(tbn2, 5)

## # A tibble: 5 x 3
##   group  line word                
##   <chr> <int> <chr>               
## 1 news   4345 marigolds scarlet   
## 2 news   7954 three areas         
## 3 news  13788 steve buckstein     
## 4 blogs  4023 everything including
## 5 news   7957 even hitting

sample_n(tbn3, 5)

## # A tibble: 5 x 3
##   group    line word                        
##   <chr>   <int> <chr>                       
## 1 twitter  2152 coolest swolest dude        
## 2 blogs   12119 choose stay single          
## 3 blogs    7312 time canadian advertisers   
## 4 twitter 11822 talents south beach         
## 5 blogs   14097 different aspects persuasive

Calculating term frequencies (tf)

Calculate n, total and term frequencies in each n-gram:

totaltf <- function (x){
    ntf <- x %>%
    count(group, word, sort=T)
    total <- ntf %>%
    group_by(group) %>%
    summarize(total = sum(n))
    ntf <- left_join(ntf, total)
    ntf <- mutate(ntf, tf = n/total)
    return(ntf)
}

TBN1 <- totaltf(tbn1)
TBN2 <- totaltf(tbn2)
TBN3 <- totaltf(tbn3)

#join all three tibbles in one object
TBN <- bind_rows("ngram1" = TBN1, "ngram2" = TBN2, "ngram3" = TBN3, .id = "ngram")
TBN <- TBN[TBN$word != 'NA',] #remove NA values
sample_n(TBN, 10)

## # A tibble: 10 x 6
##    ngram  group   word                              n  total         tf
##    <chr>  <chr>   <chr>                         <int>  <int>      <dbl>
##  1 ngram2 blogs   came large                        2 441852 0.00000453
##  2 ngram2 blogs   when bad                          1 441852 0.00000226
##  3 ngram2 news    trip disaster                     1 386426 0.00000259
##  4 ngram2 news    drop average                      1 386426 0.00000259
##  5 ngram2 blogs   said previous                     1 441852 0.00000226
##  6 ngram2 blogs   folding plate                     1 441852 0.00000226
##  7 ngram3 news    analysts predict unemployment     1 367751 0.00000272
##  8 ngram2 twitter become really                     2 136048 0.0000147 
##  9 ngram3 blogs   avengers even enemies             1 424162 0.00000236
## 10 ngram2 twitter out soccer                        1 136048 0.00000735

Select just the 10 higher tf values within each “ngram” and “group”:

TBNx <- TBN %>% group_by(group, ngram) %>% top_n(10)
TBNx[TBNx$ngram == "ngram3",]

## # A tibble: 30 x 6
## # Groups:   group, ngram [3]
##    ngram  group   word                       n  total        tf
##    <chr>  <chr>   <chr>                  <int>  <int>     <dbl>
##  1 ngram3 blogs   i know i                  74 424162 0.000174 
##  2 ngram3 blogs   i think i                 64 424162 0.000151 
##  3 ngram3 blogs   i dont know              45 424162 0.000106 
##  4 ngram3 blogs   i feel like               42 424162 0.0000990
##  5 ngram3 blogs   i thought i               37 424162 0.0000872
##  6 ngram3 twitter i think i                 35 118403 0.000296 
##  7 ngram3 blogs   i dont think             31 424162 0.0000731
##  8 ngram3 news    the year old              28 367751 0.0000761
##  9 ngram3 news    new york city             27 367751 0.0000734
## 10 ngram3 news    president barack obama    27 367751 0.0000734
## # ... with 20 more rows

Plots:

Plot the most common words in each n-gram and group:

g <- ggplot(TBNx, aes(x = word, y = tf))+
    geom_bar(stat = "identity", fill = "red", width = 0.8, alpha = 0.95)+
    labs(title = "1, 2 and 3 n-grams by source", x = "Words", y = "Term frequency")+
    facet_wrap(~ngram+group, ncol = 3, scales = "free")+
    theme(text = element_text(size=8), axis.text.x=element_text(angle = -90, hjust = 0))+
    theme(text = element_text(size=8), axis.text.y=element_text(angle = -90, hjust = 0))
g

Wordcloud

Create wordcloud plots

library(wordcloud)
par(mfrow=c(1,3))
wordcloud(TBN1$word, TBN1$tf, scale = c(3,1), max.words=50, random.order=FALSE,
          rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Blues"))
wordcloud(TBN2$word, TBN2$tf, scale = c(3,1), max.words=50, random.order=FALSE,
          rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(5, "Reds"))
wordcloud(TBN3$word, TBN3$tf, scale = c(3,1), max.words=50, random.order=FALSE,
          rot.per=0.3, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(3, "Dark2"))

Wordclouds for 1-gram, 2-gram, 3-gram

Notes for future analysis:

Remove expresions with i
Introduce code chunk for language detection
Sort bars in ggplot

Data Science Capstone - Exploratory analysis of language - week 2

César Arquero Cabral

1 de marzo de 2019