Summary

We go to analyze 3 texts from differents origins: news, twitter and a blog. We will see the big differents of size between twitter, blog and news. The multiples sources possible issuing coming from internet is relevant comparing at newspaper. We go finish with a quick analyze of differents texts to see wich word are the most used and to conclude about how we think to process for the app.

Analysis

We count the number of line of differents texts, the number of character and the mean of character by line. We see that on Twitter and Blogs, there are much more line written than on news, and see that on Twitter, there is a limit on the number of characters autorized. The most words per line are on blogs, maybe because bloggers are not professionnal writter or reporter, I mean, they have not a special formation for writte, so they have a tendance of writte more in one line, to not to be concise.

file1<-  file("C:\\PARTAGE\\aurelien\\en_US\\en_US.twitter.txt","rb")
file2<-  file("C:\\PARTAGE\\aurelien\\en_US\\en_US.blogs.txt","rb")
file3<-  file("C:\\PARTAGE\\aurelien\\en_US\\en_US.news.txt","rb")

twit<-readLines(file1)
## Warning in readLines(file1): la ligne 167155 contient un caractère nul
## Warning in readLines(file1): la ligne 268547 contient un caractère nul
## Warning in readLines(file1): la ligne 1274086 contient un caractère nul
## Warning in readLines(file1): la ligne 1759032 contient un caractère nul
blog<-readLines(file2)
news<-readLines(file3)

close(file1)
close(file2)
close(file3)

summry<-data.frame(Line=c(length(twit),length(blog),length(news)),letter=c(sum(nchar(twit)),sum(nchar(blog)),sum(nchar(news))),Mean_Word_Per_Line=c(sum(nchar(twit))/length(twit),sum(nchar(blog))/length(blog),sum(nchar(news))/length(news)))
rownames(summry)<-c("Twit","Blog","News")

summry
##         Line    letter Mean_Word_Per_Line
## Twit 2360148 162384825           68.80281
## Blog  899288 208361438          231.69601
## News 1010242 203791405          201.72533
library(ggplot2)
graph<-ggplot(summry,aes(y=summry$Line,x=rownames(summry)))
graph+geom_boxplot()+ggtitle( "Number of Line by source")+xlab("Text Source")+ylab("Count")

Exploration of texts

We use some package to treat the documents. First a clean function to treat correctly the text, remove the short word, put the complete form of verb or other abbreviation possible like remove punctuation, number or put to lower case the words to have identical treatment for all.

library(quanteda);library(RColorBrewer)
## quanteda version 0.99.22
## Using 7 of 8 threads for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(wordcloud)
library(stringr)


Clean_String <- function(string){
  
  #put to lower words
  temp<-tolower(string)
  # Remove some useless words and replace short word by long
  deter<-c(" the "," of "," a "," an ")
  temp<-stringr::str_replace_all(temp,deter, " ")
  verb<-c("'ll ","don't","isn't","aren't",
          "wasn't","weren't","doesn't","won't","'s ","'re "," u "," will ",
          "do not","is not","are not","was not","were not","does not","will not"," is "," are "," you ")
  for (i in 1:(length(verb)/2)){
    temp<-stringr::str_replace_all(temp,verb[i],verb[i+(length(verb)/2)])
  }
  #' Remove everything that is not a number or letter 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  temp<-stringr::str_replace_all(temp,"[[:punct:] ]+", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  #remove single letter
  alphabet<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n",
              "o","p","q","r","s","t","u","v","w","x","y","z")
  temp<-stringr::str_replace_all(string,alphabet, "")
  # Split it
  temp <- stringr::str_split(temp, " ")[[1]]
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  } 
  return(temp)
}

We divide each document in 2 groups because we are limited by the power of computer. We use quanteda to make the Document-Term Matrix, which make a second treatment like remove currents english words and again punctuation, and we know after the relevants terms.

#We paste the words to obtain one full text
twit<-paste(twit,collapse = " ")
blog<-paste(blog,collapse = " ")
news<-paste(news,collapse = " ")


#we clean via the function above the texts
twit<-Clean_String(twit)
blog<-Clean_String(blog)
news<-Clean_String(news)

#We paste the words to obtain one full text again, the function split the text
twit<-paste(twit,collapse = " ")
blog<-paste(blog,collapse = " ")
news<-paste(news,collapse = " ")

We use directly the function dfm, from the package (with treatment) to make theses matrices. We make treatment 2 times, to gain time process again and be sure to not be blocked by the limit of power of computer to make the matrices. After, we merge the dfm to get a unique dfm for each document. We can see in the table the 20 most relevant terms in each document.

#we build the document term matrix
dtm_twit<-dfm(twit,remove=stopwords("english"),stem=TRUE,remove_punct=TRUE)
dtm_blog<-dfm(blog,remove=stopwords("english"),stem=TRUE,remove_punct=TRUE)
dtm_news<-dfm(news,remove=stopwords("english"),stem=TRUE,remove_punct=TRUE)


relevant_term<-data.frame(twitter=names(topfeatures(dtm_twit,20)),
                          blog=names(topfeatures(dtm_blog,20)),
                          news=names(topfeatures(dtm_news,20)))
relevant_term
##    twitter blog news
## 1       nd   nd   nd
## 2      tht  tht  tht
## 3        t    s  sid
## 4      hve   ws    s
## 5       re    t   ws
## 6     just    Ă¢    t
## 7      get  hve   re
## 8       go   re    Ă¢
## 9     like   ll  hve
## 10     wht  one   hs
## 11    love bout    n
## 12      ws like  yer
## 13      ll  wht bout
## 14    good    n  one
## 15    bout   hd   hd
## 16      rt time  new
## 17   thnks just   go
## 18      cn   cn time
## 19      dy  get   ll
## 20       s   go like

Document-Term Matrix and conclusion

We can see the 100 main words used for each source. We can see clearly the difference of words used depending of the source, the lexical fields. For the n-gramm models, i think try to use as a autoregressive model, i think it’s can be a good way to put a good model.

set.seed(100)
textplot_wordcloud(dtm_twit, min.freq = 6,max.words=100, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

textplot_wordcloud(dtm_blog, min.freq = 6,max.words=100, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

textplot_wordcloud(dtm_news, min.freq = 6,max.words=100, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))