Data description

The dataset uses three files named “en_US.blogs.txt”, “en_US.news.txt”, “en_US.twitter.txt”. The data is from HC Corpora and can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. English Languaje was used.

Files descriptions

  • News File has 77259 lines and has 19.2 Mb
  • Blogs File has 899288 lines and has 248.5 Mb
  • Twitter File has 2360148 lines and has 301.4 Mb

Loading Data

corpus <- VCorpus(DirSource(dir, encoding = "UTF-8"),readerControl = list(language = "en"))

A sample of 10% of each file was made.

porciento <- 0.1 #10% of the dataset
set.seed(30) #For reproducibility
corpus[[1]]$content<-sample(corpus[[1]]$content,length(corpus[[1]]$content)*porciento)
corpus[[2]]$content<-sample(corpus[[2]]$content, length(corpus[[2]]$content)*porciento)
corpus[[3]]$content<-sample(corpus[[3]]$content, length(corpus[[3]]$content)*porciento)

Preprocessing the data

A process of data cleaning was made before mining the files.
* Extra whitespace were removed.
* URLs were removed.
* Special characters as @, ´,’,` were removed.
* Graphic characters such as emoticons were removed.
* Punctuation signs were removed.
* English stopwords were removed.
* Numbers were removed. * All text was converted to lowercase

#Removing extra whitespace
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
#Removing URL
changeSpecialSen<- content_transformer(function(x, pat) gsub(pat, " ",x))
corpus <- tm_map(corpus, changeSpecialSen, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)")
#Removing @
corpus <- tm_map(corpus, changeSpecialSen, "@")
#Removing "'"
corpus <- tm_map(corpus, changeSpecialSen, "'|´|`")
#Removing graphic characters
corpus <- tm_map(corpus, changeSpecialSen,"[^[:graph:]]")
#removing punctuation 
corpus <- tm_map(corpus,content_transformer(removePunctuation))
#removing stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Removing numbers
corpus<-tm_map(corpus,content_transformer(removeNumbers))
#converting all to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

After all transformations described before, an analysis of term frequency of each file was made

#Creating a term matrix for document
dtm_news <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[2]]$content)))
dtm_blogs <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[1]]$content)))
dtm_twitter <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[3]]$content)))

News.

The most frequently word is “the” which appears 1953 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

#plotting first fifty words
ggplot(head(news_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in news file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(news_df$words,news_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10,max.words = 130)

Blogs.

The most frequently word is “the” which appears 19197 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

#plotting first fifty words
ggplot(head(blogs_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in blogs file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(blogs_df$words,blogs_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10,max.words = 130)

Twitters.

The most frequently word is “just” which appears 14883 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

ggplot(head(twitter_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in twitter file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(twitter_df$words,twitter_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10)

As can be seen, the word “the” is more frequently in blogs and news text. Other words like “one”, can be found in all files.