Exploratory Analysis. Capstone

Data description

The dataset uses three files named “en_US.blogs.txt”, “en_US.news.txt”, “en_US.twitter.txt”. The data is from HC Corpora and can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. English Languaje was used.

Files descriptions

News File has 77259 lines and has 19.2 Mb
Blogs File has 899288 lines and has 248.5 Mb
Twitter File has 2360148 lines and has 301.4 Mb

Loading Data

corpus <- VCorpus(DirSource(dir, encoding = "UTF-8"),readerControl = list(language = "en"))

A sample of 10% of each file was made.

porciento <- 0.1 #10% of the dataset
set.seed(30) #For reproducibility
corpus[[1]]$content<-sample(corpus[[1]]$content,length(corpus[[1]]$content)*porciento)
corpus[[2]]$content<-sample(corpus[[2]]$content, length(corpus[[2]]$content)*porciento)
corpus[[3]]$content<-sample(corpus[[3]]$content, length(corpus[[3]]$content)*porciento)

Preprocessing the data

A process of data cleaning was made before mining the files.
* Extra whitespace were removed.
* URLs were removed.
* Special characters as @, ´,’,` were removed.
* Graphic characters such as emoticons were removed.
* Punctuation signs were removed.
* English stopwords were removed.
* Numbers were removed. * All text was converted to lowercase

#Removing extra whitespace
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
#Removing URL
changeSpecialSen<- content_transformer(function(x, pat) gsub(pat, " ",x))
corpus <- tm_map(corpus, changeSpecialSen, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)")
#Removing @
corpus <- tm_map(corpus, changeSpecialSen, "@")
#Removing "'"
corpus <- tm_map(corpus, changeSpecialSen, "'|´|`")
#Removing graphic characters
corpus <- tm_map(corpus, changeSpecialSen,"[^[:graph:]]")
#removing punctuation 
corpus <- tm_map(corpus,content_transformer(removePunctuation))
#removing stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Removing numbers
corpus<-tm_map(corpus,content_transformer(removeNumbers))
#converting all to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

After all transformations described before, an analysis of term frequency of each file was made

#Creating a term matrix for document
dtm_news <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[2]]$content)))
dtm_blogs <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[1]]$content)))
dtm_twitter <-  DocumentTermMatrix(VCorpus(VectorSource(corpus[[3]]$content)))

News.

The most frequently word is “the” which appears 1953 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

#plotting first fifty words
ggplot(head(news_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in news file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(news_df$words,news_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10,max.words = 130)

Blogs.

The most frequently word is “the” which appears 19197 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

#plotting first fifty words
ggplot(head(blogs_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in blogs file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(blogs_df$words,blogs_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10,max.words = 130)

Twitters.

The most frequently word is “just” which appears 14883 times

A histogram of the first fifty words frequency and a word cloud of news file was made:

Histogram

ggplot(head(twitter_df,50), aes(reorder(words, -freq),freq)) +
    geom_bar(stat="identity") +
    labs(title="Fifty first words frequency in twitter file")+
    xlab("Words") +
    ylab("Frequency") +
    theme(axis.text.x=element_text(angle=90))

Word Cloud

wordcloud(twitter_df$words,twitter_df$freq,min.freq = 50,random.color = FALSE,colors = 1:10)

As can be seen, the word “the” is more frequently in blogs and news text. Other words like “one”, can be found in all files.

Exploratory Analysis. Capstone

Lynette Garcia

8 de junio de 2016

Data description

Files descriptions

Loading Data

Preprocessing the data

News.

Histogram

Word Cloud

Blogs.

Histogram

Word Cloud

Twitters.

Histogram

Word Cloud