Introduction

On this first peer-graded assignment we are going to do some exploratory analysis and a sample of the ngram. Since the data is composed of several languages for the porpose of this report we are going to explore only the english files and only going to explore 15,000 lines of each files in order to have a manageable set.

library(tm);library(stringi);library(RWeka);library(SnowballC);library(dplyr);library(ggplot2);library(gridExtra)

Importing the data set

The following code will check is you have the dataset set on a given directory. Feel free to change the directory to where you have downloaded the file. If the file is not within the directory it will be downloaded.

dir <- "C:\\Users\\pedro moises\\Documents\\academic\\cursos online\\capstone project\\Capstone Project Data Science Specialization"
setwd(dir)
file_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
ifelse(
        file.exists("final"), 
        print("you are ready to roll"),
        ifelse(
                file.exists("Coursera-SwiftKey.zip"),
                unzip("./Coursera-SwiftKey.zip"),
                {download.file(file_url, destfile = "./Coursera-SwiftKey.zip");unzip("Coursera-SwiftKey.zip")}
        )
)

Reading the data

the r function readLines() reads text lines in a simple and quicker manner that other functions. In the case of the news dataset we have to open it as binary first because it has special characters. See this post for more information

twitter <- readLines(".\\final\\en_US\\en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(".\\final\\en_US\\en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
#reading the news file as binary
b.news <- file(".\\final\\en_US\\en_US.news.txt", open="rb")
news <- readLines(b.news, encoding="UTF-8")
close(b.news)
rm(b.news)

number of observations, Word count, length by line

#number of observations
blog_l <- length(blogs)
news_l <- length(news)
twitter_l <- length(twitter)

#number character per entry
blogs_c <- nchar(blogs)
news_c <- nchar(news)
twitter_c <- nchar(twitter)

#number of words per entry
blogs_w <- stri_count_words(blogs)
news_w <- stri_count_words(news)
twitter_w <- stri_count_words(twitter)

summary <- data.frame( "Observations" = c(blog_l, news_l, twitter_l) %>% prettyNum(big.mark = ",") , 
                       "Characters"   = c(sum(blogs_c), sum(news_c), sum(twitter_c)) %>% prettyNum(big.mark = ","),
                       "Mean" =  (c(mean(blogs_c), mean(news_c), mean(twitter_c))) %>% ceiling,
                       "Median" = (c(median(blogs_c), median(news_c), median(twitter_c))) %>% ceiling,
                       "Words"        = c(sum(blogs_w), sum(news_w), sum(twitter_w)) %>% prettyNum(big.mark = ","),
                       "mean" = c(mean(blogs_w), mean(news_w), mean(twitter_w))  %>% ceiling(),
                       "Median" = c(median(blogs_w), median(news_w), mean(twitter_w))  %>% ceiling(),
                       row.names = c("Blogs", "News", "Twitter"))
knitr::kable(summary)

	Observations	Characters	Mean	Median	Words	mean	Median.1
Blogs	899,288	206,824,505	230	156	37,546,246	42	28
News	1,010,242	203,223,159	202	185	34,762,395	35	32
Twitter	2,360,148	162,096,241	69	64	30,093,410	13	13

As we can see in the table the number of observations varies sustancially amount the different files, having more entries on the twitter file. This is, however surpassed as we sum the number of character per entry; meaning that, as we can see from the table, the blogs and news have more charatecters per entry (the same applies to words). As we can see from the median and mean columns the blogs and news files have some observations with character counts (outliers) that skew a little the means. The twitter file seems to have more regular counts (something to spect since the number of character allowed in the social media is limited to 140 charatects).

Having more or less the same word count on each file, means that we would have more or less the same sample size to train our model. This is important since the language used on each source would be different.

Data Distribution

Now we are going to see the distribution of the variables.

df <- data.frame(characters = c(blogs_c, news_c, twitter_c), words = c(blogs_w, 
    news_w, twitter_w), source = c(rep("Blogs", blog_l), rep("News", news_l), 
    rep("Twitter", twitter_l)))

char <- ggplot(data = df, aes(x = source, y = characters, color = source)) + 
    geom_boxplot() + theme(legend.position = "none")
word <- ggplot(data = df, aes(x = source, y = words, color = source)) + geom_boxplot() + 
    theme(legend.position = "none")
grid.arrange(char, word, nrow = 1, ncol = 2)

As we can see in the above graphic, the number of characters and words for the sources of blogs and news have drastic outliers. This is important to know because the way we write is different depending on the size of the writing at hand, therefore in this case we cannot simple remove the outliers to smoth the sample, because we are going to left behind potentially different writing styles.

Sampling the data for N-gram analysis

Since the files to be analyzed have a size not manageable with everyday computers we are going to sample 15,000 elements from each file.

set.seed(200)
t <- sample(twitter,15000)
b <- sample(blogs,15000)
n <- sample(news,15000)
vector <- c(t,b,n)
writeLines(vector, "vector.txt")
rm(twitter, blogs,news,t,b,n)

Creating the corpus and data cleaning

On this process we are going to use the tm package for loading and data cleaning.

c <- Corpus(DirSource(pattern  = "*.txt"))
rm(vector)

#data cleaning
c <- tm_map(c,tolower) #transfer to lowercase
c <- tm_map(c, removePunctuation) #remove punctuation
c <- tm_map(c,removeNumbers)   #remove numbers
c <- tm_map(c,stemDocument)   #remove common word endings
c <- tm_map(c, stripWhitespace)  #strip white spaces
c <- tm_map(c,PlainTextDocument)

calculation the unigram

Now we are going to calculate the word frequency. This is by no means a final analysis but it gives us an idea of what to spect from the final data.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
unigram <- DocumentTermMatrix(c, control=list(tokenize=UnigramTokenizer))
uni_table <-unigram %>%
  removeSparseTerms(0.2) %>%
  as.matrix %>%
  colSums %>%
  sort(decreasing=TRUE)  
uni_df <- data.frame(word = names(uni_table), freq = uni_table)
knitr::kable(head(uni_df))

	word	freq
the	the	303010
and	and	154155
that	that	70937
for	for	68548
you	you	56127
with	with	45174

Further Analysis

For latter analysis we would be doing de following:

Further clean the dataset: we should eliminate common words that don’t add value to the analysis. Also is really important to further clean the twitter dataset sinces on this media users tend to cut short words, use abbreviations, emojis and other non-traditional resources of language, that could hinder the prediction ability.
correlation between words: this will give us an idea of the predictive model.
2-gram and 3-gram analysis: will give us an idea of the predicted values of the model.

Thanks for reading!

Milestone Report For the Capstone course

Pedro Camacho

August/08/2017