Introduction

On this first peer-graded assignment we are going to do some exploratory analysis and a sample of the ngram. Since the data is composed of several languages for the porpose of this report we are going to explore only the english files and only going to explore 15,000 lines of each files in order to have a manageable set.

library(tm);library(stringi);library(RWeka);library(SnowballC);library(dplyr);library(ggplot2);library(gridExtra)

Importing the data set

The following code will check is you have the dataset set on a given directory. Feel free to change the directory to where you have downloaded the file. If the file is not within the directory it will be downloaded.

dir <- "C:\\Users\\pedro moises\\Documents\\academic\\cursos online\\capstone project\\Capstone Project Data Science Specialization"
setwd(dir)
file_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
ifelse(
        file.exists("final"), 
        print("you are ready to roll"),
        ifelse(
                file.exists("Coursera-SwiftKey.zip"),
                unzip("./Coursera-SwiftKey.zip"),
                {download.file(file_url, destfile = "./Coursera-SwiftKey.zip");unzip("Coursera-SwiftKey.zip")}
        )
)

Reading the data

the r function readLines() reads text lines in a simple and quicker manner that other functions. In the case of the news dataset we have to open it as binary first because it has special characters. See this post for more information

twitter <- readLines(".\\final\\en_US\\en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(".\\final\\en_US\\en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
#reading the news file as binary
b.news <- file(".\\final\\en_US\\en_US.news.txt", open="rb")
news <- readLines(b.news, encoding="UTF-8")
close(b.news)
rm(b.news)

number of observations, Word count, length by line

#number of observations
blog_l <- length(blogs)
news_l <- length(news)
twitter_l <- length(twitter)

#number character per entry
blogs_c <- nchar(blogs)
news_c <- nchar(news)
twitter_c <- nchar(twitter)

#number of words per entry
blogs_w <- stri_count_words(blogs)
news_w <- stri_count_words(news)
twitter_w <- stri_count_words(twitter)

summary <- data.frame( "Observations" = c(blog_l, news_l, twitter_l) %>% prettyNum(big.mark = ",") , 
                       "Characters"   = c(sum(blogs_c), sum(news_c), sum(twitter_c)) %>% prettyNum(big.mark = ","),
                       "Mean" =  (c(mean(blogs_c), mean(news_c), mean(twitter_c))) %>% ceiling,
                       "Median" = (c(median(blogs_c), median(news_c), median(twitter_c))) %>% ceiling,
                       "Words"        = c(sum(blogs_w), sum(news_w), sum(twitter_w)) %>% prettyNum(big.mark = ","),
                       "mean" = c(mean(blogs_w), mean(news_w), mean(twitter_w))  %>% ceiling(),
                       "Median" = c(median(blogs_w), median(news_w), mean(twitter_w))  %>% ceiling(),
                       row.names = c("Blogs", "News", "Twitter"))
knitr::kable(summary)
Observations Characters Mean Median Words mean Median.1
Blogs 899,288 206,824,505 230 156 37,546,246 42 28
News 1,010,242 203,223,159 202 185 34,762,395 35 32
Twitter 2,360,148 162,096,241 69 64 30,093,410 13 13

As we can see in the table the number of observations varies sustancially amount the different files, having more entries on the twitter file. This is, however surpassed as we sum the number of character per entry; meaning that, as we can see from the table, the blogs and news have more charatecters per entry (the same applies to words). As we can see from the median and mean columns the blogs and news files have some observations with character counts (outliers) that skew a little the means. The twitter file seems to have more regular counts (something to spect since the number of character allowed in the social media is limited to 140 charatects).

Having more or less the same word count on each file, means that we would have more or less the same sample size to train our model. This is important since the language used on each source would be different.

Data Distribution

Now we are going to see the distribution of the variables.

df <- data.frame(characters = c(blogs_c, news_c, twitter_c), words = c(blogs_w, 
    news_w, twitter_w), source = c(rep("Blogs", blog_l), rep("News", news_l), 
    rep("Twitter", twitter_l)))

char <- ggplot(data = df, aes(x = source, y = characters, color = source)) + 
    geom_boxplot() + theme(legend.position = "none")
word <- ggplot(data = df, aes(x = source, y = words, color = source)) + geom_boxplot() + 
    theme(legend.position = "none")
grid.arrange(char, word, nrow = 1, ncol = 2)

As we can see in the above graphic, the number of characters and words for the sources of blogs and news have drastic outliers. This is important to know because the way we write is different depending on the size of the writing at hand, therefore in this case we cannot simple remove the outliers to smoth the sample, because we are going to left behind potentially different writing styles.

Sampling the data for N-gram analysis

Since the files to be analyzed have a size not manageable with everyday computers we are going to sample 15,000 elements from each file.

set.seed(200)
t <- sample(twitter,15000)
b <- sample(blogs,15000)
n <- sample(news,15000)
vector <- c(t,b,n)
writeLines(vector, "vector.txt")
rm(twitter, blogs,news,t,b,n)

Creating the corpus and data cleaning

On this process we are going to use the tm package for loading and data cleaning.

c <- Corpus(DirSource(pattern  = "*.txt"))
rm(vector)

#data cleaning
c <- tm_map(c,tolower) #transfer to lowercase
c <- tm_map(c, removePunctuation) #remove punctuation
c <- tm_map(c,removeNumbers)   #remove numbers
c <- tm_map(c,stemDocument)   #remove common word endings
c <- tm_map(c, stripWhitespace)  #strip white spaces
c <- tm_map(c,PlainTextDocument)

calculation the unigram

Now we are going to calculate the word frequency. This is by no means a final analysis but it gives us an idea of what to spect from the final data.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
unigram <- DocumentTermMatrix(c, control=list(tokenize=UnigramTokenizer))
uni_table <-unigram %>%
  removeSparseTerms(0.2) %>%
  as.matrix %>%
  colSums %>%
  sort(decreasing=TRUE)  
uni_df <- data.frame(word = names(uni_table), freq = uni_table)
knitr::kable(head(uni_df))
word freq
the the 303010
and and 154155
that that 70937
for for 68548
you you 56127
with with 45174

Further Analysis

For latter analysis we would be doing de following:

Thanks for reading!