Introduction Summary

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Loading required libaries

library(tm)
## Loading required package: NLP
library(slam)
library(xtable)
library(rJava)
library(RWeka)
library(NLP)
library(ngram)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud2)
library(knitr)
library(RColorBrewer)
library(stringi)
library(LaF)

1. Downloading data and reading it

The Swifkey Dataset has been downloaded and unzipped manually from the below link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

##Information about the 3 file types for the US set downloaded:

file_information <- function(file_path) {
  file_size <- file.info(file_path)$size/1048576
  
  conn <- file(file_path, "r")
  full_text <- readLines(conn)
  n_lines <- length(full_text)
  
  max_line <- 0
  for (i in 1:n_lines) {
    line_length <- nchar(full_text[i])
    if (line_length > max_line) { max_line <- line_length }
  }
  
  n_words <- sum(stri_count_words(full_text))
  
  close(conn)
  
  list(file_size=file_size, n_lines=n_lines, max_line=max_line, n_words=n_words)
}

data_dir <- "/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/"

info_blog <- file_information(paste0(data_dir,"en_US.blogs.txt"))
info_news <- file_information(paste0(data_dir,"en_US.news.txt"))
info_twitter <- file_information(paste0(data_dir,"en_US.twitter.txt"))
## Warning in readLines(conn): line 167155 appears to contain an embedded nul
## Warning in readLines(conn): line 268547 appears to contain an embedded nul
## Warning in readLines(conn): line 1274086 appears to contain an embedded nul
## Warning in readLines(conn): line 1759032 appears to contain an embedded nul
matrix(c(info_blog[1],info_blog[2],info_blog[3],info_blog[4], 
         info_news[1],info_news[2],info_news[3],info_news[4],
         info_twitter[1],info_twitter[2],info_twitter[3],info_twitter[4]), 
      nrow = 3, ncol = 4, byrow = TRUE,
      dimnames = list(c("Info Blogs:", "Info News:", "Info Twitter:"),
                 c("File Size in MB", "No. of Lines", "Longest Line (No. characters)", "No. of Words")))
##               File Size in MB No. of Lines Longest Line (No. characters)
## Info Blogs:   200.4242        899288       40833                        
## Info News:    196.2775        1010242      11384                        
## Info Twitter: 159.3641        2360148      140                          
##               No. of Words
## Info Blogs:   37546239    
## Info News:    34762395    
## Info Twitter: 30093372

Sampling Data

Only a portion of the data will be used for an initial analysis, therefore getting a sample for the 3 file types for US set: blogs, news, twitter. A Corpus (collection of documents) is also created based on the 3 sample

blogs_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.blogs.txt", "r")

news_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.news.txt", "r")

twitter_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.twitter.txt", "r")

blogs_data <-  readLines(blogs_con, 2000)
news_data <-   readLines(news_con, 2000)
twitter_data <-   readLines(twitter_con, 2000)
corp <- VCorpus(VectorSource(c(blogs_data, news_data, twitter_data)), readerControl=list(readPlain, language="en", load=TRUE))
close(blogs_con)
close(news_con)
close(twitter_con)

2. Exploration and Data Cleaning

This section will use the text mining library ‘tm’ (loaded previously) to perform Data cleaning tasks, which are meaningful in Predictive Text Analytics. Main cleaning steps are:

  1. Converting the document to lowercase
  2. Removing punctuation marks
  3. Removing numbers
  4. Removing stopwords (i.e. “and”, “or”, “not”, “is”, etc)
  5. Removing undesired terms
  6. Removing extra whitespaces generated in previous 5 steps The above can be achieve with some of the TM package functions; let’s take a look to each cleaning task, individually:

  7. Converting the document to lowercase

corp_low <- tm_map(corp, content_transformer(tolower))
  1. Removing punctuation marks
corp_low_punct <- tm_map(corp_low, removePunctuation)
  1. Removing numbers because predicting a number is quite challenging. Therefore, the numbers are removed at this step.
corp_low_punct_no <- tm_map(corp_low_punct, removeNumbers)
  1. Removing stopwords (i.e. “and”, “or”, “not”, “is”, etc) Stopwords are words that appear so often in the text that they are not very useful for a prediction algorimth as they don’t add too much value. A good exercise before removing this type of words would be to check how common they are in the text and decide after if they are considered stopwords or not. Nevertheless, the TM package already includes a collection of stopwords for several different languages.
corp_low_punct_no_stop <- tm_map(corp_low_punct_no, removeWords,stopwords("english"))
  1. Removing undesired terms in a first exploration of the datasets, we could see they contain a lot of “profanity” words, which potentially would need to be removed; nevertheless, they could have some weight in the prediction results so therefore we can always consider this step at a later stage, depending on needs.

  2. Removing whitespaces generated in previous steps

corp_low_punct_no_stop_white <- tm_map(corp_low_punct_no_stop, stripWhitespace)

3. Analysis of the cleaned data

The cleaned data is now ready to be analysed, in the next steps it will be checked:

  1. If some words are more frequent than others (unigrams)?
  2. What is the frequency of 2-grams in the sample?
  3. What is the freqency of 3-grams in the sample?
uni_gram = as.data.frame((as.matrix(  TermDocumentMatrix(corp_low_punct_no_stop_white) )) ) 
uni_gram_sorted <- sort(rowSums(uni_gram),decreasing=TRUE)
uni_gram_data_frame <- data.frame(word = names(uni_gram_sorted),freq=uni_gram_sorted)
uni_gram_data_frame[1:10,]
##      word freq
## said said  600
## one   one  499
## will will  499
## like like  478
## just just  464
## can   can  402
## time time  351
## new   new  344
## get   get  326
## now   now  294
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi_gram= as.data.frame((as.matrix(  TermDocumentMatrix(corp_low_punct_no_stop_white,control = list(tokenize = bigram)) )) ) 
bi_gram_sorted <- sort(rowSums(bi_gram),decreasing=TRUE)
bi_gram_data_frame  <- data.frame(word = names(bi_gram_sorted),freq=bi_gram_sorted)
bi_gram_data_frame[1:10,]
##                    word freq
## new york       new york   44
## last year     last year   34
## dont know     dont know   32
## high school high school   32
## right now     right now   31
## u u                 u u   26
## last night   last night   24
## feel like     feel like   22
## new jersey   new jersey   21
## years ago     years ago   21
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri_gram = as.data.frame((as.matrix(  TermDocumentMatrix(corp_low_punct_no_stop_white,control = list(tokenize = trigram)) )) ) 
tri_gram_sorted <- sort(rowSums(tri_gram),decreasing=TRUE)
tri_gram_data_frame <- data.frame(word = names(tri_gram_sorted),freq=tri_gram_sorted)
tri_gram_data_frame[1:10,]
##                                          word freq
## u u u                                   u u u   17
## pates fountain parks     pates fountain parks   11
## classic pates fountain classic pates fountain    8
## cinco de mayo                   cinco de mayo    7
## new york city                   new york city    6
## new york times                 new york times    6
## world war ii                     world war ii    5
## cricket world cup           cricket world cup    4
## four years ago                 four years ago    4
## osama bin laden               osama bin laden    4
uni_gram_plot <- ggplot(uni_gram_data_frame[1:20,], aes(x=reorder(word, freq),y=freq)) + 
                  geom_bar(stat="identity", width=0.7, fill="steelblue") + 
                  labs(title="20th Most Common Unigrams")+
                  xlab("Unigrams") + ylab("Frequency") + 
                  theme(axis.text.x=element_text(angle=90, vjust=0.3))
uni_gram_plot

worldcloud_bi_gram <- wordcloud2(bi_gram_data_frame[1:400,],size=1.0,shape = 'cirlce')
worldcloud_bi_gram
wordcloud_tri_gram <- wordcloud2(tri_gram_data_frame[1:200,],size=1.0,shape = 'circle')
wordcloud_tri_gram