Milestone Report - Data Science Capstone

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Introduction

This submission is for the week 2 assignment of the Data Science Capstone course on Coursera. The goal of the Capstone project is to create a word prediction algorithm to be used as an app.

The goal of this first assignment is to get familiar with and explore the data so that we can get a better idea about how to proceed with making the word prediction app.

Inspecting The Data

The first thing to do is to download the data and load it into the R environment. After that is completed, we can create a data frame that will show us how many characters, lines, and words are contained in the data.

There are 3 main data files to work with, and each one comes from a different source – blogs, news articles, and tweets from Twitter.

exTable

##    Source characters   lines    words
## 1   Blogs  206824505  899288 37334131
## 2    News  203223159 1010242 34372530
## 3 Twitter  162096031 2360148 30373543

We can see now that the datasets are quite large. It will be necessary to take a sample from each of the datasets to cut down on computing resources needed to process the data

Preprocessing

A few things need to be done before we can start using Natural Language Processing functions on the data. First thing is to creates samples of each of the three datasets. A sample size of 10% will be sufficient for our purposes.

Next, we will do additional preprocessing steps on the Twitter data. We will remove any hashtags and retweet markers in the data as they will not be useful for a general purpose word prediction tool

Generating The Corpus

Now we can get on with generating a single corpus containing all three sample datasets.

Next, we will need to do some additional housekeeping on the corpus. We will convert the corpus to lowercase letters only and remove any numbers. Then we will remove any punctuation and strip the corpus of any extra whitespace. After that we will convert the corpus to a plain text document.

The final step in for this part of the report is to remove any stop words and profanity from the corpus.

The next tasks are more technical. We will tokenize the corpus. Essentially, this mean that we will seperate the data into smaller units called tokens. In this project, tokens will be individual words, although some tokens will not turn out to be words because of typos, misspelled words, or even intentional gibberish in the data. It will be our job to clean the data even further before the final steps of creating the app are undertaken.

N-Gram Frequency Plots and Tables

Now we are ready to take a look at the n-grams generated by our code with the data.

##          word frequency
## 1        will     31572
## 2        just     30384
## 3        said     30189
## 4         one     28519
## 5        like     26840
## 6         can     24425
## 7         get     22503
## 8        time     21209
## 9         new     19392
## 10       good     17692
## 11        now     17422
## 12        day     16801
## 13       know     16122
## 14     people     15718
## 15       love     15700
## 16       dont     14103
## 17       back     13978
## 18        see     13664
## 19      first     13336
## 20       make     13247
## 21       also     12849
## 22      going     12579
## 23      think     12511
## 24      great     12395
## 25       last     12331
## 26       much     12009
## 27       well     11846
## 28       year     11539
## 29        two     11516
## 30     really     11325
## 31        way     10944
## 32      today     10920
## 33       even     10877
## 34        got     10806
## 35       want     10612
## 36      still     10354
## 37       work     10334
## 38      years     10001
## 39      right      9990
## 40     thanks      9765
## 41       need      9737
## 42       many      8899
## 43       life      8630
## 44        say      8499
## 45       take      8490
## 46       made      8331
## 47     little      8262
## 48       come      8248
## 49      never      8110
## 50       home      7960
## 51       best      7895
## 52        may      7850
## 53       next      7593
## 54       week      7574
## 55      night      7370
## 56       cant      7346
## 57      thats      7317
## 58     things      7221
## 59     school      7119
## 60  something      7110
## 61        lol      6936
## 62       game      6902
## 63     always      6659
## 64     better      6647
## 65      happy      6605
## 66      every      6589
## 67    another      6539
## 68     around      6531
## 69      state      6430
## 70       look      6365
## 71       show      6336
## 72      world      6308
## 73       long      6158
## 74        big      6134
## 75      since      5870
## 76        man      5835
## 77       feel      5805
## 78       sure      5757
## 79       city      5688
## 80       help      5670
## 81        use      5609
## 82      three      5577
## 83       hope      5570
## 84     follow      5518
## 85      thing      5503
## 86      youre      5447
## 87       days      5433
## 88       find      5402
## 89    getting      5333
## 90        lot      5311
## 91      didnt      5296
## 92       keep      5294
## 93       says      5189
## 94       ever      5173
## 95      house      5161
## 96       part      5092
## 97        put      5089
## 98      place      5041
## 99        ive      4993
## 100      team      4982
## 101    family      4943
## 102      give      4941
## 103       let      4918
## 104   tonight      4876
## 105   looking      4843
## 106    though      4843
## 107     thank      4786
## 108       old      4757
## 109      play      4722
## 110       end      4692
## 111   morning      4593
## 112      away      4592
## 113       ill      4547

##               word frequency
## 1        right now      2379
## 2         new york      1974
## 3        last year      1865
## 4        cant wait      1814
## 5        dont know      1552
## 6       last night      1503
## 7      high school      1389
## 8        years ago      1350
## 9        feel like      1279
## 10       last week      1252
## 11      first time      1181
## 12        im going      1153
## 13 looking forward      1122
## 14         can get      1110
## 15       make sure      1105
## 16        st louis       955
## 17      looks like       943
## 18     even though       934
## 19  happy birthday       891
## 20    good morning       872
## 21        just got       868
## 22         im sure       813
## 23        let know       803
## 24      new jersey       799
## 25      dont think       786
## 26       dont want       780
## 27   united states       779
## 28         one day       746
## 29       every day       732
## 30       look like       732
## 31       good luck       719
## 32       next week       715
## 33       two years       703
## 34   thanks follow       692
## 35         can see       690
## 36       just like       686
## 37          said “       670
## 38       next year       665
## 39     mothers day       654
## 40    social media       644
## 41        can make       639
## 42     los angeles       637
## 43      little bit       598
## 44     many people       595
## 45       long time       591
## 46        will get       591
## 47   san francisco       588
## 48     sounds like       583
## 49       come back       581
## 50       one thing       574
## 51     follow back       567
## 52        get back       566
## 53      every time       563
## 54         im just       541
## 55         go back       533
## 56       san diego       522
## 57       dont like       507
## 58        im gonna       507
## 59      last month       498
## 60       will make       498
## 61          let us       491
## 62       will take       487
## 63        can help       484
## 64       great day       477
## 65        dont get       474
## 66      will never       468
## 67       next time       465
## 68     three years       462
## 69     pretty much       448
## 70         lets go       445

##                        word frequency
## 1         happy mothers day       338
## 2             cant wait see       336
## 3               let us know       289
## 4             new york city       244
## 5            happy new year       167
## 6             two years ago       158
## 7            new york times       153
## 8            im pretty sure       146
## 9    president barack obama       141
## 10           dont even know       125
## 11            cinco de mayo       115
## 12             feel like im       108
## 13             world war ii       105
## 14       gov chris christie       101
## 15          st louis county       101
## 16   looking forward seeing        99
## 17            cant wait get        96
## 18          will take place        93
## 19         first time since        92
## 20       im looking forward        92
## 21            two weeks ago        86
## 22          three years ago        84
## 23           cant wait till        79
## 24            ive ever seen        79
## 25          st patricks day        77
## 26            new years eve        75
## 27            just got back        70
## 28           five years ago        69
## 29           cant wait hear        68
## 30           dont feel like        67
## 31              rock n roll        67
## 32         couple weeks ago        64
## 33           four years ago        64
## 34       martin luther king        64
## 35             right now im        64
## 36            long time ago        63
## 37           love love love        61
## 38      wall street journal        61
## 39     high school students        60
## 40           past two years        60
## 41           dont get wrong        56
## 42       world trade center        56
## 43            george w bush        55
## 44             ill let know        55
## 45     happy valentines day        54
## 46           ive never seen        54
## 47            just got home        54
## 48         couple years ago        52
## 49          didnt even know        52
## 50      look forward seeing        52
## 51     superior court judge        52
## 52              g protein g        51
## 53           think im going        51
## 54    senior vice president        50
## 55             im sure will        49
## 56           just make sure        49
## 57   really looking forward        48
## 58    told associated press        48
## 59         every single day        47
## 60       follow back please        47
## 61           keep good work        47
## 62 national weather service        47
## 63          osama bin laden        47
## 64      thanks following us        47
## 65              come see us        46
## 66              new york ny        46
## 67       please follow back        46
## 68             come join us        45
## 69 executive vice president        45
## 70    good morning everyone        45
## 71           last two years        45
## 72        several years ago        45
## 73         g carbohydrate g        44
## 74   county sheriffs office        43
## 75          fat g saturated        43
## 76                  g fat g        43
## 77            just let know        43
## 78          makes feel like        43
## 79            will let know        43

Summary

There will be further analysis of the dataset and modifications to how the n-grams are generated before we will be ready to move on to further stages of creating the text prediction app.

knitr::opts_chunk$set(echo = TRUE)


## Load Libraries

library(NLP)
library(readr)
library(knitr)
library(ggplot2)
library(slam)
library(ngram)
library(stringr)
library(RWeka)
library(tidytext)
library(tm)
library(knitr)


## Set wd and Unzip the file

unzip("Coursera-SwiftKey.zip")

# load files into R environment

en_blogs <- read_lines("final/en_US/en_US.blogs.txt", skip_empty_rows = TRUE)

en_news <- read_lines("final/en_US/en_US.news.txt", skip_empty_rows = TRUE)

en_twitter <- read_lines("final/en_US/en_US.twitter.txt", skip_empty_rows = TRUE)

# Summary stats for files

linesTotal <- c(length(en_blogs), length(en_news), length(en_twitter))

wordsTotal <- c(wordcount(en_blogs), wordcount(en_news), wordcount(en_twitter))

charsTotal <- c(sum(nchar(en_blogs)), sum(nchar(en_news)), sum(nchar(en_twitter)))

exTable <- data.frame(Source = c("Blogs", "News", "Twitter"),
                      characters = charsTotal,
                      lines = linesTotal,
                      words = wordsTotal)

exTable


rm(exTable, charsTotal, linesTotal, wordsTotal)

# Create a sample of each dataset

set.seed(12321)

sample_size <- 0.10

blogsIndex <- sample(seq_len(length(en_blogs)),length(en_blogs)*sample_size)

twitterIndex <- sample(seq_len(length(en_twitter)),length(en_twitter)*sample_size)

newsIndex <- sample(seq_len(length(en_news)),length(en_news)*sample_size)

en_blogs <- en_blogs[blogsIndex[]]

en_twitter <- en_twitter[twitterIndex[]]

en_news <- en_news[newsIndex[]]

rm(sample_size, blogsIndex, twitterIndex, newsIndex)

# Remove hashtags and RTs

en_twitter <- gsub("#\\w+", "", en_twitter)

en_twitter <- gsub("RT : ", "", en_twitter)


# Concatenate 3 source vectors

en_data <- c(en_blogs, en_news, en_twitter)

rm(en_blogs, en_news, en_twitter)

# Construct the corpus

corpusData <- VCorpus(VectorSource(en_data))

rm(en_data)


# Processing of the corpus

corpusData <- tm_map(corpusData, tolower)

corpusData <- tm_map(corpusData, removeNumbers)

corpusData <- tm_map(corpusData, removePunctuation)

corpusData <- tm_map(corpusData, stripWhitespace)

corpusData <- tm_map(corpusData, PlainTextDocument)

corpusData <- tm_map(corpusData, removeWords, stopwords(kind = "en"))

profanity <- read_lines("swearWords.txt")

corpusData <- tm_map(corpusData, removeWords, profanity)


# Tokenize the corpus

uniToken <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))

biToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

triToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

uniGram <- TermDocumentMatrix(corpusData, control = list(tokenize = uniToken))

biGram <- TermDocumentMatrix(corpusData, control = list(tokenize = biToken))

triGram <- TermDocumentMatrix(corpusData, control = list(tokenize = triToken))


# Remove Sparse Terms

uniNew <- removeSparseTerms(uniGram, 0.99)

biNew <- removeSparseTerms(biGram, 0.999)

triNew <- removeSparseTerms(triGram, 0.9999)

uniFreq <- findFreqTerms(uniNew, lowfreq = 25)

biFreq <- findFreqTerms(biNew, lowfreq = 20)

triFreq <- findFreqTerms(triNew, lowfreq = 20)


# Convert to Dataframes

uniDF <- rowSums(as.matrix(uniNew[uniFreq,]))

uniDF <- data.frame(word = names(uniDF), frequency = uniDF)

biDF <- rowSums(as.matrix(biNew[biFreq,]))

biDF <- data.frame(word = names(biDF), frequency = biDF)

triDF <- rowSums(as.matrix(triNew[triFreq,]))

triDF <- data.frame(word = names(triDF), frequency = triDF)

rm(uniFreq, uniNew, biFreq, biNew, triFreq, triNew)

uniDF <- uniDF[order(-uniDF$frequency),]

biDF <- biDF[order(-biDF$frequency),]

triDF <- triDF[order(-triDF$frequency),]


row.names(uniDF) <- 1:length(row.names(uniDF))

row.names(biDF) <- 1:length(row.names(biDF))

row.names(triDF) <- 1:length(row.names(triDF))

uniDF

biDF

triDF


barplot(uniDF[1:10,]$freq,las = 2, names.arg = uniDF[1:10,]$word,col ="blue",
        main ="Most Frequent 1-grams", xlab = "Frequency", horiz = TRUE)

barplot(biDF[1:10,]$freq, las = 2, names.arg = biDF[1:10,]$word,col ="red",
        main ="Most Frequent 2-grams", xlab = "Frequency", horiz = TRUE)

barplot(triDF[1:10,]$freq, las = 2, names.arg = triDF[1:10,]$word,col ="green",
        main ="Most Frequent 3-grams", xlab = "Frequency", horiz = TRUE)