##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
This submission is for the week 2 assignment of the Data Science Capstone course on Coursera. The goal of the Capstone project is to create a word prediction algorithm to be used as an app.
The goal of this first assignment is to get familiar with and explore the data so that we can get a better idea about how to proceed with making the word prediction app.
The first thing to do is to download the data and load it into the R environment. After that is completed, we can create a data frame that will show us how many characters, lines, and words are contained in the data.
There are 3 main data files to work with, and each one comes from a different source – blogs, news articles, and tweets from Twitter.
exTable
## Source characters lines words
## 1 Blogs 206824505 899288 37334131
## 2 News 203223159 1010242 34372530
## 3 Twitter 162096031 2360148 30373543
We can see now that the datasets are quite large. It will be necessary to take a sample from each of the datasets to cut down on computing resources needed to process the data
A few things need to be done before we can start using Natural Language Processing functions on the data. First thing is to creates samples of each of the three datasets. A sample size of 10% will be sufficient for our purposes.
Next, we will do additional preprocessing steps on the Twitter data. We will remove any hashtags and retweet markers in the data as they will not be useful for a general purpose word prediction tool
Now we can get on with generating a single corpus containing all three sample datasets.
Next, we will need to do some additional housekeeping on the corpus. We will convert the corpus to lowercase letters only and remove any numbers. Then we will remove any punctuation and strip the corpus of any extra whitespace. After that we will convert the corpus to a plain text document.
The final step in for this part of the report is to remove any stop words and profanity from the corpus.
The next tasks are more technical. We will tokenize the corpus. Essentially, this mean that we will seperate the data into smaller units called tokens. In this project, tokens will be individual words, although some tokens will not turn out to be words because of typos, misspelled words, or even intentional gibberish in the data. It will be our job to clean the data even further before the final steps of creating the app are undertaken.
Now we are ready to take a look at the n-grams generated by our code with the data.
## word frequency
## 1 will 31572
## 2 just 30384
## 3 said 30189
## 4 one 28519
## 5 like 26840
## 6 can 24425
## 7 get 22503
## 8 time 21209
## 9 new 19392
## 10 good 17692
## 11 now 17422
## 12 day 16801
## 13 know 16122
## 14 people 15718
## 15 love 15700
## 16 dont 14103
## 17 back 13978
## 18 see 13664
## 19 first 13336
## 20 make 13247
## 21 also 12849
## 22 going 12579
## 23 think 12511
## 24 great 12395
## 25 last 12331
## 26 much 12009
## 27 well 11846
## 28 year 11539
## 29 two 11516
## 30 really 11325
## 31 way 10944
## 32 today 10920
## 33 even 10877
## 34 got 10806
## 35 want 10612
## 36 still 10354
## 37 work 10334
## 38 years 10001
## 39 right 9990
## 40 thanks 9765
## 41 need 9737
## 42 many 8899
## 43 life 8630
## 44 say 8499
## 45 take 8490
## 46 made 8331
## 47 little 8262
## 48 come 8248
## 49 never 8110
## 50 home 7960
## 51 best 7895
## 52 may 7850
## 53 next 7593
## 54 week 7574
## 55 night 7370
## 56 cant 7346
## 57 thats 7317
## 58 things 7221
## 59 school 7119
## 60 something 7110
## 61 lol 6936
## 62 game 6902
## 63 always 6659
## 64 better 6647
## 65 happy 6605
## 66 every 6589
## 67 another 6539
## 68 around 6531
## 69 state 6430
## 70 look 6365
## 71 show 6336
## 72 world 6308
## 73 long 6158
## 74 big 6134
## 75 since 5870
## 76 man 5835
## 77 feel 5805
## 78 sure 5757
## 79 city 5688
## 80 help 5670
## 81 use 5609
## 82 three 5577
## 83 hope 5570
## 84 follow 5518
## 85 thing 5503
## 86 youre 5447
## 87 days 5433
## 88 find 5402
## 89 getting 5333
## 90 lot 5311
## 91 didnt 5296
## 92 keep 5294
## 93 says 5189
## 94 ever 5173
## 95 house 5161
## 96 part 5092
## 97 put 5089
## 98 place 5041
## 99 ive 4993
## 100 team 4982
## 101 family 4943
## 102 give 4941
## 103 let 4918
## 104 tonight 4876
## 105 looking 4843
## 106 though 4843
## 107 thank 4786
## 108 old 4757
## 109 play 4722
## 110 end 4692
## 111 morning 4593
## 112 away 4592
## 113 ill 4547
## word frequency
## 1 right now 2379
## 2 new york 1974
## 3 last year 1865
## 4 cant wait 1814
## 5 dont know 1552
## 6 last night 1503
## 7 high school 1389
## 8 years ago 1350
## 9 feel like 1279
## 10 last week 1252
## 11 first time 1181
## 12 im going 1153
## 13 looking forward 1122
## 14 can get 1110
## 15 make sure 1105
## 16 st louis 955
## 17 looks like 943
## 18 even though 934
## 19 happy birthday 891
## 20 good morning 872
## 21 just got 868
## 22 im sure 813
## 23 let know 803
## 24 new jersey 799
## 25 dont think 786
## 26 dont want 780
## 27 united states 779
## 28 one day 746
## 29 every day 732
## 30 look like 732
## 31 good luck 719
## 32 next week 715
## 33 two years 703
## 34 thanks follow 692
## 35 can see 690
## 36 just like 686
## 37 said “ 670
## 38 next year 665
## 39 mothers day 654
## 40 social media 644
## 41 can make 639
## 42 los angeles 637
## 43 little bit 598
## 44 many people 595
## 45 long time 591
## 46 will get 591
## 47 san francisco 588
## 48 sounds like 583
## 49 come back 581
## 50 one thing 574
## 51 follow back 567
## 52 get back 566
## 53 every time 563
## 54 im just 541
## 55 go back 533
## 56 san diego 522
## 57 dont like 507
## 58 im gonna 507
## 59 last month 498
## 60 will make 498
## 61 let us 491
## 62 will take 487
## 63 can help 484
## 64 great day 477
## 65 dont get 474
## 66 will never 468
## 67 next time 465
## 68 three years 462
## 69 pretty much 448
## 70 lets go 445
## word frequency
## 1 happy mothers day 338
## 2 cant wait see 336
## 3 let us know 289
## 4 new york city 244
## 5 happy new year 167
## 6 two years ago 158
## 7 new york times 153
## 8 im pretty sure 146
## 9 president barack obama 141
## 10 dont even know 125
## 11 cinco de mayo 115
## 12 feel like im 108
## 13 world war ii 105
## 14 gov chris christie 101
## 15 st louis county 101
## 16 looking forward seeing 99
## 17 cant wait get 96
## 18 will take place 93
## 19 first time since 92
## 20 im looking forward 92
## 21 two weeks ago 86
## 22 three years ago 84
## 23 cant wait till 79
## 24 ive ever seen 79
## 25 st patricks day 77
## 26 new years eve 75
## 27 just got back 70
## 28 five years ago 69
## 29 cant wait hear 68
## 30 dont feel like 67
## 31 rock n roll 67
## 32 couple weeks ago 64
## 33 four years ago 64
## 34 martin luther king 64
## 35 right now im 64
## 36 long time ago 63
## 37 love love love 61
## 38 wall street journal 61
## 39 high school students 60
## 40 past two years 60
## 41 dont get wrong 56
## 42 world trade center 56
## 43 george w bush 55
## 44 ill let know 55
## 45 happy valentines day 54
## 46 ive never seen 54
## 47 just got home 54
## 48 couple years ago 52
## 49 didnt even know 52
## 50 look forward seeing 52
## 51 superior court judge 52
## 52 g protein g 51
## 53 think im going 51
## 54 senior vice president 50
## 55 im sure will 49
## 56 just make sure 49
## 57 really looking forward 48
## 58 told associated press 48
## 59 every single day 47
## 60 follow back please 47
## 61 keep good work 47
## 62 national weather service 47
## 63 osama bin laden 47
## 64 thanks following us 47
## 65 come see us 46
## 66 new york ny 46
## 67 please follow back 46
## 68 come join us 45
## 69 executive vice president 45
## 70 good morning everyone 45
## 71 last two years 45
## 72 several years ago 45
## 73 g carbohydrate g 44
## 74 county sheriffs office 43
## 75 fat g saturated 43
## 76 g fat g 43
## 77 just let know 43
## 78 makes feel like 43
## 79 will let know 43
There will be further analysis of the dataset and modifications to how the n-grams are generated before we will be ready to move on to further stages of creating the text prediction app.
knitr::opts_chunk$set(echo = TRUE)
## Load Libraries
library(NLP)
library(readr)
library(knitr)
library(ggplot2)
library(slam)
library(ngram)
library(stringr)
library(RWeka)
library(tidytext)
library(tm)
library(knitr)
## Set wd and Unzip the file
unzip("Coursera-SwiftKey.zip")
# load files into R environment
en_blogs <- read_lines("final/en_US/en_US.blogs.txt", skip_empty_rows = TRUE)
en_news <- read_lines("final/en_US/en_US.news.txt", skip_empty_rows = TRUE)
en_twitter <- read_lines("final/en_US/en_US.twitter.txt", skip_empty_rows = TRUE)
# Summary stats for files
linesTotal <- c(length(en_blogs), length(en_news), length(en_twitter))
wordsTotal <- c(wordcount(en_blogs), wordcount(en_news), wordcount(en_twitter))
charsTotal <- c(sum(nchar(en_blogs)), sum(nchar(en_news)), sum(nchar(en_twitter)))
exTable <- data.frame(Source = c("Blogs", "News", "Twitter"),
characters = charsTotal,
lines = linesTotal,
words = wordsTotal)
exTable
rm(exTable, charsTotal, linesTotal, wordsTotal)
# Create a sample of each dataset
set.seed(12321)
sample_size <- 0.10
blogsIndex <- sample(seq_len(length(en_blogs)),length(en_blogs)*sample_size)
twitterIndex <- sample(seq_len(length(en_twitter)),length(en_twitter)*sample_size)
newsIndex <- sample(seq_len(length(en_news)),length(en_news)*sample_size)
en_blogs <- en_blogs[blogsIndex[]]
en_twitter <- en_twitter[twitterIndex[]]
en_news <- en_news[newsIndex[]]
rm(sample_size, blogsIndex, twitterIndex, newsIndex)
# Remove hashtags and RTs
en_twitter <- gsub("#\\w+", "", en_twitter)
en_twitter <- gsub("RT : ", "", en_twitter)
# Concatenate 3 source vectors
en_data <- c(en_blogs, en_news, en_twitter)
rm(en_blogs, en_news, en_twitter)
# Construct the corpus
corpusData <- VCorpus(VectorSource(en_data))
rm(en_data)
# Processing of the corpus
corpusData <- tm_map(corpusData, tolower)
corpusData <- tm_map(corpusData, removeNumbers)
corpusData <- tm_map(corpusData, removePunctuation)
corpusData <- tm_map(corpusData, stripWhitespace)
corpusData <- tm_map(corpusData, PlainTextDocument)
corpusData <- tm_map(corpusData, removeWords, stopwords(kind = "en"))
profanity <- read_lines("swearWords.txt")
corpusData <- tm_map(corpusData, removeWords, profanity)
# Tokenize the corpus
uniToken <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
uniGram <- TermDocumentMatrix(corpusData, control = list(tokenize = uniToken))
biGram <- TermDocumentMatrix(corpusData, control = list(tokenize = biToken))
triGram <- TermDocumentMatrix(corpusData, control = list(tokenize = triToken))
# Remove Sparse Terms
uniNew <- removeSparseTerms(uniGram, 0.99)
biNew <- removeSparseTerms(biGram, 0.999)
triNew <- removeSparseTerms(triGram, 0.9999)
uniFreq <- findFreqTerms(uniNew, lowfreq = 25)
biFreq <- findFreqTerms(biNew, lowfreq = 20)
triFreq <- findFreqTerms(triNew, lowfreq = 20)
# Convert to Dataframes
uniDF <- rowSums(as.matrix(uniNew[uniFreq,]))
uniDF <- data.frame(word = names(uniDF), frequency = uniDF)
biDF <- rowSums(as.matrix(biNew[biFreq,]))
biDF <- data.frame(word = names(biDF), frequency = biDF)
triDF <- rowSums(as.matrix(triNew[triFreq,]))
triDF <- data.frame(word = names(triDF), frequency = triDF)
rm(uniFreq, uniNew, biFreq, biNew, triFreq, triNew)
uniDF <- uniDF[order(-uniDF$frequency),]
biDF <- biDF[order(-biDF$frequency),]
triDF <- triDF[order(-triDF$frequency),]
row.names(uniDF) <- 1:length(row.names(uniDF))
row.names(biDF) <- 1:length(row.names(biDF))
row.names(triDF) <- 1:length(row.names(triDF))
uniDF
biDF
triDF
barplot(uniDF[1:10,]$freq,las = 2, names.arg = uniDF[1:10,]$word,col ="blue",
main ="Most Frequent 1-grams", xlab = "Frequency", horiz = TRUE)
barplot(biDF[1:10,]$freq, las = 2, names.arg = biDF[1:10,]$word,col ="red",
main ="Most Frequent 2-grams", xlab = "Frequency", horiz = TRUE)
barplot(triDF[1:10,]$freq, las = 2, names.arg = triDF[1:10,]$word,col ="green",
main ="Most Frequent 3-grams", xlab = "Frequency", horiz = TRUE)