This project explores large text datasets using natural language processing.

#Download and load the data The data was downloaded from the course website and saved in the directory “C:/Users/angel/Documents/stats/Coursera/Course10-CAPSTONE/final/en_US/”

setwd("C:/Users/angel/Documents/stats/Coursera/Course10-CAPSTONE/final/en_US/")

#load the news data
con <- file("en_US.news.txt", "rb")
newsdat <- readLines(con) 
close(con) #close the connection

#load the blogs data
con <- file("en_US.blogs.txt")
blogsdat <- readLines(con) 
close(con)

#load the twiter data 
con <- file("en_US.twitter.txt") 
twitdat <- readLines(con) 
close(con)

##Basic summaries To count the number of words, we can remove the extra spaces, numbers and punctuations and tokenize the data.

data.series lines words wpl
twitter 2360148 29409703 12.4609571094694
news 1010243 33534491 33.1944799419546
blogs 899288 36885367 41.0161894743397

#Exploratory data analysis Since the datasets are so large, we can take subsets to work from. Using rbinom, we can take 10% of each dataset.

twitsample <- twitdat[rbinom(length(twitdat)*0.1, length(twitdat), 0.5)]
newssample <- newsdat[rbinom(length(newsdat) * 0.1, length(newsdat), 0.5)]
blogssample <- blogsdat[rbinom(length(blogsdat)*0.1, length(blogsdat), 0.5)]

#once we get the samples, we can remove the original datasets from memory
rm(twitdat) 
rm(newsdat) 
rm(blogsdat)

Before we get into exploratory data analysis, we should clean the dataset. We can remove extra spaces, stop words, punctuation, contractions, and numbers.

twitclean <- tolower(twitsample) %>% fix.contractions() %>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers() 
twitclean <- twitclean[-which(twitclean==" ")]

blogsclean <- tolower(blogssample) %>% fix.contractions()%>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers()
blogsclean <- blogsclean[-which(blogsclean==" ")]

newsclean <- tolower(newssample) %>% fix.contractions()%>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers()
newsclean <- newsclean[-which(newsclean==" ")]

##Most frequent words We can use the functions termFreq and findMostFreqTerms to find the most frequent words in each dataset.

twitfreq <- termFreq(twitclean)
twitmostfreq <- findMostFreqTerms(twitfreq,10)
newsfreq <- termFreq(newsclean)
newsmostfreq <- findMostFreqTerms(newsfreq,10)
blogsfreq <- termFreq(blogsclean)
blogsmostfreq <- findMostFreqTerms(blogsfreq,10)

And plot the top ten most frequent words.

##2-gram and 3-gram

## [1] "twitter 2-gram"
##              ngrams freq         prop
## 1       last night  1530 0.0009003548
## 2        right now  1401 0.0008244426
## 3  looking forward  1355 0.0007973731
## 4         can wait  1219 0.0007173415
## 5        next week  1069 0.0006290715
## 6    thanks follow  1017 0.0005984712
## 7   happy birthday   916 0.0005390360
## 8        feel like   894 0.0005260897
## 9         will get   805 0.0004737161
## 10      looks like   785 0.0004619468
## [1] "twitter 3-gram"
##                     ngrams freq         prop
## 1             let us know   309 1.818365e-04
## 2          go night night   232 1.365245e-04
## 3          will come hang   227 1.335821e-04
## 4             u know clap   220 1.294629e-04
## 5                hi hi hi   212 1.247551e-04
## 6        happy mother day   191 1.123973e-04
## 7  store called regarding   186 1.094550e-04
## 8         tim burton dark   180 1.059242e-04
## 9        please make sure   171 1.006280e-04
## 10          know can help   167 9.827408e-05
## [1] "news 2-gram"
##          ngrams freq         prop
## 1    last year  1321 0.0006796970
## 2     new york  1139 0.0005860522
## 3     st louis   898 0.0004620499
## 4    last week   655 0.0003370186
## 5  high school   649 0.0003339314
## 6    said will   606 0.0003118065
## 7    will play   606 0.0003118065
## 8         m pm   531 0.0002732166
## 9   new jersey   529 0.0002721875
## 10    said “   519 0.0002670422
## [1] "news 3-gram"
##                              ngrams freq         prop
## 1                    new york city   271 1.394383e-04
## 2            service years service   206 1.059937e-04
## 3              years service years   206 1.059937e-04
## 4  superintendent special services   174 8.952865e-05
## 5                  st louis county   164 8.438332e-05
## 6                      game pm buy   156 8.026706e-05
## 7                     pm side game   156 8.026706e-05
## 8             registration pm side   156 8.026706e-05
## 9                     side game pm   156 8.026706e-05
## 10                   now years old   152 7.820893e-05
## [1] "blogs 2-gram"
##            ngrams freq         prop
## 1  mister rogers   952 0.0004894096
## 2     little boy   798 0.0004102404
## 3      years ago   700 0.0003598600
## 4      big sword   612 0.0003146204
## 5       new york   606 0.0003115359
## 6        one day   591 0.0003038246
## 7      make sure   584 0.0003002260
## 8      last year   567 0.0002914866
## 9        can get   537 0.0002760640
## 10    last night   528 0.0002714372
## [1] "blogs 3-gram"
##                      ngrams freq         prop
## 1           little boy big   408 2.097471e-04
## 2            boy big sword   408 2.097471e-04
## 3           love toast mom   236 1.213243e-04
## 4                pu bef th   213 1.095003e-04
## 5            new york city   192 9.870450e-05
## 6             “ love ”   189 9.716225e-05
## 7  work incredibly pleased   176 9.047913e-05
## 8  advertising people good   164 8.431010e-05
## 9    scrapping bug designs   162 8.328192e-05
## 10 creative kuts scrapping   162 8.328192e-05

#Milestone report summary and next steps This initial exploration uses basic natural language processing functions and practices. The hardest part was waiting for the markdown file to knit. The datasets can be cleaned more for the next step of the project. For example, single letters that don’t have a meaning can be removed (like “m pm” in the news 2-gram).

Next, we can build a prediction algorithm using the 2-gram and 3-gram. Not sure how we’ll build an algorithm for words or set of words not found in the corpora. Will also combine the twitter, news, and blogs datasets in the next step to build on comprehensive corpora. I also think that the 2-gram and 3-gram I built in this milestone is not very accurate yet since I combined all the lines of text.