This project explores large text datasets using natural language processing.
#Download and load the data The data was downloaded from the course website and saved in the directory “C:/Users/angel/Documents/stats/Coursera/Course10-CAPSTONE/final/en_US/”
setwd("C:/Users/angel/Documents/stats/Coursera/Course10-CAPSTONE/final/en_US/")
#load the news data
con <- file("en_US.news.txt", "rb")
newsdat <- readLines(con)
close(con) #close the connection
#load the blogs data
con <- file("en_US.blogs.txt")
blogsdat <- readLines(con)
close(con)
#load the twiter data
con <- file("en_US.twitter.txt")
twitdat <- readLines(con)
close(con)
##Basic summaries To count the number of words, we can remove the extra spaces, numbers and punctuations and tokenize the data.
| data.series | lines | words | wpl |
|---|---|---|---|
| 2360148 | 29409703 | 12.4609571094694 | |
| news | 1010243 | 33534491 | 33.1944799419546 |
| blogs | 899288 | 36885367 | 41.0161894743397 |
#Exploratory data analysis Since the datasets are so large, we can take subsets to work from. Using rbinom, we can take 10% of each dataset.
twitsample <- twitdat[rbinom(length(twitdat)*0.1, length(twitdat), 0.5)]
newssample <- newsdat[rbinom(length(newsdat) * 0.1, length(newsdat), 0.5)]
blogssample <- blogsdat[rbinom(length(blogsdat)*0.1, length(blogsdat), 0.5)]
#once we get the samples, we can remove the original datasets from memory
rm(twitdat)
rm(newsdat)
rm(blogsdat)
Before we get into exploratory data analysis, we should clean the dataset. We can remove extra spaces, stop words, punctuation, contractions, and numbers.
twitclean <- tolower(twitsample) %>% fix.contractions() %>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers()
twitclean <- twitclean[-which(twitclean==" ")]
blogsclean <- tolower(blogssample) %>% fix.contractions()%>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers()
blogsclean <- blogsclean[-which(blogsclean==" ")]
newsclean <- tolower(newssample) %>% fix.contractions()%>% tm::removeWords(stopwords("en")) %>% removePunctuation() %>% stripWhitespace() %>% removeNumbers()
newsclean <- newsclean[-which(newsclean==" ")]
##Most frequent words We can use the functions termFreq and findMostFreqTerms to find the most frequent words in each dataset.
twitfreq <- termFreq(twitclean)
twitmostfreq <- findMostFreqTerms(twitfreq,10)
newsfreq <- termFreq(newsclean)
newsmostfreq <- findMostFreqTerms(newsfreq,10)
blogsfreq <- termFreq(blogsclean)
blogsmostfreq <- findMostFreqTerms(blogsfreq,10)
And plot the top ten most frequent words.
##2-gram and 3-gram
## [1] "twitter 2-gram"
## ngrams freq prop
## 1 last night 1530 0.0009003548
## 2 right now 1401 0.0008244426
## 3 looking forward 1355 0.0007973731
## 4 can wait 1219 0.0007173415
## 5 next week 1069 0.0006290715
## 6 thanks follow 1017 0.0005984712
## 7 happy birthday 916 0.0005390360
## 8 feel like 894 0.0005260897
## 9 will get 805 0.0004737161
## 10 looks like 785 0.0004619468
## [1] "twitter 3-gram"
## ngrams freq prop
## 1 let us know 309 1.818365e-04
## 2 go night night 232 1.365245e-04
## 3 will come hang 227 1.335821e-04
## 4 u know clap 220 1.294629e-04
## 5 hi hi hi 212 1.247551e-04
## 6 happy mother day 191 1.123973e-04
## 7 store called regarding 186 1.094550e-04
## 8 tim burton dark 180 1.059242e-04
## 9 please make sure 171 1.006280e-04
## 10 know can help 167 9.827408e-05
## [1] "news 2-gram"
## ngrams freq prop
## 1 last year 1321 0.0006796970
## 2 new york 1139 0.0005860522
## 3 st louis 898 0.0004620499
## 4 last week 655 0.0003370186
## 5 high school 649 0.0003339314
## 6 said will 606 0.0003118065
## 7 will play 606 0.0003118065
## 8 m pm 531 0.0002732166
## 9 new jersey 529 0.0002721875
## 10 said “ 519 0.0002670422
## [1] "news 3-gram"
## ngrams freq prop
## 1 new york city 271 1.394383e-04
## 2 service years service 206 1.059937e-04
## 3 years service years 206 1.059937e-04
## 4 superintendent special services 174 8.952865e-05
## 5 st louis county 164 8.438332e-05
## 6 game pm buy 156 8.026706e-05
## 7 pm side game 156 8.026706e-05
## 8 registration pm side 156 8.026706e-05
## 9 side game pm 156 8.026706e-05
## 10 now years old 152 7.820893e-05
## [1] "blogs 2-gram"
## ngrams freq prop
## 1 mister rogers 952 0.0004894096
## 2 little boy 798 0.0004102404
## 3 years ago 700 0.0003598600
## 4 big sword 612 0.0003146204
## 5 new york 606 0.0003115359
## 6 one day 591 0.0003038246
## 7 make sure 584 0.0003002260
## 8 last year 567 0.0002914866
## 9 can get 537 0.0002760640
## 10 last night 528 0.0002714372
## [1] "blogs 3-gram"
## ngrams freq prop
## 1 little boy big 408 2.097471e-04
## 2 boy big sword 408 2.097471e-04
## 3 love toast mom 236 1.213243e-04
## 4 pu bef th 213 1.095003e-04
## 5 new york city 192 9.870450e-05
## 6 “ love †189 9.716225e-05
## 7 work incredibly pleased 176 9.047913e-05
## 8 advertising people good 164 8.431010e-05
## 9 scrapping bug designs 162 8.328192e-05
## 10 creative kuts scrapping 162 8.328192e-05
#Milestone report summary and next steps This initial exploration uses basic natural language processing functions and practices. The hardest part was waiting for the markdown file to knit. The datasets can be cleaned more for the next step of the project. For example, single letters that don’t have a meaning can be removed (like “m pm” in the news 2-gram).
Next, we can build a prediction algorithm using the 2-gram and 3-gram. Not sure how we’ll build an algorithm for words or set of words not found in the corpora. Will also combine the twitter, news, and blogs datasets in the next step to build on comprehensive corpora. I also think that the 2-gram and 3-gram I built in this milestone is not very accurate yet since I combined all the lines of text.