This markdown file will acquire the english text data for the capstone project. It has the following sections:
General housekeeping and sampling function.
setwd("~/Desktop/Online-Classes/Johns Hopkins Data Science/DSCapstone")
require(R.utils)
require(tm)
require(tau)
require(textcat)
dir1<-"./final/en_US/" # path to input and sample files
dir2<-"./final/" # path to saved RData files
fileb<-"en_US.blogs.txt"; filen<-"en_US.news.txt"; filet<-"en_US.twitter.txt"
# define a function that uses the shell perl command to sample a text file
# example usage: samp.001(dir,file)
# dir - is the path to the file location
# file - is the original file name
# side effect - creates and saves a text file about 1/1000 the size of the original
# value - the name of the saved file including path: "dir.sample.file"
# reference: http://stackoverflow.com/questions/22261082/load-a-small-random-sample-from-a-large-csv-file-into-r-data-frame
samp.001<- function(dir,file) {
perlsample<-"perl -ne 'print if (rand() < .001)' "
sampleb<-paste(perlsample,dir,file," > ",dir,"sample.",file,sep="")
sampleb
system(sampleb)
paste(dir,"sample.",file,sep="")
}
saveFile<-paste(dir2,"always1.RData",sep="")
save(dir1,fileb,filen,filet,samp.001,file=saveFile);
Create three random sample files from the three original data files by using the “samp.001”" function. Each file will be about 1/1000 the size of the original and is used to develop data algorithms without reading in the entire corpus. Only run once.
## only run this code one time, or it will overwrite the sample files with new random data
sampb<-samp.001(dir1,fileb) #create sample.en_US.blogs.txt
sampn<-samp.001(dir1,filen) #create sample.en_US.news.txt
sampt<-samp.001(dir1,filet) #create sample.en_US.twitter.txt
saveFile<-paste(dir2,"runonce1.RData",sep="")
save(sampb,sampn,sampt,file=saveFile);
Read the data: Use readLines with binary connections to input the text files, putting each text line into a large list. Calculate the number of lines per file, the length of each line and the max line length. Save the results. Only run once.
Print text file statistics. Use the US tweet list to explore questions in Quiz 1.
## file lineCount meanLength maxLength
## 1 ./final/en_US/en_US.twitter.txt 2360148 68.68045 140
## 2 ./final/en_US/en_US.blogs.txt 899288 229.987 40833
## 3 ./final/en_US/en_US.news.txt 1010242 201.1628 11384
## 4 ./final/en_US/sample.en_US.twitter.txt 2423 68.21461 140
## 5 ./final/en_US/sample.en_US.blogs.txt 895 227.6615 1972
## 6 ./final/en_US/sample.en_US.news.txt 1050 199.2533 884
## totalNchar
## 1 162096031
## 2 206824505
## 3 203223159
## 4 165284
## 5 203757
## 6 209216
## [1] "love/hate= 4.61684895472795"
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
## [1] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [2] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [3] "A computer once beat me at chess, but it was no match for me at kickboxing"
Learn how to use the TM package.
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## Being able to skype with family members.
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## thanks Bill!! Feeling better????
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## In politics stupidity is not a handicap. Napoleon Bonaparte
## <<VCorpus (documents: 6, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## Being able to skype with family members.
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## thanks Bill!! Feeling better????
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## In politics stupidity is not a handicap. Napoleon Bonaparte
##
## [[4]]
## <<PlainTextDocument (metadata: 7)>>
## Sigh. Because I obviously don't have enough to do this week with HULLS2 and everything else.
##
## [[5]]
## <<PlainTextDocument (metadata: 7)>>
## Jamaican jerk chicken is on; ride on over and get some!
##
## [[6]]
## <<PlainTextDocument (metadata: 7)>>
## I ♥ my daughter; I ♥ my boyfriend; I ♥ my family...some ppl just make me laugh...=)
## <<TermDocumentMatrix (terms: 3, documents: 3)>>
## Non-/sparse entries: 0/9
## Sparsity : 100%
## Maximal term length: 13
## Weighting : term frequency (tf)
##
## Docs
## Terms 1 2 3
## 3wordsforyous 0 0 0
## 400 0 0 0
## 40000 0 0 0
## <<DocumentTermMatrix (documents: 3, terms: 4)>>
## Non-/sparse entries: 0/12
## Sparsity : 100%
## Maximal term length: 11
## Weighting : term frequency - inverse document frequency (tf-idf)
##
## Terms
## Docs "hatin'"... "have "hello" "hey
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
Examples from help for the TM package, using file “crude”. For reference only.
data("crude")
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x)
unlist(strsplit(as.character(x), "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])
data("crude")
tdm <- TermDocumentMatrix(crude,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
inspect(tdm[202:205, 1:3])
inspect(tdm[c("price", "texas"), c("127", "144", "191", "194")])
inspect(dtm[1:5, 273:276])
Read and save a list of words for Profanity tagging.
Enter text for questions on quizes 2 and 3.
qb1<-"The guy in front of me just bought a pound of bacon, a bouquet, and a case of"
qb2<-"You're the reason why I smile everyday. Can you follow me please? It would mean the"
qb3<-"Hey sunshine, can you follow me and make me the"
qb4<-"Very early observations on the Bills game: Offense still struggling but the"
qb5<-"Go on a romantic date at the"
qb6<-"Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my"
qb7<-"Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some"
qb8<-"After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
qb9<-"Be grateful for the good times and keep the faith during the"
qb10<-"If this isn't the cutest thing you've ever seen, then you must be"
qc1<-"When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"
qc2<-"Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"
qc3<-"I'd give anything to see arctic monkeys this"
qc4<-"Talking to your mom has the same effect as a hug and helps reduce your"
qc5<-"When you were in Holland you were like 1 inch away from me but you hadn't time to take a"
qc6<-"I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"
qc7<-"I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"
qc8<-"Every inch of you is perfect from the bottom to the"
qc9<-"I’m thankful my childhood was filled with imagination and bruises from playing"
qc10<-"I like how the same people are in almost all of Adam Sandler's"
qblist<-list(qb1,qb2,qb3,qb4,qb5,qb6,qb7,qb8,qb9,qb10)
qclist<-list(qc1,qc2,qc3,qc4,qc5,qc6,qc7,qc8,qc9,qc10)
saveFile<-paste(dir2,"quiz23.RData",sep="")
save(qblist,qclist,file=saveFile)
Inspect the blogs list for character-grams that match questions on quiz 2
## cgramText cgramCount nw1Text nw1Count nw2Text nw2Count nw3Text
## 1 a case of 251 beer 8 making 3 much
## 2 d mean the 31 difference 6 world 5 death
## 3 ake me the 25 ages 1 agi 1 best
## 4 ng but the 290 truth 8 best 7 kitchen
## 5 ate at the 141 end 15 same 9 time
## 6 d be on my 10 blog 1 family 1 feet
## 7 quite some 305 time 247 since 12 thing
## 8 his little 1488 girl 52 guy 44 blog
## 9 during the 5668 day 286 week 191 summer
## 10 ou must be 234 able 19 follower 8 wondering
## nw3Count nw4Text nw4Count
## 1 3 the 3
## 2 2 would 2
## 3 1 borin 1
## 4 6 was 5
## 5 8 moment 6
## 6 1 flash 1
## 7 12 after 3
## 8 28 gem 26
## 9 122 first 110
## 10 8 willing 6
Inspect the news list for character-grams that match questions on quiz 2
## cgramText cgramCount nw1Text nw1Count nw2Text nw2Count nw3Text
## 1 a case of 142 mistaken 9 being 4 beer
## 2 d mean the 68 difference 13 end 9 loss
## 3 ake me the 11 happiest 2 but 1 coolest
## 4 ng but the 116 best 20 kitchen 3 not
## 5 ate at the 178 time 18 end 10 University
## 6 d be on my 3 guard 1 own 1 short
## 7 quite some 79 time 75 about 1 becau
## 8 his little 157 brother 14 girl 10 sister
## 9 during the 8826 first 353 regular 223 season
## 10 ou must be 48 olde 3 the 3 pretty
## nw3Count nw4Text nw4Count
## 1 3 first 3
## 2 3 would 2
## 3 1 Cruz 1
## 4 2 utmost 2
## 5 8 City 4
## 6 1 <NA> NA
## 7 1 since 1
## 8 6 known 5
## 9 171 recession 162
## 10 2 able 1
Inspect the Twitter list for character-grams that match questions on quiz 2
## cgramText cgramCount nw1Text nw1Count nw2Text nw2Count nw3Text
## 1 a case of 160 Monday 22 beer 8 monday
## 2 d mean the 196 world 171 WORLD 6 absolute
## 3 ake me the 65 happiest 24 re 8 perso
## 4 ng but the 239 best 39 truth 8 make
## 5 ate at the 78 same 10 moment 5 end
## 6 d be on my 13 way 3 dolla 1 doorstep
## 7 quite some 48 time 38 thing 5 company
## 8 his little 228 girl 18 thing 7 brother
## 9 during the 2037 day 155 week 92 game
## 10 ou must be 462 proud 24 change 13 doing
## nw3Count nw4Text nw4Count
## 1 4 the 4
## 2 2 difference 2
## 3 5 happi 2
## 4 7 don 3
## 5 3 airport 2
## 6 1 linked 1
## 7 2 freedom 1
## 8 5 kid 5
## 9 78 summer 73
## 10 10 able 9