This markdown file will acquire the english text data for the capstone project. It has the following sections:

General housekeeping and sampling function.

setwd("~/Desktop/Online-Classes/Johns Hopkins Data Science/DSCapstone")
require(R.utils)
require(tm)
require(tau)
require(textcat)
dir1<-"./final/en_US/"   # path to input and sample files 
dir2<-"./final/"         # path to saved RData files
fileb<-"en_US.blogs.txt"; filen<-"en_US.news.txt"; filet<-"en_US.twitter.txt"

# define a function that uses the shell perl command to sample a text file
#  example usage:  samp.001(dir,file)
#    dir  - is the path to the file location
#    file - is the original file name
#    side effect - creates and saves a text file about 1/1000 the size of the original
#    value - the name of the saved file including path: "dir.sample.file"
# reference:  http://stackoverflow.com/questions/22261082/load-a-small-random-sample-from-a-large-csv-file-into-r-data-frame

samp.001<- function(dir,file) {  
        perlsample<-"perl -ne 'print if (rand() < .001)' "
        sampleb<-paste(perlsample,dir,file," > ",dir,"sample.",file,sep="")
        sampleb
        system(sampleb)
        paste(dir,"sample.",file,sep="")
}
saveFile<-paste(dir2,"always1.RData",sep="")
save(dir1,fileb,filen,filet,samp.001,file=saveFile);

Create three random sample files from the three original data files by using the “samp.001”" function. Each file will be about 1/1000 the size of the original and is used to develop data algorithms without reading in the entire corpus. Only run once.

## only run this code one time, or it will overwrite the sample files with new random data
sampb<-samp.001(dir1,fileb)  #create sample.en_US.blogs.txt
sampn<-samp.001(dir1,filen)  #create sample.en_US.news.txt
sampt<-samp.001(dir1,filet)  #create sample.en_US.twitter.txt
saveFile<-paste(dir2,"runonce1.RData",sep="")
save(sampb,sampn,sampt,file=saveFile);

Read the data: Use readLines with binary connections to input the text files, putting each text line into a large list. Calculate the number of lines per file, the length of each line and the max line length. Save the results. Only run once.

Print text file statistics. Use the US tweet list to explore questions in Quiz 1.

##                                     file lineCount meanLength maxLength
## 1        ./final/en_US/en_US.twitter.txt   2360148   68.68045       140
## 2          ./final/en_US/en_US.blogs.txt    899288    229.987     40833
## 3           ./final/en_US/en_US.news.txt   1010242   201.1628     11384
## 4 ./final/en_US/sample.en_US.twitter.txt      2423   68.21461       140
## 5   ./final/en_US/sample.en_US.blogs.txt       895   227.6615      1972
## 6    ./final/en_US/sample.en_US.news.txt      1050   199.2533       884
##   totalNchar
## 1  162096031
## 2  206824505
## 3  203223159
## 4     165284
## 5     203757
## 6     209216
## [1] "love/hate=   4.61684895472795"
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
## [1] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [2] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [3] "A computer once beat me at chess, but it was no match for me at kickboxing"

Learn how to use the TM package.

## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
## 
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## Being able to skype with family members.
## 
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## thanks Bill!! Feeling better????
## 
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## In politics stupidity is not a handicap. Napoleon Bonaparte
## <<VCorpus (documents: 6, metadata (corpus/indexed): 0/0)>>
## 
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## Being able to skype with family members.
## 
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## thanks Bill!! Feeling better????
## 
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## In politics stupidity is not a handicap. Napoleon Bonaparte
## 
## [[4]]
## <<PlainTextDocument (metadata: 7)>>
## Sigh. Because I obviously don't have enough to do this week with HULLS2 and everything else.
## 
## [[5]]
## <<PlainTextDocument (metadata: 7)>>
## Jamaican jerk chicken is on; ride on over and get some!
## 
## [[6]]
## <<PlainTextDocument (metadata: 7)>>
## I ♥ my daughter; I ♥ my boyfriend; I ♥ my family...some ppl just make me laugh...=)
## <<TermDocumentMatrix (terms: 3, documents: 3)>>
## Non-/sparse entries: 0/9
## Sparsity           : 100%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## 
##                Docs
## Terms           1 2 3
##   3wordsforyous 0 0 0
##   400           0 0 0
##   40000         0 0 0
## <<DocumentTermMatrix (documents: 3, terms: 4)>>
## Non-/sparse entries: 0/12
## Sparsity           : 100%
## Maximal term length: 11
## Weighting          : term frequency - inverse document frequency (tf-idf)
## 
##     Terms
## Docs "hatin'"... "have "hello" "hey
##    1           0     0       0    0
##    2           0     0       0    0
##    3           0     0       0    0

Examples from help for the TM package, using file “crude”. For reference only.

data("crude")
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])
data("crude")
tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))
dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     FALSE),
                                         stopwords = TRUE))
inspect(tdm[202:205, 1:3])
inspect(tdm[c("price", "texas"), c("127", "144", "191", "194")])
inspect(dtm[1:5, 273:276])

Read and save a list of words for Profanity tagging.

Enter text for questions on quizes 2 and 3.

qb1<-"The guy in front of me just bought a pound of bacon, a bouquet, and a case of"
qb2<-"You're the reason why I smile everyday. Can you follow me please? It would mean the"
qb3<-"Hey sunshine, can you follow me and make me the"
qb4<-"Very early observations on the Bills game: Offense still struggling but the"
qb5<-"Go on a romantic date at the"
qb6<-"Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my"
qb7<-"Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some"
qb8<-"After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
qb9<-"Be grateful for the good times and keep the faith during the"
qb10<-"If this isn't the cutest thing you've ever seen, then you must be"

qc1<-"When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"
qc2<-"Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"
qc3<-"I'd give anything to see arctic monkeys this"
qc4<-"Talking to your mom has the same effect as a hug and helps reduce your"
qc5<-"When you were in Holland you were like 1 inch away from me but you hadn't time to take a"
qc6<-"I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"
qc7<-"I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"
qc8<-"Every inch of you is perfect from the bottom to the"
qc9<-"I’m thankful my childhood was filled with imagination and bruises from playing"
qc10<-"I like how the same people are in almost all of Adam Sandler's"
qblist<-list(qb1,qb2,qb3,qb4,qb5,qb6,qb7,qb8,qb9,qb10)
qclist<-list(qc1,qc2,qc3,qc4,qc5,qc6,qc7,qc8,qc9,qc10)

saveFile<-paste(dir2,"quiz23.RData",sep="")
save(qblist,qclist,file=saveFile)

Inspect the blogs list for character-grams that match questions on quiz 2

##     cgramText cgramCount    nw1Text nw1Count  nw2Text nw2Count   nw3Text
## 1   a case of        251       beer        8   making        3      much
## 2  d mean the         31 difference        6    world        5     death
## 3  ake me the         25       ages        1      agi        1      best
## 4  ng but the        290      truth        8     best        7   kitchen
## 5  ate at the        141        end       15     same        9      time
## 6  d be on my         10       blog        1   family        1      feet
## 7  quite some        305       time      247    since       12     thing
## 8  his little       1488       girl       52      guy       44      blog
## 9  during the       5668        day      286     week      191    summer
## 10 ou must be        234       able       19 follower        8 wondering
##    nw3Count nw4Text nw4Count
## 1         3     the        3
## 2         2   would        2
## 3         1   borin        1
## 4         6     was        5
## 5         8  moment        6
## 6         1   flash        1
## 7        12   after        3
## 8        28     gem       26
## 9       122   first      110
## 10        8 willing        6

Inspect the news list for character-grams that match questions on quiz 2

##     cgramText cgramCount    nw1Text nw1Count nw2Text nw2Count    nw3Text
## 1   a case of        142   mistaken        9   being        4       beer
## 2  d mean the         68 difference       13     end        9       loss
## 3  ake me the         11   happiest        2     but        1    coolest
## 4  ng but the        116       best       20 kitchen        3        not
## 5  ate at the        178       time       18     end       10 University
## 6  d be on my          3      guard        1     own        1      short
## 7  quite some         79       time       75   about        1      becau
## 8  his little        157    brother       14    girl       10     sister
## 9  during the       8826      first      353 regular      223     season
## 10 ou must be         48       olde        3     the        3     pretty
##    nw3Count   nw4Text nw4Count
## 1         3     first        3
## 2         3     would        2
## 3         1      Cruz        1
## 4         2    utmost        2
## 5         8      City        4
## 6         1      <NA>       NA
## 7         1     since        1
## 8         6     known        5
## 9       171 recession      162
## 10        2      able        1

Inspect the Twitter list for character-grams that match questions on quiz 2

##     cgramText cgramCount  nw1Text nw1Count nw2Text nw2Count  nw3Text
## 1   a case of        160   Monday       22    beer        8   monday
## 2  d mean the        196    world      171   WORLD        6 absolute
## 3  ake me the         65 happiest       24      re        8    perso
## 4  ng but the        239     best       39   truth        8     make
## 5  ate at the         78     same       10  moment        5      end
## 6  d be on my         13      way        3   dolla        1 doorstep
## 7  quite some         48     time       38   thing        5  company
## 8  his little        228     girl       18   thing        7  brother
## 9  during the       2037      day      155    week       92     game
## 10 ou must be        462    proud       24  change       13    doing
##    nw3Count    nw4Text nw4Count
## 1         4        the        4
## 2         2 difference        2
## 3         5      happi        2
## 4         7        don        3
## 5         3    airport        2
## 6         1     linked        1
## 7         2    freedom        1
## 8         5        kid        5
## 9        78     summer       73
## 10       10       able        9