Albert Y. Kim
Monday 2015/04/13
Working with text data can be a real PITA, as there are many different character encodings, i.e. how characters are represented on a computer.
Converting between them can be a real nuissance as some characters don't translate well, like accented letters, spaces, punctuation.
We'll use UTF-8.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
The tm (text mining) package puts many such tools at our disposal, including
stopwords("english")
[1] "i" "me" "my"
[4] "myself" "we" "our"
[7] "ours" "ourselves" "you"
[10] "your" "yours" "yourself"
[13] "yourselves" "he" "him"
[16] "his" "himself" "she"
[19] "her" "hers" "herself"
[22] "it" "its" "itself"
[25] "they" "them" "their"
[28] "theirs" "themselves" "what"
[31] "which" "who" "whom"
[34] "this" "that" "these"
[37] "those" "am" "is"
[40] "are" "was" "were"
[43] "be" "been" "being"
[46] "have" "has" "had"
[49] "having" "do" "does"
[52] "did" "doing" "would"
[55] "should" "could" "ought"
[58] "i'm" "you're" "he's"
[61] "she's" "it's" "we're"
[64] "they're" "i've" "you've"
[67] "we've" "they've" "i'd"
[70] "you'd" "he'd" "she'd"
[73] "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll"
[79] "we'll" "they'll" "isn't"
[82] "aren't" "wasn't" "weren't"
[85] "hasn't" "haven't" "hadn't"
[88] "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't"
[94] "shouldn't" "can't" "cannot"
[97] "couldn't" "mustn't" "let's"
[100] "that's" "who's" "what's"
[103] "here's" "there's" "when's"
[106] "where's" "why's" "how's"
[109] "a" "an" "the"
[112] "and" "but" "if"
[115] "or" "because" "as"
[118] "until" "while" "of"
[121] "at" "by" "for"
[124] "with" "about" "against"
[127] "between" "into" "through"
[130] "during" "before" "after"
[133] "above" "below" "to"
[136] "from" "up" "down"
[139] "in" "out" "on"
[142] "off" "over" "under"
[145] "again" "further" "then"
[148] "once" "here" "there"
[151] "when" "where" "why"
[154] "how" "all" "any"
[157] "both" "each" "few"
[160] "more" "most" "other"
[163] "some" "such" "no"
[166] "nor" "not" "only"
[169] "own" "same" "so"
[172] "than" "too" "very"
To see all the possible color palette options in R, run the following code:
library(RColorBrewer)
par(mar = c(0, 4, 0, 0))
display.brewer.all()
par(mar = c(0, 0, 0, 0))
In the case of creating wordclouds today, sequential is appropriate since we want to convey highs and lows, but there is no divergence point
The following commands generate the HEX codes representing colors on the color wheel
brewer.pal(8, "Purples")
brewer.pal(8, "Set3")
We're revisiting OkCupid essay data. Using:
stringr and tm packagesWe're going to evaluate both