Milestone Report for week 2 of the Coursera Data Science Capstone project
This report aims to develop an understanding of the various statistical properties of the data set that can later be used when building the prediction model for the final data product - the Shiny application.
The model will be trained using a unified document corpus compiled from the following three sources of text data:
Load Necessary Packages
library(tm)
## Loading required package: NLP
library(SnowballC)
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Data
fileURL1 <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL1, destfile="capstone_data.zip")
unzip("capstone_data.zip")
Reading Data
en.US.twitter <- readLines("final/en_US/en_US.twitter.txt")
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
en.US.blogs <- readLines("final/en_US/en_US.blogs.txt")
en.US.news <- readLines("final/en_US/en_US.news.txt")
head(en.US.twitter, 5)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
tail(en.US.twitter, 5)
## [1] "what's good. I see the success you got poppin in yo area."
## [2] "RT : Consumers are visual. They want data at their finger tips. Mobile is the only way to deliver this, 24/7."
## [3] "u welcome"
## [4] "It is #RHONJ time!!"
## [5] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"
head(en.US.blogs, 5)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
tail(en.US.blogs, 5)
## [1] "The hulking mass of unfinished brick and concrete at 20-13 35th St. is so unsightly, it became a poster child for zoning reform."
## [2] "The 2004 IIFA award ceremony witnessed a contingent of over 450 stars, celebrities, cricketers, industrialists and government leaders over the festive weekend."
## [3] "Plus, I have also been allowing myself not to get ‘stressed’ over things that have not been done! If the ironing is not done right now, it’s not the end of the world! If that phone call is made tomorrow rather than today, then that’s OK too! Living in the moment and allowing myself the time to get ‘back to feeling great’!"
## [4] "(5) What's the barrier to entry and why is the business sustainable?"
## [5] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don’t (s) – these recommendations are easy to follow and except for - adding some herbs to your rinse . So let’s get begin…"
head(en.US.news, 5)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
tail(en.US.news, 5)
## [1] "Serve a taste of spring: Chop fresh vegetables, olives, cheeses and some grilled chicken. Set out a selection of salad dressings, and let guests assemble their own chopped salads. These numbered jars hold the dressings in style, $5 each at Anthropologie. See anthropologie.com for stores."
## [2] "The complaint alleges that Kuvan Adil Piromari, 42, of U.S. Driving School in El Cajon, served as a go-between for the applicants and state employees. No one answered the phone at his office."
## [3] "But I'm in the mood. After six or more months of chill and ice crystals in Northeast Ohio, the ground is soft and fragrant. Seemingly overnight, things are growing as if we were in the tropics. We are again producing fruit of the earth: sweet corn, mightily fragrant herbs, deep green and tender broccoli."
## [4] "That starts this Sunday at Chivas. The Goats aren't a great team, but they just beat one (a 1-0 win over Salt Lake at Rio Tinto). They also have the one player who can rival Roger Espinoza as \"The Best Guy in MLS That No One Talks About Because He Doesn't Play in New York, LA or the Pacific Northwest\" in goalkeeper Dan Kennedy. These will be tough points."
## [5] "The only outwardly religious adornment was a billboard-sized banner with an image of Our Lady of Charity, patron saint of Cuba, hanging on the side of the National Library."
File Size
twitter.filesize <- file.info("../final/en_US/en_US.twitter.txt")$size/(1024^2)
blogs.filesize <- file.info("../final/en_US/en_US.blogs.txt")$size/(1024^2)
news.filesize <- file.info("../final/en_US/en_US.news.txt")$size/(1024^2)
No. of lines
twitter.length <- length(en.US.twitter)
blogs.length <- length(en.US.blogs)
news.length <- length(en.US.news)
No. of characters
twitter.char <- sum(nchar(en.US.twitter))
blogs.char <- sum(nchar(en.US.blogs))
news.char <- sum(nchar(en.US.news))
Corpus
twitter.sample <- en.US.twitter[rbinom(length(en.US.twitter)*.01, length(en.US.twitter), .5)]
blogs.sample <- en.US.blogs[rbinom(length(en.US.blogs)*.01, length(en.US.blogs), .5)]
news.sample <- en.US.news[rbinom(length(en.US.news)*.01, length(en.US.news), .5)]
Storing subsets of 3 datasets
dir.create("subset", showWarnings = FALSE)
write(twitter.sample, file = "subset/twitter_sample.data")
write(blogs.sample, file = "subset/blogs_sample.data")
write(news.sample, file = "subset/news_sample.data")
Forming a Corpus
corpus <- Corpus(DirSource("subset/"), readerControl = list(language="en_US"))
summary(corpus)
## Length Class Mode
## blogs_sample.data 2 PlainTextDocument list
## news_sample.data 2 PlainTextDocument list
## twitter_sample.data 2 PlainTextDocument list
Determining Profanities
profanity <- read.csv("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
profanity[,1]
## [1] "abo" "abortion" "abuse"
## [4] "addict" "addicts" "adult"
## [7] "africa" "african" "alla"
## [10] "allah" "alligatorbait" "amateur"
## [13] "american" "anal" "analannie"
## [16] "analsex" "angie" "angry"
## [19] "anus" "arab" "arabs"
## [22] "areola" "argie" "aroused"
## [25] "arse" "arsehole" "asian"
## [28] "ass" "assassin" "assassinate"
## [31] "assassination" "assault" "assbagger"
## [34] "assblaster" "assclown" "asscowboy"
## [37] "asses" "assfuck" "assfucker"
## [40] "asshat" "asshole" "assholes"
## [43] "asshore" "assjockey" "asskiss"
## [46] "asskisser" "assklown" "asslick"
## [49] "asslicker" "asslover" "assman"
## [52] "assmonkey" "assmunch" "assmuncher"
## [55] "asspacker" "asspirate" "asspuppies"
## [58] "assranger" "asswhore" "asswipe"
## [61] "athletesfoot" "attack" "australian"
## [64] "babe" "babies" "backdoor"
## [67] "backdoorman" "backseat" "badfuck"
## [70] "balllicker" "balls" "ballsack"
## [73] "banging" "baptist" "barelylegal"
## [76] "barf" "barface" "barfface"
## [79] "bast" "bastard " "bazongas"
## [82] "bazooms" "beaner" "beast"
## [85] "beastality" "beastial" "beastiality"
## [88] "beatoff" "beat-off" "beatyourmeat"
## [91] "beaver" "bestial" "bestiality"
## [94] "bi" "biatch" "bible"
## [97] "bicurious" "bigass" "bigbastard"
## [100] "bigbutt" "bigger" "bisexual"
## [103] "bi-sexual" "bitch" "bitcher"
## [106] "bitches" "bitchez" "bitchin"
## [109] "bitching" "bitchslap" "bitchy"
## [112] "biteme" "black" "blackman"
## [115] "blackout" "blacks" "blind"
## [118] "blow" "blowjob" "boang"
## [121] "bogan" "bohunk" "bollick"
## [124] "bollock" "bomb" "bombers"
## [127] "bombing" "bombs" "bomd"
## [130] "bondage" "boner" "bong"
## [133] "boob" "boobies" "boobs"
## [136] "booby" "boody" "boom"
## [139] "boong" "boonga" "boonie"
## [142] "booty" "bootycall" "bountybar"
## [145] "bra" "brea5t" "breast"
## [148] "breastjob" "breastlover" "breastman"
## [151] "brothel" "bugger" "buggered"
## [154] "buggery" "bullcrap" "bulldike"
## [157] "bulldyke" "bullshit" "bumblefuck"
## [160] "bumfuck" "bunga" "bunghole"
## [163] "buried" "burn" "butchbabes"
## [166] "butchdike" "butchdyke" "butt"
## [169] "buttbang" "butt-bang" "buttface"
## [172] "buttfuck" "butt-fuck" "buttfucker"
## [175] "butt-fucker" "buttfuckers" "butt-fuckers"
## [178] "butthead" "buttman" "buttmunch"
## [181] "buttmuncher" "buttpirate" "buttplug"
## [184] "buttstain" "byatch" "cacker"
## [187] "cameljockey" "cameltoe" "canadian"
## [190] "cancer" "carpetmuncher" "carruth"
## [193] "catholic" "catholics" "cemetery"
## [196] "chav" "cherrypopper" "chickslick"
## [199] "children's" "chin" "chinaman"
## [202] "chinamen" "chinese" "chink"
## [205] "chinky" "choad" "chode"
## [208] "christ" "christian" "church"
## [211] "cigarette" "cigs" "clamdigger"
## [214] "clamdiver" "clit" "clitoris"
## [217] "clogwog" "cocaine" "cock"
## [220] "cockblock" "cockblocker" "cockcowboy"
## [223] "cockfight" "cockhead" "cockknob"
## [226] "cocklicker" "cocklover" "cocknob"
## [229] "cockqueen" "cockrider" "cocksman"
## [232] "cocksmith" "cocksmoker" "cocksucer"
## [235] "cocksuck " "cocksucked " "cocksucker"
## [238] "cocksucking" "cocktail" "cocktease"
## [241] "cocky" "cohee" "coitus"
## [244] "color" "colored" "coloured"
## [247] "commie" "communist" "condom"
## [250] "conservative" "conspiracy" "coolie"
## [253] "cooly" "coon" "coondog"
## [256] "copulate" "cornhole" "corruption"
## [259] "cra5h" "crabs" "crack"
## [262] "crackpipe" "crackwhore" "crack-whore"
## [265] "crap" "crapola" "crapper"
## [268] "crappy" "crash" "creamy"
## [271] "crime" "crimes" "criminal"
## [274] "criminals" "crotch" "crotchjockey"
## [277] "crotchmonkey" "crotchrot" "cum"
## [280] "cumbubble" "cumfest" "cumjockey"
## [283] "cumm" "cummer" "cumming"
## [286] "cumquat" "cumqueen" "cumshot"
## [289] "cunilingus" "cunillingus" "cunn"
## [292] "cunnilingus" "cunntt" "cunt"
## [295] "cunteyed" "cuntfuck" "cuntfucker"
## [298] "cuntlick " "cuntlicker " "cuntlicking "
## [301] "cuntsucker" "cybersex" "cyberslimer"
## [304] "dago" "dahmer" "dammit"
## [307] "damn" "damnation" "damnit"
## [310] "darkie" "darky" "datnigga"
## [313] "dead" "deapthroat" "death"
## [316] "deepthroat" "defecate" "dego"
## [319] "demon" "deposit" "desire"
## [322] "destroy" "deth" "devil"
## [325] "devilworshipper" "dick" "dickbrain"
## [328] "dickforbrains" "dickhead" "dickless"
## [331] "dicklick" "dicklicker" "dickman"
## [334] "dickwad" "dickweed" "diddle"
## [337] "die" "died" "dies"
## [340] "dike" "dildo" "dingleberry"
## [343] "dink" "dipshit" "dipstick"
## [346] "dirty" "disease" "diseases"
## [349] "disturbed" "dive" "dix"
## [352] "dixiedike" "dixiedyke" "doggiestyle"
## [355] "doggystyle" "dong" "doodoo"
## [358] "doo-doo" "doom" "dope"
## [361] "dragqueen" "dragqween" "dripdick"
## [364] "drug" "drunk" "drunken"
## [367] "dumb" "dumbass" "dumbbitch"
## [370] "dumbfuck" "dyefly" "dyke"
## [373] "easyslut" "eatballs" "eatme"
## [376] "eatpussy" "ecstacy" "ejaculate"
## [379] "ejaculated" "ejaculating " "ejaculation"
## [382] "enema" "enemy" "erect"
## [385] "erection" "ero" "escort"
## [388] "ethiopian" "ethnic" "european"
## [391] "evl" "excrement" "execute"
## [394] "executed" "execution" "executioner"
## [397] "explosion" "facefucker" "faeces"
## [400] "fag" "fagging" "faggot"
## [403] "fagot" "failed" "failure"
## [406] "fairies" "fairy" "faith"
## [409] "fannyfucker" "fart" "farted "
## [412] "farting " "farty " "fastfuck"
## [415] "fat" "fatah" "fatass"
## [418] "fatfuck" "fatfucker" "fatso"
## [421] "fckcum" "fear" "feces"
## [424] "felatio " "felch" "felcher"
## [427] "felching" "fellatio" "feltch"
## [430] "feltcher" "feltching" "fetish"
## [433] "fight" "filipina" "filipino"
## [436] "fingerfood" "fingerfuck " "fingerfucked "
## [439] "fingerfucker " "fingerfuckers" "fingerfucking "
## [442] "fire" "firing" "fister"
## [445] "fistfuck" "fistfucked " "fistfucker "
## [448] "fistfucking " "fisting" "flange"
## [451] "flasher" "flatulence" "floo"
## [454] "flydie" "flydye" "fok"
## [457] "fondle" "footaction" "footfuck"
## [460] "footfucker" "footlicker" "footstar"
## [463] "fore" "foreskin" "forni"
## [466] "fornicate" "foursome" "fourtwenty"
## [469] "fraud" "freakfuck" "freakyfucker"
## [472] "freefuck" "fu" "fubar"
## [475] "fuc" "fucck" "fuck"
## [478] "fucka" "fuckable" "fuckbag"
## [481] "fuckbuddy" "fucked" "fuckedup"
## [484] "fucker" "fuckers" "fuckface"
## [487] "fuckfest" "fuckfreak" "fuckfriend"
## [490] "fuckhead" "fuckher" "fuckin"
## [493] "fuckina" "fucking" "fuckingbitch"
## [496] "fuckinnuts" "fuckinright" "fuckit"
## [499] "fuckknob" "fuckme " "fuckmehard"
## [502] "fuckmonkey" "fuckoff" "fuckpig"
## [505] "fucks" "fucktard" "fuckwhore"
## [508] "fuckyou" "fudgepacker" "fugly"
## [511] "fuk" "fuks" "funeral"
## [514] "funfuck" "fungus" "fuuck"
## [517] "gangbang" "gangbanged " "gangbanger"
## [520] "gangsta" "gatorbait" "gay"
## [523] "gaymuthafuckinwhore" "gaysex " "geez"
## [526] "geezer" "geni" "genital"
## [529] "german" "getiton" "gin"
## [532] "ginzo" "gipp" "girls"
## [535] "givehead" "glazeddonut" "gob"
## [538] "god" "godammit" "goddamit"
## [541] "goddammit" "goddamn" "goddamned"
## [544] "goddamnes" "goddamnit" "goddamnmuthafucker"
## [547] "goldenshower" "gonorrehea" "gonzagas"
## [550] "gook" "gotohell" "goy"
## [553] "goyim" "greaseball" "gringo"
## [556] "groe" "gross" "grostulation"
## [559] "gubba" "gummer" "gun"
## [562] "gyp" "gypo" "gypp"
## [565] "gyppie" "gyppo" "gyppy"
## [568] "hamas" "handjob" "hapa"
## [571] "harder" "hardon" "harem"
## [574] "headfuck" "headlights" "hebe"
## [577] "heeb" "hell" "henhouse"
## [580] "heroin" "herpes" "heterosexual"
## [583] "hijack" "hijacker" "hijacking"
## [586] "hillbillies" "hindoo" "hiscock"
## [589] "hitler" "hitlerism" "hitlerist"
## [592] "hiv" "ho" "hobo"
## [595] "hodgie" "hoes" "hole"
## [598] "holestuffer" "homicide" "homo"
## [601] "homobangers" "homosexual" "honger"
## [604] "honk" "honkers" "honkey"
## [607] "honky" "hook" "hooker"
## [610] "hookers" "hooters" "hore"
## [613] "hork" "horn" "horney"
## [616] "horniest" "horny" "horseshit"
## [619] "hosejob" "hoser" "hostage"
## [622] "hotdamn" "hotpussy" "hottotrot"
## [625] "hummer" "husky" "hussy"
## [628] "hustler" "hymen" "hymie"
## [631] "iblowu" "idiot" "ikey"
## [634] "illegal" "incest" "insest"
## [637] "intercourse" "interracial" "intheass"
## [640] "inthebuff" "israel" "israeli"
## [643] "israel's" "italiano" "itch"
## [646] "jackass" "jackoff" "jackshit"
## [649] "jacktheripper" "jade" "jap"
## [652] "japanese" "japcrap" "jebus"
## [655] "jeez" "jerkoff" "jesus"
## [658] "jesuschrist" "jew" "jewish"
## [661] "jiga" "jigaboo" "jigg"
## [664] "jigga" "jiggabo" "jigger "
## [667] "jiggy" "jihad" "jijjiboo"
## [670] "jimfish" "jism" "jiz "
## [673] "jizim" "jizjuice" "jizm "
## [676] "jizz" "jizzim" "jizzum"
## [679] "joint" "juggalo" "jugs"
## [682] "junglebunny" "kaffer" "kaffir"
## [685] "kaffre" "kafir" "kanake"
## [688] "kid" "kigger" "kike"
## [691] "kill" "killed" "killer"
## [694] "killing" "kills" "kink"
## [697] "kinky" "kissass" "kkk"
## [700] "knife" "knockers" "kock"
## [703] "kondum" "koon" "kotex"
## [706] "krap" "krappy" "kraut"
## [709] "kum" "kumbubble" "kumbullbe"
## [712] "kummer" "kumming" "kumquat"
## [715] "kums" "kunilingus" "kunnilingus"
## [718] "kunt" "ky" "kyke"
## [721] "lactate" "laid" "lapdance"
## [724] "latin" "lesbain" "lesbayn"
## [727] "lesbian" "lesbin" "lesbo"
## [730] "lez" "lezbe" "lezbefriends"
## [733] "lezbo" "lezz" "lezzo"
## [736] "liberal" "libido" "licker"
## [739] "lickme" "lies" "limey"
## [742] "limpdick" "limy" "lingerie"
## [745] "liquor" "livesex" "loadedgun"
## [748] "lolita" "looser" "loser"
## [751] "lotion" "lovebone" "lovegoo"
## [754] "lovegun" "lovejuice" "lovemuscle"
## [757] "lovepistol" "loverocket" "lowlife"
## [760] "lsd" "lubejob" "lucifer"
## [763] "luckycammeltoe" "lugan" "lynch"
## [766] "macaca" "mad" "mafia"
## [769] "magicwand" "mams" "manhater"
## [772] "manpaste" "marijuana" "mastabate"
## [775] "mastabater" "masterbate" "masterblaster"
## [778] "mastrabator" "masturbate" "masturbating"
## [781] "mattressprincess" "meatbeatter" "meatrack"
## [784] "meth" "mexican" "mgger"
## [787] "mggor" "mickeyfinn" "mideast"
## [790] "milf" "minority" "mockey"
## [793] "mockie" "mocky" "mofo"
## [796] "moky" "moles" "molest"
## [799] "molestation" "molester" "molestor"
## [802] "moneyshot" "mooncricket" "mormon"
## [805] "moron" "moslem" "mosshead"
## [808] "mothafuck" "mothafucka" "mothafuckaz"
## [811] "mothafucked " "mothafucker" "mothafuckin"
## [814] "mothafucking " "mothafuckings" "motherfuck"
## [817] "motherfucked" "motherfucker" "motherfuckin"
## [820] "motherfucking" "motherfuckings" "motherlovebone"
## [823] "muff" "muffdive" "muffdiver"
## [826] "muffindiver" "mufflikcer" "mulatto"
## [829] "muncher" "munt" "murder"
## [832] "murderer" "muslim" "naked"
## [835] "narcotic" "nasty" "nastybitch"
## [838] "nastyho" "nastyslut" "nastywhore"
## [841] "nazi" "necro" "negro"
## [844] "negroes" "negroid" "negro's"
## [847] "nig" "niger" "nigerian"
## [850] "nigerians" "nigg" "nigga"
## [853] "niggah" "niggaracci" "niggard"
## [856] "niggarded" "niggarding" "niggardliness"
## [859] "niggardliness's" "niggardly" "niggards"
## [862] "niggard's" "niggaz" "nigger"
## [865] "niggerhead" "niggerhole" "niggers"
## [868] "nigger's" "niggle" "niggled"
## [871] "niggles" "niggling" "nigglings"
## [874] "niggor" "niggur" "niglet"
## [877] "nignog" "nigr" "nigra"
## [880] "nigre" "nip" "nipple"
## [883] "nipplering" "nittit" "nlgger"
## [886] "nlggor" "nofuckingway" "nook"
## [889] "nookey" "nookie" "noonan"
## [892] "nooner" "nude" "nudger"
## [895] "nuke" "nutfucker" "nymph"
## [898] "ontherag" "oral" "orga"
## [901] "orgasim " "orgasm" "orgies"
## [904] "orgy" "osama" "paki"
## [907] "palesimian" "palestinian" "pansies"
## [910] "pansy" "panti" "panties"
## [913] "payo" "pearlnecklace" "peck"
## [916] "pecker" "peckerwood" "pee"
## [919] "peehole" "pee-pee" "peepshow"
## [922] "peepshpw" "pendy" "penetration"
## [925] "peni5" "penile" "penis"
## [928] "penises" "penthouse" "period"
## [931] "perv" "phonesex" "phuk"
## [934] "phuked" "phuking" "phukked"
## [937] "phukking" "phungky" "phuq"
## [940] "pi55" "picaninny" "piccaninny"
## [943] "pickaninny" "piker" "pikey"
## [946] "piky" "pimp" "pimped"
## [949] "pimper" "pimpjuic" "pimpjuice"
## [952] "pimpsimp" "pindick" "piss"
## [955] "pissed" "pisser" "pisses "
## [958] "pisshead" "pissin " "pissing"
## [961] "pissoff " "pistol" "pixie"
## [964] "pixy" "playboy" "playgirl"
## [967] "pocha" "pocho" "pocketpool"
## [970] "pohm" "polack" "pom"
## [973] "pommie" "pommy" "poo"
## [976] "poon" "poontang" "poop"
## [979] "pooper" "pooperscooper" "pooping"
## [982] "poorwhitetrash" "popimp" "porchmonkey"
## [985] "porn" "pornflick" "pornking"
## [988] "porno" "pornography" "pornprincess"
## [991] "pot" "poverty" "premature"
## [994] "pric" "prick" "prickhead"
## [997] "primetime" "propaganda" "pros"
## [1000] "prostitute" "protestant" "pu55i"
## [1003] "pu55y" "pube" "pubic"
## [1006] "pubiclice" "pud" "pudboy"
## [1009] "pudd" "puddboy" "puke"
## [1012] "puntang" "purinapricness" "puss"
## [1015] "pussie" "pussies" "pussy"
## [1018] "pussycat" "pussyeater" "pussyfucker"
## [1021] "pussylicker" "pussylips" "pussylover"
## [1024] "pussypounder" "pusy" "quashie"
## [1027] "queef" "queer" "quickie"
## [1030] "quim" "ra8s" "rabbi"
## [1033] "racial" "racist" "radical"
## [1036] "radicals" "raghead" "randy"
## [1039] "rape" "raped" "raper"
## [1042] "rapist" "rearend" "rearentry"
## [1045] "rectum" "redlight" "redneck"
## [1048] "reefer" "reestie" "refugee"
## [1051] "reject" "remains" "rentafuck"
## [1054] "republican" "rere" "retard"
## [1057] "retarded" "ribbed" "rigger"
## [1060] "rimjob" "rimming" "roach"
## [1063] "robber" "roundeye" "rump"
## [1066] "russki" "russkie" "sadis"
## [1069] "sadom" "samckdaddy" "sandm"
## [1072] "sandnigger" "satan" "scag"
## [1075] "scallywag" "scat" "schlong"
## [1078] "screw" "screwyou" "scrotum"
## [1081] "scum" "semen" "seppo"
## [1084] "servant" "sex" "sexed"
## [1087] "sexfarm" "sexhound" "sexhouse"
## [1090] "sexing" "sexkitten" "sexpot"
## [1093] "sexslave" "sextogo" "sextoy"
## [1096] "sextoys" "sexual" "sexually"
## [1099] "sexwhore" "sexy" "sexymoma"
## [1102] "sexy-slim" "shag" "shaggin"
## [1105] "shagging" "shat" "shav"
## [1108] "shawtypimp" "sheeney" "shhit"
## [1111] "shinola" "shit" "shitcan"
## [1114] "shitdick" "shite" "shiteater"
## [1117] "shited" "shitface" "shitfaced"
## [1120] "shitfit" "shitforbrains" "shitfuck"
## [1123] "shitfucker" "shitfull" "shithapens"
## [1126] "shithappens" "shithead" "shithouse"
## [1129] "shiting" "shitlist" "shitola"
## [1132] "shitoutofluck" "shits" "shitstain"
## [1135] "shitted" "shitter" "shitting"
## [1138] "shitty " "shoot" "shooting"
## [1141] "shortfuck" "showtime" "sick"
## [1144] "sissy" "sixsixsix" "sixtynine"
## [1147] "sixtyniner" "skank" "skankbitch"
## [1150] "skankfuck" "skankwhore" "skanky"
## [1153] "skankybitch" "skankywhore" "skinflute"
## [1156] "skum" "skumbag" "slant"
## [1159] "slanteye" "slapper" "slaughter"
## [1162] "slav" "slave" "slavedriver"
## [1165] "sleezebag" "sleezeball" "slideitin"
## [1168] "slime" "slimeball" "slimebucket"
## [1171] "slopehead" "slopey" "slopy"
## [1174] "slut" "sluts" "slutt"
## [1177] "slutting" "slutty" "slutwear"
## [1180] "slutwhore" "smack" "smackthemonkey"
## [1183] "smut" "snatch" "snatchpatch"
## [1186] "snigger" "sniggered" "sniggering"
## [1189] "sniggers" "snigger's" "sniper"
## [1192] "snot" "snowback" "snownigger"
## [1195] "sob" "sodom" "sodomise"
## [1198] "sodomite" "sodomize" "sodomy"
## [1201] "sonofabitch" "sonofbitch" "sooty"
## [1204] "sos" "soviet" "spaghettibender"
## [1207] "spaghettinigger" "spank" "spankthemonkey"
## [1210] "sperm" "spermacide" "spermbag"
## [1213] "spermhearder" "spermherder" "spic"
## [1216] "spick" "spig" "spigotty"
## [1219] "spik" "spit" "spitter"
## [1222] "splittail" "spooge" "spreadeagle"
## [1225] "spunk" "spunky" "squaw"
## [1228] "stagg" "stiffy" "strapon"
## [1231] "stringer" "stripclub" "stroke"
## [1234] "stroking" "stupid" "stupidfuck"
## [1237] "stupidfucker" "suck" "suckdick"
## [1240] "sucker" "suckme" "suckmyass"
## [1243] "suckmydick" "suckmytit" "suckoff"
## [1246] "suicide" "swallow" "swallower"
## [1249] "swalow" "swastika" "sweetness"
## [1252] "syphilis" "taboo" "taff"
## [1255] "tampon" "tang" "tantra"
## [1258] "tarbaby" "tard" "teat"
## [1261] "terror" "terrorist" "teste"
## [1264] "testicle" "testicles" "thicklips"
## [1267] "thirdeye" "thirdleg" "threesome"
## [1270] "threeway" "timbernigger" "tinkle"
## [1273] "tit" "titbitnipply" "titfuck"
## [1276] "titfucker" "titfuckin" "titjob"
## [1279] "titlicker" "titlover" "tits"
## [1282] "tittie" "titties" "titty"
## [1285] "tnt" "toilet" "tongethruster"
## [1288] "tongue" "tonguethrust" "tonguetramp"
## [1291] "tortur" "torture" "tosser"
## [1294] "towelhead" "trailertrash" "tramp"
## [1297] "trannie" "tranny" "transexual"
## [1300] "transsexual" "transvestite" "triplex"
## [1303] "trisexual" "trojan" "trots"
## [1306] "tuckahoe" "tunneloflove" "turd"
## [1309] "turnon" "twat" "twink"
## [1312] "twinkie" "twobitwhore" "uck"
## [1315] "uk" "unfuckable" "upskirt"
## [1318] "uptheass" "upthebutt" "urinary"
## [1321] "urinate" "urine" "usama"
## [1324] "uterus" "vagina" "vaginal"
## [1327] "vatican" "vibr" "vibrater"
## [1330] "vibrator" "vietcong" "violence"
## [1333] "virgin" "virginbreaker" "vomit"
## [1336] "vulva" "wab" "wank"
## [1339] "wanker" "wanking" "waysted"
## [1342] "weapon" "weenie" "weewee"
## [1345] "welcher" "welfare" "wetb"
## [1348] "wetback" "wetspot" "whacker"
## [1351] "whash" "whigger" "whiskey"
## [1354] "whiskeydick" "whiskydick" "whit"
## [1357] "whitenigger" "whites" "whitetrash"
## [1360] "whitey" "whiz" "whop"
## [1363] "whore" "whorefucker" "whorehouse"
## [1366] "wigger" "willie" "williewanker"
## [1369] "willy" "wn" "wog"
## [1372] "women's" "wop" "wtf"
## [1375] "wuss" "wuzzie" "xtc"
## [1378] "xxx" "yankee" "yellowman"
## [1381] "zigabo" "zipperhead"
Tidying the Corpus:
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c(profanity[,1]))
corpus <- tm_map(corpus, stripWhitespace)
N-grams are a critical tool to identify the frequency of certain words and word patterns. We created four different N-grams:
1-gram (Uni-gram) - Indicates the frequcy of single words 2-gram (Bi-gram) - Indicates the frequency of two word patterns 3-gram (Tri-gram) - Indicates the frequency of three word patterns 4-gram (Quad-gram) - Indicates the frequency of four word patterns
plotNGram <- function(n) {
options(mc.cores=1)
tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
tdm <- TermDocumentMatrix(corpus, control=list(tokenize=tk))
ngram <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN=sum))
ngram <- data.frame(word=rownames(ngram), freq=ngram[,1])
ngram <- ngram[order(-ngram$freq), ][1:25, ]
ngram$word <- factor(ngram$word, as.character(ngram$word))
ggplot(ngram, aes(x=word, y=freq)) + ggtitle("Frequency of Words") + geom_bar(stat="Identity", fill="#ED9626", color="blue") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Word(s)") + ylab("Frequency")
}
Plot Frequency of Most Common n-gram
plotNGram(1)
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom functions are ignored
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom tokenizer is ignored
plotNGram(2)
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom functions are ignored
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom tokenizer is ignored
plotNGram(3)
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom functions are ignored
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom tokenizer is ignored
plotNGram(4)
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom functions are ignored
## Warning in TermDocumentMatrix.SimpleCorpus(corpus, control = list(tokenize =
## tk)): custom tokenizer is ignored
Next, we will build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word. The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in this report.