The work I’ve done in this report seems to me a little basic, I hope at least be on the track for developing an algorithm…
Getting the Data
Sample the Data
Cleaning the DAta punctuation and numbers
Extracting words by tokenization
Start doing some research about n-grams models and frecuency based algorithms
The use of Text Mining functions and libraries from tm, nlp
Extract frecuent words
Understand the use of stopwords
Separate my data by n-grams
Use of frecuency matrix
This first two lines were to it once, for downloading the data from the url provided.
direccion <- “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”
download.file(direccion, “D://dataset.zip”)
setwd("D://capstone//final/en_US//")
dataundstd <- file("en_US.twitter.txt", open = "r")
twitazos <- readLines(dataundstd, skipNul = TRUE)
close(con = dataundstd)
head(twitazos)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] 36 64 94 96 109 116 130 136 139 160 184 191 209 220 221 223 232
## [18] 248 277 291 309 317 324 342 353 391 404 412 448 456 460 465 494 512
## [35] 513 531 535 537 539 565 600 658 672 679 686 744 750 757 767 768 794
## [52] 805 820 842 854 872 940 944 959 963 968 992
## [1] "Fun video. Could use another edit but the students had a good time making it."
## [2] "My bro & the rest of #HVTbaseball clinched the Flight B championship. Y'all said it'd be done & it happened.. #GoodJob I'm proud â<U+009A>¾"
## [3] "Sleeping good tonight for sure >>"
## [4] "I think I would start with the Les Paul, but then again the SG and Explorer would be a good lead up to the Les Paul."
## [5] "GOOD MORNING!!!!"
## [6] "thanks bro! its good to know people actually watch those lol"
## [7] "thats great keep up the good work champ! :)"
## [8] "That talk so barely scratches the surface of such a large concept. Good talk."
## [9] "that should be a good trip to S.D....."
## [10] "Darn you for bringing mama's peanut butter cookies to HQs today. Good thing we're doing so much walking!"
## [11] "Thanks! So far so good, looks like all on time and to Providence by noon."
## [12] "lol k goodnight."
## [13] "Doubt it. RT Let's see if these DA live Dalai Lama tweets are as good as my in-game tweets."
## [14] "Keep up the good work Perez hard work and perseverance always pays off. You can do it! Tweet! back."
## [15] "Gotta call from my bro good to have u back in Pac kidd"
## [16] "Shes A Good Person..."
## [17] "Good Morning #SteelerNation. Just a few too many shots last night. But time to Move on. #Breakfast!!"
## [18] "Just read about the passing of Davy Jones of The Monkees. Brings back some good childhood memories."
## [19] "It was so good tho omg"
## [20] "Hope everyone have a great exam today...good luck !"
## [21] "how many of the good works that were sent out before us do we miss because we were focused too much on ourselves? ephesians 2:10"
## [22] "You know a guy's good when it's news that he gets out MT : #Orioles have retired Josh Hamilton. Wei-Yin Chen strikes him out."
## [23] "Good morning all! It's finally Friday!"
## [24] "bro started adding bananas to diet along with water to fight cramps. So far so good. Lega today."
## [25] "Perfect day to walk to work. Good morning, DC! Good morning, World!"
## [26] "good luck :) and morning"
## [27] "yea I'm 100 bra wats good wit u"
## [28] "Hanging with and having a really good laugh!"
## [29] "hey I've been good. Still grinding ...yaself ?"
## [30] "Whats good! love how you rock your"
## [31] "&& GOOD NEWS ! i downloaded a music editor to my computer but we just need to learn how to use it now =) lol"
## [32] "good now?"
## [33] "hey!!! I misssssseddddd u!!!!! <3 i been good !"
## [34] "hey Hope Everything is goin good for u taylor"
## [35] "oh, we'll that's good :-)"
## [36] "Santorum ranted in Oct. that birth control isn't good. Lead to ? In debate about if legalized bc is a states right & he said yes."
## [37] "Of course. Good luck!"
## [38] "oh. good question. heard any good rumors about the lineup?"
## [39] "& i spent a super long time internet stalking people we know/sort of know tonight... & we're really damn good at it."
## [40] "Saw Page One tonight. Thought-provoking & informative. Keep up the good work, #NYT! (Where would Gawker be without you!?)"
## [41] "Dammit! And I make such good cornbread, too."
## [42] "good points... the talk.alliedmedia site does indeed send plaintext, but so do many other non-critical web services. SSL==$$$"
## [43] "that's good"
## [44] "Good Morning Fans! We open at Noon for Twelve Buck Thursday...see ya soon!"
## [45] "Once again, a full body slam to concrete is a good thing. Of all the slams destined in your future it's just one less you gotta deal with."
## [46] "#Shoutout Good morning."
## [47] "Free movie with , then a delicious dinner in a lovely locale with my very funny, friends who shall remain nameless. Life is good!"
## [48] "\": Enjoy the game! I'll be there Sunday! Go Giants!\" <-- thanks!! You'll have better weather than us.. but so far so good!"
## [49] "I just need to make sure I don't get pulled over by a cop between now and said hypothetical future good hair day! ;)"
twitsample <- gsub("[[:punct:]]","",twitsample)
twitsample <- gsub("[[:digit:]]","",twitsample)
head(twitsample)
## [1] "Gus Johnson is another that no one can possibly hate"
## [2] "hi Amy just wanted to say my wife and I love u My wife use to work for melrose coop bank She took care of grampa Carl"
## [3] "Secret IP treaties evil legislation copyright terms stretching off beyond the horizon"
## [4] " runs twice in games"
## [5] "Thats hysterical"
## [6] "Sound track to my broken heart Mario"
library(NLP)
library(tm)
pdt <- PlainTextDocument(twitsample, heading = "Plain Text Dcoument", id =basename(tempfile()), language = "en")
meta(pdt)
## author : character(0)
## datetimestamp: 2016-09-04 20:31:33
## description : character(0)
## heading : Plain Text Dcoument
## id : file2328609a703e
## language : en
## origin : character(0)
tm_term_score(pdt, terms = "Good")
## [1] 10
The number of ocurrencies of the term Good, should be 42 but I had to first put all text to lowercase using the tolower function.
tokens <- scan_tokenizer(pdt[[1]])
head(tokens, n =20)
## [1] "Gus" "Johnson" "is" "another" "that" "no"
## [7] "one" "can" "possibly" "hate" "hi" "Amy"
## [13] "just" "wanted" "to" "say" "my" "wife"
## [19] "and" "I"
freqterms <- termFreq(pdt, control = list())
ordenados <- sort(freqterms, decreasing = TRUE)
keywords <- (ordenados[1:20])
barplot(keywords)