Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Profanity filtering - removing profanity and other words you do not want to predict.
Loading the data in.
readLines(con, 1) ## Read the first line of text
readLines(con, 1) ## Read the next line of text
readLines(con, 5) ## Read in the next 5 lines of text
close(con) ## It’s important to close the connection when you are done.
See the connections help page for more information.
Sampling.
To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population.
1. The en_US.blogs.txt file is how many megabytes?
givendata<-(file.info("C:/Users/nirma/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt")$size)/1024/1024
sprintf("The en_US.blogs.txt file is: %s Megabytes", givendata)
## [1] "The en_US.blogs.txt file is: 200.424207687378 Megabytes"
2. The en_US.twitter.txt|en_US.twitter.txt has how many lines of text?
twitterdata<-file("C:/Users/nirma/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt", open="rb")
total_lines<- 0L
while(length(chunk<-readBin(twitterdata,"raw", 65536))>0){
total_lines<-total_lines+sum(chunk==as.raw(10L))
}
close(twitterdata)
sprintf("The en_US.twitter.txt file has: %s Lines", total_lines)
## [1] "The en_US.twitter.txt file has: 2360148 Lines"
3. What is the length of the longest line seen in any of the three en_US data sets?
blogsdata<-file("C:/Users/nirma/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt", open="rb")
twitterdata<-file("C:/Users/nirma/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt", open="rb")
newsdata<-file("C:/Users/nirma/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.news.txt", open="rb")
#Reading Lines
blogsdata_lines<-readLines(blogsdata)
close(blogsdata)
blogsdata_L<-summary(nchar(blogsdata_lines))[6]
blogsdata_L
## Max.
## 40835
twitterdata_lines<-readLines(twitterdata)
## Warning in readLines(twitterdata): line 167155 appears to contain an embedded
## nul
## Warning in readLines(twitterdata): line 268547 appears to contain an embedded
## nul
## Warning in readLines(twitterdata): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(twitterdata): line 1759032 appears to contain an embedded
## nul
close(twitterdata)
twitterdata_L<-summary(nchar(twitterdata_lines))[6]
twitterdata_L
## Max.
## 213
newsdata_lines<-readLines(newsdata)
close(newsdata)
newsdata_L<-summary(nchar(newsdata_lines))[6]
newsdata_L
## Max.
## 11384
4.In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
lovehate<-length(grep("love", twitterdata_lines))/length(grep("hate", twitterdata_lines))
sprintf("We get around: %s", lovehate)
## [1] "We get around: 4.10859156202006"
5. The one tweet in the en_US twitter data set that matches the word “biostats” says what?
biostat_tweet<-grep("biostats", twitterdata_lines, value=TRUE)
biostat_tweet
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
6. How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)
tweet_match<-grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitterdata_lines)
tweet_match
## [1] 519059 835824 2283423