Question - 1

The πšŽπš—_πš„πš‚.πš‹πš•πš˜πšπšœ.𝚝𝚑𝚝 file is how many megabytes?

Answer

size<-file.info("./final/en_US/en_US.blogs.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 200.4242

200

Question - 2

The πšŽπš—_πš„πš‚.πšπš πš’πšπšπšŽπš›.𝚝𝚑𝚝 has how many lines of text?

Answer

twitter <- readLines(con <- file("./final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
length(twitter)
## [1] 2360148

Over 2 million

Question - 3

What is the length of the longest line seen in any of the three en_US data sets?

Answer

# Blogs file
blogs<-file("./final/en_US/en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)
summary(nchar(blogs_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830
# News file
news<-file("./final/en_US/en_US.news.txt","r")
news_lines<-readLines(news)
close(news)
summary(nchar(news_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0
# Twitter file
twitter<-file("./final/en_US/en_US.twitter.txt","r")
twitter_lines<-readLines(twitter)
## Warning in readLines(twitter): line 167155 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 268547 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 1759032 appears to contain an embedded
## nul
close(twitter)
summary(nchar(twitter_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

Over 11 thousand in the news data set

Question - 4

In the en_US twitter data set, if you divide the number of lines where the word β€œlove” (all lowercase) occurs by the number of lines the word β€œhate” (all lowercase) occurs, about what do you get?

Answer

love<-length(grep("love", twitter_lines))
hate<-length(grep("hate", twitter_lines))
love/hate
## [1] 4.108592

4

Question - 5

The one tweet in the en_US twitter data set that matches the word β€œbiostats” says what?

Answer

grep("biostats", twitter_lines, value = T)
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

They haven’t studied for their biostats exam

Question - 6

Answer

grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_lines)
## [1]  519059  835824 2283423

3