Question 1

The en_US.blogs.txt file is how many megabytes?

file.info("final/en_US/en_US.blogs.txt")
##                                  size isdir mode               mtime
## final/en_US/en_US.blogs.txt 210160014 FALSE  644 2014-07-22 05:13:05
##                                           ctime               atime uid gid
## final/en_US/en_US.blogs.txt 2020-10-05 17:17:43 2014-07-22 05:15:28 501  20
##                                uname grname
## final/en_US/en_US.blogs.txt woodzsan  staff

Answer: 210 MB

Question 2

The en_US.twitter.txt has how many lines of text?

library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.10.1 (2020-08-26 22:50:31 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,
##     warnings
sapply("final/en_US/en_US.twitter.txt",countLines)
## final/en_US/en_US.twitter.txt 
##                       2360148

Question 3

What is the length of the longest line seen in any of the three en_US data sets?

blogs <- "final/en_US/en_US.blogs.txt"
news <- "final/en_US/en_US.news.txt"
twitter <- "final/en_US/en_US.twitter.txt"

blog.line<-readLines(blogs,encoding="UTF-8", skipNul = TRUE)
news.line<-readLines(news,encoding="UTF-8", skipNul = TRUE)
twitter.line<-readLines(twitter,encoding="UTF-8", skipNul = TRUE)

blog.char.count <- nchar(blog.line)
news.char.count <- nchar(news.line)
twitter.char.count <- nchar(twitter.line)

print(paste("length of longest 'blog' line: ",max(blog.char.count)," characters"))
## [1] "length of longest 'blog' line:  40833  characters"
print(paste("length of longest 'news' line: ",max(news.char.count)," characters"))
## [1] "length of longest 'news' line:  11384  characters"
print(paste("length of longest 'twitter' line: ",max(twitter.char.count)," characters"))
## [1] "length of longest 'twitter' line:  140  characters"

Question 4

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

library(stringr)
length(str_subset(twitter.line,"love"))/
  length(str_subset(twitter.line,"hate"))
## [1] 4.108592

Question 5

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

str_subset(twitter.line,"biostats")
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

Question 6

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)

length(str_subset(twitter.line,"A computer once beat me at chess, but it was no match for me at kickboxing"))
## [1] 3