library(dplyr) #for question 3
library(data.table) #for question 5

Question 1

The en_US.blogs.txt file is how many megabytes?

file.info('./final/en_US/en_US.blogs.txt')[1]
##                                    size
## ./final/en_US/en_US.blogs.txt 210160014

It is about 200 MBs.

Question 2

The en_US.twitter.txt has how many lines of text?

twitter <- as.data.frame(readLines('./final/en_US/en_US.twitter.txt'))
names(twitter) <- c('Text')
twitter <- twitter %>% mutate(length = nchar(twitter$Text))
dim(twitter)
## [1] 2360148       2

2.3 million lines.

Question 3

What is the length of the longest line seen in any of the three en_US data sets?

blogs <- as.data.frame(readLines('./final/en_US/en_US.blogs.txt'))
names(blogs) <- c('Text')
blogs <- blogs %>% mutate(length = nchar(blogs$Text))
news <- as.data.frame(readLines('./final/en_US/en_US.news.txt'))
names(news) <- c('Text')
news <- news %>% mutate(length = nchar(news$Text))
max(blogs$length)
## [1] 40835
max(news$length)
## [1] 5760

Blogs dataset has the longest line of 40k symbols.

Question 4

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

blogs$Text <- tolower(blogs$Text)
sum(grepl('love', blogs$Text)) / sum(grepl('hate', blogs$Text))
## [1] 4.729716

About 4 (and it was 4.430258 before tolower)

Question 5

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

twitter[twitter$Text %like% 'biostats', ]
##                                                                               Text
## 556872 i know how you feel.. i have biostats on tuesday and i have yet to study =/
##        length
## 556872     75

Looks like somebody have an exam and did not study hard.

Question 6

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)

twitter[twitter$Text %like% 
          'A computer once beat me at chess, but it was no match for me at kickboxing', ]
##                                                                               Text
## 519059  A computer once beat me at chess, but it was no match for me at kickboxing
## 835824  A computer once beat me at chess, but it was no match for me at kickboxing
## 2283423 A computer once beat me at chess, but it was no match for me at kickboxing
##         length
## 519059      74
## 835824      74
## 2283423     74

3 tweets.