Quiz1

Question 1

The en_US.blogs.txt file is how many megabytes?

setwd("~/R.Studio/Data_Science_Capstone/final/en_US")
size <- file.info("en_US.blogs.txt")$size
MB <- size/1024/1024
MB

## [1] 200.4242

The en_US.blogs.txt file is 200 MB.

Question 2

The en_US.twitter.txt has how many lines of text?

setwd("~/R.Studio/Data_Science_Capstone/final/en_US")
twitter <- readLines("en_US.twitter.txt")
length(twitter)

## [1] 2360148

Over 2 million lines of text.

## Question 3

What is the length of the longest line seen in any of the three en_US data sets?

setwd("~/R.Studio/Data_Science_Capstone/final/en_US")
twitter <- readLines("en_US.twitter.txt")
news <- readLines("en_US.news.txt")
blogs <- readLines("en_US.blogs.txt")
summary(nchar(twitter))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

summary(nchar(news))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

summary(nchar(blogs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0

Over 40 thousand in the blogs file.

Question 4

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

love <- length(grep("love", twitter))
hate <- length(grep("hate", twitter))
love/hate

## [1] 4.108592

4.108592

Question 5

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

biostats <- grep("biostats", twitter)
twitter[biostats]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

They haven’t studied for their biostats exam

Question 6

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)

length(grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter))

## [1] 3