R Markdown

This is an R Markdown document of Rongbin Ye, who is pursuing the concentration of data scientist for JHU coursera course.

Question 1: En_Us.blog.txt file is ___ Megabytes? 210.2 MB

size<-file.info("~/Downloads/final/en_US/en_US.blogs.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 200.4242

Question 2:How many lines of twitter - over 2 million

library(readr)

# Loading Data into the connection
contwitters <- file("~/Downloads/final/en_US/en_US.twitter.txt", "r")
conblogs <- file("~/Downloads/final/en_US/en_US.blogs.txt", "r")
connews <- file("~/Downloads/final/en_US/en_US.news.txt", "r")

twitter <- readLines(con <- contwitters, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con <- conblogs, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con <- connews, encoding = "UTF-8", skipNul = TRUE)
length(twitter)
## [1] 2360148

Question 3: How many lines for each of them.

summary(nchar(twitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00
summary(nchar(blogs))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833
summary(nchar(news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11384.0

Question 4: In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

love<-length(grep("love", twitter))
hate<-length(grep("hate", twitter))
love/hate
## [1] 4.108592

Question 5: The one tweet in the en_US twitter data set that matches the word “biostats” says what?

grep("biostats", twitter, value = TRUE)
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

Question 6:How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)

grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter)
## [1]  519059  835824 2283423