C10 Quiz 1: Getting Started

By Sandy Sng
8 June 2018

Reading pieces of the file at a time will require the use of a file connection in R.
For example, the following code could be used to read the first few lines of the English Twitter dataset

setwd("~/Desktop/R Files/final/en_US")
con <- file("en_US.twitter.txt", "r") 
readLines(con, 1) ## Read the first line of text 
readLines(con, 1) ## Read the next line of text 
readLines(con, 5) ## Read in the next 5 lines of text 
close(con) ## It's important to close the connection when you are done
  1. The en_US.blogs.txt file is how many megabytes? 200.4MB

  2. The en_US.twitter.txt has how many lines of text?

setwd("~/Desktop/R Files/final/en_US")
EnTwitter <- readLines(con <- file("en_US.twitter.txt", "r"))
close(con)
length(EnTwitter)
## [1] 2360148
  1. What is the length of the longest line seen in any of the three en_US data sets?
require(stringi)
setwd("~/Desktop/R Files/final/en_US")

EnBlogs <- readLines(con <- file("en_US.blogs.txt", "r"))
longEnBlogs <- stri_length(EnBlogs)
max(longEnBlogs)
## [1] 40833
close(con)

EnNews <- readLines(con <- file("en_US.news.txt", "r"))
longEnNews <- stri_length(EnNews)
max(longEnNews)
## [1] 11384
close(con)

EnTwitter <- readLines(con <- file("en_US.twitter.txt", "r"))
longEnTwitter <- stri_length(EnTwitter)
max(longEnTwitter)
## [1] 140
close(con)
  1. In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
loveTwitter <- grep("love",EnTwitter)
length(loveTwitter)
## [1] 90956
hateTwitter<-grep("hate",EnTwitter)
length(hateTwitter)
## [1] 22138
print(length(loveTwitter)/length(hateTwitter)) # 90956/22138 = 4.1086
## [1] 4.108592
  1. The one tweet in the en_US twitter data set that matches the word “biostats” says what?
biostatsTwitter <- grep("biostats",EnTwitter)
EnTwitter[biostatsTwitter]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
  1. How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)
sentenceTwitter <- grep("A computer once beat me at chess, but it was no match for me at kickboxing",EnTwitter)
length(sentenceTwitter)
## [1] 3