By Sandy Sng
8 June 2018
Reading pieces of the file at a time will require the use of a file connection in R.
For example, the following code could be used to read the first few lines of the English Twitter dataset
setwd("~/Desktop/R Files/final/en_US")
con <- file("en_US.twitter.txt", "r")
readLines(con, 1) ## Read the first line of text
readLines(con, 1) ## Read the next line of text
readLines(con, 5) ## Read in the next 5 lines of text
close(con) ## It's important to close the connection when you are done
The en_US.blogs.txt file is how many megabytes? 200.4MB
The en_US.twitter.txt has how many lines of text?
setwd("~/Desktop/R Files/final/en_US")
EnTwitter <- readLines(con <- file("en_US.twitter.txt", "r"))
close(con)
length(EnTwitter)
## [1] 2360148
require(stringi)
setwd("~/Desktop/R Files/final/en_US")
EnBlogs <- readLines(con <- file("en_US.blogs.txt", "r"))
longEnBlogs <- stri_length(EnBlogs)
max(longEnBlogs)
## [1] 40833
close(con)
EnNews <- readLines(con <- file("en_US.news.txt", "r"))
longEnNews <- stri_length(EnNews)
max(longEnNews)
## [1] 11384
close(con)
EnTwitter <- readLines(con <- file("en_US.twitter.txt", "r"))
longEnTwitter <- stri_length(EnTwitter)
max(longEnTwitter)
## [1] 140
close(con)
loveTwitter <- grep("love",EnTwitter)
length(loveTwitter)
## [1] 90956
hateTwitter<-grep("hate",EnTwitter)
length(hateTwitter)
## [1] 22138
print(length(loveTwitter)/length(hateTwitter)) # 90956/22138 = 4.1086
## [1] 4.108592
biostatsTwitter <- grep("biostats",EnTwitter)
EnTwitter[biostatsTwitter]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
sentenceTwitter <- grep("A computer once beat me at chess, but it was no match for me at kickboxing",EnTwitter)
length(sentenceTwitter)
## [1] 3