The ππ_ππ.πππππ.ππ‘π file is how many megabytes?
size<-file.info("./final/en_US/en_US.blogs.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 200.4242
200
The ππ_ππ.ππ πππππ.ππ‘π has how many lines of text?
twitter <- readLines(con <- file("./final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
length(twitter)
## [1] 2360148
Over 2 million
What is the length of the longest line seen in any of the three en_US data sets?
# Blogs file
blogs<-file("./final/en_US/en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)
summary(nchar(blogs_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
# News file
news<-file("./final/en_US/en_US.news.txt","r")
news_lines<-readLines(news)
close(news)
summary(nchar(news_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
# Twitter file
twitter<-file("./final/en_US/en_US.twitter.txt","r")
twitter_lines<-readLines(twitter)
## Warning in readLines(twitter): line 167155 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 268547 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(twitter): line 1759032 appears to contain an embedded
## nul
close(twitter)
summary(nchar(twitter_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
Over 11 thousand in the news data set
In the en_US twitter data set, if you divide the number of lines where the word βloveβ (all lowercase) occurs by the number of lines the word βhateβ (all lowercase) occurs, about what do you get?
love<-length(grep("love", twitter_lines))
hate<-length(grep("hate", twitter_lines))
love/hate
## [1] 4.108592
4
The one tweet in the en_US twitter data set that matches the word βbiostatsβ says what?
grep("biostats", twitter_lines, value = T)
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
They havenβt studied for their biostats exam
grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_lines)
## [1] 519059 835824 2283423
3