This report will show some exploration ot the three input files and the creation of a smaller sample file.

Exploring the data

The code below will plot the number of lines and words for each one of the three input files.

# get number of lines and words
n=w=c(0,0,0)
n[1]=as.integer(system2("wc",args=c("-l","en_US.blogs.txt",
                        " | awk '{print $1}'"), stdout=T))
w[1]=as.integer(system2("wc",args=c("-w","en_US.blogs.txt",
                        " | awk '{print $1}'"), stdout=T))
n[2]=as.integer(system2("wc",args=c("-l","en_US.news.txt",
                        " | awk '{print $1}'"), stdout=T))
w[2]=as.integer(system2("wc",args=c("-w","en_US.news.txt",
                        " | awk '{print $1}'"), stdout=T))
n[3]=as.integer(system2("wc",args=c("-l","en_US.twitter.txt",
                        " | awk '{print $1}'"), stdout=T))
w[3]=as.integer(system2("wc",args=c("-w","en_US.twitter.txt",
                        " | awk '{print $1}'"), stdout=T))

# plot these stats
lab=c("blogs","news","twitter")
nlines=data.frame(lab,nl=as.integer(n/1000))
nlines
##       lab   nl
## 1   blogs  899
## 2    news 1010
## 3 twitter 2360
ggplot()+
    geom_col(data=nlines,aes(lab,nl),lwd=2,color="lightblue")+
    xlab("file")+
    ylab("lines (in thousands)")+
    ggtitle("number of lines in each file")

nwords=data.frame(lab,nw=as.integer(w/1000))
nwords
##       lab    nw
## 1   blogs 37334
## 2    news 34365
## 3 twitter 30373
ggplot()+
    geom_col(data=nwords,aes(lab,nw),lwd=2,color="lightgreen")+
    xlab("file")+
    ylab("words (in thousands)")+
    ggtitle("number of words in each file")

As can be seen, the largest of the files is the twitter data, with 2360 thousand lines. Interestingly, it is not the one with the greatest number of words.

Creating a sample file

Let us now create a smaller file to work with later. The code in this section will read a sample of size 50,000 of the “en_US.blogs.txt” file in chunks of 10 lines of text. The result is placed in a new file named “sample.txt” file.

inpfilename<-"en_US.blogs.txt"
outfilename<-"sample.txt"
chunksize=10
samplesize=5000

# choose 100 chunks at random
set.seed(3287)
d=sort(sample(1:(n[1] %/% 100),samplesize,replace=F) * 100)
# remove duplicates
#d=d[!duplicated(d)]

# read the chunks
inf=file(inpfilename,"r")
file.create(outfilename)
## [1] TRUE
j=0                                # "previous" value
for (m in d) {
    # open output file in append mode
    outf=file(outfilename,"a")
    # skip previous lines
    readLines(inf,m - chunksize - j - 1)
    # read in next chunk of size "chunksize"
    x=readLines(inf,chunksize)
    # save to output file
    writeLines(x,outf)
    j=m
    close(outf)
}

# the end
close(inf)

Exploring the sample file

Now that the sample file is created, we may start exploring its features.

n=as.integer(system2("wc",args=c("-l","sample.txt",
                     " | awk '{print $1}'"), stdout=T))
w=as.integer(system2("wc",args=c("-w","sample.txt",
                     " | awk '{print $1}'"), stdout=T))

As expected, the number of lines in the sample file is exactly 50000. It has “only” 2077 thousand words, far less than the three files above.