Overview

Using an english corpus gathered online, we perform a brief EDA on the text data.

The data

Using a dataset that consists of folders from multiple languages, we extract The English folder has three text files (blog,news,twitter). Each file has text gathered from that source type.

Twitter.txt:

  • File size
set.seed(69)
file.info("data\\en_US.twitter.txt")$size/(2^20)
## [1] 159.3641
  • Number of lines
con <- file("data\\en_US.twitter.txt", "r")
twitter<-readLines(con)
close(con)

length(twitter)
## [1] 2360148
  • Number of words
sum(sapply(strsplit(twitter, " "), length))+length(twitter)
## [1] 32733691

News.txt:

  • File size
## [1] 196.2775
  • Number of lines
## [1] 77259
  • Number of words
## [1] 2721228

Blogs.txt:

  • File size
## [1] 200.4242
  • Number of lines
## [1] 899288
  • Number of words
## [1] 38233419

Sampling

The files present a huge size which will be problematic doing statistical operations on the files (calculation frequencies , N-grams, sorting ..)
So we decided to create samples and save to new files.

sav<-function(var,file,n)
{
    dir<-paste0("data\\",file,"_sample.txt")
    con<-file(dir,"w")
    var<-sample(var,n)
    writeLines(var,con)
    close(con)
    var
}

twitter<-sav(twitter,"twitter",50000)
blogs<-sav(blogs,"blogs",50000)
news<-sav(news,"news",50000)

Now that we have reduced the size of our data files to just 50,000 lines each , it will be easier to compure the characteristics.

blogs[14]
## [1] "1120 N. Ashland (just get off Division & head dirty south)"

in the blogs file, we noticed the ’ character is encoded as ’ for some reason. so we try to correct that.

blogs<-gsub("’","'",blogs)

W take a very brief look at the first line of every sample:

twitter[1]
## [1] "Woo-hoo! Now the real fun starts! ;) RT : Hooray! My new passport arrived!!"
news[1]
## [1] "Saturday's forecast calls for partly sunny skies with a high near 55. At night, plan for clouds and a low around 40."
blogs[5]
## [1] "On Friday the National Portrait Gallery sent a letter of claim to a certain Derrick Coetzee, an American software developer at Microsoft. The NPG complains that Coetzee has uploaded more than 3,300 of their photos to Wikipedia. Coetzee does not deny that since joining Wikipedia in 2003 he has found it â\200\230strangely addicting'."

Interpretation

We notice on one hand hand that the Twitter text has typos, abbreviations, and other apperances of words that don’t belong to the english dictionary , which can mislead our English prediction model.
On the other hand, the news data sample is oriented towards politics and economics and other popular news subjects with special vocabulary which isn’t related to the common English we want to train out model on

Decision

the blogs data sample seems the best fit as a training set for our English NLP model.

Word frequency and N-grams

The following code takes a dataset of text and creates the frequence an dN-gram tables:

dat<-blogs[1:50000] ## CHOSES DATASET AND SIZE

oneGram<- character(0)
twoGram<-character(0)
threeGram<-character(0)
fourGram<-character(0)

idx1<-1
idx2<-1
idx3<-1
idx4<-1
len<-length(dat)
loading<-0
for (i in 2:len)
{  
    
    a<-strsplit(dat[i]," ")
    a<-unlist(a)
    for(j in 1:(length(a))){
        
        oneGram[idx1]<-a[j] ; idx1<-idx1+1
        
        if(j<length(a))
            twoGram[idx2]<-paste(a[j],a[j+1]); idx2<-idx2+1
            
        if(j<length(a)-1)
            threeGram[idx3]<-paste(a[j],a[j+1],a[j+2]); idx3<-idx3+1
                
        if(j<length(a)-2)
            fourGram[idx4]<-paste(a[j],a[j+1],a[j+2],a[j+3]); idx4<-idx4+1
                      
    }
    
}

freq1<-table(oneGram)
freq2<-table(twoGram)
freq3<-table(threeGram)
freq4<-table(fourGram)

We save our tables in data frames:

We finally remove n-grams an occurance of less than 2 because we consider them as irrelevant. We, then, order the n-grams based of their frequencies

Plots

First, We plot the most frequent words we encountered.

library(ggplot2)
g1<-ggplot(freq1[1:20,],aes(x=reorder(words,-percentage),y=percentage,fill=percentage)) + 
    geom_bar(stat = "identity")+
    ggtitle(paste("Word frequency in",length(oneGram)," blog words"))+
    xlab("Words")+
    ylab("Frequency (%)")+
    labs(fill="Frequency")+
    theme(axis.text.x = element_text(angle = 90))
g1

Similarly, We plot the frequencies of the remaining N-grams.

#----------TWO GRAM PLOT---------
g2<-ggplot(freq2[1:20,],aes(x=reorder(twoGram,-percentage),y=percentage,fill=percentage)) + 
    geom_bar(stat = "identity")+
    ggtitle(paste("2-Gram frequency in",length(twoGram)," blog 2-Grams"))+
    xlab("2-Gram")+
    ylab("Frequency (%)")+
    labs(fill="Frequency")+
    theme(axis.text.x = element_text(angle = 90))
g2

#----------THREE GRAM PLOT---------
g3<-ggplot(freq3[1:20,],aes(x=reorder(threeGram,-percentage),y=percentage,fill=percentage)) + 
    geom_bar(stat = "identity")+
    ggtitle(paste("3-Gram frequency in",length(threeGram)," blog 3-Grams"))+
    xlab("3-Gram")+
    ylab("Frequency (%)")+
    labs(fill="Frequency")+
    theme(axis.text.x = element_text(angle = 90))
g3

#----------FOUR GRAM PLOT---------
g4<-ggplot(freq4[1:20,],aes(x=reorder(fourGram,-percentage),y=percentage,fill=percentage)) + 
    geom_bar(stat = "identity")+
    ggtitle(paste("4-Gram frequency in",length(fourGram)," blog 4-Grams"))+
    xlab("4-Gram")+
    ylab("Frequency (%)")+
    labs(fill="Frequency")+
    theme(axis.text.x = element_text(angle = 90))
g4