Executive summary

In this document will present a exploratory analysis of the data presented in the week 1. Basically, the data is made up of three txt files that contain text from different sources, they are blogs, news and twitter. First, each file will be read, after this its size will be seen and a sample of it will be chosen, in order to see its main characteristics. In order to create n-gram models, the distribution of words for various cases will be studied, the case of unigrams, bigrams, trigrams and tetragrams will be analyzed.

Exploratory analysis

In this section the data will be read in its entirety in order to obtain its main characteristics such as its size, the number of lines it contains, as well as the line with the longest character length.

Blogs data

In order to see the structure of the blogs data, the first 5 lines will be displayed,

con <- file("en_US/en_US.blogs.txt", "r")

#READ LINES OF THE FILE
a1 <- readLines(con, 5)

close(con)

#SHOW 5 LINES
a1
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"

News data

In order to see the structure of the news data, the first 5 lines will be displayed

con <- file("en_US/en_US.news.txt", "r")

#READ LINES OF THE FILE
a2 <- readLines(con, 5)

close(con)

#SHOW 5 LINES
a2
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"

Twitter data

In order to see the structure of the twitter data, the first 5 lines will be displayed

con <- file("en_US/en_US.twitter.txt", "r")

#READ LINES OF THE FILE
a3 <- readLines(con, 5)

close(con)

#SHOW 5 LINES
a3
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

A summary of the main characteristics of each data is summarized in the following table,

#CREATE A DATA FRAME WITH IMPORTANT INFO
D <- as.data.frame(matrix(0,nrow = 3,ncol = 4))
names(D) <- c("Data","Size (MB)","Lines number","Longest line")

#DATA SIZE IN MB
blogs_info <- file.info("en_US/en_US.blogs.txt")
news_info <- file.info("en_US/en_US.news.txt")
twitter_info <- file.info("en_US/en_US.twitter.txt")

size <- c(blogs_info$size/(1024)**2,news_info$size/(1024)**2,twitter_info$size/(1024)**2)

#CALCULATE LENGTH OF EACH DATA
len <- c(NROW(a_blogs),NROW(a_news),NROW(a_twitter))

#LONGEST LINE
long <- c(as.numeric(summary(nchar(a_blogs))[6]),
as.numeric(summary(nchar(a_news))[6]),
as.numeric(summary(nchar(a_twitter))[6]))

#FILL D
D$Data <- c("Blogs","News","Twitter")
D$`Size (MB)` <- format(size,digits = 2,nsmall = 2,big.mark = ",", 
                      decimal.mark = ".")
D$`Lines number` <- format(len,big.mark = ",", 
                      decimal.mark = ".")
D$`Longest line` <- format(long,big.mark = ",", 
                      decimal.mark = ".")

#SHOW DATA
DT::datatable(D,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))

In order to work and see important features of the data i will consider a sample of it because the data is relatevely big. I will get a 0.2% of the data,

#CALCULATE 0.2% OF EACH FILE
D1 <- D[,c(1,3)]
D1$Sample_size <- len*0.002
D1$Sample_size<- format(D1$Sample_size,digits = 2,nsmall = 2,big.mark = ",", 
                      decimal.mark = ".")

#SHOW DATA
DT::datatable(D1,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))

Once the size of the sample to be obtained is known, this amount of lines was extracted from each file and then combined into a single data,

#SET SEED
set.seed(1234)
factor <- 0.002

#SAMPLE DATA
a_blogs_sample <- sample(a_blogs,round(factor*length(a_blogs)))
a_news_sample <- sample(a_news,round(factor*length(a_news)))
a_twitter_sample <- sample(a_twitter,round(factor*length(a_twitter)))

#REMOVE STRANGE CHARACTERS 
a_blogs_sample <- iconv(a_blogs_sample,"latin1","ASCII",sub = "")
a_news_sample <- iconv(a_news_sample,"latin1","ASCII",sub = "")
a_twitter_sample <- iconv(a_twitter_sample,"latin1","ASCII",sub = "")

#JOIN DATA
a <- c(a_blogs_sample,a_news_sample,a_twitter_sample)

#CREATE A CORPUS
corpus <- VCorpus(VectorSource(a))

#SHOW LINE 61
content(corpus[["61"]])
## [1] "Looking at this set, I've realised something. That while Scottish brewers were renowned for their Strong Ales, they didn't make Stouts of any great strength. The strongest I've found so far was 1078. That's barely a Double Stout by London standards. There were plenty of London Stouts that were 1090 and above."

The extracted data was converted to a structure called corpus, which allows easy handling of text data. Punctuation marks, numbers, blank spaces, as well as other pre-processes were applied to this structure, this in order to make it cleaner and more manageable. Finally, the tidy function was used, which allows transforming the corpus structure into a data frame, this facilitates the work with the resulting data. From this data with a series of steps it is possible to obtain the desired frequency tables for unigrams, bigrams, trigrams and tetragrams.

#PRE-PROCESSING OF THE DATA
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stemDocument, language = "english")

#OBTAIN IMPORTANT STRUCTURE
s <- tidy(corpus)

#SHOW DATA
DT::datatable(s,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

N-grams

The process for the creation of the n-grams, for n = 1,2,3,4, is very similar, it is described below,

  1. The words are separated into tokens, for them the unnest_tokens function is used, which must be specified the number of words by which it must be grouped, for n = 1 a separation of 1 in 1 will be obtained, which is useful to obtain the data of the unigrams.

  2. Once the separated words are obtained, the stop words are eliminated, as well as the null values.

  3. Finally, a frequency count of the resulting words is performed.

Unigram

A table showing the frequency of the most repeated words is presented below,

#UNIGRAM
s4 <- s %>%
  unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%
  filter(!unigram %in% stop_words$word) %>%  filter(!is.na(unigram)) %>%
  count(unigram ,sort = TRUE)

unigrama <- s4

DT::datatable(s4,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))

An interactive chart with the above information is presented below,

#UNIGRAM
fig1 <- plot_ly(s4[1:30,], x = ~unigram, y = ~n, type = 'bar',
               marker = list(color = 'blue',
                             line = list(color = 'black',
                                         width = 2))) %>%
              layout(barmode = 'stack',
               title= c("Unigram"),
               xaxis = list(title ="Word",
                            categoryorder = "array",
                            categoryarray = ~n),
               yaxis = list(title ="Frequency"))

fig1

Bigram

A table with the frequency of the most common word pairs is presented below,

#BIGRAM
s2 <- s %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%  filter(!is.na(word1)) %>%
  count(word1, word2, sort = TRUE)

bigram <- s2 %>%
  unite(bigram, word1, word2, sep = " ")

DT::datatable(bigram,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
fig2 <- plot_ly(bigram[1:30,], x = ~bigram, y = ~n, type = 'bar',
                marker = list(color = 'green',
                              line = list(color = 'black',
                                          width = 2))) %>%
  layout(barmode = 'stack',
         title= c("Bigram"),
         xaxis = list(title ="Word",
                      categoryorder = "array",
                      categoryarray = ~n),
         yaxis = list(title ="Frequency"))

fig2

TRIGRAM

A table with the frequency of the most common trios of words is presented below,

#TRIGRAM
s3 <- s %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%  filter(!is.na(word1)) %>%
  count(word1, word2, word3, sort = TRUE)

trigrama <- s3 %>%
  unite(trigram, word1, word2, word3,sep = " ")

DT::datatable(trigrama,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))

An interactive chart with the above information is presented below,

#TRIGRAMA
fig3 <- plot_ly(trigrama[1:30,], x = ~trigram, y = ~n, type = 'bar',
                marker = list(color = 'navy',
                              line = list(color = 'black',
                                          width = 2))) %>%
  layout(barmode = 'stack',
         title= c("Trigram"),
         xaxis = list(title ="Word",
                      categoryorder = "array",
                      categoryarray = ~n),
         yaxis = list(title ="Frequency"))

fig3

TETRAGRAM

A table with the frequency of the most common word quartets is presented below,

#TETRAGRAM
s5 <- s %>%
  unnest_tokens(fourgram, text, token = "ngrams", n = 4) %>%
  separate(fourgram, c("word1", "word2", "word3","word4"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word,
         !word4 %in% stop_words$word) %>%  filter(!is.na(word1)) %>%
  count(word1, word2, word3,word4, sort = TRUE)

tetragrama <- s5 %>%
  unite(trigram, word1, word2, word3,word4,sep = " ")

DT::datatable(tetragrama,rownames = FALSE,options = list(
            columnDefs = list(list(className = 'dt-center', targets = "_all"))
            ))

An interactive chart with the above information is presented below,

#TETRAGRAM
fig4 <- plot_ly(tetragrama[1:30,], x = ~trigram, y = ~n, type = 'bar',
                marker = list(color = 'navy',
                              line = list(color = 'black',
                                          width = 2))) %>%
  layout(barmode = 'stack',
         title= c("Tetragram"),
         xaxis = list(title ="Word",
                      categoryorder = "array",
                      categoryarray = ~n),
         yaxis = list(title ="Frequency"))

fig4

Coverage analysis

In order to know the amount of words needed in our data to cover a certain percentage of it, a coverage analysis was carried out. This analysis was carried out on the data of the unigrams, where a cumulative distribution was calculated according to the frequency of the words, this allows us to know how many words are necessary to cover 50 or 90% of all the data.

#WORD COVERAGE
unigrama$coverage <- cumsum(unigrama$n)/sum(unigrama$n)*100
unigrama$words <- 1:nrow(unigrama)

#PLOT
ggplot(unigrama, aes(x = words, y = coverage)) +
  geom_area(colour = "black", fill = "navy", size = 1, alpha = 0.3) +
  ggtitle("Word Coverage") +
  xlab("Unigrams Added") +
  ylab("% of Coverage") +
  theme(plot.title = element_text(size = 16, face = "bold",
                                  hjust = 0.5, margin = margin(b = 30, unit = "pt"))) +
  theme(axis.title.x = element_text(size = 12, face="bold")) +
  theme(axis.title.y = element_text(size = 12, face="bold")) +
  theme(panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  theme(panel.border = element_rect(colour = "black", fill = NA, size = 0.5)) +
  theme(strip.background = element_rect(fill = alpha("navy", 0.3), color = "black", size = 0.5))

#WORDS NEEDED TO COVER 50% OF THE DATA
min(unigrama[unigrama$coverage > 50, ]$words)
## [1] 681
#WORDS NEEDED TO COVER 90% OF THE DATA
min(unigrama[unigrama$coverage > 90, ]$words)
## [1] 6451

According to the results obtained, the number of words needed to cover 50% of the unigrams data is 681. While 6,451 words are needed to cover 90% of it.

Nexts steps