Corpus investigation

Loading and preprocessing

As the first steps we will load the downloaded data and include the mandatory libraries that we will need to manipulate the data.

We will load 3 files in english lanugague:

Twitter data
News
Blogs

library(stringr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

file1 <- "final/en_US/en_US.twitter.txt"
file2 <- "final/en_US/en_US.news.txt"
file3 <- "final/en_US/en_US.blogs.txt"

The we wiil use the function that will make a summary of the data for us. The plus of the function is that it do not require much of the memory.

corpus_stats <- function( file_path ){
  con <- file( file_path , "r")
  line <- readLines(con, n = 1)
  maks_line = 0
  word_count = 0
  line_count = 0
  numbers_count = 0
  while (length(line) > 0) {
    line_count = line_count +1

    word_count = word_count  + str_count(line,"\\w+") 
    numbers_count = numbers_count  + str_count(line,"\\d+")
    if(maks_line < str_length(line) ){
      maks_line <- str_length(line)
    }

    line<-readLines(con,1)
  }
  close(con)
  cat("Lines:",line_count)
  cat("\n")
  cat("Maks line:",maks_line)
  cat("\n")
  cat("Words:",word_count)
  cat("\n")
  cat("Numbers:",line_count)
  cat("\n")
}

Then we summarize the data of the files.

print("Twitter:")

## [1] "Twitter:"

corpus_stats( file1 )

## Lines: 2360148
## Maks line: 140
## Words: 31003501
## Numbers: 2360148

print("News:")

## [1] "News:"

corpus_stats( file2 )

## Lines: 77259
## Maks line: 5760
## Words: 2741594
## Numbers: 77259

print("Blog:")

## [1] "Blog:"

corpus_stats( file3 )

## Lines: 899288
## Maks line: 40833
## Words: 38309620
## Numbers: 899288

Plots

Next we define the function that will work very fast. The minus of the function is that it must load all the text to the memory in order to preprecess it. The desired output of the function is that we will have the counts of each word.

corpus_investigator <- function( file_path ){
  file_content <- readLines(file_path)

  word_counts <- vector("numeric", length = 0)
  names(word_counts) <- character(0)
  special_chars <- "[^[:alnum:]]"

  file_content <- str_replace_all(file_content, special_chars, " ")

    words <- tolower( unlist( strsplit( file_content , split = "\\s+") ) )

  data.frame(words) %>% count(words) %>% arrange(desc(n))
}

Next we execute the function on every file.

wc1 <- corpus_investigator(file1)
wc2 <- corpus_investigator(file2)
wc3 <- corpus_investigator(file3)

Results

Twitter data

ggplot( head(wc1) , aes(x = words, y = n)) +
  geom_bar(stat = "identity") +  # Use identity stat to plot raw counts
  labs(title = "Histogram of Top 10 Most Frequent Words",
       x = "Word",
       y = "Frequency") +
  theme_minimal()

News data

ggplot( head(wc2) , aes(x = words, y = n)) +
  geom_bar(stat = "identity") +  # Use identity stat to plot raw counts
  labs(title = "Histogram of Top 10 Most Frequent Words",
       x = "Word",
       y = "Frequency") +
  theme_minimal()

Blogs data

ggplot( head(wc3) , aes(x = words, y = n)) +
  geom_bar(stat = "identity") +  # Use identity stat to plot raw counts
  labs(title = "Histogram of Top 10 Most Frequent Words",
       x = "Word",
       y = "Frequency") +
  theme_minimal()

### Conclusion

As we see there are the same words that occu the most if we look at the top 5 but there are different count numbers here.

Idea

The idea of the word prediction will be as follows. We can take a N-gram of 1,2,3 words and predict the another one. If we take the 1-gram we make a list of lists of words. The main list constists of every word and for that each word we have a list that contains the word that follows the key and the count of how many times this pheonomena occured. Then in prediction we can randomly choose next word.