Introduction
Initial Data Load
Using Stopwords
Data Sample Files Validation
Combining and Analyzing the Sample Files
- Frequency Histograms
- Multi-Word Analysis
Conclusion: Prediction and Packaging
Appendix: All Source Code

Introduction

Welcome to the Milestone Report for the Johns Hopkins Data Science Specialization. We appreciate you taking time to review our work and give feedback on our progress as well as thoughts on fitting a prediction model. Also, any insight you would like to share on the final plan for our Shiny application would be welcome.

Initial Data Load

We started by taking three original text files, provided by SwiftKey, that were taken from blogs, news, and Twitter entries on the Internet. Initially, we had planned to do analysis on the entire set of data; however that idea quickly dissolved as our first attempt generating counts of each word took over sixteen hours to complete. Naturally, this was a non-starter.

We then turned our full attention to the total size of the original source files to get a handle on what we should do with them:

##                             File Line_Count Word_Count
## 1   SwiftKeyData/en_US.blogs.txt     899288   37334131
## 2    SwiftKeyData/en_US.news.txt      77259    2643969
## 3 SwiftKeyData/en_US.twitter.txt    2360148   30373543

As you can see the files are quite large with a vast amount of data in them. Their size, however, precludes us being able to have reasonable performance when running our analysis against them. We need to reduce the size of the data being analyzed. To do this, we opted to take a random sample of fifty thousand lines from each of the original files (blogs, news, and Twitter).

Using Stopwords

At this point we were ready to do some analysis but we first had to make a critical decision: Should we include stopwords? Traditionally, stopwords such as “an”, “the”, “or”, etc… are removed for analysis. However, our goal is to create a prediction model on the next word(s) that will occur in a sequence. Stopwords are an integral part of sentence structure and, therefore, should probably be left in for analysis. For the time being, we decided to do preliminary analysis on data with and without stopwords for comparison.

Data Sample Files Validation

To validate our sample files we used bootstrapping, a statistical resampling method, to understand the distribution of unique words in our three types of samples: blog posts, news articles, and Twitter messages. Bootstrapping lets us draw multiple samples from our original data. By doing this 5000 times for each data file sample, we can calculate the average (mean) and variability (standard deviation) of unique words in each sample.

## Summary Statistics without Stopwords

##            Mean Standard Deviation
## Blogs   19.5262          19.374715
## News    18.4270          11.611582
## Twitter  6.9256           3.789266

## Summary Statistics with Stopwords

##            Mean Standard Deviation
## Blogs   34.3806          30.476757
## News    31.7164          17.326350
## Twitter 13.9506           6.850485

These results suggest that blog and news data tend to have a higher average number of unique words and more variability compared to Twitter, which shows greater consistency, possibly due to its character limit. Remember, the validity of these bootstrap samples hinges on the original samples being representative of all blog posts, news articles, and tweets. Bias in the original sample will also be reflected in the bootstrap samples.

The below graph, a set of histograms, represents a snapshot of the variety of words in our sample files from blogs, news, and Twitter. It helps us visualize and understand the patterns and variability in our data. Each bar in the histogram represents the frequency of a specific range of unique word counts from our bootstrap samples. By looking at these bars, we can see the most common outcomes (where the bars are highest), how spread out the outcomes are, and whether the spread is even or leans more one way or the other.

The shape of these histograms can tell us a lot. For example, the histograms for blogs and news look different from Twitter. This suggests that the there are different patterns of word use. Where blogs and news have similar patterns and, therefore, word use; Twitter does not follow this pattern. This could be because of the nature of the content, or even Twitter’s character limit. These histograms not only help validate our samples but also offer insights into the differences in language use across these platforms.

Combining and Analyzing the Sample Files

Frequency Histograms

Having validated our samples individually, we combined them into one file to facilitate fitting into our model. Our exploration begins with understanding our data, focusing on the frequency of words. Two histograms are created to give us a birds-eye view of this frequency distribution.

The first visual filters out stopwords, allowing the content-specific words to take the spotlight. This histogram then represents the frequency of these words, unveiling the most common themes or topics in our data.

The second visual dives into the data in its raw form, including stopwords. By doing this, we’re keeping a record of the rhythm of language usage, but it may not give us significant insights into the specific content.

This rigorous approach to study word frequency, with and without stopwords, is a cornerstone of text analysis, and allows us to garner a deeper understanding of the data.

Multi-Word Analysis

Feature Co-Occcurrence Matrix

Next, we created a feature co-occurrence matrix (FCM), which is a fundamental concept in text analysis. The FCM records how often different words (or “features”) appear together within a specified context window. In this case, each row and column represents a unique word in the data, and the number in each cell indicates how many times the corresponding words co-occur.

For this FCM, we only took the raw data (including stopwords) to get a clearer picture of the relationships. By including all words, we can better discern where the key relationships exist regardless of word type.

For example, take a look at the word “People” on the top left of the matrix. The numbers in the same row tell you how many times “People” appears with each of the other words. Similarly, if we go down the column for “People”, we can see how many times each word appears with “People”.

Unfortunately, FCM can be particularly large if the dataset contains many unique words, which ours does. Our matrix size is 125,314 rows by 125,314 columns, but only a portion of it is shown due to size constraints.

## Feature co-occurrence matrix of: 125,314 by 125,314 features.
##               features
## features       People like   you   are unknowing transformers    of things
##   People           13   85   164   235         1            1   533     31
##   like              0 1472  5238  3263         1            1 13412    851
##   you               0    0 17088 11424         1            1 26948   1740
##   are               0    0     0  6396         1            1 31329   1587
##   unknowing         0    0     0     0         0            1     1      1
##   transformers      0    0     0     0         0            0     1      1
##   of                0    0     0     0         0            0 85289   5031
##   things            0    0     0     0         0            0     0    380
##   protected         0    0     0     0         0            0     0      0
##   by                0    0     0     0         0            0     0      0
##               features
## features       protected    by
##   People               1    79
##   like                10  1571
##   you                  4  3417
##   are                 36  4393
##   unknowing            1     2
##   transformers         1     2
##   of                 113 23308
##   things               4   531
##   protected            0    37
##   by                   0  2899
## [ reached max_feat ... 125,304 more features, reached max_nfeat ... 125,304 more features ]

N-Gram Versus Collocation Analysis

N-gram analysis and collocation analysis are two ways we look at patterns in text, but they each do it a bit differently:

In N-gram analysis, we look at word clusters or sequences. Think of an N-gram as a chain of words of length N. For example, in “I love dogs”, “I love” and “love dogs” are pairs, or bi-grams. Here, we’re mostly interested in how often these word chains show up in the text. We aren’t really looking at the meaning between the words.

Collocation analysis, however, is all about words that are found together more often than you’d expect by random chance. It’s not just about counting word sequences, but about the meaningful relationship between them. For example, in English, we tend to say “make a decision” not “do a decision”. Even though “make a” and “do a” could both be bi-grams, “make a decision” is a collocation because “make” and “decision” go together.

So, while N-gram analysis is like counting word clusters, collocation analysis goes a step further, looking at the meaning and context of words that often come together. Next, we will look at both these analyses on our combined data sample.

N-Gram Analysis

N-gram analysis serves as a powerful method in the field of text analytics, revealing patterns in data that single-word analysis might miss. It investigates combinations of words, from pairs (bi-grams) to triplets (tri-grams) and quadruplets (quad-grams). This approach helps to identify frequently occurring phrases, which can be insightful for understanding prevalent themes or topics in the data. The focus is not only on isolated words but their interplay, providing a more context-rich analysis.

The procedure begins by finding frequently occurring n-grams in the text. This step helps to distill a vast amount of textual information into manageable and insightful data. The identified n-grams are then organized into a structured format, such as data frames, allowing for a more seamless analysis. This structured data can facilitate the comparison of different n-gram frequencies, providing a quantitative perspective to the qualitative data. Visualizing this data further aids in highlighting the significant n-grams in the entire text. Overall, n-gram analysis is a potent tool in text analytics, contributing to a more nuanced understanding of the data.

Collocation Analysis

Similar to n-gram analysis, we focus on the frequency of word pairs, triplets, and quartets in a text. For example, “in the” or “will be” are pairs of words, “more and more” is a triplet, and “the way to the” is a quartet. The numbers in the output (‘count’) indicate how often these word groups appear in the text.

The values ‘lambda’ and ‘z’ provide additional information about these word groups. ‘Lambda’ measures the strength of the word groupings (higher is better), comparing how often the words come together against their individual appearances. The ‘z’ score helps identify word combinations that occur more often than expected by chance (higher is better). These measures can aid in predicting what word might follow a given series of words, crucial for language modeling.

Understanding these patterns in language use is essential when building a model to predict the next word in a sequence. These patterns become the basis for making predictions. The more accurately these patterns are quantified, the better the model can anticipate the next word. This results in improved predictive models that enhance performance in tasks such as text completion, text generation, and machine translation.

## Collocations without Stopwords

##        collocation count count_nested length   lambda         z
## 1      high school   675            0      2 6.191549 119.62264
## 2        last year   902            0      2 4.735729 118.79034
## 3        right now   835            0      2 4.860303 116.07002
## 4        years ago   644            0      2 6.186051 112.27121
## 5        last week   600            0      2 4.728781  98.98112
## 6    united states   375            0      2 9.018301  92.18599
## 7  looking forward   309            0      2 6.604044  89.15876
## 8        make sure   453            0      2 4.908054  88.52227
## 9      even though   441            0      2 4.916550  88.10266
## 10      last night   452            0      2 4.554220  84.84930

##           collocation count count_nested length   lambda        z
## 1      happy new year    41            0      3 6.125976 9.060882
## 2  first things first     8            0      3 6.031843 8.590919
## 3      keep good work    12            0      3 4.777235 8.126033
## 4     next thing know    15            0      3 4.688300 7.935269
## 5         let us know    72            0      3 2.401390 7.899223
## 6         buy one get     8            0      3 5.404141 7.873783
## 7      take good care     8            0      3 6.353893 7.528182
## 8      think like man     9            0      3 5.122461 7.426551
## 9     one way another    20            0      3 4.471770 7.367000
## 10    meet new people     9            0      3 4.948141 6.868824

##                         collocation count count_nested length   lambda        z
## 1                           ã â ã â     3            0      4 14.64843 3.198725
## 2  CLEAR OPERATION WEMBLEY EXERCISE     2            0      4 16.76869 3.146589
## 3      harry potter deathly hallows     3            0      4 13.44965 3.050286
## 4                           â ã â ã     2            0      4 13.54981 3.039981
## 5         san francisco los angeles     4            0      4 11.78254 3.023562
## 6         los angeles san francisco     2            0      4 12.25015 3.000338
## 7           COMIC BOOKS COMIC BOOKS     2            0      4 14.64843 2.998438
## 8           G1 CERTIFIED WET TSHIRT     2            0      4 14.57147 2.872469
## 9      CERTIFIED WET TSHIRT CONTEST     2            0      4 14.57147 2.872469
## 10           char kway teow hokkien     2            0      4 13.54981 2.854438

## Collocations with Stopwords

##    collocation count count_nested length   lambda        z
## 1       in the 18980            0      2 1.943751 222.3703
## 2       of the 20768            0      2 1.744803 212.7096
## 3      will be  3495            0      2 4.270809 204.8281
## 4        to be  7079            0      2 2.766260 189.0665
## 5    more than  2061            0      2 5.421813 188.1630
## 6     has been  2124            0      2 4.949615 182.8428
## 7       it was  4537            0      2 3.118714 182.6987
## 8       if you  2815            0      2 3.849003 171.1756
## 9    have been  2310            0      2 4.345093 170.5909
## 10      on the  8944            0      2 1.929450 154.0584

##      collocation count count_nested length   lambda        z
## 1     in and out    45            0      3 7.136671 27.65144
## 2      on and on    40            0      3 5.920498 26.48791
## 3  more and more   114            0      3 5.807672 25.26005
## 4      said in a   211            0      3 2.856422 23.78873
## 5      is one of   354            0      3 2.478020 23.64970
## 6    some of the   636            0      3 2.558304 21.24608
## 7  would like to   237            0      3 3.010374 21.16693
## 8   in which the    81            0      3 3.188085 21.07749
## 9   with all the   119            0      3 2.670752 20.80011
## 10   this is the   409            0      3 1.439062 20.57134

##         collocation count count_nested length   lambda        z
## 1    the way to the    31            0      4 3.179375 9.997350
## 2     in the way of    36            0      4 5.867918 8.082124
## 3      when i was a    53            0      4 2.299382 7.553593
## 4      if i have to    16            0      4 2.644105 6.361960
## 5  the one with the    11            0      4 4.863837 6.323802
## 6   the time of the    38            0      4 2.232456 6.286803
## 7      so much as a     8            0      4 4.702546 6.218929
## 8    the most of it     7            0      4 4.375749 6.078675
## 9   and not have to     7            0      4 3.631543 5.158207
## 10   in the form of    67            0      4 9.155133 5.129135

Conclusion: Prediction and Packaging

Given the analysis we have already done, it’s likely that we will fit a prediction model using a machine learning approach, such as a neural network or a transformer-based model. These types of models are well-suited for language processing since they can learn complex patterns. This helps them predict the next word in a sequence. Our model will be learning from the relationships between n-grams and collocations we’ve observed. By understanding the patterns, the model can make predictions about the next word or words in a given context, ultimately improving its performance in text generation and autocompletion.

To create a user-friendly experience, we plan to develop a Shiny app that allows users to put in a sequence of words and receive predictions. Shiny is a powerful web framework for R that enables building interactive web applications, making it an ideal choice for this project. The app would integrate our prediction model, where users can enter their word sequence and receive suggestions for what will come next. This could be useful in various applications, such as helping users compose text more efficiently, or even offering insights into common language patterns that might be relevant.

Appendix: All Source Code

# run our setup code to obtain all the objects we will need to produce our data and graphs

##
# Created by Zain Naboulsi for the Johns Hopkins Data Science Specialization Capstone Course 
##

# set a seed in case we use anything that uses random numbers
set.seed(1337)


# Set the names of the packages and libraries you want to install
# Most notably load up all the quanteda packages we will need
required_libraries <- c("quanteda","quanteda.textmodels","quanteda.textstats",
                        "quanteda.textplots", "readtext", "text", "sqldf", 
                        "digest", "ggplot2", "dplyr", "gridExtra", "broom","tidyverse")

for (lib in required_libraries) {
  if (!requireNamespace(lib, quietly = TRUE)) {
    install.packages(lib)
  }
  library(lib, character.only = TRUE)
}


# Get our initial file metrics on the original source files
# see if we have an RDS file for the object we need
if (file.exists("Objects/initialfilemetrics.rds")) {
  
  # if we do, read our object into memory from disk
  initialfilemetrics <- readRDS("Objects/initialfilemetrics.rds")
  
} else {
  # Define our file paths
  file_paths <- c("SwiftKeyData/en_US.blogs.txt", "SwiftKeyData/en_US.news.txt", "SwiftKeyData/en_US.twitter.txt")
  
  # Initialize a data frame to store the results
  initialfilemetrics <- data.frame(
    File = character(),
    Line_Count = numeric(),
    Word_Count = numeric(),
    stringsAsFactors = FALSE
  )
  
  # Process each file in our loop
  for (file_path in file_paths) {
    
    # Initialize counters
    line_count <- 0
    word_count <- 0
    
    # Open a connection to the file
    con <- file(file_path, "r")
    
    # Read the file line by line
    while (TRUE) {
      lines <- readLines(con, n = 1)  # read one line at a time
      
      if (length(lines) == 0) {  # end of file
        break
      }
      
      # Update line and word counts
      line_count <- line_count + 1
      word_count <- word_count + length(strsplit(lines, "\\s")[[1]])
    }
    
    # Close the connection
    close(con)
    
    # Add the results to the data frame for each file
    initialfilemetrics <- rbind(initialfilemetrics, data.frame(
      File = file_path,
      Line_Count = line_count,
      Word_Count = word_count,
      stringsAsFactors = FALSE
    ))
  }
  
  # save our object to disk
  saveRDS(initialfilemetrics, "Objects/initialfilemetrics.rds")
}

##
# Created by Zain Naboulsi for the Johns Hopkins Data Science Specialization Capstone Course 
##

# Check if the digest files are missing
if (!file.exists("Digests/blogs_sample_digest.txt") | 
    !file.exists("Digests/news_sample_digest.txt") | 
    !file.exists("Digests/twitter_sample_digest.txt")) {
  
  # if any digest file is missing then create them all to keep track of 
  # when the data changes
  blogs_sample_digest <- digest(file.info("SampleData/blogs_sample.txt")$mtime)
  write.table(blogs_sample_digest, file = "Digests/blogs_sample_digest.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  
  news_sample_digest <- digest(file.info("SampleData/news_sample.txt")$mtime)
  write.table(news_sample_digest, file = "Digests/news_sample_digest.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  
  twitter_sample_digest <- digest(file.info("SampleData/twitter_sample.txt")$mtime)
  write.table(twitter_sample_digest, file = "Digests/twitter_sample_digest.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  
} else {
  
  # if no files are missing read in the digest information for all three files
  blogs_sample_digest <- readLines("Digests/blogs_sample_digest.txt")
  news_sample_digest <- readLines("Digests/news_sample_digest.txt")
  twitter_sample_digest <- readLines("Digests/twitter_sample_digest.txt")
}


# Define the function for sampling lines to use if we have changes to the data
sample_lines2 <- function(file, n) {
  total_lines <- sum(readLines(file) != "")
  p <- n / total_lines
  lines_sample <- c()
  con <- file(file, "r")
  while(TRUE) {
    line <- readLines(con, n = 1)
    if(length(line) == 0) {
      break
    }
    if(runif(1) < p) {
      lines_sample <- c(lines_sample, line)
    }
  }
  close(con)
  return(lines_sample)
}


# Check if the sample files have changed by looking at the modification times
if (digest(file.info("SampleData/blogs_sample.txt")$mtime) != blogs_sample_digest |
    digest(file.info("SampleData/news_sample.txt")$mtime) != news_sample_digest |
    digest(file.info("SampleData/twitter_sample.txt")$mtime) != twitter_sample_digest) {
  
  # redo the seed to make sure the results are reproducible
  set.seed(1337)
  
  # if there have been changes to the files then we need to sample the 
  # lines from the new files again
  sample_size <- 50000
  blogs_sample <- sample_lines2("SwiftKeyData/en_US.blogs.txt", sample_size)
  news_sample <- sample_lines2("SwiftKeyData/en_US.news.txt", sample_size)
  twitter_sample <- sample_lines2("SwiftKeyData/en_US.twitter.txt", sample_size)
  
  # now let's save our sample files to disk
  write.table(blogs_sample, file = "SampleData/blogs_sample.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  write.table(news_sample, file = "SampleData/news_sample.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  write.table(twitter_sample, file = "SampleData/twitter_sample.txt", 
              row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE)
  
} else {
  
  # if none of the sample files have changed just use the already generated
  # sample files
  blogs_sample <- readLines("SampleData/blogs_sample.txt")
  news_sample <- readLines("SampleData/news_sample.txt")
  twitter_sample <- readLines("SampleData/twitter_sample.txt")
}

# after everything is done, combine our three sample files into one
full_data_sample <- c(blogs_sample, news_sample, twitter_sample)

# Check if all of our saved objects exist
if (file.exists("Objects/corp.rds") &
    file.exists("Objects/tokens.rds") &
    file.exists("Objects/tokens_dfm.rds") &
    file.exists("Objects/ngram2.rds") &
    file.exists("Objects/ngram3.rds") &
    file.exists("Objects/ngram4.rds") & 
    file.exists("Objects/collocations2.rds") &
    file.exists("Objects/collocations3.rds") &
    file.exists("Objects/collocations4.rds") &
    file.exists("Objects/tokens_nostop.rds") &
    file.exists("Objects/tokens_dfm_nostop.rds") &
    file.exists("Objects/ngram2_nostop.rds") &
    file.exists("Objects/ngram3_nostop.rds") &
    file.exists("Objects/ngram4_nostop.rds") & 
    file.exists("Objects/collocations2_nostop.rds") &
    file.exists("Objects/collocations3_nostop.rds") &
    file.exists("Objects/collocations4_nostop.rds") 
    
    
    ) {
  
  # if all of our saved objects are there then load them into memory
  corp <- readRDS("Objects/corp.rds")
  tokens <- readRDS("Objects/tokens.rds")
  tokens_dfm <- readRDS("Objects/tokens_dfm.rds")
  ngram2 <- readRDS("Objects/ngram2.rds")
  ngram3 <- readRDS("Objects/ngram3.rds")
  ngram4 <- readRDS("Objects/ngram4.rds")
  collocations2 <- readRDS("Objects/collocations2.rds")
  collocations3 <- readRDS("Objects/collocations3.rds")
  collocations4 <- readRDS("Objects/collocations4.rds")
  tokens_nostop <- readRDS("Objects/tokens_nostop.rds")
  tokens_dfm_nostop <- readRDS("Objects/tokens_dfm_nostop.rds")
  ngram2_nostop <- readRDS("Objects/ngram2_nostop.rds")
  ngram3_nostop <- readRDS("Objects/ngram3_nostop.rds")
  ngram4_nostop <- readRDS("Objects/ngram4_nostop.rds")
  collocations2_nostop <- readRDS("Objects/collocations2_nostop.rds")
  collocations3_nostop <- readRDS("Objects/collocations3_nostop.rds")
  collocations4_nostop <- readRDS("Objects/collocations4_nostop.rds")
  
} else {
  
  # if any saved objects are missing, then redo all the objects just to make
  # sure we don't miss anything
  
  # Create the corpus and tokens objects
  corp <- corpus(full_data_sample)
  tokens <- tokens(corp)
  
  # remove all non-alphanumeric characters since they aren't interesting to us
  tokens <- tokens_remove(tokens, pattern = "[^[:alnum:]]", valuetype = "regex") 
  tokens_dfm <- dfm(tokens)
  
  # generate our frequency distributions for bi,tri, and quad-grams
  # with stopwords
  ngram2 <- tokens_ngrams(tokens, n = 2)
  ngram3 <- tokens_ngrams(tokens, n = 3)
  ngram4 <- tokens_ngrams(tokens, n = 4)
  
  
  # create our bigram and trigram objects for sample with stopwords
  collocations2 <- textstat_collocations(tokens, size = 2)  # for bi-grams
  collocations3 <- textstat_collocations(tokens, size = 3)  # for tri-grams
  collocations4 <- textstat_collocations(tokens, size = 4)  # for quad-grams
  
  # create tokens and dfm with stopwords removed
  tokens_nostop <- tokens_remove(tokens, pattern = stopwords("english"))
  tokens_dfm_nostop <- dfm(tokens_nostop)
  
  # generate our frequency distributions for bi,tri, and quad-grams
  # without stopwords
  ngram2_nostop <- tokens_ngrams(tokens_nostop, n = 2)
  ngram3_nostop <- tokens_ngrams(tokens_nostop, n = 3)
  ngram4_nostop <- tokens_ngrams(tokens_nostop, n = 4)
  
  # create our bigram and trigram objects for sample without stopwords
  collocations2_nostop <- textstat_collocations(tokens_nostop, size = 2)  # for bi-grams
  collocations3_nostop <- textstat_collocations(tokens_nostop, size = 3)  # for tri-grams
  collocations4_nostop <- textstat_collocations(tokens_nostop, size = 4)  # for quad-grams
  
  # Save the corpus, tokens, ngrams, and collocations to RDS files
  saveRDS(corp, "Objects/corp.rds")
  saveRDS(tokens, "Objects/tokens.rds")
  saveRDS(tokens_dfm, "Objects/tokens_dfm.rds")
  saveRDS(ngram2, "Objects/ngram2.rds")
  saveRDS(ngram3, "Objects/ngram3.rds")
  saveRDS(ngram4, "Objects/ngram4.rds")
  saveRDS(collocations2, "Objects/collocations2.rds")
  saveRDS(collocations3, "Objects/collocations3.rds")
  saveRDS(collocations4, "Objects/collocations4.rds")
  saveRDS(tokens_nostop, "Objects/tokens_nostop.rds")
  saveRDS(tokens_dfm_nostop, "Objects/tokens_dfm_nostop.rds")
  saveRDS(ngram2_nostop, "Objects/ngram2_nostop.rds")
  saveRDS(ngram3_nostop, "Objects/ngram3_nostop.rds")
  saveRDS(ngram4_nostop, "Objects/ngram4_nostop.rds")
  saveRDS(collocations2_nostop, "Objects/collocations2_nostop.rds")
  saveRDS(collocations3_nostop, "Objects/collocations3_nostop.rds")
  saveRDS(collocations4_nostop, "Objects/collocations4_nostop.rds")
}

# Update the digest files with the new modification times
blogs_sample_digest <- digest(file.info("SampleData/blogs_sample.txt")$mtime)
news_sample_digest <- digest(file.info("SampleData/news_sample.txt")$mtime)
twitter_sample_digest <- digest(file.info("SampleData/twitter_sample.txt")$mtime)
full_data_sample_digest <- digest(file.info("SampleData/full_data_sample.txt")$mtime)

writeLines(blogs_sample_digest, con = "SampleData/blogs_sample_digest.txt")
writeLines(news_sample_digest, con = "SampleData/news_sample_digest.txt")
writeLines(twitter_sample_digest, con = "SampleData/twitter_sample_digest.txt")
writeLines(full_data_sample_digest, con = "SampleData/full_data_sample_digest.txt")


##
# Created by Zain Naboulsi for the Johns Hopkins Data Science Specialization Capstone Course 
##



print(initialfilemetrics)

# check and see if our bootstap object is already saved
if (file.exists("Objects/bootstrap_results_list.rds")) {
  
  # if it is then read our bootstrap object into memory from disk
  bootstrap_results_list <- readRDS("Objects/bootstrap_results_list.rds")
  
} else {
  # Function to process a file and generate bootstrap samples
  bootstrap_unique_words <- function(file_path, n_bootstrap) {
    # Read the file
    text <- readLines(file_path, warn = FALSE)
    
    # Tokenize the text
    tokens <- tokens(text, what = "word", remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>%
      tokens_tolower() %>%
      tokens_remove(stopwords("english"))
    
    # Generate bootstrap samples and compute the number of unique tokens
    n_unique <- numeric(n_bootstrap)
    for (i in 1:n_bootstrap) {
      bootstrap_sample <- sample(tokens, size = length(tokens), replace = TRUE)
      n_unique[i] <- ntype(bootstrap_sample)
    }
    
    return(n_unique)
  }
  
  
# Define our file paths
file_paths <- c("SampleData/blogs_sample.txt", "SampleData/news_sample.txt", "SampleData/twitter_sample.txt")  


# Number of bootstrap samples
n_bootstrap <- 5000

# Perform bootstrap sampling for each file
bootstrap_results_list <- lapply(file_paths, bootstrap_unique_words, n_bootstrap = n_bootstrap)

# save the bootstrap_results_list object to disk
saveRDS(bootstrap_results_list, "Objects/bootstrap_results_list.rds")
  
}



# Combine the results into a data frame for plotting
bootstrap_results_df <- data.frame(
  Blogs = bootstrap_results_list[[1]],
  News = bootstrap_results_list[[2]],
  Twitter = bootstrap_results_list[[3]]
)

# Melt the data frame to long format
bootstrap_results_long <- reshape2::melt(bootstrap_results_df)

# Compute the mean and standard deviation of the number of unique words for each source file
summary_stats <- sapply(bootstrap_results_list, function(x) c(mean = mean(x), sd = sd(x)))

# Convert to a data frame for easier viewing
summary_stats_df <- data.frame(t(summary_stats))
names(summary_stats_df) <- c("Mean", "Standard Deviation")
rownames(summary_stats_df) <- c("Blogs", "News", "Twitter")

cat("Summary Statistics without Stopwords", "\n")
print(summary_stats_df)


# check and see if our bootstap object is already saved
if (file.exists("Objects/bootstrap_results_list_allwords.rds")) {
  
  # if it is then read our bootstrap object into memory from disk
  bootstrap_results_list_allwords <- readRDS("Objects/bootstrap_results_list_allwords.rds")
  
} else {
  # Function to process a file and generate bootstrap samples
  bootstrap_unique_words_allwords <- function(file_path, n_bootstrap) {
    # Read the file
    text <- readLines(file_path, warn = FALSE)
    
    # Tokenize the text
    tokens <- tokens(text, what = "word") %>% tokens_tolower()
    
    # Generate bootstrap samples and compute the number of unique tokens
    n_unique <- numeric(n_bootstrap)
    for (i in 1:n_bootstrap) {
      bootstrap_sample <- sample(tokens, size = length(tokens), replace = TRUE)
      n_unique[i] <- ntype(bootstrap_sample)
    }
    
    return(n_unique)
  }
  
  
  # Define your file paths
  file_paths <- c("SampleData/blogs_sample.txt", "SampleData/news_sample.txt", "SampleData/twitter_sample.txt")  # replace with your actual file paths
  
  # Number of bootstrap samples
  n_bootstrap <- 5000
  
  # Perform bootstrap sampling for each file
  bootstrap_results_list_allwords <- lapply(file_paths, bootstrap_unique_words_allwords, n_bootstrap = n_bootstrap)
  
  # save the bootstrap_results_list object to disk
  saveRDS(bootstrap_results_list_allwords, "Objects/bootstrap_results_list_allwords.rds")
  
}



# Combine the results into a data frame for plotting
bootstrap_results_df_allwords <- data.frame(
  Blogs = bootstrap_results_list_allwords[[1]],
  News = bootstrap_results_list_allwords[[2]],
  Twitter = bootstrap_results_list_allwords[[3]]
)

# Melt the data frame to long format
bootstrap_results_long_allwords <- reshape2::melt(bootstrap_results_df_allwords)


# Compute the mean and standard deviation of the number of unique words for each source file
summary_stats_allwords <- sapply(bootstrap_results_list_allwords, function(x) c(mean = mean(x), sd = sd(x)))

# Convert to a data frame for easier viewing
summary_stats_df_allwords <- data.frame(t(summary_stats_allwords))
names(summary_stats_df_allwords) <- c("Mean", "Standard Deviation")
rownames(summary_stats_df_allwords) <- c("Blogs", "News", "Twitter")

cat("Summary Statistics with Stopwords", "\n")
print(summary_stats_df_allwords)

ggplot(bootstrap_results_long, aes(x = value, fill = variable)) + 
  geom_histogram(bins = 30, color = "black", alpha = 0.7) +  
  scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF")) +  
  facet_wrap(~ variable, scales = "free_x") +
  theme_minimal() +
  theme(
    text = element_text(size = 14),  
    legend.position = "bottom",  
    plot.title = element_text(hjust = 0.5)  
  ) +
  labs(
    title = "Bootstrap Sample Unique Word Counts without Stopwords", 
    y = "Frequency", 
    x = "Number of Unique Words", 
    fill = "Text Source"  
  )


ggplot(bootstrap_results_long_allwords, aes(x = value, fill = variable)) + 
  geom_histogram(bins = 30, color = "black", alpha = 0.7) +  
  scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF")) +  
  facet_wrap(~ variable, scales = "free_x") +
  theme_minimal() +
  theme(
    text = element_text(size = 14),  
    legend.position = "bottom",  
    plot.title = element_text(hjust = 0.5)  
  ) +
  labs(
    title = "Bootstrap Sample Unique Word Counts with Stopwords", 
    y = "Frequency", 
    x = "Number of Unique Words", 
    fill = "Text Source"  
  )


# Generate frequency histogram of data without stopwords

# Extract word frequencies
freq_nostop <- textstat_frequency(tokens_dfm_nostop)

# Sort in descending order and keep only the top 30 words
top_words_nostop <- head(freq_nostop, 30)


# Show our histogram
ggplot(top_words_nostop, aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 30 Word Frequencies without Stopwords",
       x = "Words",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 20), 
        axis.text.y = element_text(size = 9), 
        axis.text.x = element_text(size = 12),
        axis.title = element_text(size = 14))

# Generate frequency histogram of data with stopwords

# Extract word frequencies
freq <- textstat_frequency(tokens_dfm)

# Sort in descending order and keep only the top 30 words
top_words <- head(freq, 30)

# Show our histogram
ggplot(top_words, aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 30 Word Frequencies with Stopwords",
       x = "Words",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 20), 
        axis.text.y = element_text(size = 9), 
        axis.text.x = element_text(size = 12),
        axis.title = element_text(size = 14))

# Compute the feature co-occurrence matrix
fcm_matrix <- fcm(tokens)

# Print the feature co-occurrence matrix
print(fcm_matrix)

# Convert tokens to dfm
dfm2 <- dfm(ngram2)
dfm3 <- dfm(ngram3)
dfm4 <- dfm(ngram4)

top_ngrams2 <- topfeatures(dfm2, n = 10)
top_ngrams3 <- topfeatures(dfm3, n = 10)
top_ngrams4 <- topfeatures(dfm4, n = 10)


# Convert to data frames
df2 <- data.frame(ngram = names(top_ngrams2), freq = unname(top_ngrams2))
df3 <- data.frame(ngram = names(top_ngrams3), freq = unname(top_ngrams3))
df4 <- data.frame(ngram = names(top_ngrams4), freq = unname(top_ngrams4))

# Add a column to identify the n-gram order
df2$order <- "2-gram"
df3$order <- "3-gram"
df4$order <- "4-gram"

# Combine the data frames
df_combined <- bind_rows(df2, df3, df4)

# Plot our n-gram data
ggplot(df_combined, aes(x = reorder(ngram, -freq), y = freq, fill = order)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  facet_wrap(~order, scales = "free") +
  labs(x = "N-gram", y = "Frequency", fill = "Order",
       title = "Top 10 N-grams in Text Data",
       subtitle = "Comparing 2-grams, 3-grams, and 4-grams")


# print the top 10 collocations objects for sample without stopwords
cat("Collocations without Stopwords")
head(collocations2_nostop, 10)
head(collocations3_nostop, 10)
head(collocations4_nostop, 10)

# print the top 10 collocations objects for sample with stopwords
cat("Collocations with Stopwords")
head(collocations2, 10)
head(collocations3, 10)
head(collocations4, 10)

Johns Hopkins University Data Science Capstone - Milestone Report

Zain Naboulsi

2023-05-13