Basic Statistics from the datasets

The below code creates a summary table by conducting an organized “audit” of the raw text files. After setting up an empty data frame to hold the findings, it iterates over each file in the data folder using a for-loop. Within the loop, it starts a file connection to read the content and uses the file.info function to determine the physical file size in megabytes. It merely scans the first 50,000 lines (or the entire file if it’s less) in order to maximize performance and prevent the computer’s memory from being overloaded. Lastly, it determines the number of lines, combines that data with the file name and size, and adds it to the summary table.

# Replace the path below with your specific folder path if different
data_path <- "E:/DESKTOP/data science capstone/Coursera-SwiftKey/final/en_US/"
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

# Calculate stats
summary_results <- data.frame(File = character(), Lines = numeric(), Size_MB = numeric())

for (f in files) {
  full_path <- paste0(data_path, f)
  f_size <- file.info(full_path)$size / (1024^2)
  con <- file(full_path, "r")
  # Read first 50,000 for summary speed, or use your full count logic
  data_lines <- readLines(con, skipNul = TRUE, warn = FALSE)
  close(con)
  
  summary_results <- rbind(summary_results, data.frame(
    File = f, 
    Lines = length(data_lines), 
    Size_MB = round(f_size, 2)
  ))
}

kable(summary_results, col.names = c("Source File", "Line Count", "Size (MB)"), 
      caption = "Table 1: Overview of Training Data")
Table 1: Overview of Training Data
Source File Line Count Size (MB)
en_US.blogs.txt 899288 200.42
en_US.news.txt 1010206 196.28
en_US.twitter.txt 2360148 159.36

Exploratory Analysis

By converting unstructured text into a structured bar chart, the below code segment carries out the exploratory display of word frequencies. To make sure the analysis is representative and computationally efficient, a 10% random sample of the data is first taken. After that, the text is processed using tokenization, which eliminates extraneous characters like punctuation and digits and changes every word to lowercase for consistency. The script determines the top 15 most frequent terms by generating a Document-Feature Matrix (DFM) and counting the frequency of each unique word. Lastly, it employs ggplot2 to create a horizontal bar chart that clearly illustrates the key vocabulary of the dataset. This is crucial for determining the main linguistic patterns that the prediction model must prioritize.

1.Word Frequency Visualization

The following chart highlights the most common words found in our sampled dataset. This visualization helps identify which words will have the highest predictive weight in our model.

set.seed(123)
# We take a subset of the loaded lines to analyze word patterns
sample_text <- sample(data_lines, length(data_lines) * 0.1) 

# Tokenization: Cleaning the text and converting to lowercase
tokens_obj <- tokens(corpus(sample_text), remove_punct = TRUE, remove_numbers = TRUE) %>%
              tokens_tolower()

# Create a Document-Feature Matrix to count occurrences
dfm_obj <- dfm(tokens_obj)
word_freq <- textstat_frequency(dfm_obj, n = 15)

# Generating the Bar Chart
ggplot(word_freq, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Most Common Words", 
       x = "Words", 
       y = "Frequency") +
  theme_minimal()

2.Common Word Pairs (Bigrams)

We can identify common phrases like “of the” or “in a,” which are the foundation of our prediction engine, by analyzing word pairings.

# Create 2-word combinations (Bigrams)
tokens_2 <- tokens_ngrams(tokens_obj, n = 2)
dfm_2 <- dfm(tokens_2)
bigram_freq <- textstat_frequency(dfm_2, n = 15)

ggplot(bigram_freq, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip() +
  labs(title = "Top 15 Most Common Word Pairs", x = "Bigrams", y = "Frequency") +
  theme_minimal()

3. Data Processing & Tokenization

To prepare the data for analysis, I created a custom function to clean the text. This involves removing punctuation, numbers, URLs, and a custom list of profanity words to ensure the suggestions remain professional.

# 1. SETUP AND LIBRARIES
library(quanteda)
library(stringi)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)

tokenize_file <- function(file_path, n_lines = 10000) {
  # 1. Read the file
  con <- file(file_path, "r", encoding = "UTF-8")
  raw_data <- readLines(con, n_lines, skipNul = TRUE, warn = FALSE)
  close(con)
  
  # 2. Create a quanteda corpus
  q_corp <- corpus(raw_data)
  
  # 3. Tokenization & Cleaning
  tokens_obj <- tokens(q_corp, 
                       remove_punct = TRUE, 
                       remove_numbers = TRUE, 
                       remove_symbols = TRUE,
                       remove_url = TRUE)
  
  # 4. Normalization (Lowercasing)
  tokens_obj <- tokens_tolower(tokens_obj)
  
  # 5. Profanity Filtering
  bad_words <- c("damn", "hell", "crap") 
  tokens_obj <- tokens_remove(tokens_obj, pattern = bad_words)
  
  return(tokens_obj)
}

4.Word Cloud

A word cloud (also known as a tag cloud) is a visual representation of text data where the importance of each word is shown by its size and color. It is one of the most popular tools in exploratory data analysis for quickly identifying the “vibe” or primary themes of a large dataset.

# Process the blogs file using your custom function
blog_tokens <- tokenize_file("E:/DESKTOP/data science capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", n_lines = 10000)


# 1. Create a Document-Feature Matrix (quanteda's version of TDM)
dfm_blog <- dfm(blog_tokens)

# 2. Get word frequencies
word_freqs <- textstat_frequency(dfm_blog)
df <- data.frame(word = word_freqs$feature, freq = word_freqs$frequency)


set.seed(1234) 
wordcloud(words = df$word, freq = df$freq, min.freq = 5,
          max.words = 100, random.order = FALSE, rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"))