The below code creates a summary table by conducting an organized “audit” of the raw text files. After setting up an empty data frame to hold the findings, it iterates over each file in the data folder using a for-loop. Within the loop, it starts a file connection to read the content and uses the file.info function to determine the physical file size in megabytes. It merely scans the first 50,000 lines (or the entire file if it’s less) in order to maximize performance and prevent the computer’s memory from being overloaded. Lastly, it determines the number of lines, combines that data with the file name and size, and adds it to the summary table.
# Replace the path below with your specific folder path if different
data_path <- "E:/DESKTOP/data science capstone/Coursera-SwiftKey/final/en_US/"
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
# Calculate stats
summary_results <- data.frame(File = character(), Lines = numeric(), Size_MB = numeric())
for (f in files) {
full_path <- paste0(data_path, f)
f_size <- file.info(full_path)$size / (1024^2)
con <- file(full_path, "r")
# Read first 50,000 for summary speed, or use your full count logic
data_lines <- readLines(con, skipNul = TRUE, warn = FALSE)
close(con)
summary_results <- rbind(summary_results, data.frame(
File = f,
Lines = length(data_lines),
Size_MB = round(f_size, 2)
))
}
kable(summary_results, col.names = c("Source File", "Line Count", "Size (MB)"),
caption = "Table 1: Overview of Training Data")
| Source File | Line Count | Size (MB) |
|---|---|---|
| en_US.blogs.txt | 899288 | 200.42 |
| en_US.news.txt | 1010206 | 196.28 |
| en_US.twitter.txt | 2360148 | 159.36 |
By converting unstructured text into a structured bar chart, the below code segment carries out the exploratory display of word frequencies. To make sure the analysis is representative and computationally efficient, a 10% random sample of the data is first taken. After that, the text is processed using tokenization, which eliminates extraneous characters like punctuation and digits and changes every word to lowercase for consistency. The script determines the top 15 most frequent terms by generating a Document-Feature Matrix (DFM) and counting the frequency of each unique word. Lastly, it employs ggplot2 to create a horizontal bar chart that clearly illustrates the key vocabulary of the dataset. This is crucial for determining the main linguistic patterns that the prediction model must prioritize.
The following chart highlights the most common words found in our sampled dataset. This visualization helps identify which words will have the highest predictive weight in our model.
set.seed(123)
# We take a subset of the loaded lines to analyze word patterns
sample_text <- sample(data_lines, length(data_lines) * 0.1)
# Tokenization: Cleaning the text and converting to lowercase
tokens_obj <- tokens(corpus(sample_text), remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower()
# Create a Document-Feature Matrix to count occurrences
dfm_obj <- dfm(tokens_obj)
word_freq <- textstat_frequency(dfm_obj, n = 15)
# Generating the Bar Chart
ggplot(word_freq, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Most Common Words",
x = "Words",
y = "Frequency") +
theme_minimal()
We can identify common phrases like “of the” or “in a,” which are the foundation of our prediction engine, by analyzing word pairings.
# Create 2-word combinations (Bigrams)
tokens_2 <- tokens_ngrams(tokens_obj, n = 2)
dfm_2 <- dfm(tokens_2)
bigram_freq <- textstat_frequency(dfm_2, n = 15)
ggplot(bigram_freq, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "darkred") +
coord_flip() +
labs(title = "Top 15 Most Common Word Pairs", x = "Bigrams", y = "Frequency") +
theme_minimal()
To prepare the data for analysis, I created a custom function to clean the text. This involves removing punctuation, numbers, URLs, and a custom list of profanity words to ensure the suggestions remain professional.
# 1. SETUP AND LIBRARIES
library(quanteda)
library(stringi)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
tokenize_file <- function(file_path, n_lines = 10000) {
# 1. Read the file
con <- file(file_path, "r", encoding = "UTF-8")
raw_data <- readLines(con, n_lines, skipNul = TRUE, warn = FALSE)
close(con)
# 2. Create a quanteda corpus
q_corp <- corpus(raw_data)
# 3. Tokenization & Cleaning
tokens_obj <- tokens(q_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE)
# 4. Normalization (Lowercasing)
tokens_obj <- tokens_tolower(tokens_obj)
# 5. Profanity Filtering
bad_words <- c("damn", "hell", "crap")
tokens_obj <- tokens_remove(tokens_obj, pattern = bad_words)
return(tokens_obj)
}
A word cloud (also known as a tag cloud) is a visual representation of text data where the importance of each word is shown by its size and color. It is one of the most popular tools in exploratory data analysis for quickly identifying the “vibe” or primary themes of a large dataset.
# Process the blogs file using your custom function
blog_tokens <- tokenize_file("E:/DESKTOP/data science capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", n_lines = 10000)
# 1. Create a Document-Feature Matrix (quanteda's version of TDM)
dfm_blog <- dfm(blog_tokens)
# 2. Get word frequencies
word_freqs <- textstat_frequency(dfm_blog)
df <- data.frame(word = word_freqs$feature, freq = word_freqs$frequency)
set.seed(1234)
wordcloud(words = df$word, freq = df$freq, min.freq = 5,
max.words = 100, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))