This document outlines the data acquisition, preprocessing, exploratory analysis, and initial steps towards developing a predictive model and a Shiny application for text data. The goal is to demonstrate proficiency in handling data and developing predictive algorithms.
# Check and create the "../data/" directory if it does not exist
if (!dir.exists("../data/")) {
dir.create("../data")
}
# Download the Coursera-SwiftKey dataset zip file if it does not exist
if (!file.exists("../data/Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "../data/Coursera-SwiftKey.zip")
}
# Unzip the dataset if the "../data/final/" directory does not exist
if (!dir.exists("../data/final/")) {
unzip("../data/Coursera-SwiftKey.zip", exdir = "../data/", overwrite = FALSE)
}
# Define paths to the individual data files
data_path = list(
US_Blog = "../data/final/en_US/en_US.blogs.txt",
US_Twitter = "../data/final/en_US/en_US.twitter.txt",
US_News = "../data/final/en_US/en_US.news.txt"
)
# Read the data from each path
data = map(data_path, readLines)
# Combine data from all sources into one dataframe, calculating basic stats for each
basic_summary <- map_dfr(data,
function(x){
# Create a list with the number of lines, total character count, and max line length
list(lines = length(x),
chars = sum(nchar(x)),
chars_longest_line = max(nchar(x)))
})
# Calculate the file size in megabytes for each data source
size = map_dbl(data_path, function(x){
file_info <- file.info(x) # Get file information
file_info$size / (1024 * 1024) # Convert size from bytes to megabytes
})
# Add file size in MB to the basic summary dataframe
basic_summary$size_mb <- size
# Convert the list to a dataframe
basic_summary <- as.data.frame(basic_summary)
# Set the row names of the dataframe to the names of the data sources
row.names(basic_summary) <- names(data_path)
# Display the basic summary table using knitr's kable function for better formatting
basic_summary %>% kable()
| lines | chars | chars_longest_line | size_mb | |
|---|---|---|---|---|
| US_Blog | 899288 | 206824509 | 40833 | 200.4242 |
| US_Twitter | 2360148 | 162122651 | 144 | 159.3641 |
| US_News | 77259 | 15639408 | 5760 | 196.2775 |
# Set a seed for reproducibility
set.seed(1234)
# Sample 3% of data from each source and combine into one list
sample_data <- map(data, ~sample(.x, length(.x) * 0.03)) %>%
unlist(recursive = FALSE)
# Create a corpus from the combined sampled data
corpus <- VCorpus(VectorSource(sample_data))
# Load a list of profanity words from an external source
profanity_url <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity <- readLines(profanity_url)
# Initialize the content transformer for regex-based substitutions
sub_transformer <- content_transformer(function(x, pattern, replacement = "") {
gsub(pattern, replacement, x)
})
# Transform the corpus with several cleaning steps
corpus <- corpus %>%
# Convert all text to ASCII to remove non-ASCII characters
tm_map(content_transformer(function(x) iconv(x, "latin1", "ASCII", sub = ""))) %>%
# Remove profanity words, utilizing a pre-defined list 'profanity'
tm_map(removeWords, profanity) %>%
# Remove URLs
tm_map(sub_transformer, pattern = "http[[:alnum:][:punct:]]*") %>%
# Remove all punctuation
tm_map(sub_transformer, pattern = "[[:punct:]]*") %>%
# Remove all digits
tm_map(sub_transformer, pattern = "[[:digit:]]*") %>%
# Convert all text to lowercase to ensure uniformity
tm_map(content_transformer(tolower)) %>%
# Remove extra spaces
tm_map(sub_transformer, pattern = "\\s+", replacement = " ")
# Function to create n-gram DocumentTermMatrix and tidy it
create_ngram_dtm <- function(n) {
# Use NGramTokenizer with specified n-gram range for tokenization
gram <- NGramTokenizer
control_list <- list(tokenize = function(words) gram(words, Weka_control(min = n, max = n)))
# Generate TermDocumentMatrix for the corpus with specified n-gram tokenizer
dtm <- TermDocumentMatrix(corpus, control = control_list)
# Convert DTM to tidy format, summarize frequencies, and arrange in descending order
tidy_dtm <- tidy(dtm) %>%
group_by(term) %>%
summarize(freq = sum(count)) %>%
arrange(desc(freq))
return(tidy_dtm)
}
# Generate and store DTMs for unigram, bigram, trigram, and fourgram
ngram_types <- list(unigram = 1, bigram = 2, trigram = 3, fourgram = 4)
tdm <- map(ngram_types, create_ngram_dtm)
# Refactored Code for Creating Bar Plot of 1-Gram
unigram_top20 <- tdm$unigram %>%
top_n(20, freq) %>%
arrange(desc(freq))
ggplot(unigram_top20, aes(x=reorder(term, freq), y=freq)) +
geom_bar(stat="identity", fill="steelblue") +
coord_flip() + # Flip coordinates for better readability of terms
labs(title="Top 20 1-Grams Frequency", x="Frequency", y="1-Gram") +
theme_minimal() + # Use a minimal theme for a cleaner look
theme(axis.text.x=element_text(angle=45, hjust=1)) # Adjust text angle for x-axis labels
# Generate the wordcloud
wordcloud(
words = unigram_top20$term,
freq = unigram_top20$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2")
)
# Extract the top 20 2-Grams based on frequency
top_bigrams <- tdm$bigram %>%
top_n(20, freq) %>%
arrange(desc(freq))
# Generate the bar plot for 2-Grams
ggplot(top_bigrams, aes(x=reorder(term, freq), y=freq)) +
geom_bar(stat="identity", fill="coral") +
coord_flip() + # Horizontal bars for better readability
labs(title="Top 20 2-Grams Frequency", x="Frequency", y="2-Gram") +
theme_minimal() + # Use a minimal theme for a cleaner look
theme(axis.text.y=element_text(angle=0, hjust=1)) # Adjust text angle for y-axis labels
# Generate the wordcloud with selected parameters for 2-Grams
wordcloud(
words = top_bigrams$term,
freq = top_bigrams$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"),
scale = c(4, 0.5) # Adjusting the scale for better visual distinction
)
# Preparing the data: Extract the top 20 3-Grams based on frequency
top_trigrams <- tdm$trigram %>%
arrange(desc(freq)) %>%
slice(1:20)
# Creating the bar plot with ggplot2
ggplot(top_trigrams, aes(x = reorder(term, freq), y = freq)) +
geom_bar(stat = "identity", fill = "cornflowerblue") + # Using a more visually appealing color
coord_flip() + # Flipping the coordinates for better label readability
labs(title = "Top 20 3-Grams", x = "3-Gram", y = "Frequency") + # Setting custom labels
theme_minimal() + # Using a minimal theme for a cleaner look
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Adjusting the text angle for readability
# Generating the wordcloud with optimized visualization settings
wordcloud(
words = top_trigrams$term,
freq = top_trigrams$freq,
min.freq = 1, # Minimum frequency threshold for words to be included
max.words = 300, # Maximum number of words to be displayed
random.order = FALSE, # Display higher frequency words more centrally
rot.per = 0.30, # Proportion of words displayed horizontally
colors = brewer.pal(8, "Dark2"), # Color palette
scale = c(4, 0.8) # Scale for word sizes
)
# Prepare the data: Select the top 20 4-Grams based on frequency
top_fourgrams <- tdm$fourgram %>%
top_n(20, freq) %>%
arrange(desc(freq))
# Generate the bar plot
ggplot(top_fourgrams, aes(x = reorder(term, freq), y = freq)) +
geom_bar(stat = "identity", fill = "turquoise3") + # Choose a visually appealing fill color
coord_flip() + # Flip the plot for horizontal bars and easier term readability
labs(title = "Top 20 4-Grams Frequency", x = "Frequency", y = "4-Gram") +
theme_light() + # Use a light theme for a modern and clean look
theme(axis.title.y = element_blank(), # Remove the y-axis label for a cleaner look
axis.text.y = element_text(size = 12)) # Adjust text size for readability
# Define parameters for the wordcloud to enhance visual appeal and readability
color_palette <- brewer.pal(8, "Dark2")
rotation_proportion <- 0.30
max_words_displayed <- 300
# Create the wordcloud
wordcloud(
words = top_fourgrams$term,
freq = top_fourgrams$freq,
min.freq = 1,
max.words = max_words_displayed,
random.order = FALSE,
rot.per = rotation_proportion,
colors = color_palette,
scale = c(3, 0.5) # Adjust scale for better visual distinction among words
)
The analysis predominantly identified stop words as the most frequent 1-grams. These were intentionally not removed to preserve the integrity of further analyses.
The sequence ‘of the’ stood out as the most recurring 2-gram, suggesting its common usage in the dataset’s context.
‘Thanks for the’ emerged as the leading 3-gram, indicating a pattern of gratitude expressions within the dataset.
The 4-gram ‘thanks for the follow’ was most prominent, further underscoring themes of acknowledgment and social interaction. A limitation was noted where some frequent 4-grams could not be visually represented in word clouds due to spatial constraints.
With insights from the n-gram analysis, we plan to construct a predictive model that fully utilizes the dataset. The model’s creation will involve batch processing techniques to manage the corpus’s substantial size effectively.
We aim to explore and assess various modeling techniques to refine the predictive model’s performance, ensuring it delivers reliable and actionable predictions.
The final step involves developing a Shiny application that leverages the predictive model to offer word predictions based on user input. This application aims to demonstrate the practical application of our findings, making predictive text analysis accessible and interactive.