Synopsis

This document outlines the data acquisition, preprocessing, exploratory analysis, and initial steps towards developing a predictive model and a Shiny application for text data. The goal is to demonstrate proficiency in handling data and developing predictive algorithms.

Code for Acquisition of Data

# Check and create the "../data/" directory if it does not exist
if (!dir.exists("../data/")) {
  dir.create("../data")
}

# Download the Coursera-SwiftKey dataset zip file if it does not exist
if (!file.exists("../data/Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "../data/Coursera-SwiftKey.zip")
}

# Unzip the dataset if the "../data/final/" directory does not exist
if (!dir.exists("../data/final/")) {
  unzip("../data/Coursera-SwiftKey.zip", exdir = "../data/", overwrite = FALSE)
}

# Define paths to the individual data files
data_path = list(
  US_Blog = "../data/final/en_US/en_US.blogs.txt",
  US_Twitter = "../data/final/en_US/en_US.twitter.txt",
  US_News = "../data/final/en_US/en_US.news.txt"
)

# Read the data from each path
data = map(data_path, readLines)

Basic Summary of Data

# Combine data from all sources into one dataframe, calculating basic stats for each
basic_summary <- map_dfr(data,
                          function(x){
                            # Create a list with the number of lines, total character count, and max line length
                            list(lines = length(x),
                                 chars = sum(nchar(x)), 
                                 chars_longest_line = max(nchar(x)))
                          })

# Calculate the file size in megabytes for each data source
size = map_dbl(data_path, function(x){ 
                        file_info <- file.info(x) # Get file information
                        file_info$size / (1024 * 1024) # Convert size from bytes to megabytes
                        })

# Add file size in MB to the basic summary dataframe
basic_summary$size_mb <- size

# Convert the list to a dataframe
basic_summary <- as.data.frame(basic_summary)

# Set the row names of the dataframe to the names of the data sources
row.names(basic_summary) <- names(data_path)

# Display the basic summary table using knitr's kable function for better formatting
basic_summary %>% kable()

	lines	chars	chars_longest_line	size_mb
US_Blog	899288	206824509	40833	200.4242
US_Twitter	2360148	162122651	144	159.3641
US_News	77259	15639408	5760	196.2775

Code for Corpus Generation

# Set a seed for reproducibility
set.seed(1234)

# Sample 3% of data from each source and combine into one list
sample_data <- map(data, ~sample(.x, length(.x) * 0.03)) %>%
               unlist(recursive = FALSE)

# Create a corpus from the combined sampled data
corpus <- VCorpus(VectorSource(sample_data))

# Load a list of profanity words from an external source
profanity_url <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanity <- readLines(profanity_url)

Code for Corpus Transformation

# Initialize the content transformer for regex-based substitutions
sub_transformer <- content_transformer(function(x, pattern, replacement = "") {
  gsub(pattern, replacement, x)
})

# Transform the corpus with several cleaning steps
corpus <- corpus %>%
  # Convert all text to ASCII to remove non-ASCII characters
  tm_map(content_transformer(function(x) iconv(x, "latin1", "ASCII", sub = ""))) %>%
  # Remove profanity words, utilizing a pre-defined list 'profanity'
  tm_map(removeWords, profanity) %>%
  # Remove URLs
  tm_map(sub_transformer, pattern = "http[[:alnum:][:punct:]]*") %>%
  # Remove all punctuation
  tm_map(sub_transformer, pattern = "[[:punct:]]*") %>%
  # Remove all digits
  tm_map(sub_transformer, pattern = "[[:digit:]]*") %>%
  # Convert all text to lowercase to ensure uniformity
  tm_map(content_transformer(tolower)) %>%
  # Remove extra spaces
  tm_map(sub_transformer, pattern = "\\s+", replacement = " ")

Code for creating DocumentTermMatrices

# Function to create n-gram DocumentTermMatrix and tidy it
create_ngram_dtm <- function(n) {
  # Use NGramTokenizer with specified n-gram range for tokenization
  gram <- NGramTokenizer
  control_list <- list(tokenize = function(words) gram(words, Weka_control(min = n, max = n)))
  
  # Generate TermDocumentMatrix for the corpus with specified n-gram tokenizer
  dtm <- TermDocumentMatrix(corpus, control = control_list)
  
  # Convert DTM to tidy format, summarize frequencies, and arrange in descending order
  tidy_dtm <- tidy(dtm) %>%
    group_by(term) %>%
    summarize(freq = sum(count)) %>%
    arrange(desc(freq))
  
  return(tidy_dtm)
}

# Generate and store DTMs for unigram, bigram, trigram, and fourgram
ngram_types <- list(unigram = 1, bigram = 2, trigram = 3, fourgram = 4)
tdm <- map(ngram_types, create_ngram_dtm)

Code for creating barplot of 1-Gram

# Refactored Code for Creating Bar Plot of 1-Gram
unigram_top20 <- tdm$unigram %>% 
  top_n(20, freq) %>%
  arrange(desc(freq))

ggplot(unigram_top20, aes(x=reorder(term, freq), y=freq)) +
  geom_bar(stat="identity", fill="steelblue") +
  coord_flip() + # Flip coordinates for better readability of terms
  labs(title="Top 20 1-Grams Frequency", x="Frequency", y="1-Gram") +
  theme_minimal() + # Use a minimal theme for a cleaner look
  theme(axis.text.x=element_text(angle=45, hjust=1)) # Adjust text angle for x-axis labels

Code for creating wordcloud of 1-Gram

# Generate the wordcloud
wordcloud(
  words = unigram_top20$term,
  freq = unigram_top20$freq,
  min.freq = 1,
  max.words = 200,
  random.order = FALSE,
  rot.per = 0.35,
  colors = brewer.pal(8, "Dark2")
)

Code for creating barplot of 2-Gram

# Extract the top 20 2-Grams based on frequency
top_bigrams <- tdm$bigram %>% 
  top_n(20, freq) %>%
  arrange(desc(freq))

# Generate the bar plot for 2-Grams
ggplot(top_bigrams, aes(x=reorder(term, freq), y=freq)) +
  geom_bar(stat="identity", fill="coral") +
  coord_flip() + # Horizontal bars for better readability
  labs(title="Top 20 2-Grams Frequency", x="Frequency", y="2-Gram") +
  theme_minimal() + # Use a minimal theme for a cleaner look
  theme(axis.text.y=element_text(angle=0, hjust=1)) # Adjust text angle for y-axis labels

Code for creating wordcloud of 2-Gram TDM

# Generate the wordcloud with selected parameters for 2-Grams
wordcloud(
  words = top_bigrams$term,
  freq = top_bigrams$freq,
  min.freq = 1,
  max.words = 200,
  random.order = FALSE,
  rot.per = 0.35,
  colors = brewer.pal(8, "Dark2"),
  scale = c(4, 0.5)  # Adjusting the scale for better visual distinction
)

Code for creating barplot of 3-Gram

# Preparing the data: Extract the top 20 3-Grams based on frequency
top_trigrams <- tdm$trigram %>%
  arrange(desc(freq)) %>%
  slice(1:20)

# Creating the bar plot with ggplot2
ggplot(top_trigrams, aes(x = reorder(term, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "cornflowerblue") +  # Using a more visually appealing color
  coord_flip() +  # Flipping the coordinates for better label readability
  labs(title = "Top 20 3-Grams", x = "3-Gram", y = "Frequency") +  # Setting custom labels
  theme_minimal() +  # Using a minimal theme for a cleaner look
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjusting the text angle for readability

Code for creating wordcloud of 3-Gram

# Generating the wordcloud with optimized visualization settings
wordcloud(
  words = top_trigrams$term,
  freq = top_trigrams$freq,
  min.freq = 1,          # Minimum frequency threshold for words to be included
  max.words = 300,       # Maximum number of words to be displayed
  random.order = FALSE,  # Display higher frequency words more centrally
  rot.per = 0.30,        # Proportion of words displayed horizontally
  colors = brewer.pal(8, "Dark2"),  # Color palette
  scale = c(4, 0.8)      # Scale for word sizes
)

Code for creating barplot of 4-Gram

# Prepare the data: Select the top 20 4-Grams based on frequency
top_fourgrams <- tdm$fourgram %>% 
  top_n(20, freq) %>% 
  arrange(desc(freq))

# Generate the bar plot
ggplot(top_fourgrams, aes(x = reorder(term, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "turquoise3") +  # Choose a visually appealing fill color
  coord_flip() +  # Flip the plot for horizontal bars and easier term readability
  labs(title = "Top 20 4-Grams Frequency", x = "Frequency", y = "4-Gram") +
  theme_light() +  # Use a light theme for a modern and clean look
  theme(axis.title.y = element_blank(),  # Remove the y-axis label for a cleaner look
        axis.text.y = element_text(size = 12))  # Adjust text size for readability

Code for creating wordcloud of 4-Gram

# Define parameters for the wordcloud to enhance visual appeal and readability
color_palette <- brewer.pal(8, "Dark2")
rotation_proportion <- 0.30
max_words_displayed <- 300

# Create the wordcloud
wordcloud(
  words = top_fourgrams$term, 
  freq = top_fourgrams$freq, 
  min.freq = 1,
  max.words = max_words_displayed, 
  random.order = FALSE, 
  rot.per = rotation_proportion, 
  colors = color_palette,
  scale = c(3, 0.5)  # Adjust scale for better visual distinction among words
)

Key Findings from N-Gram Analysis

1-Gram Insights

The analysis predominantly identified stop words as the most frequent 1-grams. These were intentionally not removed to preserve the integrity of further analyses.

2-Gram Discovery

The sequence ‘of the’ stood out as the most recurring 2-gram, suggesting its common usage in the dataset’s context.

3-Gram Prevalence

‘Thanks for the’ emerged as the leading 3-gram, indicating a pattern of gratitude expressions within the dataset.

4-Gram Observation

The 4-gram ‘thanks for the follow’ was most prominent, further underscoring themes of acknowledgment and social interaction. A limitation was noted where some frequent 4-grams could not be visually represented in word clouds due to spatial constraints.

Strategic Future Plans

Predictive Model Development

With insights from the n-gram analysis, we plan to construct a predictive model that fully utilizes the dataset. The model’s creation will involve batch processing techniques to manage the corpus’s substantial size effectively.

Modeling Techniques Evaluation

We aim to explore and assess various modeling techniques to refine the predictive model’s performance, ensuring it delivers reliable and actionable predictions.

Shiny Application Creation

The final step involves developing a Shiny application that leverages the predictive model to offer word predictions based on user input. This application aims to demonstrate the practical application of our findings, making predictive text analysis accessible and interactive.

Exploratory Analysis for Predictive Model Development Based on Text Data

Shinya Hashimoto

2024-03-12