This document serves as the interim report submittal for the Coursera Data Science Capstone course. The capstone project involves creating a natural language processing (NPL) application to predict words that follow user-entered words. For the interim report, the student must demonstrate that they have downloaded the files and performed some exploratory data analyses.
I applied the following code to download the three English-language data sources (blogs, Twitter, and news sources). The ultimate application will combine these three files. Hence, for the exploratory data analysis I also combined the three files into a single file, and then removed the individual files to free up memory.
blogs <- readLines(here("en_US","en_US.blogs.txt"))
news <- readLines(here("en_US","en_US.news.txt"))
twitter <- readLines(here("en_US","en_US.twitter.txt"))
combotext <- c(blogs, news, twitter)
rm(blogs)
rm(news)
rm(twitter)
invisible(gc())
In the next step, I created a corpus from the data, and then created tokens of individual. words. For the latter step I removed numbers, punctuation, urls, and symbols. I split hyphenated terms. Finally, I created a document-feature matrix from the tokens. These steps use functions from the quanteda package.
corpus <- corpus(combotext)
tokens <- tokens(corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
remove_url = TRUE, remove_symbols = TRUE, split_hyphens = TRUE)
tokens <- tokens_tolower(tokens)
dfm <- dfm(tokens)
This section describes several exploratory data analysis that were performed with the data.
The tokens were evaluated to identify the most common twenty words, which are displayed in Table 1.
top_words <- topfeatures(dfm, n = 20)
top_words_df <- data.frame(word = names(top_words), frequency = unname(top_words))
ggplot(top_words_df, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab("Words") +
ylab("Frequency") +
ggtitle("Top Words in Corpus") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 1 - Top Twenty Words
The following code was used to determine the number of unique words that represent 50% and 90% of the all word instances in the corpus:
# Calculate the frequency of each term
term_frequencies <- colSums(dfm)
# Sort the terms by frequency in descending order
sorted_term_frequencies <- sort(term_frequencies, decreasing = TRUE)
# Calculate the cumulative frequency
cumulative_frequencies <- cumsum(sorted_term_frequencies)
# Total number of word instances in the corpus
total_word_instances <- sum(sorted_term_frequencies)
# Determine the number of unique words needed to reach 50% of total word instances
threshold_50 <- total_word_instances * 0.5
num_unique_words_50 <- which(cumulative_frequencies >= threshold_50)[1]
# Determine the number of unique words needed to reach 50% of total word instances
threshold_90 <- total_word_instances * 0.9
num_unique_words_90 <- which(cumulative_frequencies >= threshold_90)[1]
The results indicate that there are 69,398,254 words (or 1-grams) in the corpus. But only 126 account for 50% of the word instances, and 6,784 account for 90% of the word instances.
The following code was used to identify the words that occur in the corpus but are not in a list of common English words (Grady’s Expanded dataset from the lexicon package). Table 1 provides examples of non-dictionary words in the corpus.
# Get the unique words from the corpus
unique_words <- featnames(dfm)
# Count the number of unique words
num_unique_words <- length(unique_words)
# Load the English dictionary from the lexicon package
data("grady_augmented")
# Convert the English dictionary to lowercase for comparison
dict_words <- tolower(grady_augmented)
# Determine which words are not in the English dictionary
non_dict_words <- setdiff(unique_words, dict_words)
# Remove terms with apostrophes
non_dict_words <- non_dict_words[!grepl("['’]", non_dict_words)]
# Count the number of non-dictionary words
num_non_dict_words <- length(non_dict_words)
# Calculate the percentage of non-dictionary words
perc_non_dict <- 100* num_non_dict_words / num_unique_words
# Count occurrences of non-dictionary words
non_dict_cols <- !(colnames(dfm) %in% dict_words)
# word_counts <- dfm[, non_dict_cols]
# Subset the DFM to include only the features in list1
dfm_subset <- dfm_select(dfm, pattern = non_dict_words)
# Calculate the total counts of words in list1
total_non_dict_count <- sum(colSums(dfm_subset))
# Calculate the total counts of all words in the corpus
total_word_count <- sum(colSums(dfm))
# Calculate the percentage of total word occurrences that are in list1
percentage_non_dict <- (total_non_dict_count / total_word_count) * 100
# Create a table of 20 random words from the non-dictionary list
selected_words <- sample(non_dict_words, 20)
word_df <- data.frame(Word = selected_words)
kable(word_df, col.names = c("Random Words"), align = "c", caption = "Table 1: Examples of Non-Dictionary Words in the Corpus")
| Random Words |
|---|
| #ithurts |
| congeneality |
| was.amazing |
| pantall |
| misconfigured |
| optique |
| dsny |
| #gktour |
| bhajante |
| gullability |
| acwi |
| 228all |
| aristochien |
| #swingvision |
| deal.just |
| coleby |
| flâneurs |
| tsumiki |
| caretaking |
| frikikikikiki |
Based on this analysis, 79% of the unique words in the corpus are not common English words. Table 1 shows that although some are words from foreign languages, most fall into other categories such as proper names, misspellings, and less common English words or English word derivatives. Although they are are high proportion of the unique word list, in total these words account for only 4.8% of the total word occurrences in the corpus. This points to a need for a decision to retain them for future steps or not. They are unlikely to be important for common n-grams, and so including them would probably not affect the app results. But perhaps excluding them would speed up the application.
The following code was developed to make a histogram of the most common bigrams (i.e., 2-grams) in the corpus. Results are displayed in Figure 2.
# Tokenize by bigrams
tokens2 <- tokens_ngrams(tokens, n = 2)
dfm2 <- dfm(tokens2)
top_bigrams <- topfeatures(dfm2, n = 20)
top_bigrams_df <- data.frame(word = names(top_bigrams), frequency = unname(top_bigrams))
ggplot(top_bigrams_df, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab("Bigrams") +
ylab("Frequency") +
ggtitle("Top Bigrams in Corpus") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 2 - Top Bigrams
The following code was developed to make a histogram of the most common trigrams (i.e., 3-grams) in the corpus. Results are displayed in Figure 3.
# Tokenize by bigrams
# Tokenize by trigrams
tokens3 <- tokens_ngrams(tokens, n = 3)
dfm3 <- dfm(tokens3)
top_trigrams <- topfeatures(dfm3, n = 20)
top_trigrams_df <- data.frame(word = names(top_trigrams), frequency = unname(top_trigrams))
ggplot(top_trigrams_df, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab("Trigrams") +
ylab("Frequency") +
ggtitle("Top Trigrams in Corpus") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 3 - Top Trigrams