This is a comprehensive project that covers all the areas of the data science specialization offered by john hopkins university. It involves working with NLP and building a predictive model that can be deployed to a useful data application. This is not an easy task and all the steps will be outlined clearly as we proceed.
library(tidyverse)
-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --
[32mv[39m [34mggplot2[39m 3.3.5 [32mv[39m [34mpurrr [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.8 [32mv[39m [34mdplyr [39m 1.0.7
[32mv[39m [34mtidyr [39m 1.2.0 [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr [39m 2.1.3 [32mv[39m [34mforcats[39m 0.5.1
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
# Reading in.
us_twitter = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt')
# Reading in.
us_blogs = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt')
us_news = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.news.txt')
# Making our data a dataframe
us_twitter_data =
data.frame(
text = us_twitter
) %>%
# Adding a count of character variable
mutate(no_characters = nchar(text))
list.files('./data/Coursera-SwiftKey/final/en_US')
lens = c(length(us_twitter), length(us_blogs), length(us_news))
barplot(lens, names.arg = c('us_twitter', 'us_blogs', 'us_news'), col = 'orange', main = 'Lines by Corpus',
ylab = 'Length', xlab = 'Corpus')
png
str(us_twitter_data)
'data.frame': 2360148 obs. of 2 variables:
$ text : chr "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if I don't." "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)" ...
$ no_characters: int 109 111 40 84 63 77 101 51 54 30 ...
summary(us_twitter_data)
text no_characters
Length:2360148 Min. : 2.0
Class :character 1st Qu.: 37.0
Mode :character Median : 64.0
Mean : 68.8
3rd Qu.:100.0
Max. :213.0
Our primary tasks are deceptively simple: - How often do certain words occur - How often do certain pairs occur together?
This will require the use of regular expressions because we do not know exactly what words we are looking for.
Let us look at the distributions of the words by their length.
counts = numeric(0)
for (number in 1:10) {
regex = paste('\\s[a-z]{', number, '}\\s', sep = "")
count = str_detect(us_twitter_data$text[1:1000], pattern = regex) %>% sum
counts = c(counts, count)
}
counts
barplot(counts, names.arg = 1:length(counts), col = 'steelblue', xlab = "Word Length", ylab = "Frequency")
png
This might not be very useful as it only shows the relative
distribution of a words of a given length. This could hold with other
text datasets and not just ours.
Also there are overlapping areas, seeing that the sum adds to more than
the total number of words. But the pattern holds throught the whole
dataset.
# A function for looking for n number of words occurences
uniquify = function (string, regex) {
# Extract words using regular expression
words = str_extract_all(string, regex)
# Convert to lowercase
words = tolower(unlist(words))
# Count frequency of each word
word_freq = table(words)
# Sort by frequency
sorted_freq = sort(word_freq, decreasing = TRUE)
# Print the 10 most frequent words
head(sorted_freq, 10)
}
if (!file.exists('./data/Coursera-SwiftKey/final/en_US/word_freq.csv')) {
word_freq = as.data.frame(
uniquify(us_twitter_data$text, "\\w+")
)
# Writing this to file.
write.csv(word_freq, './data/Coursera-SwiftKey/final/en_US/word_freq.csv')
}
start = 1
end = 10
barplot(word_freq$Freq[start:end], names.arg = word_freq$words[start:end], col = 'purple',
xlab = 'Word Rank', ylab = 'Word Frequency')
png
A lot of the common words are not at all aurprising. They are the usual ones used every day. - articles: the, a and an - filler words: like - other commonly used words like when, today.
We have to take this into account.
regex = "\\b\\w+\\s\\w+\\b"
str_view_all(us_twitter_data$text[1:10], regex, match = TRUE)
<!doctype html>
if (!file.exists('./data/Coursera-SwiftKey/final/en_US/pairs_freq.csv')) {
pairs_freq =
as.data.frame(
uniquify(us_twitter_data$text, "\\b\\w+\\s\\w+\\b")
)
# Writing this to file.
write.csv(pairs_freq, './data/Coursera-SwiftKey/final/en_US/pairs_freq.csv')
head(pairs_freq)
}
barplot(pairs_freq$Freq, names.arg = pairs_freq$words, col = "lightblue", las = 2, ylab = "Word Frequency")
png
Now that we have seen what words appear mostly and next to what other words, we can generalize our pattern matching to find bunches of words that occur together.
if (!file.exists('./data/Coursera-SwiftKey/final/en_US/triple_freq.csv')) {
triple_freq =
as.data.frame(
uniquify(us_twitter_data$text, "\\b\\w+\\s\\w+\\s\\w+\\b")
)
# Writing this to file.
write.csv(triple_freq, './data/Coursera-SwiftKey/final/en_US/triple_freq.csv')
}
barplot(triple_freq$Freq, names.arg = triple_freq$words, col = "pink", las = 2, ylab = "Word Frequency")
png
head(triple_freq)
| words | Freq | |
|---|---|---|
| <fct> | <int> | |
| 1 | thanks for the | 22284 |
| 2 | thank you for | 7725 |
| 3 | t wait to | 7454 |
| 4 | looking forward to | 6621 |
| 5 | i love you | 6344 |
| 6 | i want to | 5138 |
That concludes our exploratory data analysis.
Our tasks are to: - BUild an ngram model based on markov chains - Predict next word based on one two or three words - figure out a way to evaluate model performance - make the model perform in a reasonable amount of time. - Dealing with edge caees; uncommon words