Having already downloaded the zip file from the course website, the next step is to unzip and load in the data.
if(!file.exists("./Coursera-SwiftKey")){
unzip("C:/Users/james/Documents/R/JHU DSS Data Science Capstone DATA/Coursera-SwiftKey.zip", exdir = "./Coursera-SwiftKey")}
Here I will display the three respective file sizes.
blogsize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twittersize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
newssize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
data.frame(Data = c("Blogs", "Twitter", "News"),
File_Size = c(blogsize/(1024^2), twittersize/(1024^2), newssize/(1024^2)))
## Data File_Size
## 1 Blogs 200.4242
## 2 Twitter 159.3641
## 3 News 196.2775
Next I will provide the number of respective lines in each dataset along with the length of the longest line of each.
twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
data.frame(Data = c("Blogs", "Twitter", "News"),
Number_of_lines_of_text = c(length(blogs), length(twitter), length(news)),
Longest_line_length = c(summary(nchar(blogs))[[6]], summary(nchar(twitter))[[6]], summary(nchar(news))[[6]]))
## Data Number_of_lines_of_text Longest_line_length
## 1 Blogs 899288 40835
## 2 Twitter 2360148 213
## 3 News 77259 5760
In order to avoid slowing down the data analysis process, I will take a sample of the datasets and work with that. I will then compile them into one dataset and write it to a new text file to again speed up later processing.
set.seed(210914)
blogs_sample <- sample(blogs, length(blogs)*0.05)
twitter_sample <- sample(twitter, length(twitter)*0.05)
news_sample <- sample(news, length(news)*0.05)
all_sample <- c(blogs_sample, twitter_sample, news_sample)
writeLines(all_sample, "./en_US.all_sample.txt")
head(all_sample, n = 3)
## [1] "G Rated gathers up the most brilliant ideas and inspiring product and present them all to you. Follow their blog for awesome DIY ideas, simple recipes, fashion, crafts, and decor."
## [2] "The Qualifications to qualify for Il Pay day loans is often a lowest typical earnings of Buck1,000, citizenship of america, being a resident of Il for 5years and up, lowest 18 yrs . old."
## [3] "fabric used this week: -8 yards"
Next I will create a function that cleans the data by removing hyperlinks, removing numerics and non-English characters or symbols, and converting all letters to lowercase.
cleaningfn1 <- function(data){
newdataset <<- gsub("http[^[:space:]]*", "", data)
newdataset <<- gsub("[^a-zA-Z[:space:]]", "", newdataset)
newdataset <<- tolower(newdataset)
}
cleaningfn1(all_sample)
all_sample <- newdataset
I want to remove profanity so I will create a profanity variable that is made from an appropriate dataset found online.
profanityURL <- "https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt"
if(!file.exists("./profanity.txt")){download.file(profanityURL, "./profanity.txt")}
profanity <- read.table("./profanity.txt")
profanity <- as.character(profanity[,1])
head(profanity)
## [1] "4r5e" "5h1t" "5hit" "a55" "anal" "anus"
The following for-loop illustrates the process for removing profanity from the all_sample dataset. At this point I have a dataset that is fully cleaned and ready for use. However, in the process that will take place later on, it will in fact be faster and more beneficial to not use this for-loop. I leave it in to show the process I first went through to get to a cleaned dataset.
for(x in 1:length(profanity)){
all_sample <- gsub(paste0(" ", profanity[x], " "), " ", all_sample)
all_sample <- gsub(paste0(" ", profanity[x], "$"), "", all_sample)
all_sample <- gsub(paste0("^", profanity[x], " "), " ", all_sample)
}
library(tidyverse)
library(tidytext)
library(stopwords)
Having loaded the necessary libraries, I next create a stopword variable. At this stage of the analysis it will be illustrative to remove words that are too common and don’t provide meaning to sentences. With this we can see a more interesting representation of the most common one word occurrences.
profanity <- tibble(profanity) %>% rename(word = 1)
stopwords <- stopwords("en") %>% tibble() %>% rename(word = 1)
stopwords$word <- gsub("[^a-zA-Z0-9[:space:]]", "", stopwords$word)
Next I will use the cleaning function built earlier and the unnest_token function to create datasets that are essentially just lists of single words that occur. The main purpose of doing this separately for each dataset (rather than once to the previously cleaned all_sample dataset) is that I want to add a tag, named source, variable to each word occurrence so that I can display a comparison between the different sources.
cleaningfn1(blogs_sample)
blogs_sample <- newdataset
blogs_sample_df <- tibble(blogs_sample)
blogs_sample_unnest <- blogs_sample_df %>%
unnest_tokens(word, blogs_sample) %>%
mutate(source="blogs") %>%
anti_join(profanity) %>%
anti_join(stopwords)
blogs_sample_unnest
## # A tibble: 932,979 x 2
## word source
## <chr> <chr>
## 1 g blogs
## 2 rated blogs
## 3 gathers blogs
## 4 brilliant blogs
## 5 ideas blogs
## 6 inspiring blogs
## 7 product blogs
## 8 present blogs
## 9 follow blogs
## 10 blog blogs
## # ... with 932,969 more rows
cleaningfn1(twitter_sample)
twitter_sample <- newdataset
twitter_sample_df <- tibble(twitter_sample)
twitter_sample_unnest <- twitter_sample_df %>%
unnest_tokens(word, twitter_sample) %>%
mutate(source="twitter") %>%
anti_join(profanity) %>%
anti_join(stopwords)
twitter_sample_unnest
## # A tibble: 816,022 x 2
## word source
## <chr> <chr>
## 1 one twitter
## 2 year twitter
## 3 deal twitter
## 4 nutsshould twitter
## 5 locked twitter
## 6 least twitter
## 7 dumbnetworkstheirspendingrules twitter
## 8 constantly twitter
## 9 running twitter
## 10 time twitter
## # ... with 816,012 more rows
cleaningfn1(news_sample)
news_sample <- newdataset
news_sample_df <- tibble(news_sample)
news_sample_unnest <- news_sample_df %>%
unnest_tokens(word, news_sample) %>%
mutate(source="news") %>%
anti_join(profanity) %>%
anti_join(stopwords)
news_sample_unnest
## # A tibble: 72,183 x 2
## word source
## <chr> <chr>
## 1 army news
## 2 corps news
## 3 engineers news
## 4 blown news
## 5 holes news
## 6 birds news
## 7 point news
## 8 levee news
## 9 monday news
## 10 night news
## # ... with 72,173 more rows
Here is the compiled dataset ordered by frequency of occurrence followed by a bar chart.
all_sample_unnest <- rbind(blogs_sample_unnest, twitter_sample_unnest, news_sample_unnest) %>%
count(word) %>%
arrange(desc(n))
all_sample_unnest
## # A tibble: 113,843 x 2
## word n
## <chr> <int>
## 1 just 12766
## 2 like 11340
## 3 one 10500
## 4 can 9576
## 5 get 9391
## 6 time 8365
## 7 love 7554
## 8 good 7473
## 9 now 7285
## 10 day 7102
## # ... with 113,833 more rows
all_sample_unnest %>% top_n(20) %>%
ggplot(aes(x=reorder(word, -n), y = n)) +
geom_col() +
xlab("Word") +
ylab("Count") +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
Here is a comparison between the different sources.
rbind(blogs_sample_unnest, twitter_sample_unnest, news_sample_unnest) %>%
group_by(source) %>%
count(word) %>%
arrange(desc(n)) %>%
top_n(20) %>%
ggplot(aes(x=reorder(word, n), y = n)) +
geom_col() +
xlab("Word") +
ylab("Count") +
coord_flip() +
facet_grid(~source, scales="free")
Next I will create a bigrams dataset, save as an object for later use, and display a bar chart. Here I have decided to leave in stopwords as without them we would have two, three or four word groups that would be mostly grammatically incorrect.
blogs_sample_bigram <- blogs_sample_df %>%
unnest_tokens(bigram, blogs_sample, token = "ngrams", n=2)
twitter_sample_bigram <- twitter_sample_df %>%
unnest_tokens(bigram, twitter_sample, token = "ngrams", n=2)
news_sample_bigram <- news_sample_df %>%
unnest_tokens(bigram, news_sample, token = "ngrams", n=2)
all_sample_bigram <- rbind(blogs_sample_bigram, twitter_sample_bigram, news_sample_bigram) %>%
count(bigram) %>%
arrange(desc(n))
if(!file.exists("./bigram_data.rds")){saveRDS(all_sample_bigram, "./bigram_data.rds")}
all_sample_bigram %>% top_n(20) %>%
ggplot(aes(x=reorder(bigram, -n), y = n)) +
geom_col() +
xlab("Bigram") +
ylab("Count") +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
Next, the same for trigrams.
blogs_sample_trigram <- blogs_sample_df %>%
unnest_tokens(trigram, blogs_sample, token = "ngrams", n=3) %>%
na.omit()
twitter_sample_trigram <- twitter_sample_df %>%
unnest_tokens(trigram, twitter_sample, token = "ngrams", n=3) %>%
na.omit()
news_sample_trigram <- news_sample_df %>%
unnest_tokens(trigram, news_sample, token = "ngrams", n=3) %>%
na.omit()
all_sample_trigram <- rbind(blogs_sample_trigram, twitter_sample_trigram, news_sample_trigram) %>%
count(trigram) %>%
arrange(desc(n))
if(!file.exists("./trigram_data.rds")){saveRDS(all_sample_trigram, "./trigram_data.rds")}
all_sample_trigram %>% top_n(20) %>%
ggplot(aes(x=reorder(trigram, -n), y = n)) +
geom_col() +
xlab("Trigram") +
ylab("Count") +
scale_x_discrete(guide = guide_axis(angle=90))
Finally, quadgrams.
blogs_sample_quadgram <- blogs_sample_df %>%
unnest_tokens(quadgram, blogs_sample, token = "ngrams", n=4) %>%
na.omit()
twitter_sample_quadgram <- twitter_sample_df %>%
unnest_tokens(quadgram, twitter_sample, token = "ngrams", n=4) %>%
na.omit()
news_sample_quadgram <- news_sample_df %>%
unnest_tokens(quadgram, news_sample, token = "ngrams", n=4) %>%
na.omit()
all_sample_quadgram <- rbind(blogs_sample_quadgram, twitter_sample_quadgram, news_sample_quadgram) %>%
count(quadgram) %>%
arrange(desc(n))
if(!file.exists("./quadgram_data.rds")){saveRDS(all_sample_quadgram, "./quadgram_data.rds")}
all_sample_quadgram %>% top_n(20) %>%
ggplot(aes(x=reorder(quadgram, -n), y = n)) +
geom_col() +
xlab("Quadgram") +
ylab("Count") +
scale_x_discrete(guide = guide_axis(angle=90))
Let’s see how many unique words account for 50 and 90 percent of all word occurrences.
total_n <- sum(all_sample_unnest$n)
total_word <- nrow(all_sample_unnest)
all_sample_cumsum <- all_sample_unnest %>% mutate(proportion = n/total_n, cumsum = cumsum(proportion))
cover_0.5 <- which(all_sample_cumsum$cumsum > 0.5)[1]
cover_0.9 <- which(all_sample_cumsum$cumsum > 0.9)[1]
cover_0.5/total_word
## [1] 0.007440071
cover_0.9/total_word
## [1] 0.1390072
First, loading the data from the previously saved objects.
bigram_data <- readRDS("./bigram_data.rds")
trigram_data <- readRDS("./trigram_data.rds")
quadgram_data <- readRDS("./quadgram_data.rds")
Next, building a function that will predict the next word after 1, 2 or 3 words are given.
next_word_predictor <- function(x, firstword=NULL, secondword=NULL, thirdword=NULL){
if(x==2){
twoword_id <- grep(paste0("^", firstword, " "), bigram_data$bigram)[1]
first_comb <- bigram_data[twoword_id,][1]
strsplit(first_comb$bigram, " ")[[1]][2]
}
else if (x==3){
threeword_id <- grep(paste0("^", firstword, " ", secondword, " "), trigram_data$trigram)[1]
first_comb <- trigram_data[threeword_id,][1]
strsplit(first_comb$trigram, " ")[[1]][3]
}
else if (x==4){
fourword_id <- grep(paste0("^", firstword, " ", secondword, " ", thirdword," "), quadgram_data$quadgram)[1]
first_comb <- quadgram_data[fourword_id,][1]
strsplit(first_comb$quadgram, " ")[[1]][4]
}
else{
print("")
}
}
Let’s test the function a few times. Note that the first argument is the n in ngram or, in other words, the position number (in the word group) of the word that we want to be predicted. The second, third and fourth arguments are the first words of the word group that precede the word that we want to be predicted.
next_word_predictor(6, "when")
## [1] ""
next_word_predictor(2, "speaking")
## [1] "of"
next_word_predictor(3, "people", "always")
## [1] "assume"
next_word_predictor(4, "why", "do", "they")
## [1] "have"
next_word_predictor(2, "terrible")
## [1] "and"
next_word_predictor(3, "elephants", "are")
## [1] "able"
next_word_predictor(4, "if", "only", "i")
## [1] "could"
Here I will list some ideas and methods that will be explored to further this project and improve this model.
In order to provide a reasonable experience for the user, the algorithm runtime must be considered. A function like Rprof() could be used in this diagnostic process.
As it is a text prediction algorithm, it will be made for the purpose of use on smartphones. Smartphones have limited memory and processing power so considering how much RAM is required and how much space objects take up will be necessary. Functions like gc() and object.size() will help.
The data only accounts for a limited range of possible word sequences. It would be useful to consider adding a process which accounts for word combinations that haven’t appeared in the dataset. ‘Backoff models’ will be worth exploring.
My first model draft could be simplified. I should consider how to condense the code with quicker ways to achieve the same results. Also to clean it, I would like to improve the argument input simplicity. For instance, instead of the currently required 3, "people", "always", I would like "people always" to be sufficient.