First we will upload the data to R. Our data consists of three texts files, so we will first check how many lines there are in each file.
conn <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.twitter.txt","r")
conn1 <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.news.txt","r")
conn2 <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.blogs.txt","r")
n <- length(readLines(conn))
close(conn)
n1 <- length(readLines(conn1))
close(conn1)
n2 <- length(readLines(conn2))
close(conn2)
That means there are 2360148 lines in twitter file, 1010242 lines in new file, and 899288 lines in blogs file.
Since there are totally 4269678 lines, we will only use some of it in our analysis/project. We decided to pick approximately 100k lines from each file.
conn <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.twitter.txt","r")
conn1 <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.news.txt","r")
conn2 <- file("/Users/ovunc/Desktop/Code_Related/R/NLP Project/final/en_US/en_US.blogs.txt","r")
set.seed(31)
to_keep <- rbinom(n,1,0.05)
to_keep1 <- rbinom(n1,1,0.1)
to_keep2 <- rbinom(n2,1,1/9)
lines <- readLines(conn)[which(to_keep==1)]
lines1 <- readLines(conn1)[which(to_keep1==1)]
lines2 <- readLines(conn2)[which(to_keep2==1)]
close(conn)
close(conn1)
close(conn2)
our_lines <- c(lines,lines1,lines2)
Now we will tokenize our lines. In other words we will pick every words, punctuation, and numbers from the lines and create a huge list.
n <- length(our_lines)
tokens <- vector("list",n)
for (i in seq_along(our_lines)){
tokens[[i]] <- unlist(
regmatches(our_lines[i],gregexpr("[A-Za-z0-9]+|[[:punct:]]", our_lines[i])))
}
tokens <- unlist(tokens)
tokens <- tolower(tokens)
We will now filter out some profanity words, such as curses, racial slurs, slangs, etc. However, we will not directly take them away from our tokens because that would cause changes in the indeces of the tokens. Instead, we will replace them by DISREGARD.
Apparently, there were 4923 profanity words we filtered.
Here, we will check the most frequent words seen in our lines. As we want to see the most frequent words, we will drop the punctuation and numbers here.
table_tokens <- as.data.frame(table(tokens))
low_freq <- table_tokens$Freq<=25000
punct_rows <- grepl("^[[:punct:]]+$", table_tokens$tokens)
nums <- grepl("^[0-9]+$", table_tokens$tokens)
both <- low_freq | punct_rows | nums
other_count <- sum(table_tokens$Freq[both])
table_tokens <- table_tokens[!both,]
table_tokens <- rbind(table_tokens, data.frame(
tokens="Other",Freq=other_count))
library(ggplot2)
ggplot(table_tokens[-46,],aes(x=tokens,y=log(Freq)))+
xlab("Tokens")+ylab("Log Frequencies")+ geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Further we will analyze the most frequent 2-grams.
library(tokenizers)
bigram <- unlist(tokenize_ngrams(our_lines,n=2))
bigram <- data.frame(table(bigram))
to_keep <- bigram$Freq > 5000
bigram_vis <- bigram[to_keep,]
library(ggplot2)
ggplot(bigram_vis, aes(x = bigram, y = log(Freq))) +
geom_point() + xlab("2-grams") +ylab("Log Frequencies")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Next, we will analyze the most frequent 3-grams.
trigram <- unlist(tokenize_ngrams(our_lines,n=3))
trigram <- data.frame(table(trigram))
to_keep <- trigram$Freq > 600
trigram_vis <- trigram[to_keep,]
ggplot(trigram_vis, aes(x = trigram, y = log(Freq))) +
geom_point() + xlab("3-grams") +ylab("Log Frequencies")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Next, we will analyze the most frequent 4-grams.
library(tokenizers)
quadgram <- unlist(tokenize_ngrams(our_lines,n=4))
quadgram <- data.frame(table(quadgram))
to_keep <- quadgram$Freq > 140
quadgram_vis <- quadgram[to_keep,]
library(ggplot2)
ggplot(quadgram_vis, aes(x = quadgram, y = log(Freq))) +
geom_point() + xlab("4-grams") +ylab("Log Frequencies")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Here, we will call the unique elements of our tokens list dictionary. And the questions is: “How many tokens does it take from our dictionary to cover the 50% and 90% of the whole lines?”.
every_token_table <- as.data.frame(table(tokens))
every_token_table_sorted <- every_token_table[
order(every_token_table$Freq,decreasing = TRUE),]
sum = 0
token_count = 0
for (i in seq_along(every_token_table_sorted$tokens)[-1]){
sum = sum + every_token_table_sorted$Freq[i]
token_count = token_count + 1
if (sum >= length(tokens)/2){
break
}
}
sum = 0
token_count1 = 0
for (i in seq_along(every_token_table_sorted$tokens)[-1]){
sum = sum + every_token_table_sorted$Freq[i]
token_count1 = token_count1 + 1
if (sum >= length(tokens)*0.9){
break
}
}
Apparently it takes 121 words to cover 50% of the whole lines, whereas it takes 1.4421^{4} words to cover 90% of the whole lines.
Now we want to answer how many foreign words there are in our lines. Even though we cannot answer this question in 100% accuracy, we will try to approach it. Now to do so we will first consider the tokens that appear less than 4 times in our tokens list. Then we will disregard numbers and punctuation from this list. Later we will throw away the characters with lot of repetition, such as “hahaha”, “aaaaa”, or “abcabcabc”. Lastly, we will upload an English dictionary and we will throw away the ones which are in this dictionary. This means we are left with foreign words, misspelled words or just nonsense words.
We see that these words cover the 0.6652585 percent of the whole tokens.
Using all of these, we are aiming to create a predictive keyboard that guesses which word you want to type after you type 1,2, or 3 words. The way we are planning to build this model is to for instance you typed the words “A B C” (suppose those letters are words). Then our model will check if “A B C” appears in the trigram table. If yes, then that means there are 4-grams quadgrams table starting with these letters “A B C”. If not, then we will check similar things for the bigram “B C”.