Due to limited time (i have 30 minutes before 12) I will be as brief as possible. Also I want to say that I have decided to use only base functions and create the functions needed because all the available packages dont match my desires and I want to solve the problems on my own words.
First, load the (whole) data:
text1 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.blogs.txt")
text2 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.twitter.txt")
text3 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.news.txt")
cbind(summary(text1), summary(text2), summary(text3))
## [,1] [,2] [,3]
## Length "899288" "2360148" "77259"
## Class "character" "character" "character"
## Mode "character" "character" "character"
There are a large number of phrases and small texts.
There must be a sort of cleaning the data, I will proceed to clean all the upper cases and puntuation, leave the aportrophes.
text1 <- tolower(text1)
text2 <- tolower(text2)
text3 <- tolower( text3)
text1 <- gsub("[^[:alnum:][:space:]']", "", text1)
text2 <- gsub("[^[:alnum:][:space:]']", "", text2)
text3 <- gsub("[^[:alnum:][:space:]']", "", text3)
Now, I will split the phrases, search for all the empty spaces between words and tell R to split for every space.
text1 <- strsplit ( text1 , " " )
text2 <- strsplit ( text2 , " " )
text3 <- strsplit ( text3 , " " )
Basically, whats left is al the phrases but with all the words split into character vectors inside a list.
To get the amount of words -and the frequency for each word- its needed to unlist the elements. Then merge all into one.
list1 <- unlist(text1)
list2 <- unlist(text2)
list3 <- unlist (text3)
list <- c(list1,list2,list3)
Now with the function table you can see all the frequency of unique words. Also order the elements in decreasing order.
table1 <- table(list)
table1 <- table1[order(-table1)]
table1[1:40]
## list
## the to and a i of in you is
## 2932049 1920318 1586224 1567821 1485845 1292563 1020163 842141 809956
## for that it on my with this was
## 773282 717032 704350 569876 562850 536897 478533 428072 412395
## be have at are me but so we as
## 406388 397183 372980 361748 338194 337636 330448 323310 307710
## not your all just from out up like what
## 303721 272258 267928 253112 243139 227487 223730 222773 222206
## or if he they
## 217669 216972 216258 215772
What you can see there is the array of the first most used words in the corpora. Also a plot distribution in the following chunk,
plot(table1[1:40])
dev.off()
## null device
## 1
As you can see, after the first 20 words approximately, the curve falls dramatically to the x axis.
Now I mix all the clean, separated texts and create a null object called test5.
textmix <- c(text1, text2, text3)
test5<- NULL
Now for the second part, about finding the combinations of 2 words, 3 words, n words that are more frequent I designed a function that returns the words cut two by two and then its easy to just table and plot the frequencies. Also its important to say that this function can be arranged to meet the n-gram criteria needed, just by changing the formula from 2 to n(desired length of chunk).
pairs_r <- function(test) {
for (i in 1:length(test)) {
n <- paste(test[[i]] [seq(1, length(test[[i]]) -1)], test[[i]][seq(2, length(test[[i]]))])
test5 <<- c(test5,n)
}
}
This basically grabs every vector of characters of each part of the list(each phrase) and pastes all the words in chunks of two using R sequencing and the length of each phrase. All the results will be stored in the test5 object created before. After doing the calculations, it will take for the function 21 hours to process all the information, for matters of time I will sample 10K random numbers and process with the formula.
model <- which(lengths(textmix)==0) #find vectors length 0
textmix [model] <- NULL #remove vectors length0
m <- sample(1:length(textmix), 1000, replace=T)
textmix_sample <- textmix[m]
pairs_r (textmix_sample)
Now to see the 50th most used pair of words.
pairstable <- table(test5)
pairstable <- pairstable[order(-pairstable)]
pairstable[1:50]
## test5
## of the in the to the on the to be for the
## 82 79 47 44 42 40
## at the if you and the it was to get from the
## 35 25 23 23 21 20
## i am in a it is with the have a for a
## 20 20 20 20 19 18
## i have is a with a and i i was of my
## 18 18 18 17 17 16
## rt thanks for want to you are one of the same
## 16 16 16 16 15 15
## was a going to i think is the to see trying to
## 15 14 14 14 14 14
## will be all the to you i just i love that the
## 14 13 13 12 12 12
## to a you can you know a good have been i know
## 12 12 12 11 11 11
## i want out of
## 11 11
A simple plot next.
plot(pairstable[1:50])
Also it can be noted that after the first 20 combinations it tends to fall on the x axis.
Plans for creating a prediction algorithm - 1. Try to use base functions when possible - 2. Use the tables for the most used pairs, triplets and n chunks of words (where n is determined by statistical analysis) and make the algorithm decide which three words have the highest probability and present them. - 3. for unseen combinations, take account of the previous word and the most used words and decide based on the highest probabilities between those two - 4. This could be applicable for any language - 5. Going int thinking deeper… it could be best to sepparate suffixes from prefixes and let the algorith choose from a simpler corpora, and then also decide which form of the word is the most likely correct - 6. Both methods dont seem difficult so both could be compared to see which is better
Limitatons Theres still some analysis needed and a lot of cleaning of the corpora. I need to get some more information on the other languages to clean the words. I had little to no time this week to finish the course, but i will get back on track. Thanks for your time reading.