Second Week Report

Due to limited time (i have 30 minutes before 12) I will be as brief as possible. Also I want to say that I have decided to use only base functions and create the functions needed because all the available packages dont match my desires and I want to solve the problems on my own words.

First, load the (whole) data:

text1 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.blogs.txt")
text2 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.twitter.txt")
text3 <- readLines ("C:/Users/Alberto2/Desktop/Coursera Capstone Project/final/en_US/en_Us.news.txt") 
cbind(summary(text1), summary(text2), summary(text3))

##        [,1]        [,2]        [,3]       
## Length "899288"    "2360148"   "77259"    
## Class  "character" "character" "character"
## Mode   "character" "character" "character"

There are a large number of phrases and small texts.

There must be a sort of cleaning the data, I will proceed to clean all the upper cases and puntuation, leave the aportrophes.

text1 <- tolower(text1)
text2 <- tolower(text2)
text3 <- tolower( text3)
text1 <- gsub("[^[:alnum:][:space:]']", "", text1)
text2 <- gsub("[^[:alnum:][:space:]']", "", text2)
text3 <- gsub("[^[:alnum:][:space:]']", "", text3)

Now, I will split the phrases, search for all the empty spaces between words and tell R to split for every space.

text1 <- strsplit ( text1 , " " )
text2 <- strsplit ( text2 , " " )
text3 <- strsplit ( text3 , " " )

Basically, whats left is al the phrases but with all the words split into character vectors inside a list.

To get the amount of words -and the frequency for each word- its needed to unlist the elements. Then merge all into one.

list1 <- unlist(text1)
list2 <- unlist(text2)
list3 <- unlist (text3)
list <- c(list1,list2,list3)

Now with the function table you can see all the frequency of unique words. Also order the elements in decreasing order.

table1 <- table(list)
table1 <- table1[order(-table1)]
table1[1:40]

## list
##     the      to     and       a       i      of      in     you      is 
## 2932049 1920318 1586224 1567821 1485845 1292563 1020163  842141  809956 
##     for    that      it      on      my            with    this     was 
##  773282  717032  704350  569876  562850  536897  478533  428072  412395 
##      be    have      at     are      me     but      so      we      as 
##  406388  397183  372980  361748  338194  337636  330448  323310  307710 
##     not    your     all    just    from     out      up    like    what 
##  303721  272258  267928  253112  243139  227487  223730  222773  222206 
##      or      if      he    they 
##  217669  216972  216258  215772

What you can see there is the array of the first most used words in the corpora. Also a plot distribution in the following chunk,

plot(table1[1:40])

dev.off()

## null device 
##           1

As you can see, after the first 20 words approximately, the curve falls dramatically to the x axis.

Now I mix all the clean, separated texts and create a null object called test5.

textmix <- c(text1, text2, text3) 

test5<- NULL

Now for the second part, about finding the combinations of 2 words, 3 words, n words that are more frequent I designed a function that returns the words cut two by two and then its easy to just table and plot the frequencies. Also its important to say that this function can be arranged to meet the n-gram criteria needed, just by changing the formula from 2 to n(desired length of chunk).

pairs_r <- function(test) {
  
  for (i in 1:length(test)) { 
    n <- paste(test[[i]] [seq(1, length(test[[i]]) -1)], test[[i]][seq(2, length(test[[i]]))])
    test5 <<- c(test5,n)
    
  } 
  
}

This basically grabs every vector of characters of each part of the list(each phrase) and pastes all the words in chunks of two using R sequencing and the length of each phrase. All the results will be stored in the test5 object created before. After doing the calculations, it will take for the function 21 hours to process all the information, for matters of time I will sample 10K random numbers and process with the formula.

model <- which(lengths(textmix)==0) #find vectors length 0
textmix [model] <- NULL #remove vectors length0
m <- sample(1:length(textmix), 1000, replace=T)
textmix_sample <- textmix[m]
pairs_r (textmix_sample)

Now to see the 50th most used pair of words.

pairstable <- table(test5) 

pairstable <- pairstable[order(-pairstable)]

pairstable[1:50]

## test5
##     of the     in the     to the     on the      to be    for the 
##         82         79         47         44         42         40 
##     at the     if you    and the     it was     to get   from the 
##         35         25         23         23         21         20 
##       i am       in a      it is   with the     have a      for a 
##         20         20         20         20         19         18 
##     i have       is a     with a      and i      i was      of my 
##         18         18         18         17         17         16 
##        rt  thanks for    want to    you are     one of   the same 
##         16         16         16         16         15         15 
##      was a   going to    i think     is the     to see  trying to 
##         15         14         14         14         14         14 
##    will be    all the     to you     i just     i love   that the 
##         14         13         13         12         12         12 
##       to a    you can   you know     a good  have been     i know 
##         12         12         12         11         11         11 
##     i want     out of 
##         11         11

A simple plot next.

plot(pairstable[1:50])

Also it can be noted that after the first 20 combinations it tends to fall on the x axis.

Plans for creating a prediction algorithm - 1. Try to use base functions when possible - 2. Use the tables for the most used pairs, triplets and n chunks of words (where n is determined by statistical analysis) and make the algorithm decide which three words have the highest probability and present them. - 3. for unseen combinations, take account of the previous word and the most used words and decide based on the highest probabilities between those two - 4. This could be applicable for any language - 5. Going int thinking deeper… it could be best to sepparate suffixes from prefixes and let the algorith choose from a simpler corpora, and then also decide which form of the word is the most likely correct - 6. Both methods dont seem difficult so both could be compared to see which is better

Limitatons Theres still some analysis needed and a lot of cleaning of the corpora. I need to get some more information on the other languages to clean the words. I had little to no time this week to finish the course, but i will get back on track. Thanks for your time reading.

Second Week Report

Alberto Florez

Sunday, March 20, 2016