Ngram Capstone Coursework

Coursework Goal:

Provide exploratory analysis on text collected from random blogs,tweets and news. Then build a linguistic model for text prediction. Similar to text prediction when texting on smartphones.

Target Reader:

Non-data Scientist

Coursework data is based from this file

First step is to load all the libraries and r functions needed to run the program.

Libraries and Function are hidden but will be shown at the end of the document.

Datafile consist of 3 files:

1.Blogs

2.Tweets

3.News

Let’s start with loading blog file and process it for data analysis.

Number of foreign characters found in the blog file:

##    word   freq
## 1     â 743337
## 2     Ã  10319
## 3     _   7838
## 4     Â   7046
## 5     ã   6248
## 6     Ð   1908
## 7     á   1545
## 8     Î   1120
## 9     ï   1028
## 10    å    999
## 11    Ä    967
## 12    æ    863
## 13    Ñ    806
## 14    Ï    579
## 15    ç    516
## 16    ä    477
## 17    Å    421
## 18    è    418
## 19    é    291
## 20    ì    184

Let’s proceed with the Tweets.

Number of foreign characters found in the tweet file:

##    word  freq
## 1     â 92298
## 2     ð 26620
## 3     _ 25232
## 4     î  6033
## 5     Â  3442
## 6     Ã  2785
## 7     ã   869
## 8     ï   796
## 9     å   187
## 10    Î   172
## 11    Ì   164
## 12    æ   156
## 13    Ï   121
## 14    è   105
## 15    Ð   100
## 16    Ä    94
## 17    ç    93
## 18    ä    92
## 19    Å    60
## 20    é    59

Lastly, lets process the news file.

Number of foreign characters found in the News file:

##   word  freq
## 1    â 20393
## 2    Â  2324
## 3    Ã  1055
## 4    _   408
## 5    ï    94
## 6    Å     2
## 7    Ä     1
## 8    Ï     1

ANALYSIS: Blog files contain more foreign character than the others. Most probably, source of blog files were taken from a mixed of english and non english source.

Plot of all the collected files

ANALYSIS: Blog files lines-to- word ratio is higher that the others because of character limitations on both tweets (140 char limit) and news files (publishing stylesheet).

Plot showing the 3 files with a “not so normal” distribution of word frequency
(Red line denotes the mean value for each file )

ANALYSIS: Here the range of words used by blog writers and tweets are far wider than news writers. This is because non-traditional writers are less formal, less restricted (in choice of words) and tend not to follow any kind of writing rules. As opposed to paid news writers that follow formal and restricted publishing rules and guidelines.

Standard Deviation and variance for the 3 data files

## [1] "Blog: standard deviation  617 , variance  381624"

## [1] "Tweet: standard deviation  685 , variance  469338"

## [1] "News: standard deviation  88 , variance  7834"

ANALYSIS: Since the mean for all three files are very low (2,1,2), the variance tend to be bigger. The more range of words used, the higher the variance.

Calulating the recommended Population Sample Size

Ideal sample size based from counted lines (99% confidence level with +/- 1 margin of error)

## [1] "Blog file:  16339"

## [1] "Tweet file:  16524"

## [1] "News file:  13692"

Since the files contain way more than enough, we can set our sample size higher than the recommended sample size. Using the code below, I have set the sample size to be 20% of file lines counted.

resampledata <- function(bigdf)
{
    smalldf <- bigdf[sample(length(bigdf), floor(length(bigdf)*.2))] #20% of  counted file lines 
    return(smalldf)
}

## [1] "Blog: 179857 , Tweet:  472029 , News:  15451"

Note: The difference between New’s ideal sample size (13,692) and 20% (15,451) is less than 2000. So we are right on track on using 20% as the sample size accross the board.

Pairing words and phrases or Tokenization

Since we already got the most frequently used word from the 3 files, lets get the most frequent 2 words pairing and 3 words phrases from them. Then combine and sort them by how frequently they are used.

Top words, pairings and phrases combined

Words shown has a minimum frequency of 150. The is the final model for text prediction.

Proposed text prediction algorithm

Get user input
Search the word table for match (default word table and local word table)
Combine result of search from default and local word table
If result length is greater than zero, meaning a match is found then show all the matching words
if result length is equal to zero, meaning no match is found then store user input into local word table

Code below:

## Proposed Algorithm
processword1 <- function(word1)
{
    result <- topwords[grep(pattern=word1,topwords$word,ignore.case = TRUE),1]
    #localresult <- topwords[grep(pattern=user_input,local$word,ignore.case = TRUE),1] # search in local word table
    #then add local result to result
    if (length(result) > 0 )
    { 
        for (i in 1:length(result))
        { print(as.character(result[i]))}
    } 
    else 
    { 
        paste( word1, " not found, adding ", word1, "to local word table")
    }   
}

Let’s test this code with the user input of goo.

## [1] "good"
## [1] "good thing "
## [1] "pretty good "
## [1] "good time "
## [1] "it good "
## [1] "good news "
## [1] "good idea "
## [1] "good luck "
## [1] "feel good "
## [1] "good things "
## [1] "good bad "
## [1] "good job "
## [1] "good friend "
## [1] "goog spellcheck word "
## [1] "yellow class goog "
## [1] "class goog spellcheck "
## [1] "makes feel good "

Let test this code with the user input of xyz.

If not found in the word table, it will store it on the local word table.

user_input1 <- 'xyz'
processword1(user_input1)

## [1] "xyz  not found, adding  xyz to local word table"

Libraries and Functions used in this program

## start load library
#library(ngram)
#library(tm)
#library(stringi)
#library(RWeka)
#library(dplyr)
#library(caret)
#library(stringr)
#library(ggplot2)

### end load library

# start load needed data for functions
#set.seed(333)
#enlist <- stopwords("en")
#SMARTList <- stopwords("SMART")
#cursewords <- readLines("swearWords.txt") ## from http://www.bannedwordlist.com/
# end load needed data for functions

# start load functions

#foreignchar <- function(textfile)
#{
#    rawfile1 <- concatenate(textfile)
#    rm(textfile)
#    gc()
#    rawfile1  <- gsub(pattern = "[a-zA-Z]", replace= "", rawfile1) #delete all english letters
#    rawfile1  <- gsub(pattern = "\\d", replace= "", rawfile1)# delete digits
#    rawfile1  <- gsub(pattern = "\\W", replace= "", rawfile1)# delete all punctuations
#    rawfile1  <- stripWhitespace(rawfile1) # remove spaces
#    result <- rawfile1 %>% strsplit(split = "") %>% table()%>% sort(decreasing = TRUE)%>% data.frame()
#    colnames(result)  <- c("word", "freq")
#    return(result)
#}    

#cleanfile <- function(textfile)
#{
#    f1 <- concatenate(textfile)
#    rm(textfile)
#    f2 <- preprocess(f1,case= "lower", remove.numbers = TRUE) #lower case and remove numbers in one shot
#    f3 <- removeWords(f2,stopwords(kind = "en")) # remove words based from vector
#    f4 <- removeWords(f3,cursewords) # remove curse words
#    rm(f2); rm(f3); gc() #remove useless variables to save memory
#    f5 <- removeWords(f4,SMARTList) # remove words based from vector
#    f6 <- gsub(pattern = "\\W", replace= " ", f5) # delete all punctuations
#    f7 <- gsub("[^[:alpha:]///' ]", "", f6) #remove non- alphabet characters
#    rm(f4); rm(f5); rm(f6); gc()
#    f8 <- gsub("â|ã|ð|ÿ|î|ñ|á|ï|à", "", f7) # remove one off foreign characters
#    f9 <- gsub(pattern= "\\b[A-z]\\b{1}", replace= " ",f8) # remove single letters orphaned
#    f10 <- gsub(pattern= " ve | ll ", replace= " ",f9)  # remove double letters that we orphaned
#    f11 <- stripWhitespace(f10) # remove spaces
#    rm(f7); rm(f8); rm(f9); gc()
#    cleanedfile <- f11
#    rm(f10); gc()
#    return(cleanedfile)
#}

#word.frequency <- function(data)
#{    
#    dfresult <- data %>% strsplit(split = " ") %>% table()%>% sort(decreasing = TRUE)%>% data.frame()
#    colnames(dfresult)  <- c("word", "freq")
#    return(dfresult)
#} 

#resampledata <- function(bigdf)# use 20% resampling size of the population (filelines counted)
#{
#    smalldf <- bigdf[sample(length(bigdf), floor(length(bigdf)*.2))]
#    return(smalldf)
#}

#mergewords <- function(df1,df2)
#{
#    overlap <- intersect(df1$filename,df2$filename)
#    df1[df1$filename %in% overlap,2] <- df1[df1$filename %in% overlap,2]+ df2[df2$filename %in% overlap,2]
#    mergeDF <- rbind(df1,df2[-(which(df2$filename %in% overlap)),])
#    return(mergeDF)
#}


#get.sample.size <- function(N)#99 confidence level with +/- 1 margin of error
#{
#    Z <- 2.58 # 99% Z table
#    margin <- .01
#    p <- .5 #planned proportion estimate of 50%
#    s1 <- (Z^2 * p *(1-p))/ (margin^2)
#    s2 <- 1 + ((Z^2 * p *(1-p))/ (margin^2* N) )
#    return(round(s1/s2, digits = 0))
#}    
#
#
### end load functions