Provide exploratory analysis on text collected from random blogs,tweets and news. Then build a linguistic model for text prediction. Similar to text prediction when texting on smartphones.
Non-data Scientist
Coursework data is based from this file
First step is to load all the libraries and r functions needed to run the program.
Libraries and Function are hidden but will be shown at the end of the document.
Datafile consist of 3 files:
1.Blogs
2.Tweets
3.News
Number of foreign characters found in the blog file:
## word freq
## 1 â 743337
## 2 Ã 10319
## 3 _ 7838
## 4 Â 7046
## 5 ã 6248
## 6 Ð 1908
## 7 á 1545
## 8 Î 1120
## 9 ï 1028
## 10 å 999
## 11 Ä 967
## 12 æ 863
## 13 Ñ 806
## 14 Ï 579
## 15 ç 516
## 16 ä 477
## 17 Å 421
## 18 è 418
## 19 é 291
## 20 ì 184
Number of foreign characters found in the tweet file:
## word freq
## 1 â 92298
## 2 ð 26620
## 3 _ 25232
## 4 î 6033
## 5 Â 3442
## 6 Ã 2785
## 7 ã 869
## 8 ï 796
## 9 å 187
## 10 Î 172
## 11 Ì 164
## 12 æ 156
## 13 Ï 121
## 14 è 105
## 15 Ð 100
## 16 Ä 94
## 17 ç 93
## 18 ä 92
## 19 Å 60
## 20 é 59
Number of foreign characters found in the News file:
## word freq
## 1 â 20393
## 2 Â 2324
## 3 Ã 1055
## 4 _ 408
## 5 ï 94
## 6 Å 2
## 7 Ä 1
## 8 Ï 1
ANALYSIS: Blog files contain more foreign character than the others. Most probably, source of blog files were taken from a mixed of english and non english source.
Plot of all the collected files
ANALYSIS: Blog files lines-to- word ratio is higher that the others because of character limitations on both tweets (140 char limit) and news files (publishing stylesheet).
Plot showing the 3 files with a “not so normal” distribution of word frequency
(Red line denotes the mean value for each file )
ANALYSIS: Here the range of words used by blog writers and tweets are far wider than news writers. This is because non-traditional writers are less formal, less restricted (in choice of words) and tend not to follow any kind of writing rules. As opposed to paid news writers that follow formal and restricted publishing rules and guidelines.
Standard Deviation and variance for the 3 data files
## [1] "Blog: standard deviation 617 , variance 381624"
## [1] "Tweet: standard deviation 685 , variance 469338"
## [1] "News: standard deviation 88 , variance 7834"
ANALYSIS: Since the mean for all three files are very low (2,1,2), the variance tend to be bigger. The more range of words used, the higher the variance.
Ideal sample size based from counted lines (99% confidence level with +/- 1 margin of error)
## [1] "Blog file: 16339"
## [1] "Tweet file: 16524"
## [1] "News file: 13692"
Since the files contain way more than enough, we can set our sample size higher than the recommended sample size. Using the code below, I have set the sample size to be 20% of file lines counted.
resampledata <- function(bigdf)
{
smalldf <- bigdf[sample(length(bigdf), floor(length(bigdf)*.2))] #20% of counted file lines
return(smalldf)
}
## [1] "Blog: 179857 , Tweet: 472029 , News: 15451"
Note: The difference between New’s ideal sample size (13,692) and 20% (15,451) is less than 2000. So we are right on track on using 20% as the sample size accross the board.
Since we already got the most frequently used word from the 3 files, lets get the most frequent 2 words pairing and 3 words phrases from them. Then combine and sort them by how frequently they are used.
Words shown has a minimum frequency of 150. The is the final model for text prediction.
Code below:
## Proposed Algorithm
processword1 <- function(word1)
{
result <- topwords[grep(pattern=word1,topwords$word,ignore.case = TRUE),1]
#localresult <- topwords[grep(pattern=user_input,local$word,ignore.case = TRUE),1] # search in local word table
#then add local result to result
if (length(result) > 0 )
{
for (i in 1:length(result))
{ print(as.character(result[i]))}
}
else
{
paste( word1, " not found, adding ", word1, "to local word table")
}
}
Let’s test this code with the user input of goo.
## [1] "good"
## [1] "good thing "
## [1] "pretty good "
## [1] "good time "
## [1] "it good "
## [1] "good news "
## [1] "good idea "
## [1] "good luck "
## [1] "feel good "
## [1] "good things "
## [1] "good bad "
## [1] "good job "
## [1] "good friend "
## [1] "goog spellcheck word "
## [1] "yellow class goog "
## [1] "class goog spellcheck "
## [1] "makes feel good "
Let test this code with the user input of xyz.
If not found in the word table, it will store it on the local word table.
user_input1 <- 'xyz'
processword1(user_input1)
## [1] "xyz not found, adding xyz to local word table"
## start load library
#library(ngram)
#library(tm)
#library(stringi)
#library(RWeka)
#library(dplyr)
#library(caret)
#library(stringr)
#library(ggplot2)
### end load library
# start load needed data for functions
#set.seed(333)
#enlist <- stopwords("en")
#SMARTList <- stopwords("SMART")
#cursewords <- readLines("swearWords.txt") ## from http://www.bannedwordlist.com/
# end load needed data for functions
# start load functions
#foreignchar <- function(textfile)
#{
# rawfile1 <- concatenate(textfile)
# rm(textfile)
# gc()
# rawfile1 <- gsub(pattern = "[a-zA-Z]", replace= "", rawfile1) #delete all english letters
# rawfile1 <- gsub(pattern = "\\d", replace= "", rawfile1)# delete digits
# rawfile1 <- gsub(pattern = "\\W", replace= "", rawfile1)# delete all punctuations
# rawfile1 <- stripWhitespace(rawfile1) # remove spaces
# result <- rawfile1 %>% strsplit(split = "") %>% table()%>% sort(decreasing = TRUE)%>% data.frame()
# colnames(result) <- c("word", "freq")
# return(result)
#}
#cleanfile <- function(textfile)
#{
# f1 <- concatenate(textfile)
# rm(textfile)
# f2 <- preprocess(f1,case= "lower", remove.numbers = TRUE) #lower case and remove numbers in one shot
# f3 <- removeWords(f2,stopwords(kind = "en")) # remove words based from vector
# f4 <- removeWords(f3,cursewords) # remove curse words
# rm(f2); rm(f3); gc() #remove useless variables to save memory
# f5 <- removeWords(f4,SMARTList) # remove words based from vector
# f6 <- gsub(pattern = "\\W", replace= " ", f5) # delete all punctuations
# f7 <- gsub("[^[:alpha:]///' ]", "", f6) #remove non- alphabet characters
# rm(f4); rm(f5); rm(f6); gc()
# f8 <- gsub("â|ã|ð|ÿ|î|ñ|á|ï|à", "", f7) # remove one off foreign characters
# f9 <- gsub(pattern= "\\b[A-z]\\b{1}", replace= " ",f8) # remove single letters orphaned
# f10 <- gsub(pattern= " ve | ll ", replace= " ",f9) # remove double letters that we orphaned
# f11 <- stripWhitespace(f10) # remove spaces
# rm(f7); rm(f8); rm(f9); gc()
# cleanedfile <- f11
# rm(f10); gc()
# return(cleanedfile)
#}
#word.frequency <- function(data)
#{
# dfresult <- data %>% strsplit(split = " ") %>% table()%>% sort(decreasing = TRUE)%>% data.frame()
# colnames(dfresult) <- c("word", "freq")
# return(dfresult)
#}
#resampledata <- function(bigdf)# use 20% resampling size of the population (filelines counted)
#{
# smalldf <- bigdf[sample(length(bigdf), floor(length(bigdf)*.2))]
# return(smalldf)
#}
#mergewords <- function(df1,df2)
#{
# overlap <- intersect(df1$filename,df2$filename)
# df1[df1$filename %in% overlap,2] <- df1[df1$filename %in% overlap,2]+ df2[df2$filename %in% overlap,2]
# mergeDF <- rbind(df1,df2[-(which(df2$filename %in% overlap)),])
# return(mergeDF)
#}
#get.sample.size <- function(N)#99 confidence level with +/- 1 margin of error
#{
# Z <- 2.58 # 99% Z table
# margin <- .01
# p <- .5 #planned proportion estimate of 50%
# s1 <- (Z^2 * p *(1-p))/ (margin^2)
# s2 <- 1 + ((Z^2 * p *(1-p))/ (margin^2* N) )
# return(round(s1/s2, digits = 0))
#}
#
#
### end load functions