In this report, we are aiming to explain exploratory analysis performed on the data. The motivation for this project is to:
The data is cleaned and the work will be on prediction and will be based on n-gram and a backoff method.
The first step in analyzing any new data set is figuring out: (1) what data we have and (2) what are the standard tools and models used for that type of data. The data should be downloaded first from the link provided. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. The files have been language filtered but may still contain some foreign text. In this capstone we will be applying data science in the area of natural language processing. The following are taken from Wikipedia:
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics mainly about the interactions between computers and human languages. It is particularly involved in programming computers to fruitfully process large natural language corpora.
Natural language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation.
Corpus linguistics is the study of language as expressed in corpora (samples) of “real world” text. The text-corpus method is a digestive approach for deriving a set of abstract rules, from a text, for governing a natural language, and how that language relates to and with another language; originally derived manually, corpora now are automatically derived from the source texts.
#data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#f <- file.path(getwd(), "Swiftkey.zip")
#download.file(data_url, f)
#unzip("C:/users/user/Desktop/Coursera/JHU Data Science/Capstone/SwiftKey.zip", exdir ="./")
#Downloaddate <- date()
#Downloaddate
list.files("./", recursive = "TRUE") # list all files in the folder and sub folder
## [1] "Capstone1.html"
## [2] "Capstone1.Rmd"
## [3] "Capstone2.Rmd"
## [4] "final/de_DE/de_DE.blogs.txt"
## [5] "final/de_DE/de_DE.news.txt"
## [6] "final/de_DE/de_DE.twitter.txt"
## [7] "final/en_US/bigram.RDS"
## [8] "final/en_US/en_US.blogs.txt"
## [9] "final/en_US/en_US.news.txt"
## [10] "final/en_US/en_US.twitter.txt"
## [11] "final/en_US/fivegram.RDS"
## [12] "final/en_US/quadragram.RDS"
## [13] "final/en_US/trigram.RDS"
## [14] "final/en_US/unigram.RDS"
## [15] "final/fi_FI/fi_FI.blogs.txt"
## [16] "final/fi_FI/fi_FI.news.txt"
## [17] "final/fi_FI/fi_FI.twitter.txt"
## [18] "final/ru_RU/ru_RU.blogs.txt"
## [19] "final/ru_RU/ru_RU.news.txt"
## [20] "final/ru_RU/ru_RU.twitter.txt"
## [21] "Google_badwords.csv"
## [22] "rsconnect/documents/Capstone1.Rmd/rpubs.com/rpubs/Document.dcf"
## [23] "test ngram/bigram.RDS"
## [24] "test ngram/fivegram.RDS"
## [25] "test ngram/quadrigram.RDS"
## [26] "test ngram/trigram.RDS"
## [27] "test ngram/unigram.RDS"
library(ggplot2)
#install.packages('tm', dependencies=TRUE, repos = "http://cran.us.r-project.org")
#install.packages('RWeka', dependencies=TRUE, repos = "http://cran.us.r-project.org")
orig_dir<- getwd()
orig_dir
## [1] "C:/Users/user/Desktop/Coursera/JHU Data Science/10 - Capstone"
setwd("./final/en_US")
en_blogs <- readLines("./en_US.blogs.txt", 3000) # limit words since it takes a lot of time
head(en_blogs, 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
en_news <- readLines("./en_US.news.txt", 3000)
head(en_news, 3)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
en_twitter <- readLines("./en_US.twitter.txt", 3000)
head(en_twitter, 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
library(stringi) # this package helps with character analysis in the text
stri_stats_general(en_blogs) # analyzing linesand characters
## Lines LinesNEmpty Chars CharsNWhite
## 3000 3000 686473 566487
stri_stats_general(en_news)
## Lines LinesNEmpty Chars CharsNWhite
## 3000 3000 617080 516096
stri_stats_general(en_twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 3000 3000 206310 170789
words_blogs <- stri_count_words(en_blogs)
summ_blogs <- summary(words_blogs)
qplot(words_blogs)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#'binwidth = max(range(summ_blogs))/30') # adjusting the bin size according to the range
words_news <- stri_count_words(en_news)
summary(words_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 20.00 33.00 35.35 47.00 242.00
qplot(words_news)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
words_twitter <- stri_count_words(en_twitter)
summary(words_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.78 18.00 34.00
qplot(words_twitter)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Tasks to accomplish for cleaning the data:
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Profanity filtering - removing profanity and other words you do not want to predict.
library(tm) #text mining library
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
comb_data <- c(en_blogs, en_news, en_twitter) # combined data
length(en_blogs)
## [1] 3000
length(comb_data)
## [1] 9000
comb_data <- iconv(comb_data,"latin1","ASCII",sub="'") # Remove non english characters
Corp <- VCorpus(VectorSource(list(comb_data))) # Create volatile corpora. from a source object (VectorSource...)
library(tibble)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
##
## %+%
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:base':
##
## Filter
Corp <- tm_map(Corp, content_transformer(tolower) ) # lower case
Corp <- tm_map(Corp, removeNumbers) # remove numbers from the text
Corp <- tm_map(Corp, removePunctuation) # remove special characters
#Corp <- tm_map(Corp,content_transformer(replace_contraction)) # replace contraction with full form
#Corp <- tm_map(Corp,content_transformer(replace_abbreviation)) # replace abbreviation with full form
#Corp <- tm_map(Corp,content_transformer(bracketX)) # Remove text which is within brackets
Corp <- tm_map(Corp, removeWords, stopwords("english") ) # remove stop words such as a and, the etc
Corp <- tm_map(Corp, stripWhitespace) # remove unnecessary white spaces
library(SnowballC)
Corp <- tm_map(Corp, stemDocument) # removing common endings such as ing, s, etc
# removing bad words:
profanityList <- read.csv(file.path(orig_dir, "Google_badwords.csv"),
header = FALSE, stringsAsFactors = FALSE) # file.path: concatanate directories
#profanewordsvector <- VectorSource(profanityList)
profanityList = profanityList %>% tolower() %>% stripWhitespace()
Corp <- tm_map(Corp, removeWords, profanityList)
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
library(RWeka)
CorpDF <- data.frame(text=unlist(sapply(Corp, `[`, "content")),
stringsAsFactors=FALSE) # converting corpus to data frame
# function to extract Ngrams:
Ngram_finder<- function (df, numgrams){
ngram <- NGramTokenizer(df, Weka_control(min = numgrams, max = numgrams))
ngram <- data.frame(table(ngram))
colnames(ngram) <- c("Words","Frequency")
ngram <- ngram[order(ngram$Frequency, decreasing = TRUE),]
ngram
}
monoGram<- Ngram_finder(CorpDF, 1)
biGram<- Ngram_finder(CorpDF, 2)
triGram<- Ngram_finder(CorpDF, 3)
#tetraGram<- Ngram_finder(CorpDF, 4)
#pentaGram<- Ngram_finder(CorpDF, 5)
#show up to 30 ngrams (value could be changed)
library(wordcloud)
# make sure not to choose a really large diameter, it would cause truncating the words...
#set.seed(10)
wordcloud(monoGram$Words[1:150], monoGram$Frequency[1:150]) #max.words=300)
# reorder in "aes" is needed to plot based on the y-axis values.. Otherwise automatic ordering based on x-axis
ggplot(monoGram[1:30, ], aes(x = reorder(Words, -Frequency), y = Frequency)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45))
ggplot(biGram[1:30, ], aes(x = reorder(Words, -Frequency), y = Frequency)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45))
ggplot(triGram[1:30, ], aes(x = reorder(Words, -Frequency), y = Frequency)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45))
Report any interesting findings that you amassed so far.
Top words in terms of the frequency of occurences could be viewed in the hostogram of mono bi and tri grams created..
A lot of words are stopwords, therefore removing them is quite important in this assignment.
The last steps of this assignment takes time around 10 minutes running on an intel Core i7 and CPU 2.5GHZ. Was thinking of whether one can benefit from using a GPU, etc.