Exploratory Data Analysis
Pre-load the necessary libraries in preparation of an efficient R execution environment.
library(dplyr) library(ngram) library(tidyverse) library(tokenizers) library(RColorBrewer) library(ggplot2) library(stringr)
Download the training data from the coursera-website. There are data files from 4 different regions and in each region, there are 3 data-files from different sources.
#if (!file.exists("Coursera-SwiftKey.zip")) { #download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", #destfile = "/Coursera-SwiftKey.zip") #} # Check for data file and unzip if necessary #if (!file.exists("/final/en_US/en_US.blogs.txt")) { #unzip("/Coursera-SwiftKey.zip", exdir = "/final/en_US", list = TRUE) #}
blogs <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt") twitter <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.twitter.txt") news <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.news.txt") a <- max(nchar(blogs)) b <- max(nchar(twitter)) c <- max(nchar(news))
Raw Data has to be cleaned to take care of punctuations, quotations, numbers and other metacharacters.A function is written to handle this situation. (Clean_String)
Another function is written to handle the unknown spaces between the words (more than 1 space) and between the lines. (Clean_Text_Block)
Clean_String <- function(string){ # Lowercase temp <- tolower(string) #' Remove everything that is not a number or letter (may want to keep more #' stuff in your actual analyses). temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ") # Shrink down to just one white space temp <- stringr::str_replace_all(temp,"[\\s]+", " ") # Split it temp <- stringr::str_split(temp, " ")[[1]] # Get rid of trailing "" if necessary indexes <- which(temp == "") if(length(indexes) > 0){ temp <- temp[-indexes] } return(temp) } Clean_Text_Block <- function(text){ if(length(text) <= 1){ # Check to see if there is any text at all with another conditional if(length(text) == 0){ cat("There was no text in this document! \n") to_return <- list(num_tokens = 0, unique_tokens = 0, text = "") }else{ # If there is , and only only one line of text then tokenize it clean_text <- Clean_String(text) num_tok <- length(clean_text) num_uniq <- length(unique(clean_text)) to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text) } }else{ # Get rid of blank lines indexes <- which(text == "") if(length(indexes) > 0){ text <- text[-indexes] } # Loop through the lines in the text and use the append() function to clean_text <- Clean_String(text[1]) for(i in 2:length(text)){ # add them to a vector clean_text <- append(clean_text,Clean_String(text[i])) } # Calculate the number of tokens and unique tokens and return them in a # named list object. num_tok <- length(clean_text) num_uniq <- length(unique(clean_text)) to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text) } return(to_return) }
File format of the data files are .txt and the size of the files are large, indicating that the datasets have huge data in them. When tried to work on all the available data, performance issues arised. Solution: parallel computing (I have to do it yet!) For now, I am reading the first 10000 texts from the US Blogs data for our analysis.
data_sample <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", n = 10000)
For the 10,000 lines that are considered, they are first cleaned and then analysis is done.
clean_data <- Clean_Text_Block(data_sample) str(clean_data)
## List of 3 ## $ num_tokens : int 416864 ## $ unique_tokens: int 28782 ## $ text : chr [1:416864] "in" "the" "years" "thereafter" ...
tab <- table(clean_data[[3]]) tab <- data.frame(words = names(tab), count = as.numeric(tab)) desc_word_count <- arrange(tab, desc(count)) #finding the % of occurances in the dataset for each word. desc_word_count$percentage <- (desc_word_count$count/sum(desc_word_count$count)) head(desc_word_count)
## words count percentage ## 1 the 20422 0.04898960 ## 2 and 12012 0.02881515 ## 3 to 11640 0.02792278 ## 4 i 10073 0.02416376 ## 5 a 9946 0.02385910 ## 6 of 9538 0.02288036
#understanding the data a little bit more. max(desc_word_count$percentage)
## [1] 0.0489896
min(desc_word_count$percentage)
## [1] 2.398864e-06
head(desc_word_count, 10)
## words count percentage ## 1 the 20422 0.04898960 ## 2 and 12012 0.02881515 ## 3 to 11640 0.02792278 ## 4 i 10073 0.02416376 ## 5 a 9946 0.02385910 ## 6 of 9538 0.02288036 ## 7 in 6487 0.01556143 ## 8 it 5372 0.01288670 ## 9 that 5331 0.01278834 ## 10 is 4766 0.01143299
When looked at the words based on frequency, words like 'and','the','that'.. have repeated the most. Using these words makes the analysis much more difficult. Hence, words that have a frequency range between 0.01 to 0.001 are considered(trial and error method). more analysis has to be done to determine this range.
desc_word_count <- desc_word_count[desc_word_count$percentage < 0.01, ] head(desc_word_count)
## words count percentage ## 11 for 3870 0.009283603 ## 12 s 3745 0.008983745 ## 13 you 3516 0.008434405 ## 14 with 3196 0.007666769 ## 15 was 3081 0.007390900 ## 16 my 3072 0.007369310
#considering results that are having percentage greater than 0.001. desc_word_count <- desc_word_count[desc_word_count$percentage > 0.001, ]
hist(desc_word_count$count)