Exploratory Data Analysis

Published By: Venkata Mallepudi


Objective:

1. To demonstrate competency in working with a training data.
2. Perform Exploratory Analysis on the training data and identify major features identified.
3. Future Plans for creating a predictive algorithm and shiny app.


Approach:

1. Setting up the Environment.
2. Downloading and loading the data-files.
3. Analysing the Raw Data.
4. Cleaning the Raw Data.
5. Perform some basic analysis on cleansed Data.
6. Exploratory Analysis
7. Issues Faced so far...
8. Future Plans

1. Setting up the Environment.

Pre-load the necessary libraries in preparation of an efficient R execution environment.

library(dplyr)
library(ngram)
library(tidyverse)
library(tokenizers)
library(RColorBrewer)
library(ggplot2)
library(stringr)

2. Downloading and loading the data-files.

Download the training data from the coursera-website. There are data files from 4 different regions and in each region, there are 3 data-files from different sources.

#if (!file.exists("Coursera-SwiftKey.zip")) {
  #download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                #destfile = "/Coursera-SwiftKey.zip")
#}
# Check for data file and unzip if necessary
#if (!file.exists("/final/en_US/en_US.blogs.txt")) {
  #unzip("/Coursera-SwiftKey.zip", exdir = "/final/en_US", list = TRUE)
#}

3. Analysing the Raw Data.

blogs <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twitter <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
news <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.news.txt")
a <- max(nchar(blogs))
b <- max(nchar(twitter))
c <- max(nchar(news))

4. Cleaning the Raw Data.

Raw Data has to be cleaned to take care of punctuations, quotations, numbers and other metacharacters.A function is written to handle this situation. (Clean_String)

Another function is written to handle the unknown spaces between the words (more than 1 space) and between the lines. (Clean_Text_Block)

Clean_String <- function(string){
  # Lowercase
  temp <- tolower(string)
  #' Remove everything that is not a number or letter (may want to keep more 
  #' stuff in your actual analyses). 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  # Split it
  temp <- stringr::str_split(temp, " ")[[1]]
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  }
  return(temp)
}
Clean_Text_Block <- function(text){
  if(length(text) <= 1){
    # Check to see if there is any text at all with another conditional
    if(length(text) == 0){
      cat("There was no text in this document! \n")
      to_return <- list(num_tokens = 0, unique_tokens = 0, text = "")
    }else{
      # If there is , and only only one line of text then tokenize it
      clean_text <- Clean_String(text)
      num_tok <- length(clean_text)
      num_uniq <- length(unique(clean_text))
      to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text)
    }
  }else{
    # Get rid of blank lines
    indexes <- which(text == "")
    if(length(indexes) > 0){
      text <- text[-indexes]
    }
    # Loop through the lines in the text and use the append() function to 
    clean_text <- Clean_String(text[1])
    for(i in 2:length(text)){
      # add them to a vector 
      clean_text <- append(clean_text,Clean_String(text[i]))
    }
    # Calculate the number of tokens and unique tokens and return them in a 
    # named list object.
    num_tok <- length(clean_text)
    num_uniq <- length(unique(clean_text))
    to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text)
  }
  return(to_return)
}

5. Perform some basic analysis on cleansed Data.

File format of the data files are .txt and the size of the files are large, indicating that the datasets have huge data in them. When tried to work on all the available data, performance issues arised. Solution: parallel computing (I have to do it yet!) For now, I am reading the first 10000 texts from the US Blogs data for our analysis.

data_sample <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt",
                         n = 10000)

For the 10,000 lines that are considered, they are first cleaned and then analysis is done.

clean_data <- Clean_Text_Block(data_sample)
str(clean_data)
## List of 3
##  $ num_tokens   : int 416864
##  $ unique_tokens: int 28782
##  $ text         : chr [1:416864] "in" "the" "years" "thereafter" ...
To understand the data better, frequency is calculated for all the words that are available in the 10,000 lines that are initially considered. Along with the frequency, also calculated the coverage of each word in the text(s) that is considered. Percentage helps us to determine the range of words that should be considered for our analysis.
tab <- table(clean_data[[3]])
tab <- data.frame(words = names(tab), count = as.numeric(tab))
desc_word_count <- arrange(tab, desc(count))
#finding the % of occurances in the dataset for each word.
desc_word_count$percentage <- (desc_word_count$count/sum(desc_word_count$count))
head(desc_word_count)
##   words count percentage
## 1   the 20422 0.04898960
## 2   and 12012 0.02881515
## 3    to 11640 0.02792278
## 4     i 10073 0.02416376
## 5     a  9946 0.02385910
## 6    of  9538 0.02288036
#understanding the data a little bit more.
max(desc_word_count$percentage)
## [1] 0.0489896
min(desc_word_count$percentage)
## [1] 2.398864e-06
head(desc_word_count, 10)
##    words count percentage
## 1    the 20422 0.04898960
## 2    and 12012 0.02881515
## 3     to 11640 0.02792278
## 4      i 10073 0.02416376
## 5      a  9946 0.02385910
## 6     of  9538 0.02288036
## 7     in  6487 0.01556143
## 8     it  5372 0.01288670
## 9   that  5331 0.01278834
## 10    is  4766 0.01143299

6. Exploratory Analysis

When looked at the words based on frequency, words like 'and','the','that'.. have repeated the most. Using these words makes the analysis much more difficult. Hence, words that have a frequency range between 0.01 to 0.001 are considered(trial and error method). more analysis has to be done to determine this range.

desc_word_count <- desc_word_count[desc_word_count$percentage < 0.01, ]
head(desc_word_count)
##    words count  percentage
## 11   for  3870 0.009283603
## 12     s  3745 0.008983745
## 13   you  3516 0.008434405
## 14  with  3196 0.007666769
## 15   was  3081 0.007390900
## 16    my  3072 0.007369310
#considering results that are having percentage greater than 0.001.
desc_word_count <- desc_word_count[desc_word_count$percentage > 0.001, ]
,p> Plotting a histogram to see the distribution of the words in the text document.
hist(desc_word_count$count)
plot of chunk unnamed-chunk-9

7. Issues Faced so far...

1. Handling the large data takes a lot of time. I have to implement parallel computing to overcome this issue.

8. Future Plans

1. Implement n-grams to generate tokens of one-word, 2-words and so on.
2. Summarize frequency of tokens and find association between tokens.
3. Build predictive models using these tokens.
4. Develop a data product to make word recomendation based on user-inputs.