Data Science Capstone Project

Pre-load the necessary libraries in preparation of an efficient R execution environment.

library(dplyr)
library(ngram)
library(tidyverse)
library(tokenizers)
library(RColorBrewer)
library(ggplot2)
library(stringr)

2. Downloading and loading the data-files.

Download the training data from the coursera-website. There are data files from 4 different regions and in each region, there are 3 data-files from different sources.

#if (!file.exists("Coursera-SwiftKey.zip")) {
  #download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                #destfile = "/Coursera-SwiftKey.zip")
#}
# Check for data file and unzip if necessary
#if (!file.exists("/final/en_US/en_US.blogs.txt")) {
  #unzip("/Coursera-SwiftKey.zip", exdir = "/final/en_US", list = TRUE)
#}

3. Analysing the Raw Data.

blogs <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twitter <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
news <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.news.txt")
a <- max(nchar(blogs))
b <- max(nchar(twitter))
c <- max(nchar(news))

4. Cleaning the Raw Data.

Raw Data has to be cleaned to take care of punctuations, quotations, numbers and other metacharacters.A function is written to handle this situation. (Clean_String)

Another function is written to handle the unknown spaces between the words (more than 1 space) and between the lines. (Clean_Text_Block)

Clean_String <- function(string){
  # Lowercase
  temp <- tolower(string)
  #' Remove everything that is not a number or letter (may want to keep more 
  #' stuff in your actual analyses). 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  # Split it
  temp <- stringr::str_split(temp, " ")[[1]]
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  }
  return(temp)
}
Clean_Text_Block <- function(text){
  if(length(text) <= 1){
    # Check to see if there is any text at all with another conditional
    if(length(text) == 0){
      cat("There was no text in this document! \n")
      to_return <- list(num_tokens = 0, unique_tokens = 0, text = "")
    }else{
      # If there is , and only only one line of text then tokenize it
      clean_text <- Clean_String(text)
      num_tok <- length(clean_text)
      num_uniq <- length(unique(clean_text))
      to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text)
    }
  }else{
    # Get rid of blank lines
    indexes <- which(text == "")
    if(length(indexes) > 0){
      text <- text[-indexes]
    }
    # Loop through the lines in the text and use the append() function to 
    clean_text <- Clean_String(text[1])
    for(i in 2:length(text)){
      # add them to a vector 
      clean_text <- append(clean_text,Clean_String(text[i]))
    }
    # Calculate the number of tokens and unique tokens and return them in a 
    # named list object.
    num_tok <- length(clean_text)
    num_uniq <- length(unique(clean_text))
    to_return <- list(num_tokens = num_tok, unique_tokens = num_uniq, text = clean_text)
  }
  return(to_return)
}

5. Perform some basic analysis on cleansed Data.

File format of the data files are .txt and the size of the files are large, indicating that the datasets have huge data in them. When tried to work on all the available data, performance issues arised. Solution: parallel computing (I have to do it yet!) For now, I am reading the first 10000 texts from the US Blogs data for our analysis.

data_sample <- readLines("C:/Users/Venkata/Documents/Coursera_Capstone_Project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt",
                         n = 10000)

For the 10,000 lines that are considered, they are first cleaned and then analysis is done.

clean_data <- Clean_Text_Block(data_sample)
str(clean_data)

## List of 3
##  $ num_tokens   : int 416864
##  $ unique_tokens: int 28782
##  $ text         : chr [1:416864] "in" "the" "years" "thereafter" ...

tab <- table(clean_data[[3]])
tab <- data.frame(words = names(tab), count = as.numeric(tab))
desc_word_count <- arrange(tab, desc(count))
#finding the % of occurances in the dataset for each word.
desc_word_count$percentage <- (desc_word_count$count/sum(desc_word_count$count))
head(desc_word_count)

##   words count percentage
## 1   the 20422 0.04898960
## 2   and 12012 0.02881515
## 3    to 11640 0.02792278
## 4     i 10073 0.02416376
## 5     a  9946 0.02385910
## 6    of  9538 0.02288036

#understanding the data a little bit more.
max(desc_word_count$percentage)

## [1] 0.0489896

min(desc_word_count$percentage)

## [1] 2.398864e-06

head(desc_word_count, 10)

##    words count percentage
## 1    the 20422 0.04898960
## 2    and 12012 0.02881515
## 3     to 11640 0.02792278
## 4      i 10073 0.02416376
## 5      a  9946 0.02385910
## 6     of  9538 0.02288036
## 7     in  6487 0.01556143
## 8     it  5372 0.01288670
## 9   that  5331 0.01278834
## 10    is  4766 0.01143299

6. Exploratory Analysis

When looked at the words based on frequency, words like 'and','the','that'.. have repeated the most. Using these words makes the analysis much more difficult. Hence, words that have a frequency range between 0.01 to 0.001 are considered(trial and error method). more analysis has to be done to determine this range.

desc_word_count <- desc_word_count[desc_word_count$percentage < 0.01, ]
head(desc_word_count)

##    words count  percentage
## 11   for  3870 0.009283603
## 12     s  3745 0.008983745
## 13   you  3516 0.008434405
## 14  with  3196 0.007666769
## 15   was  3081 0.007390900
## 16    my  3072 0.007369310

#considering results that are having percentage greater than 0.001.
desc_word_count <- desc_word_count[desc_word_count$percentage > 0.001, ]

Exploratory Data Analysis

Published By: Venkata Mallepudi

Objective:

Approach:

1. Setting up the Environment.

2. Downloading and loading the data-files.

3. Analysing the Raw Data.

4. Cleaning the Raw Data.

5. Perform some basic analysis on cleansed Data.

6. Exploratory Analysis

7. Issues Faced so far...

8. Future Plans