Week 1 Quiz Overview

Capstone week 1 report of the Data Science specialization, we'll try to apply some data science techniques to analyze Natural Language Processing.

The main goal is to elaborate a text-prediction application with R Shiny package.

However, this report covers only the EDA of the Capstone Dataset.

Tasks to accomplish

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tips, tricks, and hints

Loading the data

This dataset is fairly large. We emphasize that you don't necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R's readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:

  • con <- file("en_US.twitter.txt", "r")

  • readLines(con, 1) ## Read the first line of text

  • readLines(con, 1) ## Read the next line of text

  • readLines(con, 5) ## Read in the next 5 lines of text

  • close(con) ## It's important to close the connection when you are done.

Sampling

To reiterate, to build models you don't need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population.

To reiterate, to build models you don't need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to "flip a biased coin" to determine whether you sample a line of text or not.

Loading libraries

library(stringi)

Loading data

# Read the blogs and twitter files using readLines
blog_data <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
twitter_data <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

# Read the news file using binary/binomial mode as there are special characters in the text
con <- file("final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding = "UTF-8")
close(con)
rm(con)

Calculatin data dimensions Megabytes

## size of the data
blog_dim <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
sprintf("The en_US.blogs.txt file is: %s Megabytes", blog_dim)
## [1] "The en_US.blogs.txt file is: 200.424207687378 Megabytes"
news_dim <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
sprintf("The en_US.news.txt file is: %s Megabytes", news_dim)
## [1] "The en_US.news.txt file is: 196.277512550354 Megabytes"
twitter_dim <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
sprintf("The en_US.twitter.txt file is: %s Megabytes", twitter_dim)
## [1] "The en_US.twitter.txt file is: 159.364068984985 Megabytes"

Data Summary

Data_info <- data.frame('File' = c("Blogs","News","Twitter"),
                      "FileSizeinMB" = c(blog_dim, news_dim, twitter_dim),
                      'NumberofLines' = sapply(list(blog_data, news_data, twitter_data), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){sum(nchar(x))}),
                      TotalWords = sapply(list(blog_data,news_data,twitter_data),stri_stats_latex)[4,],
                      'MaxCharacters' = sapply(list(blog_data, news_data, twitter_data), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
                      )

Data_info
##      File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1   Blogs     200.4242        899288       206824505   37570839         40833
## 2    News     196.2775       1010242       203223159   34494539         11384
## 3 Twitter     159.3641       2360148       162096031   30451128           140

As we can see, each files has 200 & below MBand number of words are more than 30 million per file.
- Twitter has the most amount of lines, and fewer words per line.
- Blogs has the longest line (40,833 characters).
- News has the longest paragraphs.

Counting words

In the en_US twitter data set, if you divide the number of lines where the word "love" (all lowercase) occurs by the number of lines the word "hate" (all lowercase) occurs, about what do you get?

love_hate<-length(grep("love", twitter_data))/length(grep("hate", twitter_data))
sprintf("We get around: %s", love_hate)
## [1] "We get around: 4.10859156202006"

The one tweet in the en_US twitter data set that matches the word "biostats" says what?

tweet<-grep("biostats", twitter_data, value=TRUE)
tweet
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

How many tweets have the exact characters "A computer once beat me at chess, but it was no match for me at kickboxing". (I.e. the line matches those characters exactly.)

tweet_match<-grep("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_data)
tweet_match
## [1]  519059  835824 2283423