Milestone report for the Coursera “Data Science Specialization” Capstone project

In this project we will eventually build an R Shiny app which will predict what word will be entered, based on the previous words. To build this app, large sets of natural texts are used for training. In this milestone report, I’ll do some exploratory data analysis on these datasets, clean the data and already create some bigrams and trigrams. This data analysis is primarily done for English, but I also have a look at the German data sets. For this first analysis, I do not use specific R packages for text mining or natural language processing. I’ll switch to such packages later when building the prediction model. I use these R packages:

library(dplyr); library(ggplot2); library(wordcloud)

The data can be downloaded using this URL: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Unzip the data, and have a look at the files:

URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destfile <- "Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
download.file(URL, destfile = destfile)}
cs <- unzip("Coursera-SwiftKey.zip")
cs
##  [1] "./final/de_DE/de_DE.twitter.txt" "./final/de_DE/de_DE.blogs.txt"  
##  [3] "./final/de_DE/de_DE.news.txt"    "./final/ru_RU/ru_RU.blogs.txt"  
##  [5] "./final/ru_RU/ru_RU.news.txt"    "./final/ru_RU/ru_RU.twitter.txt"
##  [7] "./final/en_US/en_US.twitter.txt" "./final/en_US/en_US.news.txt"   
##  [9] "./final/en_US/en_US.blogs.txt"   "./final/fi_FI/fi_FI.news.txt"   
## [11] "./final/fi_FI/fi_FI.blogs.txt"   "./final/fi_FI/fi_FI.twitter.txt"

We have blogs, twitter messages and news texts for several languages. I select English first:

blog <- cs[9] 
twit <- cs[7] 
news <- cs[8] 

blogLines <- readLines(blog)
twitLines <- readLines(twit)
newsLines <- readLines(news)

To have a better idea of the data we will work with, I’ll check the number of lines per data set and the number of single words. As the data set is not cleaned yet, the number of single words will be an overestimation.

numbBlogLines <- length(blogLines)
numbTwitLines <- length(twitLines)
numbNewsLines <- length(newsLines)
numberOfLines <- data.frame(c(numbBlogLines, numbTwitLines, numbNewsLines))
rownames(numberOfLines) <- c("Blogs", "Twitter", "News"); colnames(numberOfLines) <- "Number of Lines"
print(numberOfLines)
##         Number of Lines
## Blogs            899288
## Twitter         2360148
## News            1010242
singleWordsBlog <- strsplit(blogLines, split = " ") %>%
      unlist()
numbBlogWords <- length(singleWordsBlog)
singleWordsTwit <- strsplit(twitLines, split = " ") %>%
      unlist()
numbTwitWords <- length(singleWordsTwit)
singleWordsNews <- strsplit(newsLines, split = " ") %>%
      unlist()
numbNewsWords <- length(singleWordsNews)
numberOfWords <- data.frame(c(numbBlogWords, numbTwitWords, numbNewsWords))
rownames(numberOfWords) <- c("Blogs", "Twitter", "News"); colnames(numberOfWords) <- "Number of Words"
print(numberOfWords)
##         Number of Words
## Blogs          37334131
## Twitter        30373543
## News           34372530

Now we’ll clean the data set. I only use 5% of the text to prevent the analysis from taking to much time.

words <- c(blogLines, twitLines, newsLines)
words <- sample(words, 0.05*length(words))
wordsClean <- gsub(words, pattern = "#", replacement = "") %>%
      gsub(pattern = ",", replacement = "") %>%
      gsub(pattern = "/", replacement = "") %>%
      gsub(pattern = "\"", replacement = "") %>%
      gsub(pattern = ":", replacement = "") %>%
      gsub(pattern = ";", replacement = "") %>%
      gsub(pattern = "^", replacement = "") %>%
      gsub(pattern = "-", replacement = "") %>%
      gsub(pattern = "*", replacement = "") %>%
      gsub(pattern = "+", replacement = "") %>%
      gsub(pattern = "?", replacement = "") %>%
      gsub(pattern = "=", replacement = "") %>%
      gsub(pattern = "<", replacement = "") %>%
      gsub(pattern = ">", replacement = "") %>%
      gsub(pattern = "\\(", replacement = "") %>%
      gsub(pattern = "\\)", replacement = "") %>%
      gsub(pattern = "!", replacement = "") %>%
      gsub(pattern = "\\.", replacement = "") %>%
      gsub(pattern = "\"", replacement = "")

We then split the data up in single words:

singleWords <- strsplit(wordsClean, split = " ") %>%
      unlist() %>%
      tolower()

And we create a wordcloud to visualize the frequency of words in the data, with deleting stop words.

wordcloud(singleWords, max.words = 200)