This document aims to show a simple exploratory data analysis about the news, blogs and twitter data.
rm(list = ls())
library(data.table)
library(tidyverse)
library(tidytext)
library(dplyr)
library(ggplot2)
The data was preloaded in another script shown in the Appendix below. Checking the summary of the data.
load("./dta/textDta.RData")
summary(textDta)
## text source
## Thanks for the RT! : 571 Length:3336695
## Thank you! : 547 Class :character
## thank you! : 382 Mode :character
## Thanks for the follow! : 326
## Thanks for the mention!: 188
## thanks for the RT! : 185
## (Other) :3334496
summary(as.factor(textDta$source))
## blog news twitter
## 899288 77259 2360148
Since the the data is too big to process, we will only use 5% of the observations which are randomly selected. Below are the distribution of the samples per source.
textDta <- textDta[sample(1:nrow(textDta), 0.05*nrow(textDta)),]
summary(as.factor(textDta$source))
## blog news twitter
## 45309 3847 117678
Depending on the type of analysis, stop words are used to exclude the words that are very common. In this project, I used the pre-existing stop words data called “stop_words”. The data summary is shown below.
## Classes 'tbl_df', 'tbl' and 'data.frame': 1149 obs. of 2 variables:
## $ word : chr "a" "a's" "able" "about" ...
## $ lexicon: chr "SMART" "SMART" "SMART" "SMART" ...
There are 1,149 words listed in the stop words which we will use in the following data exploration.
Tokenization is the process of breaking up a sequence of strings into words or phrases. The term used to determine the number of sequence of words in tokenization is “n-gram”. The “n” describes the number of words in the sequence. Here we will explore different sequences, 1-gram, 2-gram and 3-gram.
The following are the top 20 1-gram, 2-gram and 3-grams. It is important to note that this is a sample and a margin error should be considered when analysing.
The project requires to predict the next word when typing in a keyboard. The best approach to achieve this is by using n-grams and applying a prediction algorithm afterwards. I will most likely utilize the Stupid backoff algorithm as it is very well suited for this type of data.
Below is the R code I used to read and save the data.
rm(list = ls())
library(tidyverse)
library(tidytext)
library(ggplot2)
## download the data ##
if(!file.exists("./dta/raw")){
dir.create("./dta/raw")
}
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "./dta/raw/Coursera-SwiftKey.zip", mode = "wb")
unzip(zipfile = "./dta/raw/Coursera-SwiftKey.zip", exdir = "./dta/raw") ## unzip to open files
path <- file.path("./dta/raw" , "en_US")
files <- list.files(path, recursive = TRUE)
## open twitter data ##
con <- file("./dta/raw/final/en_US/en_US.twitter.txt", "r")
twitterDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
twitterDta <- as.data.frame(twitterDta)
names(twitterDta)[1] <- "text"
twitterDta$source <- "twitter" ## add an identifier to what data source it belongs
close(con)
save(twitterDta, file = "./dta/twitterDta.RData")
## open news data ##
con <- file("./dta/raw/final/en_US/en_US.news.txt", "r")
newsDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
newsDta <- as.data.frame(newsDta)
names(newsDta)[1] <- "text"
newsDta$source <- "news" ## add an identifier to what data source it belongs
close(con)
save(newsDta, file = "./dta/newsDta.RData")
## open blog data ##
con <- file("./dta/raw/final/en_US/en_US.blogs.txt", "r")
blogDta <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
blogDta <- as.data.frame(blogDta)
names(blogDta)[1] <- "text"
blogDta$source <- "blog" ## add an identifier to what data source it belongs
close(con)
save(blogDta, file = "./dta/blogDta.RData")
## bind data to make 1 data frame ##
textDta <- rbind(blogDta, newsDta, twitterDta)
save(textDta, file = "./dta/textDta.RData")