Intro

This is a brief report about the capstone project of the Data Science Specialization at Coursera. It reports some data exploration about the dataset (blogs, news and twitter) and ideas to create a prediction model for the word prediction algorithm.

Loading the libraries

setwd("~/Documents/git/DataScienceCapstone")
library(tm) # text mining
library(wordcloud) # wordcloud plot
library(RColorBrewer) # nice colour for plotting
library(ggplot2) # standard plot package
library(scales) # features for our plots
library(magrittr) # %>% operator
library(stringr) # for string operations
library(dplyr) # data manipulation
library(reshape2) # melt, for data manipulation

Analysing and Loading the files

In this step we look at the file size and the number of lines in each one of them.

#Size of the files
blogsSize <- file.size("final/en_US/en_US.blogs.txt") / 1024^2
newsSize <- file.size("final/en_US/en_US.news.txt") / 1024^2
twitterSize <- file.size("final/en_US/en_US.twitter.txt") / 1024^2

# Read files

blogs <- readLines("final/en_US/en_US.blogs.txt")
news <- readLines("final/en_US/en_US.news.txt")
twitter <- readLines("final/en_US/en_US.twitter.txt")

# Calculate number of lines
blogsLines <- length(blogs)
newsLines <- length(news)
twitterLines <- length(twitter)

fileSummary <- NULL
fileSummary$Source <- c("Blogs", "News", "Twitter")
fileSummary$Size <- c(blogsSize, newsSize, twitterSize) 
fileSummary$Lines <- c(blogsLines, newsLines, twitterLines)
fileSummary <- as.data.frame(fileSummary)

This plots shows a correlation betwwen the file size and the number of lines. Twitter has the most lines, but as its max character number is 140, its file size is lower than news and blogs.

ggplot(fileSummary, aes(x = Size, y = Lines, colour = Source)) +
  geom_point(size = 4) +
  xlab("Size [mb]") +
  ylab("Number of lines") +
  labs(title = "Sources") +
  scale_y_continuous(labels = comma)

Sample Exploration

# Read first 1000 rows
blogs <- readLines("final/en_US/en_US.blogs.txt", n = 1000)
news <- readLines("final/en_US/en_US.news.txt", n = 1000)
twitter <- readLines("final/en_US/en_US.twitter.txt", n =1000)

data <- as.data.frame(cbind(blogs, news, twitter))
data <- data %>%
  mutate(blogsLength = str_length(blogs),
         newsLength = str_length(news),
         twitterLength = str_length(twitter)) %>%
  select(blogsLength, newsLength, twitterLength) %>%
  melt(variable.name = "Source", value.name = "Length")

This graph is a mix of a box plot and histogram, showing the range of each observation

ggplot(data, aes(x = Source, y = Length)) +
  geom_violin(fill = "blue", alpha = 0.5) +
  ylab("Number of character")

Load the data as a corpus

corpus = c(blogs, news, twitter)

cp <- Corpus(VectorSource(corpus))

cp <- cp %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeWords, stopwords("english"))

cpDtm <- TermDocumentMatrix(cp)

m <- as.matrix(cpDtm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v),freq=v)
head(d,10)
##      word freq
## said said  304
## will will  260
## one   one  255
## just just  250
## like like  248
## can   can  192
## time time  192
## new   new  186
## get   get  171
## dont dont  146

This next plot show the top 200 words found in this samples (3,000 observations)

set.seed(1)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8,"Dark2"))

Looking only for our top 10 words (after removing the stop words), give us a insight of the most used words in the English language

sample <- d[1:10,]
ggplot(sample, aes(x = reorder(word, freq), y = freq)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  xlab("Top 10 words") +
  ylab("Frequency")

Next Steps

For the next steps in this capstone, the counting of n-grams of two and three words will be done, and predictions models will be tested to see how to find a good fit between precision, memory management and process time.