This is a brief report about the capstone project of the Data Science Specialization at Coursera. It reports some data exploration about the dataset (blogs, news and twitter) and ideas to create a prediction model for the word prediction algorithm.
setwd("~/Documents/git/DataScienceCapstone")
library(tm) # text mining
library(wordcloud) # wordcloud plot
library(RColorBrewer) # nice colour for plotting
library(ggplot2) # standard plot package
library(scales) # features for our plots
library(magrittr) # %>% operator
library(stringr) # for string operations
library(dplyr) # data manipulation
library(reshape2) # melt, for data manipulation
In this step we look at the file size and the number of lines in each one of them.
#Size of the files
blogsSize <- file.size("final/en_US/en_US.blogs.txt") / 1024^2
newsSize <- file.size("final/en_US/en_US.news.txt") / 1024^2
twitterSize <- file.size("final/en_US/en_US.twitter.txt") / 1024^2
# Read files
blogs <- readLines("final/en_US/en_US.blogs.txt")
news <- readLines("final/en_US/en_US.news.txt")
twitter <- readLines("final/en_US/en_US.twitter.txt")
# Calculate number of lines
blogsLines <- length(blogs)
newsLines <- length(news)
twitterLines <- length(twitter)
fileSummary <- NULL
fileSummary$Source <- c("Blogs", "News", "Twitter")
fileSummary$Size <- c(blogsSize, newsSize, twitterSize)
fileSummary$Lines <- c(blogsLines, newsLines, twitterLines)
fileSummary <- as.data.frame(fileSummary)
This plots shows a correlation betwwen the file size and the number of lines. Twitter has the most lines, but as its max character number is 140, its file size is lower than news and blogs.
ggplot(fileSummary, aes(x = Size, y = Lines, colour = Source)) +
geom_point(size = 4) +
xlab("Size [mb]") +
ylab("Number of lines") +
labs(title = "Sources") +
scale_y_continuous(labels = comma)
# Read first 1000 rows
blogs <- readLines("final/en_US/en_US.blogs.txt", n = 1000)
news <- readLines("final/en_US/en_US.news.txt", n = 1000)
twitter <- readLines("final/en_US/en_US.twitter.txt", n =1000)
data <- as.data.frame(cbind(blogs, news, twitter))
data <- data %>%
mutate(blogsLength = str_length(blogs),
newsLength = str_length(news),
twitterLength = str_length(twitter)) %>%
select(blogsLength, newsLength, twitterLength) %>%
melt(variable.name = "Source", value.name = "Length")
This graph is a mix of a box plot and histogram, showing the range of each observation
ggplot(data, aes(x = Source, y = Length)) +
geom_violin(fill = "blue", alpha = 0.5) +
ylab("Number of character")
corpus = c(blogs, news, twitter)
cp <- Corpus(VectorSource(corpus))
cp <- cp %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(removeWords, stopwords("english"))
cpDtm <- TermDocumentMatrix(cp)
m <- as.matrix(cpDtm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v),freq=v)
head(d,10)
## word freq
## said said 304
## will will 260
## one one 255
## just just 250
## like like 248
## can can 192
## time time 192
## new new 186
## get get 171
## dont dont 146
This next plot show the top 200 words found in this samples (3,000 observations)
set.seed(1)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8,"Dark2"))
Looking only for our top 10 words (after removing the stop words), give us a insight of the most used words in the English language
sample <- d[1:10,]
ggplot(sample, aes(x = reorder(word, freq), y = freq)) +
geom_col(fill = "darkgreen") +
coord_flip() +
xlab("Top 10 words") +
ylab("Frequency")
For the next steps in this capstone, the counting of n-grams of two and three words will be done, and predictions models will be tested to see how to find a good fit between precision, memory management and process time.