The Coursera Data Science Specialization intends to teach the basic skills involved with being a data scientist. The goal in this Capstone project is to give the experience of what it truly means to be a data scientist. It is common in any Data Science project to get messy data, a vague question, and little instruction on how to analyze the data. The project consists for developing a predicitve model of text using a Swiftkey Company Dataset. The main steps of this assignment consists of downloading the dataset, cleaning the data, and doing some basic analysis. THe major feautres of the datasets are shown in tables and plots, and plans for creating a prediciton algorithm are discussed.
library(ggplot2)
library(ngram)
library(NLP)
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
## Warning: package 'tm' was built under R version 3.6.2
library(magrittr)
library(stringi)
The large data file is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and then unzipped. The unzipped file contains text data in English, German, finnish, and Russian. We are only going to look at the English data.
setwd("~/Yale Papers/Data Analysis/Data Science Capstone/final/en_US")
blogs <- "en_US.blogs.txt"
news <- "en_US.news.txt"
twitter <- "en_US.twitter.txt"
blog.line <- readLines(blogs, encoding = "UTF-8", skipNul = TRUE)
news.line <- readLines(news, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(news, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'en_US.news.txt'
twitter.line <- readLines(twitter, encoding = "UTF-8", skipNul = TRUE)
blog.word.count <- stri_count_words(blog.line)
news.word.count <- stri_count_words(news.line)
twitter.word.count <- stri_count_words(twitter.line)
length(blog.word.count)
## [1] 899288
length(twitter.word.count)
## [1] 2360148
length(news.word.count)
## [1] 77259
As these files are large, and computationally intensive to work with, we will sample 1000 entries from the origianl files. This should be large enough to yield statistically significant results that work for the entire population.
setwd("~/Yale Papers/Data Analysis/Data Science Capstone/final/en_US")
sBlog <- readLines("en_US.blogs.txt", 1000)
sNews <- readLines("en_US.news.txt", 1000)
sTwitter <- readLines("en_US.twitter.txt", 1000)
sData <-c(sBlog, sNews, sTwitter)
dcorpus <-Corpus(VectorSource(sData))
In order to clean the sample data file, I removed: Punctuation, Extra WhiteSpaces, Stopwords, and Numbers. Then, I transformed the file to all lowercase for simplicity
dcorpus <-tm_map(dcorpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(dcorpus, removePunctuation): transformation
## drops documents
dcorpus <- tm_map(dcorpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(dcorpus, removeNumbers): transformation
## drops documents
dcorpus <- tm_map(dcorpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(dcorpus, stripWhitespace): transformation
## drops documents
dcorpus <- tm_map(dcorpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(dcorpus, removeWords, stopwords("english")):
## transformation drops documents
dcorpus <- tm_map(dcorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(dcorpus, content_transformer(tolower)):
## transformation drops documents
I created a Term Document Matrix in order to rank commmonly appearing words, and provide tabular output of the top words.
gram1 = as.data.frame((as.matrix(TermDocumentMatrix(dcorpus))))
gram1v <- sort(rowSums(gram1), decreasing = TRUE)
gram1d <- data.frame(word= names(gram1v), freq=gram1v)
gram1d[1:10,]
## word freq
## the the 464
## said said 304
## will will 259
## one one 254
## just just 249
## like like 248
## can can 191
## time time 191
## new new 186
## get get 171
Then, I created a histogram using ggplot2 to visually show the findings.
ggplot(gram1d[1:30,], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity", width=0.5, fill= "blue") +
labs(title= "Top Most Common Words") +
xlab("Top Words") +ylab("Frequency") +
theme(axis.text.x = element_text(angle=65, vjust = 0.6))
This is the start to a number steps that need to be taken in the Predictive/Shiny part of this project. Some things that I need to address later are some warnings about the data that I received as I conducted my exploratory analysis. I need to understand these better in order to ascertain the potential affect that they could have on my prediciton models down the road. In addition, I will need to extensively use N-grams and Tokenization when I will create the predictive modes, which will be an eye opening experience. At the end of this capstone project, the shiny application will allow a non-data scientist to interact with the program and it will allow the user to attempt to predict the next word given a string of previous words.