In this project, we are taking a raw data of text from 3 different sources(blogs, twitter, and news) to generate a predictive model that can predict what words that user possibly want to write based on previous word.
This particular report is a milestone for the first and second task of this project from Data Science Specialization from coursera.
This analysis focuses on generating information from the raw data, cleaning the data, and doing some exploratory work.
First we will need to download the data and load it into R.
I would recommend setting up your working directory as tidy as you can to avoid confusion. I have prepared a data folder to save all the data needed for this project.
if(!file.exists("Data")){
dir.create("Data", path = "./")
}
So let us begin by downloading the data and save it into Data folder in your working directory.
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("./Data/final")){
download.file(url, destfile = "./Data/Capstone.zip")
unzip("./Data/Capstone.zip")
}
Now that we have our data in our working directory, you can see whats inside the data. In this project, we are only going to use en_US data inside the final folder. So lets list all files path into R to make things easier.
fileList <- list.files("./Data/final/en_US/", full.names = TRUE)
fileList
## [1] "./Data/final/en_US/en_US.blogs.txt"
## [2] "./Data/final/en_US/en_US.news.txt"
## [3] "./Data/final/en_US/en_US.twitter.txt"
Lets load it into R !
blogs <- readLines(fileList[1], encoding = "UTF-8", skipNul = TRUE)
news <- readLines(fileList[2], encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(fileList[2], encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on './Data/final/en_US/en_US.news.txt'
twitter <- readLines(fileList[3], encoding = "UTF-8", skipNul = TRUE)
Great! Now, lets gather information about those 3 data that we are going to use.
As you can see, the data is massive in size. So we need to develop a certain strategy to process this raw data.
An important step before we even consider analyzing the data is to clean it first. The goal of this segment is to have a processed data without profanity and is ready to be analyzed.
First thing first, let us decide what and how many data we are going to use. I am going to combine all three data with a some proportion in it.
The reasoning behind this decision is blogs and news offer a more structured and formal text for us to handle. It will also have less abbreviation, which is beneficial for our task.
Next I will do some sampling from the data and put it into a new file called combData
set.seed(1)
combData <- as.character()
combData <- c(sample(blogs, 0.05*len[1]), sample(news, 0.1*len[2]), sample(twitter, 0.01*len[3]))
Next we want to remove lines that have profanity. To remove profanity, I’ve used collection of profanity words from this link.
Load it into R and simply remove lines that include profanity in it using these following lines.
profan <- readLines("./Data/PROFANITY.txt")
combData <- combData[!combData %in% profan]
Last but not least, remove all unused character that may hinder our exploratory and analysis.
combData <- tolower(combData)
combData <- removeWords(combData, stopwords("en"))
combData <- removePunctuation(combData, preserve_intra_word_contractions = TRUE)
combData <- removeNumbers(combData)
combData <- stripWhitespace(combData)
combData <- iconv(combData, "UTF-8", "ASCII", sub = "")
After having a cleaned data, we can save it into a file so we don’t have to do all of that again when we start R.
if(!file.exists("./Data/en_CleanData.txt")){
writeLines(combData, "./Data/en_CleanData.txt")
}
Based on our Statistical Inference Class, we can take a sample out of the population. The idea is this sample can represent our data population.
First thing first, I wanted to take a look at what words showed up the most. We can do this by using tokens function from quanteda package.
We can conclude that based on our simulation using 10 to 10^4 sample from combined data. The word “one” and “will” came up as the word that appeared the most. But we shall make another graph to make things easier to see.
The most important thing that we can conclude from this graph is that the rate in which a word appeared grew steadily across sample size, there are no random patterns for those words so we can assume that if we continue to add the sample size, the word ranking will probably stay the same if not changed only by a little bit.
This time we will use 2*10^4 samples to plot graph of words that appeared the most.
set.seed(12)
for (i in 1:3) {
most <- sample(combData, 2*10^4) %>% tokens(what = "word") %>% unlist() %>% table() %>% sort(decreasing = T)
}
Now that we have the file, we can use ggplot function to plot our data and plotly to make it interactive.
most <- data.frame(most[1:15])
names(most)[1] <- "Word"
g <- ggplot(data.frame(most), aes(x = Word , y = Freq))+ geom_col() + labs(x = "" ,y = "Freq", title = paste("observation for word that appeared the most"))+ theme(axis.text.x = element_text(angle = 45))
ggplotly(g)
Good, it is pretty consistent with what our findings in the first exploratory graph above.
The final objective for this assignment is to make a word pairs graph. This way we can get an idea of what the next word will probably be given the first word.
That being said, lets create a bigram and trigram graph using the approach as we did earlier.
bigram <- sample(combData, 2*10^4) %>% tokens(what= "word") %>%tokens_ngrams(n = 2, concatenator = " ") %>% unlist %>% table() %>% sort(decreasing = T)
trigram <-sample(combData, 4*10^4) %>% tokens(what= "word") %>%tokens_ngrams(n = 3, concatenator = " ") %>% unlist %>% table() %>% sort(decreasing = T)
Great! now let see what does this looked like, we shall make another plot to make things easier to see.
dfBi <- data.frame(bigram[1:15])
dfTri <- data.frame(trigram[1:15])
names(dfBi)[1] <- "Word"
names(dfTri)[1] <- "Word"
gBi <- ggplot(data.frame(dfBi), aes(x = Word , y = Freq))+ geom_col() + labs(x = "" ,y = "Freq", title = paste("observation for two words")) + theme(axis.text.x = element_text(angle = 45))
gTri <- ggplot(data.frame(dfTri), aes(x = Word , y = Freq))+ geom_col() + labs(x = "" ,y = "Freq", title = paste("observation for three words")) + theme(axis.text.x = element_text(angle = 45))
ggplotly(gBi)
Now lets see what the trigram looks like.