I present the work I’ve done for each task. First of all I’m working on a direcory on my computer.
setwd("C:/Users/Bruno Gonzalez/Google Drive/JHU_Data_Science/10 Data Science Capstone/")
The task 0 was only to download the file. The code for doing this is the next one.
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "./Coursera-SwiftKey.zip")
First of all, for the examples presented in this work I’m only using the “en_US.news” data to save some memory. I start just reading the file as a character.
nws <- readLines("./final/en_US/en_US.news.txt")
## Warning in readLines("./final/en_US/en_US.news.txt"): incomplete final line
## found on './final/en_US/en_US.news.txt'
This funtion gets the longest line.
fun <- function(dat){
x <- nchar(dat[1])
n <- length(dat)
for(i in 2:n){
m <- nchar(dat[i])
if(x < m){x <- m}
}
x
}
After this y eliminate all the punctuation signs and all the digits.
tok_nws <- gsub("[[:punct:]]", "", nws)
tok_nws <- gsub("[[:digit:]]+", "", tok_nws)
Then I split eash work, and transform the character to a factor type. This will make easier to analyze the data
tok_nws <- strsplit(tok_nws, "\\s+")
tok2_nws <- unlist(tok_nws)
tok2_nws <- as.factor(tok2_nws)
I start observing those words that repeats more times. The first 30 are presented on the table.
sum_nws <- summary(tok2_nws, maxsum = 31)
sum_nws
## the to and a of in for that is
## 132178 68718 65667 63707 58822 47980 25819 24950 21727
## on said with The was at as it he
## 19622 19112 18966 18891 17576 15432 13432 13024 12858
## be I from his have are by has an
## 11489 11480 11359 11336 10915 10600 9820 9304 9043
## will who not (Other)
## 8300 8189 8175 1802814
It can be seen from the “bar chart” that “the” is most frequent word in the text. Followed by “to” and “and”.
barplot(sum_nws[1:30])
Then, I will try to analyze which are the most frequent words after “the”
x <- grep("[Tt]he", tok2_nws)
x <- x+1
sig_the <- tok2_nws[x]
sum_the <- summary(sig_the)
By plotting this information, it can be seen that “first” is the most frequent word after “the”
barplot(sum_the[1:23])
For the modeling I made a function that creates a matrix of frequencies for two consecutive words.
gram_mc <- function(s){
x <- levels(s)
l <- length(x)
mat <- matrix(nrow = l, ncol = l )
for (i in 1:l){
a <- grep(x[i], s)
b <- tok2_nws[a+1]
for(j in 1:l){
mat[i,j] <- sum(b==x[j])
}
}
mat
}
The rows would be the first word, and the columns would be the second. As these matrix can be seen as a Markov Chan (just need to be divided by the total by row), the frecuency of the third word can be calculated multiplying this matrix by itself.
This project has been a real chalenge to me, but I think I’ve done the rigth advances to achive the goal.