Konrad Zdeb
20th August 2015
The following presentation fulfils the assessment requirements for the Data Science Specialisation on-line course delivered via Coursera. The presentation focuses on the predictive Shiny Application (https://konrad.shinyapps.io/CapstonePredApp).
The project uses the data provided during the course. Due to the substantial size the data was sampled and only chunks of the English files were used. In practice the one would indent to use full data set but due to technical limitations it wasn't possible. The data was sampled using the code below.
# After reading the files for the 1st paper
lst_fls_sbs <- lapply(lst_fls, function(x) x[sample(1:length(x),size = length(x)*0.05)])
# When developing the app
txt_eng <- sample(txt_eng,(length(txt_eng)/50))
The app uses tm and RWeka libraries to build document matrices that are used to provide the end-user with predictions. This is illustrated in the code snippet below.
# For ngram four
options(mc.cores = 1)
fourNgramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min = 4 ,max = 4))
four_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = fourNgramTokenizer))
four_ngram <- data.frame(sort(rowSums(as.matrix(four_tdm)),decreasing = TRUE))
Matrices are subsequently sorted by the likelihood of the typed words occurring and search according to the user input.
The screenshot below illiterates the application at work: