Data Science Capstone Predictive App Presentation

Konrad Zdeb
20th August 2015

Introduction

The following presentation fulfils the assessment requirements for the Data Science Specialisation on-line course delivered via Coursera. The presentation focuses on the predictive Shiny Application (https://konrad.shinyapps.io/CapstonePredApp).

Outline

Used data
Predictive capacity
Application preview
Summary

Used data

The project uses the data provided during the course. Due to the substantial size the data was sampled and only chunks of the English files were used. In practice the one would indent to use full data set but due to technical limitations it wasn't possible. The data was sampled using the code below.

# After reading the files for the 1st paper
lst_fls_sbs <- lapply(lst_fls, function(x) x[sample(1:length(x),size = length(x)*0.05)])
# When developing the app
txt_eng <- sample(txt_eng,(length(txt_eng)/50))

Predictive Capacity

The app uses tm and RWeka libraries to build document matrices that are used to provide the end-user with predictions. This is illustrated in the code snippet below.

# For ngram four
options(mc.cores = 1)
fourNgramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min = 4 ,max = 4))
four_tdm <- TermDocumentMatrix(corpus, control = list(tokenize = fourNgramTokenizer))
four_ngram <- data.frame(sort(rowSums(as.matrix(four_tdm)),decreasing = TRUE))

Matrices are subsequently sorted by the likelihood of the typed words occurring and search according to the user input.

Preview

The screenshot below illiterates the application at work: Pubklished Application