Capstone Project; Creating a Word Prediction Model

5/2/2021

Coursera Capstone Project Description

The capstone of the Coursera R Programming course walked the students through the development of predictions models based off data provided by the instructors.

While completing the goals of creating a code capable of cleaning and processing the data provided, I noticed the sources of the data; blogs, news, and twitter. From my own experience with the data sources, I thought to provide the user with a tailored experienced, as someone writing for a news site might use different words than an individual writing a personal blog or an influencer on twitter.

-To do this, in the final R Shiny application, I developed a tabular tool to included the data summary and plots from the milestone project to provide the user with an idea of the source’s characteristics.

-Regarding the prediction model, I included the option for user to select whether they want to base their predictions off their first, second, third, or complete sample of the all the sources.

-While the initial data was provided by the Coursera staff, this model has been developed with the intent for a user to use any three data sets they have at their disposal. As Rpubs restricts the amount of data a user can upload, code for extracting sample data was included in the first tab. I recommend using the the sample sources provided by the course instructures for testing the tool.

Developing Table of Related Words

This tool uses a variety of defintions, since the sample sources will all need to be fun through the same code in order to match the objects necessary to run the prediction model

-The following definition produces a table of related words. These are words that appear next to each other in the form of pairs or triplets and later can be called to help predict next word

The following function calls the table created by the last function to find what word mostly like the to followed based off the provided samples. If the word/s does not show up in the table, it will not reply with a prediction, so best check the plot results for potential pairs or triplets.

ngrm_tbl <- function(w){
    s <- strsplit(as.character(w$word), " ")
    w$mnsgrm <- NA
    w$lstwrd <- NA
    for (l in 1:nrow(w)){
        wrd <- vector()
        for (k in 1:(length(s[[l]])-1)){
            wrd <- c(wrd, s[[l]][k])
        }
        w$mnsgrm[l] <- paste(wrd, collapse = " ")
        w$lstwrd[l] <- tail(s[[l]],1)
    }
    return(as.data.frame(w, row.names = NULL, stringsAsFactors = FALSE))
}

prd_nxt_wrd <- function(x,y,z){
    t <- tolower(x)
    q <- paste(tail(unlist(strsplit(t,' ')),2), collapse=" ")
    r <- paste(tail(unlist(strsplit(t,' ')),1), collapse=" ")
    if(stri_count_words(x)==2){
        if (q %in% y$mnsgrm){
            w <- y %>% filter(mnsgrm==q) %>% .$lstwrd
            return(w[1])
            } else if (r %in% z$mnsgrm==q){
                w <- z %>% filter(mnsgrm==q) %>% .$lstwrd
                return(w[1])
                } else {return('no prediction from input')}
        } else if (stri_count_words(x)==1){
            if (r %in% z$mnsgrm){
                w <- z %>% filter(mnsgrm==q) %>% .$lstwrd
                return(w[1])
                } else {return('no prediction from input')}
            } else {print('no prediction from input')}
    }

Predicting the Next Word

The following code allows the user to specifically select which tables the prediction function will use in calling the next word in the order.

-The following code is what allows the user to select which sources they would like to predict the words from based off the characteristics of the data sets or their own particular interests.

srcinput <- reactive({
    switch(input$table,
           "Text 1" = txt_1_3wrd_tbl(),
           "Text 2" = txt_2_3wrd_tbl(),
           "Text 3" = txt_3_3wrd_tbl(),
           "Complete Data" = txt_dt_3wrd_tbl()
           )
           
       })

## Error in reactive({: could not find function "reactive"

srctbl2 <- reactive({
    switch(input$table2,
           "Text 1" = txt_1_2wrd_tbl(),
           "Text 2" = txt_2_2wrd_tbl(),
           "Text 3" = txt_3_2wrd_tbl(),
           "Complete Data" = txt_dt_2wrd_tbl()
           )
    })

## Error in reactive({: could not find function "reactive"

    sidebarLayout(
        sidebarPanel(
            selectInput("table", "Choose Source Test for Prediction based on Triples:",
                        choices = c("Complete Data", "Text 1", "Text 2", "Text 3")),
            selectInput("table2", "Choose Source Test for Prediction based on Pairs:",
                        choices = c("Complete Data", "Text 1", "Text 2", "Text 3"))
        ))

## Error in sidebarLayout(sidebarPanel(selectInput("table", "Choose Source Test for Prediction based on Triples:", : could not find function "sidebarLayout"

R Shiny Web Application

R Shiny provides the proper platform to easily host the reportings and applications.

There are various improvements that could be made with code:

-One would be to allievate the restrictions placed on memory by Rpubs. This could be hosting the tool on Rpubs account that is fund with greater memory or hosting the tool on a different server not under those restrictions. This would allow the user to sample larger files for more greater analysis

-Right now, the predictions are only based off pairs and triplets. This could easily by expand to include strings of four over even five words through adjusting the funtions.

-Allow the user to see not only the next likely word, but alternatives as well.

The final R Shiny application can be found at this link; https://jmastapeter.shinyapps.io/Capstone_Project_Prediction_Tool/

The server and ui codes, along with the sample data used to test the code can be found at this github repository; https://github.com/jmastapeter/Coursera_Capstone_Final