Creative Writing for Children

Stories are used all over the world by parents and adults alike to spend quality time with their children. Moreover, stories can also help children to learn and improve their creativity, imagination, and linguistics. A way to improve this activity is by providing children an opportunity to develop their own stories. This will greatly enhance their skill in creative writing while also having fun learning it.

R with its text generation algorithm allows the development of the automatic text generator. This will help teachers and adults alike on self-creating a set of writing tasks to children. In this article, we will discuss a step-by-step on developing a text generator for creative writing activity using Markov Chain Algorithm in R. Further exploration of this project may lead to the development of an educational app to improve children’s creative writing.

Markov Chain Algorithm

Markov Chain is a mathematical model of stochastic process that predicts the condition of the next state (e.g. will it rain tomorrow?) based on the condition of the previous one. Using this principle, the Markov Chain can predict the next word based on the last word typed. It models the transition probability between states, where in NLP each state is represented by terms/words. Markov Chains is a simple yet effective method to create a text generation model¹.

Libraries

We will be using the following packages for building text generator using Markov Chain algorithm:

# data wrangling
library(tidyverse)

# text processing
library(tidytext)
library(textclean)
library(tokenizers)

# markov chain
library(markovchain)

Dataset

Markov Chain model will generate a text output as good as the text input. Therefore, we should use a decent literature in order to build a decent text generator. We will be using The Little Prince Corpus dataset from The AMR Bank. This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943.

# read data
tlp <- read.delim("the_little_prince.txt", 
                  col.names = "text")

head(tlp,10)

The above format is a simplified version of The Little Prince Corpus without metadata. We will be using only rows of sentences (labeled by index before the sentence).

Data Pre-processing

Like most of NLP tasks, data cleaning or preprocessing took a major part of the process. Below, we will perform several data cleaning steps below to obtain one file containing sequences of words from The Little Prince story.

tlp_clean <- tlp %>% 
  slice(-1) %>% # remove first line (version info)
  filter(!str_detect(text, "[/:]"), # remove lines with certain characters
         !str_detect(text, "Chapter")) # remove lines with certain string
head(tlp_clean)

tlp_clean  <- tlp_clean %>% 
  mutate(text = tolower(text) %>% # tolower sentences
           replace_contraction() %>%  # expand contraction
           replace_white() %>%  # replace double white space into single space
           str_remove_all(pattern = "lpp_1943.") %>% # remove pattern
           str_remove_all(pattern = "[0-9]") %>% # remove numbers
           str_remove_all(pattern = "[()]") %>% # remove specific punctuation
           str_remove_all(pattern = "--") %>%
           str_replace_all(pattern = " - ", replacement = "-") %>%  # replace pattern
           str_replace_all(pattern = "n't", replacement = "not") %>% 
           str_remove(pattern = "[.]") %>% # remove first matched pattern
           str_remove(pattern = " "))
            

# glimpse data; first 10 sentences
head(tlp_clean, 10)

# split words from sentences
text_tlp <- tlp_clean %>% 
   pull(text) %>% 
   strsplit(" ") %>% 
   unlist() 

text_tlp %>% head(27)

#>  [1] "once"        "when"        "i"           "was"         "six"        
#>  [6] "years"       "old"         "i"           "saw"         "a"          
#> [11] "magnificent" "picture"     "in"          "a"           "book"       
#> [16] ","           "called"      "true"        "stories"     "from"       
#> [21] "nature"      ","           "about"       "the"         "primeval"   
#> [26] "forest"      "."

Once we have the cleaned data, we can continue to build the text generation model.

Model Fitting

We will use markovchainFit() from markovchain package. The function will take our data as a sequence of words and will learn the probability of occurence of the following word based on the previous word. The default setting will only take one word before (1-gram) for the calculation. Although we can also build 2 or 3 words before for higher predictive performace (for predictive text) but with bigger computation. Because we only aim to builda text generator for random initial words, we can use the 1-gram setting.

fit_markov <- markovchainFit(text_tlp)

RCreate Text Generator

This function below will generate n-random words which can be used to initialize a sentence. Below is the description of the functions:

num = number of random sentences to generate
first_word = first word of each sentences, obtained from the vocabulary (data)
n = number of random words per sentence

create_me <- function(num = 5, first_word = "i", n = 2) {
  

for (i in 1:num) {
  
   set.seed(i+5)
  
   markovchainSequence(n = n, # generate 2 additional random words
                       markovchain = fit_markov$estimate,
                       t0 = tolower(first_word), include.t0 = T) %>% 
   # joint words
   paste(collapse = " ") %>% # join generated words with space
   # create proper sentence form
   str_replace_all(pattern = " ,", replacement = ",") %>% 
   str_replace_all(pattern = " [.]", replacement = ".") %>% 
   str_replace_all(pattern = " [!]", replacement = "!") %>% 
   str_to_sentence() %>% # start every sentences with capitalization
   print()
  
    
}
  
}

With a guidance from the teachers and caregivers, students may be given task to finish such senteces or to create a paragraph on specific topics. This allow students to develop creativity in story telling and writing.

create_me(num = 5, first_word = "i", n = 3)

#> [1] "I saw the flower"
#> [1] "I were some use"
#> [1] "I do not show"
#> [1] "I was a flower"
#> [1] "I could fly to"

For more guided challange, you may ask students to combine those uncomplete sentences with a set of random words. The function below will generate n-random words from the vocabulary. You can set the random number generator by setting an integer number in the seed parameter.

random_vocab <- function(n = 10, seed = NULL) {
  
  set.seed(seed)
  unique_vocab <- tlp_clean %>% 
    mutate(text = text %>% 
           str_remove_all("[:punct:]")) %>% 
    pull(text) %>% 
    strsplit(" ") %>% 
    unlist() %>% 
    unique()
    
  unique_vocab[sample(length(unique_vocab), n)]

  }

random_vocab(n = 10, seed = 123)

#>  [1] "telescope"   "ways"        "man"         "resemblance" "mechanic"   
#>  [6] "clad"        "perplexed"   "twentyeight" "force"       "stretch"

Students may finish or even improve the omitted words with their imagination. This allows students to practice their writing skills and develop their curiousity in vocabulary and grammar.

This model may be far from perfection for generating a sentence. But with better and a lot more data we may improve the performance of this model. For further discussion on text generation and text prediction using Markov Chain model, you can go to the following link. The article “Text Generation With Markov Chains” will discuss the logic behind Markov Chain model and demonstrate a more sophisticated usage of Markov Chain in Business Practice.

Happy learning and exploring!

Adyatama, A. 2020. “Algotech: Text Generation With Markov Chains”. Published online April 2, 2020.↩

Text Generation for Creative Writing using R

Nabiilah Ardini Fauziyyah

28/4/2020