Before we start…

Please be noted that you will work with R Markdown documents. R Markdown consists of three parts: 1) contents; 2) codes; 3) outputs (results). First, the content parts describe what you are learning about and asked to work on. Second, the code parts are in grey boxes and are what you can enter in the source window of RStudio. The single pound sign # denotes my annotations that help you get sense of what each code is carrying out. Third, the output parts show the results of R code execution. Check your codes result in the same as this part. The results are indexed with numbers in square brackets, starting from [1].

Working Directory

Working directory is the location where you can read files from or save the files into.

We use projects in RStudio to set the working directory to the folder we are working in.

getwd() # Find the current working directory (where inputs are found and outputs are sent).
## [1] "C:/Users/CAU/Dropbox/2021_Class/ATA/R_practice"
#setwd("~/...")

Creating vectors

A sequence of numbers/integers, characters, Booleans

c(1,3,5) # Join elements into a vector 
## [1] 1 3 5
1:5 # An integer sequence
## [1] 1 2 3 4 5
seq(1, 5, by=2) # A sequence of integers from 1 to 5, increasing by 2
## [1] 1 3 5
rep(1:5, times=3) # Repeat an integer sequence 1:5 three times
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1:5, each=3) # Repeat each element of an integer sequence 1:5 three times
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

Vector functions

# R is an object-oriented programming language; everything can be assinged to an object
fruits <- c("Apple","Grape","Pear","Apple","Mango","Orange","Mango","Strawberry","Grape","Apple")
sort(fruits) # Return the object, fruits, sorted in alphabetical order for characters
##  [1] "Apple"      "Apple"      "Apple"      "Grape"      "Grape"     
##  [6] "Mango"      "Mango"      "Orange"     "Pear"       "Strawberry"
table(fruits) # See counts of values (elements)
## fruits
##      Apple      Grape      Mango     Orange       Pear Strawberry 
##          3          2          2          1          1          1
unique(fruits) # See unique values (elements)
## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Strawberry"

Note1: R is sensitive to lowercase and uppercase letters

word1 <- "TEXT"
word1
## [1] "TEXT"
word2 <- "Text"
word2
## [1] "Text"
word1 == word2
## [1] FALSE

Note2: Parentheses are for functions but brackets are for selecting a certain element of a vector.

toupper(fruits[4]) 
## [1] "APPLE"
tolower(fruits[4])
## [1] "apple"

list( )

What is a vector object? A collection of ordered elements in the same nature.

Ex) a vector of numbers; a vector of characters; a logical vector

What if we want to have such three types of vector objects?

Remember what the function c( ) does.

# assign a sequence of numbers to a numeric vector object, vector_numeric
vector_numeric <- c(1:6) # a sequence of numbers from 1 to 6
vector_numeric
## [1] 1 2 3 4 5 6
# assign a sequence of characters to a character vector object, vector_character
vector_character <- c("v","e","c","t","o","r") # characters should be distinguished with quotation marks.
vector_character
## [1] "v" "e" "c" "t" "o" "r"
class(vector_character)
## [1] "character"
# assign a sequence of Boolean values, TRUE or FALSE, to a logical vector object, vector_logical
vector_logical <- c(TRUE, FALSE, FALSE, TRUE, T, F) # TRUE/FALSE should be uppercase letters
vector_logical
## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE
class(vector_logical)
## [1] "logical"

What is a list object?

A vector with possible heterogeneous elements. That is, a list is collection of elements which can be of different types.

The elements of a list can include numeric vectors, character vectors, and logical vectors at once.

Let say we want to generate a list object that contains the above three vector objects all together.

We can use a function list( ) here.

List1 <- list(numbers=vector_numeric, characters=vector_character, Booleans=vector_logical) # Generate a list object, List1, to contain all the elements that are named "numbers", "characters", and "Booleans"
length(List1) # The function, length, calculates how many elements are in any R object (vectors, lists, & factors).
## [1] 3
class(List1) # R is an object-oriented style of programming. The function, class, allows us to know what type an object belongs to. It can be numeric, character, logical, list, and so on...
## [1] "list"
names(List1) # To get or set the names of an object; Here, elements' names are "numbers", "characters", and "Booleans"
## [1] "numbers"    "characters" "Booleans"
List1[1:2] # returns a new list object that contains the first and the second elements. What is length(List1[1:2])?
## $numbers
## [1] 1 2 3 4 5 6
## 
## $characters
## [1] "v" "e" "c" "t" "o" "r"
List1[2] # returns a new list object that contains the second element only. What is length(List1[2])?
## $characters
## [1] "v" "e" "c" "t" "o" "r"
List1[[2]] # returns a vector object that contains six elements in the second element of List1, the characters "v", "e", "c", "t", "o", and "r". What is length(List1[[2]])? 
## [1] "v" "e" "c" "t" "o" "r"

Why is list( ) important in text mining?

Text data are analyzed at multiple levels. The list( ) function is useful for dealing with hierarchical data.

Ex) Document > Paragraphs > Sentences > Words

Let say we have two sentences about BTS.

Ex) BTS <- “BTS is known as the Bangtan Boys. And BTS is a seven-member South Korean boy band.”

The character object BTS can be a list of two sentences, and each sentence as a vector contains a sequence of words.

# Example
sentences <- "BTS is known as the Bangtan Boys. BTS is a seven-member South Korean boy band." 
#sentences[[1]] <- c("BTS", "is", "known", "as", "the", "Bangtan Boys")
#sentences[[2]] <- c("And", "BTS", "is", "a", "seven-member", "South Korean", "boy band")

Let’s practice on list( )

# A list is formed with following objects
obj1 <- c("BTS","is","known","as","the","Bangtan Boys")
obj2 <- c("And", "BTS","is","a","seven-member","South Korea","boy band")
obj3 <- list(obj1, obj2)
class(obj1) # vector or list?
## [1] "character"
class(obj2) # vector or list?
## [1] "character"
class(obj3) # vector or list?
## [1] "list"

Indexing: selecting certain elements

How to select certain elements in a list? Use double square brackets [[ ]] for a list and single square brackets [ ] for a vector. [[ ]] returns sub-elements of a list element.

obj3
## [[1]]
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
obj3[1]
## [[1]]
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
class(obj3[1])
## [1] "list"
length(obj3[1])
## [1] 1
obj3[2]
## [[1]]
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
obj3[[1]]
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
class(obj3[[1]])
## [1] "character"
obj3[[2]]
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
# [1] to select the first element of a vector; [[1]] to select the elements of a list's first element

When you get familiar with list( ), then you will figure out what obj3[[1]][5] is.

obj3[[1]][5]
## [1] "the"

How to select the word “South Korea” in the second element of the list object obj3?

obj3[[2]][6]
## [1] "South Korea"

Questions

  1. Return the first element as a vector object with six words from the list obj3

  2. Return the first element, the word “BTS”, from the above vector object

How to turn a list into a vector

#unlist( ) is a function to turn a list object into a vector object
#be cautious about using the unlist function
obj3
## [[1]]
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
unlist(obj3)
##  [1] "BTS"          "is"           "known"        "as"           "the"         
##  [6] "Bangtan Boys" "And"          "BTS"          "is"           "a"           
## [11] "seven-member" "South Korea"  "boy band"
List1[2] # a new list object with one element, the character vector object vector_character
## $characters
## [1] "v" "e" "c" "t" "o" "r"
class(List1[2])
## [1] "list"
unlist(List1[2]) # turn a list object into a vector object
## characters1 characters2 characters3 characters4 characters5 characters6 
##         "v"         "e"         "c"         "t"         "o"         "r"
vector_character # a vector object with six elements
## [1] "v" "e" "c" "t" "o" "r"
class(vector_character)
## [1] "character"
unlist(List1[2]) == vector_character # is unlist(List1[2]) equal to vector_character?
## characters1 characters2 characters3 characters4 characters5 characters6 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE

When do we use unlist( ) in text mining?

Paragraph as a list of sentences can be combined into a vector object with words

# Organize a list of character vectors into a vector object
list_sentences <- list(obj3[[1]],obj3[[2]]) # generate a list of the two vectors
list_sentences
## [[1]]
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
class(list_sentences)
## [1] "list"
list_sentences[[1]] # returns the first element as a vector under list_sentences as a list
## [1] "BTS"          "is"           "known"        "as"           "the"         
## [6] "Bangtan Boys"
class(unlist(list_sentences[1]))
## [1] "character"
list_sentences[[2]] # returns the second element as a vector under list_sentences as a list
## [1] "And"          "BTS"          "is"           "a"            "seven-member"
## [6] "South Korea"  "boy band"
unlist(list_sentences) # reverse the list into an vector object
##  [1] "BTS"          "is"           "known"        "as"           "the"         
##  [6] "Bangtan Boys" "And"          "BTS"          "is"           "a"           
## [11] "seven-member" "South Korea"  "boy band"
list_words <- unlist(list_sentences)
list_words
##  [1] "BTS"          "is"           "known"        "as"           "the"         
##  [6] "Bangtan Boys" "And"          "BTS"          "is"           "a"           
## [11] "seven-member" "South Korea"  "boy band"
class(list_words)
## [1] "character"
sort(list_words)
##  [1] "a"            "And"          "as"           "Bangtan Boys" "boy band"    
##  [6] "BTS"          "BTS"          "is"           "is"           "known"       
## [11] "seven-member" "South Korea"  "the"
table(list_words)
## list_words
##            a          And           as Bangtan Boys     boy band          BTS 
##            1            1            1            1            1            2 
##           is        known seven-member  South Korea          the 
##            2            1            1            1            1

unlist( ) allows us to combine a list object’s elements into a vector object

Question

How to turn the character vector object, unlist(list_sentences), into a sentence in which words are separated by blank?