CIS 4730 Unstructured Data Management

Flow control

if-else
for
while
function

if-else

An if statement consists of a logic condition (TRUE or FALSE) followed by one or more statements.

# Template in words
if(a logic condition) {
  Get inside the curly brackets and 
  run this block when the condition is true
}

# Example
x = 1
if(x == 1) {
  print("x equals 1")
}

## [1] "x equals 1"

An if statement can be followed by an optional else statement, which executes when the previous logic expression is false.

name="Tim Cook" ; role = "CEO" 
if(role == "CEO") {
  print(paste(name, "is a CEO.", sep=" "))
} else {
  print(paste(name, "is not a CEO.", sep=" "))
}

## [1] "Tim Cook is a CEO."

name = "Jeff Williams" ; role = "COO" 
if(role == "CEO") {
  print(paste(name, "is a CEO.", sep=" "))
} else {
  print(paste(name, "is not a CEO.", sep=" "))
}

## [1] "Jeff Williams is not a CEO."

The logic expression in the previous example is role == "CEO".

Alternatively, you can just use a logic variable (TRUE or FALSE):

name="Tim Cook"
is_ceo = TRUE
if(is_ceo) {
  print(paste(name, "is a CEO.", sep=" "))
} else {
  print(paste(name, "is not a CEO.", sep=" "))
}

## [1] "Tim Cook is a CEO."

You can have more than two decision points:

age = 15
if(age < 13) {
  print("Kid")
} else if (age < 20) {
  # Go into this block if the previous logic expression is FALSE
  # and the current logic expression is TRUE
  print("Teenager")
} else { 
  # Go into this block if all previous logic expressions are FALSE
  print("Adult")
}

## [1] "Teenager"

Your turn

Create a variable grade = "A" and use if-else statements to print its respective value in a 4.0 scale based on the following conversion table. Try different letter grades to see if your if-else statements are correct.

(use print(paste("point:" ), point) to print the variable with a string)

Grade Letter	Point
A	4
B	3
C	2
D	1
All others	0

for loops

You often encounter situations when you need to perform the same statements multiple times, potentially over a set of data objects.

Motivating example:

print( paste("The year was", 2001) )
print( paste("The year was", 2002) )
print( paste("The year was", 2003) )
print( paste("The year was", 2004) )

A for loop iterates through every element in a vector.

# Template in words
for (element in vector){
  Use the element to do something
}

# Example
for (year in 2001:2004){
  print( paste("The year was", year) )
}

## [1] "The year was 2001"
## [1] "The year was 2002"
## [1] "The year was 2003"
## [1] "The year was 2004"

for each year that is in the sequence 2001:2004, execute the code chunk print( paste("The year was", year) )

# you could define the vector outside the for-loop
x = c(1, 3, 5)  
for(i in x) {
  print(i)
}

## [1] 1
## [1] 3
## [1] 5

# it works regardless the data type of the vector
for(i in c("A", "B", "C", "D")) { 
  print(i)
}

## [1] "A"
## [1] "B"
## [1] "C"
## [1] "D"

If you want to skip some items in a for-loop, use the keyword next.

for (i in 1:5) {
  if (i == 2){
    next
  }
  print(i)
}

## [1] 1
## [1] 3
## [1] 4
## [1] 5

Your turn

Continue with our previous exercise on converting letter grades to grade points.

Put letter grades (A, A, C, B, B) in a vector and use for-loop to print out their respective points.

while loops

The while loop continually executes a block of statements while a particular condition is true.

# Template in words
while(condition to check) {
  statements to run when the condition is true
}

# Example
x = 1
while(x <= 3) {
  print(x)
  x = x + 1 # what would happen if you skip this line?
}

## [1] 1
## [1] 2
## [1] 3

for() vs. while()

for() is better when you want to iterate over a set of elements that you know in advance

while() is better if you find it easy to specify when to run and when to stop.

Note: Every for() could be replaced with a while()

Your turn

1. Use a for loop to get the sum of all numbers from 1 to 100 2. Use a while loop to get the sum of all numbers from 1 to 100

Hint 1: Create a variable total outside the for/while loop, and change the value of total as you looping through all numbers from 1 to 100
Hint 2: You can verify your answer against R’s internal sum() function: sum(1:100)

Functions

A function is a procedure or routine which takes optional inputs and produces an optional output.

So far we have already seen many built-in functions:

Vector
- seq(), rep(), mean(), length(), …
Data frame
- colnames(), rownames()
Character string
- paste()

Why functions?

Data structures tie related values into one object
Functions tie related commands into one object
In both cases: easier to understand, easier to work with, easier to build into larger things

What should be a function?

Things you’re going to re-run, especially if it will be re-run with changes
Chunks of code you keep highlighting and hitting return on
Chunks of code which are small parts of bigger analyses
Chunks which are very similar to other chunks

Creating functions

In R, you can create your own functions using the following syntax:

my_function <- function(input1, input2, ...) {
  # Use the input to do something 
  return(output) # return a result
}

Here is a working example:

hello_world <- function() { 
  # this particular function requires no inputs
  print("Hello world!")
  # This function has no return statement; return nothing.
}

hello_world()  # call the function; don't forget the parentheses

## [1] "Hello world!"

Another example:

add_one <- function(num) {
  num = num+1  # be sure to match the input variable name 
  return(num)  # return() says what the output is
}
a = add_one(10)
print(a)

## [1] 11

More than one input

greeting <- function(your_name, course_name) {
  print(paste0("Hello, ", your_name, ". This is ", course_name))
}

greeting(your_name="Smith", course_name="CIS 4730")

## [1] "Hello, Smith. This is CIS 4730"

greeting("Alice", "CIS 4950")

## [1] "Hello, Alice. This is CIS 4950"

Your turn

Write a function that takes two numerical values and return the multiplication of these two values.

Create a vector stop_words to store the following stop words: a, an, and, the, that. Then write a function detect_stop_word that take a word as input and detect if the word is a stop word.
- Hint: recall the %in% operator from lab-02

detect_stop_word("atlanta")

## [1] FALSE

detect_stop_word("that")

## [1] TRUE

Tokenization example

shakespeare<-c("Et tu, Brute? — Then fall, Caesar!",
"Romans, countrymen, and lovers, hear me for my cause, 
and be silent, that you may hear.",
"Believe me for mine honor, and have respect to mine honor, 
that you may believe.",
"Censure me in your wisdom, and awake your senses, 
that you may the better judge.",
"If there be any in this assembly, any dear friend of Caesar's, 
to him I say that Brutus' love to Caesar was no less than his.",
"If then that friend demand why Brutus rose against Caesar, 
this is my answer:",
"not that I loved Caesar less, but that I loved Rome more.")

#text_df <- data.frame(text = shakespeare, stringsAsFactors=FALSE)
text_df <- tibble(line=1:7, text = shakespeare)
text_df

## # A tibble: 7 x 2
##    line text                                                                    
##   <int> <chr>                                                                   
## 1     1 "Et tu, Brute? — Then fall, Caesar!"                                    
## 2     2 "Romans, countrymen, and lovers, hear me for my cause, \nand be silent,~
## 3     3 "Believe me for mine honor, and have respect to mine honor, \nthat you ~
## 4     4 "Censure me in your wisdom, and awake your senses, \nthat you may the b~
## 5     5 "If there be any in this assembly, any dear friend of Caesar's, \nto hi~
## 6     6 "If then that friend demand why Brutus rose against Caesar, \nthis is m~
## 7     7 "not that I loved Caesar less, but that I loved Rome more."

unnest_tokens(df, word, text), in tidytext package, splits a column (text, in this case) into tokens and flattens the table into one-token-per-row (using word as the columns name).

library(tidytext)

tokens <- text_df %>% 
  unnest_tokens(word, text)
#tokens <- unnest_tokens(text_df, word, text)

tokens %>%
  # word count
  count(word) %>%
  # Arrange the counts in descending order
  arrange(desc(n))

## # A tibble: 65 x 2
##    word       n
##    <chr>  <int>
##  1 that       7
##  2 and        4
##  3 caesar     4
##  4 i          3
##  5 may        3
##  6 me         3
##  7 to         3
##  8 you        3
##  9 any        2
## 10 be         2
## # ... with 55 more rows

Stop words can be removed using anti_join(stop_words). anti_join(x,y) takes two data frames, returns all rows from x that not in y. (Notice the change in the number of rows.)

sw_tokens <- text_df %>%
  # Tokenize the lines data
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

sw_tokens %>%
  # word count
  count(word) %>%
  # Arrange the counts in descending order
  arrange(desc(n))

## # A tibble: 28 x 2
##    word         n
##    <chr>    <int>
##  1 caesar       4
##  2 brutus       2
##  3 friend       2
##  4 hear         2
##  5 honor        2
##  6 loved        2
##  7 mine         2
##  8 answer       1
##  9 assembly     1
## 10 awake        1
## # ... with 18 more rows

Further cleaning using `library(tm)`

install.package("tm")
library(tm) #text mining package that utilizes NLP package

shakespeare_lower <- tolower(shakespeare)
shakespeare_lower

## [1] "et tu, brute? — then fall, caesar!"                                                                                             
## [2] "romans, countrymen, and lovers, hear me for my cause, \nand be silent, that you may hear."                                      
## [3] "believe me for mine honor, and have respect to mine honor, \nthat you may believe."                                             
## [4] "censure me in your wisdom, and awake your senses, \nthat you may the better judge."                                             
## [5] "if there be any in this assembly, any dear friend of caesar's, \nto him i say that brutus' love to caesar was no less than his."
## [6] "if then that friend demand why brutus rose against caesar, \nthis is my answer:"                                                
## [7] "not that i loved caesar less, but that i loved rome more."

# Remove punctuation
shakespeare_lower_np <- gsub('[[:punct:]]', '', shakespeare_lower) 
shakespeare_lower_np

## [1] "et tu brute — then fall caesar"                                                                                            
## [2] "romans countrymen and lovers hear me for my cause \nand be silent that you may hear"                                       
## [3] "believe me for mine honor and have respect to mine honor \nthat you may believe"                                           
## [4] "censure me in your wisdom and awake your senses \nthat you may the better judge"                                           
## [5] "if there be any in this assembly any dear friend of caesars \nto him i say that brutus love to caesar was no less than his"
## [6] "if then that friend demand why brutus rose against caesar \nthis is my answer"                                             
## [7] "not that i loved caesar less but that i loved rome more"

# Remove stopwords
shakespeare_lower_np_sw <- removeWords(shakespeare_lower_np, stopwords())
shakespeare_lower_np_sw

## [1] "et tu brute —  fall caesar"                                                 
## [2] "romans countrymen  lovers hear    cause \n  silent   may hear"              
## [3] "believe   mine honor   respect  mine honor \n  may believe"                 
## [4] "censure    wisdom  awake  senses \n  may  better judge"                     
## [5] "      assembly  dear friend  caesars \n   say  brutus love  caesar   less  "
## [6] "   friend demand  brutus rose  caesar \n   answer"                          
## [7] "   loved caesar less    loved rome "

# Remove whitespaces
shakespeare_lower_np_sw_nw <- gsub(' +',' ', shakespeare_lower_np_sw) %>% 
  str_trim(side="both")
shakespeare_lower_np_sw_nw

## [1] "et tu brute — fall caesar"                                  
## [2] "romans countrymen lovers hear cause \n silent may hear"     
## [3] "believe mine honor respect mine honor \n may believe"       
## [4] "censure wisdom awake senses \n may better judge"            
## [5] "assembly dear friend caesars \n say brutus love caesar less"
## [6] "friend demand brutus rose caesar \n answer"                 
## [7] "loved caesar less loved rome"

Stemming & Lemmatization

dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')
stem_words(dw)

## [1] "driver" "drive"  "drove"  "driven" "drive"  "drive"

lemmatize_words(dw)

## [1] "driver" "drive"  "drive"  "drive"  "drive"  "drive"

bw <- c('are', 'am', 'being', 'been', 'be')
stem_words(bw)

## [1] "ar"   "am"   "be"   "been" "be"

lemmatize_words(bw)

## [1] "be" "be" "be" "be" "be"

reference: https://cran.r-project.org/web/packages/textstem/README.html

Stemming & Lemmatization example

st_shakespeare <- tibble(line=1:7, 
                         text=stem_strings(shakespeare))
lm_shakespeare <- tibble(line=1:7, 
                         text=lemmatize_strings(shakespeare))

st_shakespeare

## # A tibble: 7 x 2
##    line text                                                                    
##   <int> <chr>                                                                   
## 1     1 Et tu, Brute? — Then fall, Caesar!                                      
## 2     2 Roman, countrymen, and lover, hear me for my caus, and be silent, that ~
## 3     3 Believ me for mine honor, and have respect to mine honor, that you mai ~
## 4     4 Censur me in your wisdom, and awak your sens, that you mai the better j~
## 5     5 If there be ani in thi assembli, ani dear friend of Caesar', to him I s~
## 6     6 If then that friend demand why Brutu rose against Caesar, thi i my answ~
## 7     7 not that I love Caesar less, but that I love Rome more.

lm_shakespeare

## # A tibble: 7 x 2
##    line text                                                                    
##   <int> <chr>                                                                   
## 1     1 Et tu, Brute? — Then fall, Caesar!                                      
## 2     2 roman, countryman, and lover, hear me for my cause, and be silent, that~
## 3     3 Believe me for mine honor, and have respect to mine honor, that you may~
## 4     4 Censure me in your wisdom, and awake your sense, that you may the good ~
## 5     5 If there be any in this assembly, any dear friend of Caesar's, to him I~
## 6     6 If then that friend demand why Brutus rise against Caesar, this be my a~
## 7     7 not that I love Caesar little, but that I love Rome much.

stemmed_tokens <- st_shakespeare %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word) %>%
  arrange(desc(n))

## Joining, by = "word"

stemmed_tokens

## # A tibble: 34 x 2
##    word       n
##    <chr>  <int>
##  1 caesar     5
##  2 love       3
##  3 mai        3
##  4 ani        2
##  5 believ     2
##  6 friend     2
##  7 hear       2
##  8 honor      2
##  9 mine       2
## 10 thi        2
## # ... with 24 more rows

lemmatized_tokens <- lm_shakespeare %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word) %>%
  arrange(desc(n))

## Joining, by = "word"

lemmatized_tokens

## # A tibble: 27 x 2
##    word         n
##    <chr>    <int>
##  1 caesar       4
##  2 love         3
##  3 brutus       2
##  4 friend       2
##  5 hear         2
##  6 honor        2
##  7 mine         2
##  8 answer       1
##  9 assembly     1
## 10 awake        1
## # ... with 17 more rows

Flow control

if-else

Your turn

for loops

Your turn

while loops

for() vs. while()

Your turn

Functions

Why functions?

What should be a function?

Creating functions

More than one input

Your turn

Tokenization example

Further cleaning using library(tm)

Stemming & Lemmatization

Stemming & Lemmatization example

Further cleaning using `library(tm)`