Data wrangling wrap up

Data wrangling is the process of cleaning, structuring and enriching data into a desired format (Trifacta, 2018).

In this assignment I chose two topics to work with to improve my data wrangling skills. In the first one I completed several exercises related to strings, and in the second one I worked with a dataset to extract and analyze geographical information.

Topic 1: Character data

For this task I completed exercises from the Strings chapter of R for Data Science.

I mostly worked with stringr::words and stringr::sentences

14.2 String basics

14.2.5 Exercises

  1. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
library(tidyverse)
library(stringr)

string1 <- "abc"
string2 <- "abcd"

str_sub(string1, floor((str_length(string1)+1)/2), ceiling((str_length(string1)+1)/2))
## [1] "b"
str_sub(string2, floor((str_length(string2)+1)/2), ceiling((str_length(string2)+1)/2)) #returns the two middle characters as the string has an even number. 
## [1] "bc"

I chose to extract both of the middle characters when having an even number, but I could also extract only one arbitrarily:

str_sub(string2, ceiling(str_length(string2)/2), ceiling(str_length(string2)/2))
## [1] "b"

This way, I extract only one of the middle characters.

  1. Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
st_comma <- function (x, delim = ",") {
  num <- length(x)
  if(num == 0) {
   stop("vector length = 0") #error message when trying with a length 0 vector
  } else if(num == 1) {
    x
  } else if(num == 2) {
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
   str_1 <- str_c(x[seq_len(num - 1)], delim) #all but the last
   str_2 <- str_c("and", x[[num]], sep = " ")
   str_c(c(str_1, str_2), collapse = " ")
  }
}

#st_comma(c()) # as vector is length 0, the function throws an error message "vector length = 0"
st_comma("a") 
## [1] "a"
st_comma(c("a", "b"))
## [1] "a and b"
st_comma(c("a", "b", "c"))
## [1] "a, b, and c"

14.3 Matching patterns with regular expressions

14.3.1.1 Basic matches

  1. How would you match the sequence "'\?
str_view("\"'\\", "\"'\\\\")
  1. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?
str_view("w.x.y.z", "\\..\\..\\..")

It matches patterns with a dot followed by a character that repeats three times.

14.3.2.1 Anchors

  1. How would you match the literal string "$^$" ?
str_view("$^$", "^\\$\\^\\$")
  1. Given the corpus of common words in stringr::words, create regular expressions that find all words that:
  • Start with “y”.
  • End with “x”
  • Are exactly three letters long. (Don’t cheat by using str_length()!)
  • Have seven letters or more.

Since the list is long, I used a match argument to show only the matching words. When the output is too long, I use str_subset instead of str_view to have a more compact output.

str_view_match <- function(words, pattern) {
    str_view(words, pattern, match=TRUE)
} #function to only show matches

# start with y
str_view_match(words, "^y") 
# end with x
str_view_match(words, "x$")
# have exactly three letters
str_subset(words, "^...$")
##   [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask"
##  [12] "bad" "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but"
##  [23] "buy" "can" "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry"
##  [34] "due" "eat" "egg" "end" "eye" "far" "few" "fit" "fly" "for" "fun"
##  [45] "gas" "get" "god" "guy" "hit" "hot" "how" "job" "key" "kid" "lad"
##  [56] "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs" "new"
##  [67] "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per"
##  [78] "put" "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit"
##  [89] "six" "son" "sun" "tax" "tea" "ten" "the" "tie" "too" "top" "try"
## [100] "two" "use" "war" "way" "wee" "who" "why" "win" "yes" "yet" "you"
# ≥ 7 letters
str_subset(words, ".......")
##   [1] "absolute"    "account"     "achieve"     "address"     "advertise"  
##   [6] "afternoon"   "against"     "already"     "alright"     "although"   
##  [11] "america"     "another"     "apparent"    "appoint"     "approach"   
##  [16] "appropriate" "arrange"     "associate"   "authority"   "available"  
##  [21] "balance"     "because"     "believe"     "benefit"     "between"    
##  [26] "brilliant"   "britain"     "brother"     "business"    "certain"    
##  [31] "chairman"    "character"   "Christmas"   "colleague"   "collect"    
##  [36] "college"     "comment"     "committee"   "community"   "company"    
##  [41] "compare"     "complete"    "compute"     "concern"     "condition"  
##  [46] "consider"    "consult"     "contact"     "continue"    "contract"   
##  [51] "control"     "converse"    "correct"     "council"     "country"    
##  [56] "current"     "decision"    "definite"    "department"  "describe"   
##  [61] "develop"     "difference"  "difficult"   "discuss"     "district"   
##  [66] "document"    "economy"     "educate"     "electric"    "encourage"  
##  [71] "english"     "environment" "especial"    "evening"     "evidence"   
##  [76] "example"     "exercise"    "expense"     "experience"  "explain"    
##  [81] "express"     "finance"     "fortune"     "forward"     "function"   
##  [86] "further"     "general"     "germany"     "goodbye"     "history"    
##  [91] "holiday"     "hospital"    "however"     "hundred"     "husband"    
##  [96] "identify"    "imagine"     "important"   "improve"     "include"    
## [101] "increase"    "individual"  "industry"    "instead"     "interest"   
## [106] "introduce"   "involve"     "kitchen"     "language"    "machine"    
## [111] "meaning"     "measure"     "mention"     "million"     "minister"   
## [116] "morning"     "necessary"   "obvious"     "occasion"    "operate"    
## [121] "opportunity" "organize"    "original"    "otherwise"   "paragraph"  
## [126] "particular"  "pension"     "percent"     "perfect"     "perhaps"    
## [131] "photograph"  "picture"     "politic"     "position"    "positive"   
## [136] "possible"    "practise"    "prepare"     "present"     "pressure"   
## [141] "presume"     "previous"    "private"     "probable"    "problem"    
## [146] "proceed"     "process"     "produce"     "product"     "programme"  
## [151] "project"     "propose"     "protect"     "provide"     "purpose"    
## [156] "quality"     "quarter"     "question"    "realise"     "receive"    
## [161] "recognize"   "recommend"   "relation"    "remember"    "represent"  
## [166] "require"     "research"    "resource"    "respect"     "responsible"
## [171] "saturday"    "science"     "scotland"    "secretary"   "section"    
## [176] "separate"    "serious"     "service"     "similar"     "situate"    
## [181] "society"     "special"     "specific"    "standard"    "station"    
## [186] "straight"    "strategy"    "structure"   "student"     "subject"    
## [191] "succeed"     "suggest"     "support"     "suppose"     "surprise"   
## [196] "telephone"   "television"  "terrible"    "therefore"   "thirteen"   
## [201] "thousand"    "through"     "thursday"    "together"    "tomorrow"   
## [206] "tonight"     "traffic"     "transport"   "trouble"     "tuesday"    
## [211] "understand"  "university"  "various"     "village"     "wednesday"  
## [216] "welcome"     "whether"     "without"     "yesterday"

14.3.3.1 Character classes and alternatives

  1. Create regular expressions to find all words that:
  • Start with a vowel.
  • That only contain consonants. (Hint: thinking about matching “not”-vowels.)
  • End with ed, but not with eed.
  • End with ing or ise.
# start with a vowel
str_subset(words, "^[aeiou]")
##   [1] "a"           "able"        "about"       "absolute"    "accept"     
##   [6] "account"     "achieve"     "across"      "act"         "active"     
##  [11] "actual"      "add"         "address"     "admit"       "advertise"  
##  [16] "affect"      "afford"      "after"       "afternoon"   "again"      
##  [21] "against"     "age"         "agent"       "ago"         "agree"      
##  [26] "air"         "all"         "allow"       "almost"      "along"      
##  [31] "already"     "alright"     "also"        "although"    "always"     
##  [36] "america"     "amount"      "and"         "another"     "answer"     
##  [41] "any"         "apart"       "apparent"    "appear"      "apply"      
##  [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
##  [51] "arm"         "around"      "arrange"     "art"         "as"         
##  [56] "ask"         "associate"   "assume"      "at"          "attend"     
##  [61] "authority"   "available"   "aware"       "away"        "awful"      
##  [66] "each"        "early"       "east"        "easy"        "eat"        
##  [71] "economy"     "educate"     "effect"      "egg"         "eight"      
##  [76] "either"      "elect"       "electric"    "eleven"      "else"       
##  [81] "employ"      "encourage"   "end"         "engine"      "english"    
##  [86] "enjoy"       "enough"      "enter"       "environment" "equal"      
##  [91] "especial"    "europe"      "even"        "evening"     "ever"       
##  [96] "every"       "evidence"    "exact"       "example"     "except"     
## [101] "excuse"      "exercise"    "exist"       "expect"      "expense"    
## [106] "experience"  "explain"     "express"     "extra"       "eye"        
## [111] "idea"        "identify"    "if"          "imagine"     "important"  
## [116] "improve"     "in"          "include"     "income"      "increase"   
## [121] "indeed"      "individual"  "industry"    "inform"      "inside"     
## [126] "instead"     "insure"      "interest"    "into"        "introduce"  
## [131] "invest"      "involve"     "issue"       "it"          "item"       
## [136] "obvious"     "occasion"    "odd"         "of"          "off"        
## [141] "offer"       "office"      "often"       "okay"        "old"        
## [146] "on"          "once"        "one"         "only"        "open"       
## [151] "operate"     "opportunity" "oppose"      "or"          "order"      
## [156] "organize"    "original"    "other"       "otherwise"   "ought"      
## [161] "out"         "over"        "own"         "under"       "understand" 
## [166] "union"       "unit"        "unite"       "university"  "unless"     
## [171] "until"       "up"          "upon"        "use"         "usual"
# only consonants
str_view_match(words, "^[^aeiou]+$")
# end with ed, but not with eed
str_view_match(words, "^ed$|[^e]ed$")
# end with ing or ise
str_view_match(words, "ing$|ise$")
  1. Empirically verify the rule “i before e except after c”.
str_view_match(words, "([^c]ie|cei)") #rule
str_view_match(words, "(cie)") #exceptions?

From the second output we can see there are some exceptions for this rule, such as the words science and society.

  1. Is “q” always followed by a “u”?
str_view_match(words, "q[^u]")

There were no words in the output, so “q” is always followed by a “u”.

  1. Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view_match(words, "ise$|our") #ise instead of ize and our instead of or
  1. Create a regular expression that will match telephone numbers as commonly written in your country
phones <- c("(55)32498722", "(778)952-5873")
str_view(phones, "\\(\\d{2}\\)\\d{8}", match = T)

The output matches the telephone number as commonly written in Mexico.

14.3.4.1 Repetition

  1. Create regular expressions to find all words that:
  • Start with three consonants.
  • Have three or more vowels in a row.
  • Have two or more vowel-consonant pairs in a row.
# start with 3 consonants
str_view_match(words, "^[^aeiou]{3}")
# ≥ 3 vowels in a row
str_view_match(words, "[aeiou]{3,}")
# ≥ 2 vowel-consonant pairs in a row
str_subset(words, "([aeiou][^aeiou]){2,}")
##   [1] "absolute"    "agent"       "along"       "america"     "another"    
##   [6] "apart"       "apparent"    "authority"   "available"   "aware"      
##  [11] "away"        "balance"     "basis"       "become"      "before"     
##  [16] "begin"       "behind"      "benefit"     "business"    "character"  
##  [21] "closes"      "community"   "consider"    "cover"       "debate"     
##  [26] "decide"      "decision"    "definite"    "department"  "depend"     
##  [31] "design"      "develop"     "difference"  "difficult"   "direct"     
##  [36] "divide"      "document"    "during"      "economy"     "educate"    
##  [41] "elect"       "electric"    "eleven"      "encourage"   "environment"
##  [46] "europe"      "even"        "evening"     "ever"        "every"      
##  [51] "evidence"    "exact"       "example"     "exercise"    "exist"      
##  [56] "family"      "figure"      "final"       "finance"     "finish"     
##  [61] "friday"      "future"      "general"     "govern"      "holiday"    
##  [66] "honest"      "hospital"    "however"     "identify"    "imagine"    
##  [71] "individual"  "interest"    "introduce"   "item"        "jesus"      
##  [76] "level"       "likely"      "limit"       "local"       "major"      
##  [81] "manage"      "meaning"     "measure"     "minister"    "minus"      
##  [86] "minute"      "moment"      "money"       "music"       "nature"     
##  [91] "necessary"   "never"       "notice"      "okay"        "open"       
##  [96] "operate"     "opportunity" "organize"    "original"    "over"       
## [101] "paper"       "paragraph"   "parent"      "particular"  "photograph" 
## [106] "police"      "policy"      "politic"     "position"    "positive"   
## [111] "power"       "prepare"     "present"     "presume"     "private"    
## [116] "probable"    "process"     "produce"     "product"     "project"    
## [121] "proper"      "propose"     "protect"     "provide"     "quality"    
## [126] "realise"     "reason"      "recent"      "recognize"   "recommend"  
## [131] "record"      "reduce"      "refer"       "regard"      "relation"   
## [136] "remember"    "report"      "represent"   "result"      "return"     
## [141] "saturday"    "second"      "secretary"   "secure"      "separate"   
## [146] "seven"       "similar"     "specific"    "strategy"    "student"    
## [151] "stupid"      "telephone"   "television"  "therefore"   "thousand"   
## [156] "today"       "together"    "tomorrow"    "tonight"     "total"      
## [161] "toward"      "travel"      "unit"        "unite"       "university" 
## [166] "upon"        "visit"       "water"       "woman"

14.3.5.1 Grouping and backreferences

  1. Construct regular expressions to match words that:
  • Start and end with the same character.
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
#start + end same character
str_view_match(words, "^(.).*\\1$")
#repeated pair of letters
str_view_match(words, "(..).*\\1")
#repeated letters
str_view_match(words, "(.).*\\1.*\\1")

14.4 Tools

14.4.2 Detect matches

  1. For each of the following challenges, try solving by using both a single regular expression, and a combination of multiple str_detect() calls.
  • Find all words that start or end with x.
  • Find all words that start with a vowel and end with a consonant.
  • Are there any words that contain at least one of each different vowel?
# start or end with x 
str_view_match(words, "^x|x$") #single regex
# multiple str_detect() calls
start_x <- str_detect(words, "^x")
end_x <- str_detect(words, "x$")
words[start_x | end_x]
## [1] "box" "sex" "six" "tax"
#start with vowel end with consonant
str_subset(words, "^[aeiou].*[^aeiou]$") #single regex
##   [1] "about"       "accept"      "account"     "across"      "act"        
##   [6] "actual"      "add"         "address"     "admit"       "affect"     
##  [11] "afford"      "after"       "afternoon"   "again"       "against"    
##  [16] "agent"       "air"         "all"         "allow"       "almost"     
##  [21] "along"       "already"     "alright"     "although"    "always"     
##  [26] "amount"      "and"         "another"     "answer"      "any"        
##  [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
##  [36] "approach"    "arm"         "around"      "art"         "as"         
##  [41] "ask"         "at"          "attend"      "authority"   "away"       
##  [46] "awful"       "each"        "early"       "east"        "easy"       
##  [51] "eat"         "economy"     "effect"      "egg"         "eight"      
##  [56] "either"      "elect"       "electric"    "eleven"      "employ"     
##  [61] "end"         "english"     "enjoy"       "enough"      "enter"      
##  [66] "environment" "equal"       "especial"    "even"        "evening"    
##  [71] "ever"        "every"       "exact"       "except"      "exist"      
##  [76] "expect"      "explain"     "express"     "identify"    "if"         
##  [81] "important"   "in"          "indeed"      "individual"  "industry"   
##  [86] "inform"      "instead"     "interest"    "invest"      "it"         
##  [91] "item"        "obvious"     "occasion"    "odd"         "of"         
##  [96] "off"         "offer"       "often"       "okay"        "old"        
## [101] "on"          "only"        "open"        "opportunity" "or"         
## [106] "order"       "original"    "other"       "ought"       "out"        
## [111] "over"        "own"         "under"       "understand"  "union"      
## [116] "unit"        "university"  "unless"      "until"       "up"         
## [121] "upon"        "usual"
#multiple str_detect()
start_vowel <- str_detect(words, "^[aeiou]")
end_cons <- str_detect(words, "[^aeiou]$")
words[start_vowel & end_cons]
##   [1] "about"       "accept"      "account"     "across"      "act"        
##   [6] "actual"      "add"         "address"     "admit"       "affect"     
##  [11] "afford"      "after"       "afternoon"   "again"       "against"    
##  [16] "agent"       "air"         "all"         "allow"       "almost"     
##  [21] "along"       "already"     "alright"     "although"    "always"     
##  [26] "amount"      "and"         "another"     "answer"      "any"        
##  [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
##  [36] "approach"    "arm"         "around"      "art"         "as"         
##  [41] "ask"         "at"          "attend"      "authority"   "away"       
##  [46] "awful"       "each"        "early"       "east"        "easy"       
##  [51] "eat"         "economy"     "effect"      "egg"         "eight"      
##  [56] "either"      "elect"       "electric"    "eleven"      "employ"     
##  [61] "end"         "english"     "enjoy"       "enough"      "enter"      
##  [66] "environment" "equal"       "especial"    "even"        "evening"    
##  [71] "ever"        "every"       "exact"       "except"      "exist"      
##  [76] "expect"      "explain"     "express"     "identify"    "if"         
##  [81] "important"   "in"          "indeed"      "individual"  "industry"   
##  [86] "inform"      "instead"     "interest"    "invest"      "it"         
##  [91] "item"        "obvious"     "occasion"    "odd"         "of"         
##  [96] "off"         "offer"       "often"       "okay"        "old"        
## [101] "on"          "only"        "open"        "opportunity" "or"         
## [106] "order"       "original"    "other"       "ought"       "out"        
## [111] "over"        "own"         "under"       "understand"  "union"      
## [116] "unit"        "university"  "unless"      "until"       "up"         
## [121] "upon"        "usual"
#one of each vowel
allv <- c("aeioux", "aei") #to check it works
str_subset(allv, "a.*e.*i.*o.*u")
## [1] "aeioux"
str_subset(words, "a.*e.*i.*o.*u")
## character(0)
# multiple str_detect()
words[str_detect(words, "a") & str_detect(words, "e") &
        str_detect(words, "i") & str_detect(words, "o") &
        str_detect(words, "u")]
## character(0)

There are no words in stringr::words that contain all the vowels.

  1. What word has the highest number of vowels? What word has the highest proportion of vowels?
#highest number of vowels
num_v <- str_count(words, "[aeiou]")
max_v <- max(num_v)
words[num_v == max_v]
## [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
## [6] "experience"  "individual"  "television"
#highest proportion of vowels
prop_v <- str_count(words, "[aeiou]") / str_length(words)
max_p <- max(prop_v)
words[prop_v == max_p]
## [1] "a"

8 words have 5 vowels, which is the maximum number of values among these words.

The word a has the highest proportion since length = 1 and num_v = 1.

14.4.3.1 Extract matches

  1. From the Harvard sentences data, extract:
  • The first word from each sentence.
  • All words ending in ing.
  • All plurals.
#first word
str_extract(sentences, "[^ ]+") %>% head()
## [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"
sentences %>% head() #to check it worked
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."
#words ending in ing
pat_ing <- "[A-Za-z]+ing" #define pattern
ing <- str_detect(sentences, pat_ing)
str_extract_all(sentences[ing], pat_ing) %>%
  unlist() %>%
  unique() #don't show repeated words
##  [1] "stocking"  "spring"    "evening"   "morning"   "winding"  
##  [6] "living"    "king"      "Adding"    "making"    "raging"   
## [11] "playing"   "sleeping"  "ring"      "glaring"   "sinking"  
## [16] "thing"     "dying"     "Bring"     "lodging"   "filing"   
## [21] "wearing"   "wading"    "swing"     "nothing"   "Whiting"  
## [26] "sing"      "bring"     "painting"  "walking"   "ling"     
## [31] "shipping"  "hing"      "puzzling"  "landing"   "waiting"  
## [36] "whistling" "timing"    "ting"      "changing"  "drenching"
## [41] "moving"    "working"
#plurals
str_extract_all(sentences, "[A-Za-z]{3,}s") %>%
  unlist() %>%
  unique() %>%
  head()
## [1] "planks" "Thes"   "days"   "bowls"  "lemons" "makes"

14.4.4.1 Grouped matches

  1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
pat_num <- "(one|two|three|four|five|six|seven|eight|nine|ten) ([^ ]+)"
sen_num <- sentences %>% str_subset(pat_num)
sen_num %>% str_match(pat_num)
##       [,1]            [,2]    [,3]      
##  [1,] "ten served"    "ten"   "served"  
##  [2,] "one over"      "one"   "over"    
##  [3,] "seven books"   "seven" "books"   
##  [4,] "two met"       "two"   "met"     
##  [5,] "two factors"   "two"   "factors" 
##  [6,] "one and"       "one"   "and"     
##  [7,] "three lists"   "three" "lists"   
##  [8,] "seven is"      "seven" "is"      
##  [9,] "two when"      "two"   "when"    
## [10,] "one floor."    "one"   "floor."  
## [11,] "ten inches."   "ten"   "inches." 
## [12,] "one with"      "one"   "with"    
## [13,] "one war"       "one"   "war"     
## [14,] "one button"    "one"   "button"  
## [15,] "six minutes."  "six"   "minutes."
## [16,] "ten years"     "ten"   "years"   
## [17,] "one in"        "one"   "in"      
## [18,] "ten chased"    "ten"   "chased"  
## [19,] "one like"      "one"   "like"    
## [20,] "two shares"    "two"   "shares"  
## [21,] "two distinct"  "two"   "distinct"
## [22,] "one costs"     "one"   "costs"   
## [23,] "ten two"       "ten"   "two"     
## [24,] "five robins."  "five"  "robins." 
## [25,] "four kinds"    "four"  "kinds"   
## [26,] "one rang"      "one"   "rang"    
## [27,] "ten him."      "ten"   "him."    
## [28,] "three story"   "three" "story"   
## [29,] "ten by"        "ten"   "by"      
## [30,] "one wall."     "one"   "wall."   
## [31,] "three inches"  "three" "inches"  
## [32,] "ten your"      "ten"   "your"    
## [33,] "six comes"     "six"   "comes"   
## [34,] "one before"    "one"   "before"  
## [35,] "three batches" "three" "batches" 
## [36,] "two leaves."   "two"   "leaves."
  1. Find all contractions. Separate out the pieces before and after the apostrophe.
cont <- "([A-Za-z]+)'([A-Za-z]+)"
sen_cont <- sentences %>% str_subset(cont)
sen_cont %>% str_match(cont)
##       [,1]         [,2]       [,3]
##  [1,] "It's"       "It"       "s" 
##  [2,] "man's"      "man"      "s" 
##  [3,] "don't"      "don"      "t" 
##  [4,] "store's"    "store"    "s" 
##  [5,] "workmen's"  "workmen"  "s" 
##  [6,] "Let's"      "Let"      "s" 
##  [7,] "sun's"      "sun"      "s" 
##  [8,] "child's"    "child"    "s" 
##  [9,] "king's"     "king"     "s" 
## [10,] "It's"       "It"       "s" 
## [11,] "don't"      "don"      "t" 
## [12,] "queen's"    "queen"    "s" 
## [13,] "don't"      "don"      "t" 
## [14,] "pirate's"   "pirate"   "s" 
## [15,] "neighbor's" "neighbor" "s"

14.4.5.1 Replacing matches

  1. Replace all forward slashes in a string with backslashes.
for_slash <- ("one/two/three")
str_replace_all(for_slash, "/", "\\\\") %>% writeLines()
## one\two\three
  1. Implement a simple version of str_to_lower() using replace_all().
caps <- ("ABCDE")
str_replace_all(caps, "([A-Z])", tolower)
## [1] "abcde"

14.4.6.1 Splitting

  1. Split up a string like “apples, pears, and bananas” into individual components.
fruity <- ("apples, pears, and bananas")
str_split(fruity, ", and |,")
## [[1]]
## [1] "apples"  " pears"  "bananas"
  1. Why is it better to split up by boundary("word") than " "?
fruity2 <- ("fruit: apples, pears, (bananas), and oranges")
str_split(fruity2, " ")
## [[1]]
## [1] "fruit:"     "apples,"    "pears,"     "(bananas)," "and"       
## [6] "oranges"
str_split(fruity2, boundary("word"))
## [[1]]
## [1] "fruit"   "apples"  "pears"   "bananas" "and"     "oranges"

Splitting up with boundary("word") is better so I don’t have to specify each special punctuation character to keep out like : , or ().

14.5 Other types of pattern

  1. How would you find all strings containing \ with regex() vs. with fixed()?
strings <- c("ab", "0\\1", "x\\y")
#regex()
str_subset(strings, regex("\\\\")) 
## [1] "0\\1" "x\\y"
#fixed()
str_subset(strings, fixed("\\"))
## [1] "0\\1" "x\\y"
  1. What are the five most common words in sentences?
(words_sen <- str_split(sentences, boundary("word")) %>%
  unlist() %>%
  str_to_lower() %>% #avoid repeated words in caps and lower
  as.tibble() %>%
  set_names("word") %>%
  group_by(word) %>%
  count(sort = TRUE) %>% #order by number
  head(5)) #only top 5
## # A tibble: 5 x 2
## # Groups:   word [5]
##   word      n
##   <chr> <int>
## 1 the     751
## 2 a       202
## 3 of      132
## 4 to      123
## 5 and     118

Topic 4: Work with the singer data

4.1 Use purrr to map latitude and longitude into human readable information on the band’s origin places.

Notice that revgeocode(... , output = "more") outputs a data frame, while revgeocode(... , output = "address") returns a string: you have the option of dealing with nested data frames. You will need to pay attention to two things:

  • Not all of the track have a latitude and longitude: what can we do with the missing information? (filtering, …)
  • Not all of the time we make a research through revgeocode() we get a result. What can we do to avoid those errors to bite us? (look at possibly() in purrr…)

First, I need to load the necessary packages and register my google API key to use ggmap().

library(tidyverse)
library(devtools)
install_github("dkahle/ggmap")
library(ggplot2)
library(ggmap)
register_google("AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ")

The data set singer_locations contains information about songs and associated artists in the Million Song Dataset.

Let’s look at this data frame:

library(singer)
str(singer_locations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10100 obs. of  14 variables:
##  $ track_id          : chr  "TRWICRA128F42368DB" "TRXJANY128F42246FC" "TRIKPCA128F424A553" "TRYEATD128F92F87C9" ...
##  $ title             : chr  "The Conversation (Cd)" "Lonely Island" "Here's That Rainy Day" "Rego Park Blues" ...
##  $ song_id           : chr  "SOSURTI12A81C22FB8" "SODESQP12A6D4F98EF" "SOQUYQD12A8C131619" "SOEZGRC12AB017F1AC" ...
##  $ release           : chr  "Even If It Kills Me" "The Duke Of Earl" "Imprompture" "Still River" ...
##  $ artist_id         : chr  "ARACDPV1187FB58DF4" "ARYBUAO1187FB3F4EB" "AR4111G1187B9B58AB" "ARQDZP31187B98D623" ...
##  $ artist_name       : chr  "Motion City Soundtrack" "Gene Chandler" "Paul Horn" "Ronnie Earl & the Broadcasters" ...
##  $ year              : int  2007 2004 1998 1995 1968 2006 2003 2007 1966 2006 ...
##  $ duration          : num  170 107 528 695 237 ...
##  $ artist_hotttnesss : num  0.641 0.394 0.431 0.362 0.411 ...
##  $ artist_familiarity: num  0.823 0.57 0.504 0.477 0.53 ...
##  $ latitude          : num  NA 41.9 40.7 NA 42.3 ...
##  $ longitude         : num  NA -87.6 -74 NA -83 ...
##  $ name              : chr  NA "Gene Chandler" "Paul Horn" NA ...
##  $ city              : chr  NA "Chicago, IL" "New York, NY" NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ track_id          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ title             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ song_id           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ release           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ artist_id         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ artist_name       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ year              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ duration          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ artist_hotttnesss : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ artist_familiarity: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ latitude          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ longitude         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ name              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ city              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
library(kableExtra)
singer_locations %>% head() %>%
  kable() %>%
  kable_styling(full_width = F, position = "center")
track_id title song_id release artist_id artist_name year duration artist_hotttnesss artist_familiarity latitude longitude name city
TRWICRA128F42368DB The Conversation (Cd) SOSURTI12A81C22FB8 Even If It Kills Me ARACDPV1187FB58DF4 Motion City Soundtrack 2007 170.4485 0.6410183 0.8230522 NA NA NA NA
TRXJANY128F42246FC Lonely Island SODESQP12A6D4F98EF The Duke Of Earl ARYBUAO1187FB3F4EB Gene Chandler 2004 106.5530 0.3937627 0.5700167 41.88415 -87.63241 Gene Chandler Chicago, IL
TRIKPCA128F424A553 Here’s That Rainy Day SOQUYQD12A8C131619 Imprompture AR4111G1187B9B58AB Paul Horn 1998 527.5947 0.4306226 0.5039940 40.71455 -74.00712 Paul Horn New York, NY
TRYEATD128F92F87C9 Rego Park Blues SOEZGRC12AB017F1AC Still River ARQDZP31187B98D623 Ronnie Earl & the Broadcasters 1995 695.1179 0.3622792 0.4773099 NA NA NA NA
TRBYYXH128F4264585 Games SOPIOCP12A8C13A322 Afro-Harping AR75GYU1187B9AE47A Dorothy Ashby 1968 237.3220 0.4107520 0.5303468 42.33168 -83.04792 Dorothy Ashby Detroit, MI
TRKFFKR128F9303AE3 More Pipes SOHQSPY12AB0181325 Six Yanks ARCENE01187B9AF929 Barleyjuice 2006 192.9400 0.3762635 0.5412950 40.99471 -77.60454 Barleyjuice Pennsylvania

The singer_locations data frame contains geographical information associated with the artist location stored in two different formats: 1. as a (dirty!) variable named city; 2. as a latitude / longitude pair (stored in latitude, longitude respectively).

From the output of the first songs, we can see that some tracks don’t have this geographical information so I will filter to have only the ones that do contain this information.

singer_geo <- singer_locations %>%
  filter(!is.na(city)) %>%
  select(title, artist_name, year, latitude, longitude, city) #to make table smaller
singer_geo %>%
  head() %>%
  kable() %>%
  kable_styling(full_width = F, position = "center")
title artist_name year latitude longitude city
Lonely Island Gene Chandler 2004 41.88415 -87.63241 Chicago, IL
Here’s That Rainy Day Paul Horn 1998 40.71455 -74.00712 New York, NY
Games Dorothy Ashby 1968 42.33168 -83.04792 Detroit, MI
More Pipes Barleyjuice 2006 40.99471 -77.60454 Pennsylvania
Indian Deli Madlib 2007 34.20034 -119.18044 Oxnard, CA
Miss Gorgeous Seeed’s Pharaoh Riddim Feat. General Degree 2003 50.73230 7.10169 Bonn
nrow(singer_locations)
## [1] 10100
nrow(singer_geo)
## [1] 4129

After, filtering the new data frame singer_geo has 4129 observations, compared to 10100 that the original data frame had. However, as there are still so many observations, I will only work with the first 25 songs. As there are many variables too, I will only keep a few to have an easier data set to look at.

singer_geo <- singer_geo[1:25,]
singer_geo %>%
  kable() %>%
  kable_styling(full_width = F, position = "center")
title artist_name year latitude longitude city
Lonely Island Gene Chandler 2004 41.88415 -87.63241 Chicago, IL
Here’s That Rainy Day Paul Horn 1998 40.71455 -74.00712 New York, NY
Games Dorothy Ashby 1968 42.33168 -83.04792 Detroit, MI
More Pipes Barleyjuice 2006 40.99471 -77.60454 Pennsylvania
Indian Deli Madlib 2007 34.20034 -119.18044 Oxnard, CA
Miss Gorgeous Seeed’s Pharaoh Riddim Feat. General Degree 2003 50.73230 7.10169 Bonn
Lahainaluna Keali’i Reichel 2003 19.59009 -155.43414 Hawaii
The Ingenue (LP Version) Little Feat 1989 34.05349 -118.24532 Los Angeles, CA
The Unquiet Grave (Child No. 78) Joan Baez 1964 40.57250 -74.15400 Staten Island, NY
The Breaks 31Knots 2008 45.51179 -122.67563 Portland, OR
The Operator Bleep 1989 51.50632 -0.12714 UK - England - London
Con Il Nastro Rosa Lucio Battisti 1980 42.50172 12.88512 Poggio Bustone, Rieti, Italy
SOS Ray Brown Trio / Ralph Moore 1991 40.43831 -79.99745 Pittsburgh, PA
At The End iio 2002 40.71455 -74.00712 New York, NY
The Hunting Song Tom Lehrer 1953 37.77916 -122.42005 New York, NY
Mob Job (LP Version) John Zorn 1989 40.71455 -74.00712 New York, NY
Nothing’s the Same The Meeting Places 2006 34.05349 -118.24532 Los Angeles, CA
Bohemian Ballet Deep Forest 1995 37.27188 -119.27023 California
Do You Mean To Imply Billy Cobham 1999 8.41770 -80.11278 Panama
Pollen And Salt Daphne Loves Derby 2005 47.38028 -122.23742 KENT, WASHINGTON
Surrounded SOiL 2009 41.88415 -87.63241 Chicago
Headless Run Level Zero 2003 62.19845 17.55142 SWEDEN
Na Laethe Bhí Clannad 1993 53.41961 -8.24055 Ireland
Haiku (Album Version) Tally Hall 2005 42.32807 -83.73360 Ann Arbor, MI
Bedlam Boys Old Blind Dogs 2007 57.15382 -2.10679 Aberdeen, Scotland
singer_address <- mapply(FUN = function(longitude, latitude) { 
  revgeocode(c(longitude, latitude), output = "address")}, 
  singer_geo$longitude, singer_geo$latitude)
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=41.88415,-87.63241&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.33168,-83.04792&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.99471,-77.60454&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.20034,-119.18044&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=50.7323,7.10169&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=19.59009,-155.43414&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.05349,-118.24532&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.5725,-74.154&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=45.51179,-122.67563&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=51.50632,-0.12714&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.50172,12.88512&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.43831,-79.99745&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=37.77916,-122.42005&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.05349,-118.24532&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=37.27188,-119.27023&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=8.4177,-80.11278&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=47.38028,-122.23742&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=41.88415,-87.63241&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=62.19845,17.55142&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=53.41961,-8.24055&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.32807,-83.7336&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=57.15382,-2.10679&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
singer_address
##  [1] "134 N LaSalle St suite 1720, Chicago, IL 60602, USA"                        
##  [2] "80 Chambers St, New York, NY 10007, USA"                                    
##  [3] "1001 Woodward Ave, Detroit, MI 48226, USA"                                  
##  [4] "Z. H. Confair Memorial Hwy, Howard, PA 16841, USA"                          
##  [5] "300 W 3rd St, Oxnard, CA 93030, USA"                                        
##  [6] "Regina-Pacis-Weg 1, 53113 Bonn, Germany"                                    
##  [7] "Unnamed Road, Hawaii, USA"                                                  
##  [8] "1420 S Oakhurst Dr, Los Angeles, CA 90035, USA"                             
##  [9] "215 Arthur Kill Rd, Staten Island, NY 10306, USA"                           
## [10] "1500 SW 1st Ave, Portland, OR 97201, USA"                                   
## [11] "39 Whitehall, Westminster, London SW1A 2BY, UK"                             
## [12] "Localita' Pescatore, Poggio Bustone, RI 02018, Italy"                       
## [13] "410 Grant St, Pittsburgh, PA 15219, USA"                                    
## [14] "80 Chambers St, New York, NY 10007, USA"                                    
## [15] "1 Dr Carlton B Goodlett Pl, San Francisco, CA 94102, USA"                   
## [16] "80 Chambers St, New York, NY 10007, USA"                                    
## [17] "1420 S Oakhurst Dr, Los Angeles, CA 90035, USA"                             
## [18] "Shaver Lake, CA 93634, USA"                                                 
## [19] "Calle Aviacion, Río Hato, Panama"                                           
## [20] "220 4th Ave S, Kent, WA 98032, USA"                                         
## [21] "134 N LaSalle St suite 1720, Chicago, IL 60602, USA"                        
## [22] "Unnamed Road, 862 96 Njurunda, Sweden"                                      
## [23] "ICastle view, Borris in ossory, Laois, Borris in ossory, Co. Laois, Ireland"
## [24] "3788 Pontiac Trail, Ann Arbor, MI 48105, USA"                               
## [25] "91 Hutcheon St, Aberdeen AB25 1EW, UK"

Now singer_address contains the corresponding addresses from the given coordinates. Let’s see if these addresses match with the variable city.

4.2 Try to check wether the place in city corresponds to the information you retrieved.

sing_add_city <- data.frame(address = singer_address, city = singer_geo$city)
sing_add_city %>% 
  kable() %>%
  kable_styling(full_width = F)
address city
134 N LaSalle St suite 1720, Chicago, IL 60602, USA Chicago, IL
80 Chambers St, New York, NY 10007, USA New York, NY
1001 Woodward Ave, Detroit, MI 48226, USA Detroit, MI
Z. H. Confair Memorial Hwy, Howard, PA 16841, USA Pennsylvania
300 W 3rd St, Oxnard, CA 93030, USA Oxnard, CA
Regina-Pacis-Weg 1, 53113 Bonn, Germany Bonn
Unnamed Road, Hawaii, USA Hawaii
1420 S Oakhurst Dr, Los Angeles, CA 90035, USA Los Angeles, CA
215 Arthur Kill Rd, Staten Island, NY 10306, USA Staten Island, NY
1500 SW 1st Ave, Portland, OR 97201, USA Portland, OR
39 Whitehall, Westminster, London SW1A 2BY, UK UK - England - London
Localita’ Pescatore, Poggio Bustone, RI 02018, Italy Poggio Bustone, Rieti, Italy
410 Grant St, Pittsburgh, PA 15219, USA Pittsburgh, PA
80 Chambers St, New York, NY 10007, USA New York, NY
1 Dr Carlton B Goodlett Pl, San Francisco, CA 94102, USA New York, NY
80 Chambers St, New York, NY 10007, USA New York, NY
1420 S Oakhurst Dr, Los Angeles, CA 90035, USA Los Angeles, CA
Shaver Lake, CA 93634, USA California
Calle Aviacion, Río Hato, Panama Panama
220 4th Ave S, Kent, WA 98032, USA KENT, WASHINGTON
134 N LaSalle St suite 1720, Chicago, IL 60602, USA Chicago
Unnamed Road, 862 96 Njurunda, Sweden SWEDEN
ICastle view, Borris in ossory, Laois, Borris in ossory, Co. Laois, Ireland Ireland
3788 Pontiac Trail, Ann Arbor, MI 48105, USA Ann Arbor, MI
91 Hutcheon St, Aberdeen AB25 1EW, UK Aberdeen, Scotland

From the table we can visually compare the cities, but let’s try with some code:

library(stringi)
stri_detect_fixed(sing_add_city$address, sing_add_city$city)
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [23]  TRUE  TRUE FALSE

From the output, we can see that 8 of the observations don’t match. But some of these mismatches are because some cities/countries are in capitals in the column city.

I will try this again, putting all in low case.

low_address <- str_to_lower(sing_add_city$address)
low_city <- str_to_lower(sing_add_city$city)
words_address <- str_split(low_address, boundary("word"))
words_city <- str_split(low_city, boundary("word"))

no_match<-function(match_length){
  match_length > 0 #to see if intersects or not
}

mapply(intersect, words_address, words_city) %>%
  lapply(length) %>%
  map(no_match)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] FALSE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE
## 
## [[11]]
## [1] TRUE
## 
## [[12]]
## [1] TRUE
## 
## [[13]]
## [1] TRUE
## 
## [[14]]
## [1] TRUE
## 
## [[15]]
## [1] FALSE
## 
## [[16]]
## [1] TRUE
## 
## [[17]]
## [1] TRUE
## 
## [[18]]
## [1] FALSE
## 
## [[19]]
## [1] TRUE
## 
## [[20]]
## [1] TRUE
## 
## [[21]]
## [1] TRUE
## 
## [[22]]
## [1] TRUE
## 
## [[23]]
## [1] TRUE
## 
## [[24]]
## [1] TRUE
## 
## [[25]]
## [1] TRUE

After applying the previous code, the mismatches were reduced from 8 to 3. Observation number 15 is a true mismatch, since address = San Francisco, and city = New York. The other two cases are because in the column city only appears the state, and in the column address this state is abbreviated so they don’t match.

This things could potentially be true to different methods (i.e. having the states abbreviated in both columns) but it may compromise the accuracy of the output.

4.3 Go visual

library(leaflet)

singer_geo %>%  
  leaflet()  %>%   
  addTiles() %>%  
  addCircles(lng = singer_geo$longitude,
             lat = singer_geo$latitude,
             popup = singer_geo$artist_name, 
             color = "deeppink")

The map can show the artist name that corresponds to the city of each pink circle.

References