Data wrangling is the process of cleaning, structuring and enriching data into a desired format (Trifacta, 2018).
In this assignment I chose two topics to work with to improve my data wrangling skills. In the first one I completed several exercises related to strings, and in the second one I worked with a dataset to extract and analyze geographical information.
For this task I completed exercises from the Strings chapter of R for Data Science.
I mostly worked with stringr::words
and stringr::sentences
str_length()
and str_sub()
to extract the middle character from a string. What will you do if the string has an even number of characters?library(tidyverse)
library(stringr)
string1 <- "abc"
string2 <- "abcd"
str_sub(string1, floor((str_length(string1)+1)/2), ceiling((str_length(string1)+1)/2))
## [1] "b"
str_sub(string2, floor((str_length(string2)+1)/2), ceiling((str_length(string2)+1)/2)) #returns the two middle characters as the string has an even number.
## [1] "bc"
I chose to extract both of the middle characters when having an even number, but I could also extract only one arbitrarily:
str_sub(string2, ceiling(str_length(string2)/2), ceiling(str_length(string2)/2))
## [1] "b"
This way, I extract only one of the middle characters.
st_comma <- function (x, delim = ",") {
num <- length(x)
if(num == 0) {
stop("vector length = 0") #error message when trying with a length 0 vector
} else if(num == 1) {
x
} else if(num == 2) {
str_c(x[[1]], "and", x[[2]], sep = " ")
} else {
str_1 <- str_c(x[seq_len(num - 1)], delim) #all but the last
str_2 <- str_c("and", x[[num]], sep = " ")
str_c(c(str_1, str_2), collapse = " ")
}
}
#st_comma(c()) # as vector is length 0, the function throws an error message "vector length = 0"
st_comma("a")
## [1] "a"
st_comma(c("a", "b"))
## [1] "a and b"
st_comma(c("a", "b", "c"))
## [1] "a, b, and c"
"'\
?str_view("\"'\\", "\"'\\\\")
\..\..\..
match? How would you represent it as a string?str_view("w.x.y.z", "\\..\\..\\..")
It matches patterns with a dot followed by a character that repeats three times.
"$^$"
?str_view("$^$", "^\\$\\^\\$")
stringr::words
, create regular expressions that find all words that:str_length()
!)Since the list is long, I used a match argument to show only the matching words. When the output is too long, I use str_subset
instead of str_view
to have a more compact output.
str_view_match <- function(words, pattern) {
str_view(words, pattern, match=TRUE)
} #function to only show matches
# start with y
str_view_match(words, "^y")
# end with x
str_view_match(words, "x$")
# have exactly three letters
str_subset(words, "^...$")
## [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask"
## [12] "bad" "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but"
## [23] "buy" "can" "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry"
## [34] "due" "eat" "egg" "end" "eye" "far" "few" "fit" "fly" "for" "fun"
## [45] "gas" "get" "god" "guy" "hit" "hot" "how" "job" "key" "kid" "lad"
## [56] "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs" "new"
## [67] "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per"
## [78] "put" "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit"
## [89] "six" "son" "sun" "tax" "tea" "ten" "the" "tie" "too" "top" "try"
## [100] "two" "use" "war" "way" "wee" "who" "why" "win" "yes" "yet" "you"
# ≥ 7 letters
str_subset(words, ".......")
## [1] "absolute" "account" "achieve" "address" "advertise"
## [6] "afternoon" "against" "already" "alright" "although"
## [11] "america" "another" "apparent" "appoint" "approach"
## [16] "appropriate" "arrange" "associate" "authority" "available"
## [21] "balance" "because" "believe" "benefit" "between"
## [26] "brilliant" "britain" "brother" "business" "certain"
## [31] "chairman" "character" "Christmas" "colleague" "collect"
## [36] "college" "comment" "committee" "community" "company"
## [41] "compare" "complete" "compute" "concern" "condition"
## [46] "consider" "consult" "contact" "continue" "contract"
## [51] "control" "converse" "correct" "council" "country"
## [56] "current" "decision" "definite" "department" "describe"
## [61] "develop" "difference" "difficult" "discuss" "district"
## [66] "document" "economy" "educate" "electric" "encourage"
## [71] "english" "environment" "especial" "evening" "evidence"
## [76] "example" "exercise" "expense" "experience" "explain"
## [81] "express" "finance" "fortune" "forward" "function"
## [86] "further" "general" "germany" "goodbye" "history"
## [91] "holiday" "hospital" "however" "hundred" "husband"
## [96] "identify" "imagine" "important" "improve" "include"
## [101] "increase" "individual" "industry" "instead" "interest"
## [106] "introduce" "involve" "kitchen" "language" "machine"
## [111] "meaning" "measure" "mention" "million" "minister"
## [116] "morning" "necessary" "obvious" "occasion" "operate"
## [121] "opportunity" "organize" "original" "otherwise" "paragraph"
## [126] "particular" "pension" "percent" "perfect" "perhaps"
## [131] "photograph" "picture" "politic" "position" "positive"
## [136] "possible" "practise" "prepare" "present" "pressure"
## [141] "presume" "previous" "private" "probable" "problem"
## [146] "proceed" "process" "produce" "product" "programme"
## [151] "project" "propose" "protect" "provide" "purpose"
## [156] "quality" "quarter" "question" "realise" "receive"
## [161] "recognize" "recommend" "relation" "remember" "represent"
## [166] "require" "research" "resource" "respect" "responsible"
## [171] "saturday" "science" "scotland" "secretary" "section"
## [176] "separate" "serious" "service" "similar" "situate"
## [181] "society" "special" "specific" "standard" "station"
## [186] "straight" "strategy" "structure" "student" "subject"
## [191] "succeed" "suggest" "support" "suppose" "surprise"
## [196] "telephone" "television" "terrible" "therefore" "thirteen"
## [201] "thousand" "through" "thursday" "together" "tomorrow"
## [206] "tonight" "traffic" "transport" "trouble" "tuesday"
## [211] "understand" "university" "various" "village" "wednesday"
## [216] "welcome" "whether" "without" "yesterday"
# start with a vowel
str_subset(words, "^[aeiou]")
## [1] "a" "able" "about" "absolute" "accept"
## [6] "account" "achieve" "across" "act" "active"
## [11] "actual" "add" "address" "admit" "advertise"
## [16] "affect" "afford" "after" "afternoon" "again"
## [21] "against" "age" "agent" "ago" "agree"
## [26] "air" "all" "allow" "almost" "along"
## [31] "already" "alright" "also" "although" "always"
## [36] "america" "amount" "and" "another" "answer"
## [41] "any" "apart" "apparent" "appear" "apply"
## [46] "appoint" "approach" "appropriate" "area" "argue"
## [51] "arm" "around" "arrange" "art" "as"
## [56] "ask" "associate" "assume" "at" "attend"
## [61] "authority" "available" "aware" "away" "awful"
## [66] "each" "early" "east" "easy" "eat"
## [71] "economy" "educate" "effect" "egg" "eight"
## [76] "either" "elect" "electric" "eleven" "else"
## [81] "employ" "encourage" "end" "engine" "english"
## [86] "enjoy" "enough" "enter" "environment" "equal"
## [91] "especial" "europe" "even" "evening" "ever"
## [96] "every" "evidence" "exact" "example" "except"
## [101] "excuse" "exercise" "exist" "expect" "expense"
## [106] "experience" "explain" "express" "extra" "eye"
## [111] "idea" "identify" "if" "imagine" "important"
## [116] "improve" "in" "include" "income" "increase"
## [121] "indeed" "individual" "industry" "inform" "inside"
## [126] "instead" "insure" "interest" "into" "introduce"
## [131] "invest" "involve" "issue" "it" "item"
## [136] "obvious" "occasion" "odd" "of" "off"
## [141] "offer" "office" "often" "okay" "old"
## [146] "on" "once" "one" "only" "open"
## [151] "operate" "opportunity" "oppose" "or" "order"
## [156] "organize" "original" "other" "otherwise" "ought"
## [161] "out" "over" "own" "under" "understand"
## [166] "union" "unit" "unite" "university" "unless"
## [171] "until" "up" "upon" "use" "usual"
# only consonants
str_view_match(words, "^[^aeiou]+$")
# end with ed, but not with eed
str_view_match(words, "^ed$|[^e]ed$")
# end with ing or ise
str_view_match(words, "ing$|ise$")
str_view_match(words, "([^c]ie|cei)") #rule
str_view_match(words, "(cie)") #exceptions?
From the second output we can see there are some exceptions for this rule, such as the words science
and society
.
str_view_match(words, "q[^u]")
There were no words in the output, so “q” is always followed by a “u”.
str_view_match(words, "ise$|our") #ise instead of ize and our instead of or
phones <- c("(55)32498722", "(778)952-5873")
str_view(phones, "\\(\\d{2}\\)\\d{8}", match = T)
The output matches the telephone number as commonly written in Mexico.
# start with 3 consonants
str_view_match(words, "^[^aeiou]{3}")
# ≥ 3 vowels in a row
str_view_match(words, "[aeiou]{3,}")
# ≥ 2 vowel-consonant pairs in a row
str_subset(words, "([aeiou][^aeiou]){2,}")
## [1] "absolute" "agent" "along" "america" "another"
## [6] "apart" "apparent" "authority" "available" "aware"
## [11] "away" "balance" "basis" "become" "before"
## [16] "begin" "behind" "benefit" "business" "character"
## [21] "closes" "community" "consider" "cover" "debate"
## [26] "decide" "decision" "definite" "department" "depend"
## [31] "design" "develop" "difference" "difficult" "direct"
## [36] "divide" "document" "during" "economy" "educate"
## [41] "elect" "electric" "eleven" "encourage" "environment"
## [46] "europe" "even" "evening" "ever" "every"
## [51] "evidence" "exact" "example" "exercise" "exist"
## [56] "family" "figure" "final" "finance" "finish"
## [61] "friday" "future" "general" "govern" "holiday"
## [66] "honest" "hospital" "however" "identify" "imagine"
## [71] "individual" "interest" "introduce" "item" "jesus"
## [76] "level" "likely" "limit" "local" "major"
## [81] "manage" "meaning" "measure" "minister" "minus"
## [86] "minute" "moment" "money" "music" "nature"
## [91] "necessary" "never" "notice" "okay" "open"
## [96] "operate" "opportunity" "organize" "original" "over"
## [101] "paper" "paragraph" "parent" "particular" "photograph"
## [106] "police" "policy" "politic" "position" "positive"
## [111] "power" "prepare" "present" "presume" "private"
## [116] "probable" "process" "produce" "product" "project"
## [121] "proper" "propose" "protect" "provide" "quality"
## [126] "realise" "reason" "recent" "recognize" "recommend"
## [131] "record" "reduce" "refer" "regard" "relation"
## [136] "remember" "report" "represent" "result" "return"
## [141] "saturday" "second" "secretary" "secure" "separate"
## [146] "seven" "similar" "specific" "strategy" "student"
## [151] "stupid" "telephone" "television" "therefore" "thousand"
## [156] "today" "together" "tomorrow" "tonight" "total"
## [161] "toward" "travel" "unit" "unite" "university"
## [166] "upon" "visit" "water" "woman"
#start + end same character
str_view_match(words, "^(.).*\\1$")
#repeated pair of letters
str_view_match(words, "(..).*\\1")
#repeated letters
str_view_match(words, "(.).*\\1.*\\1")
str_detect()
calls.# start or end with x
str_view_match(words, "^x|x$") #single regex
# multiple str_detect() calls
start_x <- str_detect(words, "^x")
end_x <- str_detect(words, "x$")
words[start_x | end_x]
## [1] "box" "sex" "six" "tax"
#start with vowel end with consonant
str_subset(words, "^[aeiou].*[^aeiou]$") #single regex
## [1] "about" "accept" "account" "across" "act"
## [6] "actual" "add" "address" "admit" "affect"
## [11] "afford" "after" "afternoon" "again" "against"
## [16] "agent" "air" "all" "allow" "almost"
## [21] "along" "already" "alright" "although" "always"
## [26] "amount" "and" "another" "answer" "any"
## [31] "apart" "apparent" "appear" "apply" "appoint"
## [36] "approach" "arm" "around" "art" "as"
## [41] "ask" "at" "attend" "authority" "away"
## [46] "awful" "each" "early" "east" "easy"
## [51] "eat" "economy" "effect" "egg" "eight"
## [56] "either" "elect" "electric" "eleven" "employ"
## [61] "end" "english" "enjoy" "enough" "enter"
## [66] "environment" "equal" "especial" "even" "evening"
## [71] "ever" "every" "exact" "except" "exist"
## [76] "expect" "explain" "express" "identify" "if"
## [81] "important" "in" "indeed" "individual" "industry"
## [86] "inform" "instead" "interest" "invest" "it"
## [91] "item" "obvious" "occasion" "odd" "of"
## [96] "off" "offer" "often" "okay" "old"
## [101] "on" "only" "open" "opportunity" "or"
## [106] "order" "original" "other" "ought" "out"
## [111] "over" "own" "under" "understand" "union"
## [116] "unit" "university" "unless" "until" "up"
## [121] "upon" "usual"
#multiple str_detect()
start_vowel <- str_detect(words, "^[aeiou]")
end_cons <- str_detect(words, "[^aeiou]$")
words[start_vowel & end_cons]
## [1] "about" "accept" "account" "across" "act"
## [6] "actual" "add" "address" "admit" "affect"
## [11] "afford" "after" "afternoon" "again" "against"
## [16] "agent" "air" "all" "allow" "almost"
## [21] "along" "already" "alright" "although" "always"
## [26] "amount" "and" "another" "answer" "any"
## [31] "apart" "apparent" "appear" "apply" "appoint"
## [36] "approach" "arm" "around" "art" "as"
## [41] "ask" "at" "attend" "authority" "away"
## [46] "awful" "each" "early" "east" "easy"
## [51] "eat" "economy" "effect" "egg" "eight"
## [56] "either" "elect" "electric" "eleven" "employ"
## [61] "end" "english" "enjoy" "enough" "enter"
## [66] "environment" "equal" "especial" "even" "evening"
## [71] "ever" "every" "exact" "except" "exist"
## [76] "expect" "explain" "express" "identify" "if"
## [81] "important" "in" "indeed" "individual" "industry"
## [86] "inform" "instead" "interest" "invest" "it"
## [91] "item" "obvious" "occasion" "odd" "of"
## [96] "off" "offer" "often" "okay" "old"
## [101] "on" "only" "open" "opportunity" "or"
## [106] "order" "original" "other" "ought" "out"
## [111] "over" "own" "under" "understand" "union"
## [116] "unit" "university" "unless" "until" "up"
## [121] "upon" "usual"
#one of each vowel
allv <- c("aeioux", "aei") #to check it works
str_subset(allv, "a.*e.*i.*o.*u")
## [1] "aeioux"
str_subset(words, "a.*e.*i.*o.*u")
## character(0)
# multiple str_detect()
words[str_detect(words, "a") & str_detect(words, "e") &
str_detect(words, "i") & str_detect(words, "o") &
str_detect(words, "u")]
## character(0)
There are no words in stringr::words
that contain all the vowels.
#highest number of vowels
num_v <- str_count(words, "[aeiou]")
max_v <- max(num_v)
words[num_v == max_v]
## [1] "appropriate" "associate" "available" "colleague" "encourage"
## [6] "experience" "individual" "television"
#highest proportion of vowels
prop_v <- str_count(words, "[aeiou]") / str_length(words)
max_p <- max(prop_v)
words[prop_v == max_p]
## [1] "a"
8 words have 5 vowels, which is the maximum number of values among these words.
The word a
has the highest proportion since length = 1 and num_v = 1.
#first word
str_extract(sentences, "[^ ]+") %>% head()
## [1] "The" "Glue" "It's" "These" "Rice" "The"
sentences %>% head() #to check it worked
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
## [6] "The juice of lemons makes fine punch."
#words ending in ing
pat_ing <- "[A-Za-z]+ing" #define pattern
ing <- str_detect(sentences, pat_ing)
str_extract_all(sentences[ing], pat_ing) %>%
unlist() %>%
unique() #don't show repeated words
## [1] "stocking" "spring" "evening" "morning" "winding"
## [6] "living" "king" "Adding" "making" "raging"
## [11] "playing" "sleeping" "ring" "glaring" "sinking"
## [16] "thing" "dying" "Bring" "lodging" "filing"
## [21] "wearing" "wading" "swing" "nothing" "Whiting"
## [26] "sing" "bring" "painting" "walking" "ling"
## [31] "shipping" "hing" "puzzling" "landing" "waiting"
## [36] "whistling" "timing" "ting" "changing" "drenching"
## [41] "moving" "working"
#plurals
str_extract_all(sentences, "[A-Za-z]{3,}s") %>%
unlist() %>%
unique() %>%
head()
## [1] "planks" "Thes" "days" "bowls" "lemons" "makes"
pat_num <- "(one|two|three|four|five|six|seven|eight|nine|ten) ([^ ]+)"
sen_num <- sentences %>% str_subset(pat_num)
sen_num %>% str_match(pat_num)
## [,1] [,2] [,3]
## [1,] "ten served" "ten" "served"
## [2,] "one over" "one" "over"
## [3,] "seven books" "seven" "books"
## [4,] "two met" "two" "met"
## [5,] "two factors" "two" "factors"
## [6,] "one and" "one" "and"
## [7,] "three lists" "three" "lists"
## [8,] "seven is" "seven" "is"
## [9,] "two when" "two" "when"
## [10,] "one floor." "one" "floor."
## [11,] "ten inches." "ten" "inches."
## [12,] "one with" "one" "with"
## [13,] "one war" "one" "war"
## [14,] "one button" "one" "button"
## [15,] "six minutes." "six" "minutes."
## [16,] "ten years" "ten" "years"
## [17,] "one in" "one" "in"
## [18,] "ten chased" "ten" "chased"
## [19,] "one like" "one" "like"
## [20,] "two shares" "two" "shares"
## [21,] "two distinct" "two" "distinct"
## [22,] "one costs" "one" "costs"
## [23,] "ten two" "ten" "two"
## [24,] "five robins." "five" "robins."
## [25,] "four kinds" "four" "kinds"
## [26,] "one rang" "one" "rang"
## [27,] "ten him." "ten" "him."
## [28,] "three story" "three" "story"
## [29,] "ten by" "ten" "by"
## [30,] "one wall." "one" "wall."
## [31,] "three inches" "three" "inches"
## [32,] "ten your" "ten" "your"
## [33,] "six comes" "six" "comes"
## [34,] "one before" "one" "before"
## [35,] "three batches" "three" "batches"
## [36,] "two leaves." "two" "leaves."
cont <- "([A-Za-z]+)'([A-Za-z]+)"
sen_cont <- sentences %>% str_subset(cont)
sen_cont %>% str_match(cont)
## [,1] [,2] [,3]
## [1,] "It's" "It" "s"
## [2,] "man's" "man" "s"
## [3,] "don't" "don" "t"
## [4,] "store's" "store" "s"
## [5,] "workmen's" "workmen" "s"
## [6,] "Let's" "Let" "s"
## [7,] "sun's" "sun" "s"
## [8,] "child's" "child" "s"
## [9,] "king's" "king" "s"
## [10,] "It's" "It" "s"
## [11,] "don't" "don" "t"
## [12,] "queen's" "queen" "s"
## [13,] "don't" "don" "t"
## [14,] "pirate's" "pirate" "s"
## [15,] "neighbor's" "neighbor" "s"
for_slash <- ("one/two/three")
str_replace_all(for_slash, "/", "\\\\") %>% writeLines()
## one\two\three
str_to_lower()
using replace_all()
.caps <- ("ABCDE")
str_replace_all(caps, "([A-Z])", tolower)
## [1] "abcde"
fruity <- ("apples, pears, and bananas")
str_split(fruity, ", and |,")
## [[1]]
## [1] "apples" " pears" "bananas"
boundary("word")
than " "?fruity2 <- ("fruit: apples, pears, (bananas), and oranges")
str_split(fruity2, " ")
## [[1]]
## [1] "fruit:" "apples," "pears," "(bananas)," "and"
## [6] "oranges"
str_split(fruity2, boundary("word"))
## [[1]]
## [1] "fruit" "apples" "pears" "bananas" "and" "oranges"
Splitting up with boundary("word")
is better so I don’t have to specify each special punctuation character to keep out like :
,
or ()
.
\
with regex()
vs. with fixed()
?strings <- c("ab", "0\\1", "x\\y")
#regex()
str_subset(strings, regex("\\\\"))
## [1] "0\\1" "x\\y"
#fixed()
str_subset(strings, fixed("\\"))
## [1] "0\\1" "x\\y"
(words_sen <- str_split(sentences, boundary("word")) %>%
unlist() %>%
str_to_lower() %>% #avoid repeated words in caps and lower
as.tibble() %>%
set_names("word") %>%
group_by(word) %>%
count(sort = TRUE) %>% #order by number
head(5)) #only top 5
## # A tibble: 5 x 2
## # Groups: word [5]
## word n
## <chr> <int>
## 1 the 751
## 2 a 202
## 3 of 132
## 4 to 123
## 5 and 118
purrr
to map latitude and longitude into human readable information on the band’s origin places.Notice that revgeocode(... , output = "more")
outputs a data frame, while revgeocode(... , output = "address")
returns a string: you have the option of dealing with nested data frames. You will need to pay attention to two things:
revgeocode()
we get a result. What can we do to avoid those errors to bite us? (look at possibly() in purrr…)First, I need to load the necessary packages and register my google API key to use ggmap()
.
library(tidyverse)
library(devtools)
install_github("dkahle/ggmap")
library(ggplot2)
library(ggmap)
register_google("AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ")
The data set singer_locations
contains information about songs and associated artists in the Million Song Dataset.
Let’s look at this data frame:
library(singer)
str(singer_locations)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10100 obs. of 14 variables:
## $ track_id : chr "TRWICRA128F42368DB" "TRXJANY128F42246FC" "TRIKPCA128F424A553" "TRYEATD128F92F87C9" ...
## $ title : chr "The Conversation (Cd)" "Lonely Island" "Here's That Rainy Day" "Rego Park Blues" ...
## $ song_id : chr "SOSURTI12A81C22FB8" "SODESQP12A6D4F98EF" "SOQUYQD12A8C131619" "SOEZGRC12AB017F1AC" ...
## $ release : chr "Even If It Kills Me" "The Duke Of Earl" "Imprompture" "Still River" ...
## $ artist_id : chr "ARACDPV1187FB58DF4" "ARYBUAO1187FB3F4EB" "AR4111G1187B9B58AB" "ARQDZP31187B98D623" ...
## $ artist_name : chr "Motion City Soundtrack" "Gene Chandler" "Paul Horn" "Ronnie Earl & the Broadcasters" ...
## $ year : int 2007 2004 1998 1995 1968 2006 2003 2007 1966 2006 ...
## $ duration : num 170 107 528 695 237 ...
## $ artist_hotttnesss : num 0.641 0.394 0.431 0.362 0.411 ...
## $ artist_familiarity: num 0.823 0.57 0.504 0.477 0.53 ...
## $ latitude : num NA 41.9 40.7 NA 42.3 ...
## $ longitude : num NA -87.6 -74 NA -83 ...
## $ name : chr NA "Gene Chandler" "Paul Horn" NA ...
## $ city : chr NA "Chicago, IL" "New York, NY" NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 14
## .. ..$ track_id : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ title : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ song_id : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ release : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ artist_id : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ artist_name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ duration : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ artist_hotttnesss : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ artist_familiarity: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ latitude : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ longitude : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ city : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
library(kableExtra)
singer_locations %>% head() %>%
kable() %>%
kable_styling(full_width = F, position = "center")
track_id | title | song_id | release | artist_id | artist_name | year | duration | artist_hotttnesss | artist_familiarity | latitude | longitude | name | city |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TRWICRA128F42368DB | The Conversation (Cd) | SOSURTI12A81C22FB8 | Even If It Kills Me | ARACDPV1187FB58DF4 | Motion City Soundtrack | 2007 | 170.4485 | 0.6410183 | 0.8230522 | NA | NA | NA | NA |
TRXJANY128F42246FC | Lonely Island | SODESQP12A6D4F98EF | The Duke Of Earl | ARYBUAO1187FB3F4EB | Gene Chandler | 2004 | 106.5530 | 0.3937627 | 0.5700167 | 41.88415 | -87.63241 | Gene Chandler | Chicago, IL |
TRIKPCA128F424A553 | Here’s That Rainy Day | SOQUYQD12A8C131619 | Imprompture | AR4111G1187B9B58AB | Paul Horn | 1998 | 527.5947 | 0.4306226 | 0.5039940 | 40.71455 | -74.00712 | Paul Horn | New York, NY |
TRYEATD128F92F87C9 | Rego Park Blues | SOEZGRC12AB017F1AC | Still River | ARQDZP31187B98D623 | Ronnie Earl & the Broadcasters | 1995 | 695.1179 | 0.3622792 | 0.4773099 | NA | NA | NA | NA |
TRBYYXH128F4264585 | Games | SOPIOCP12A8C13A322 | Afro-Harping | AR75GYU1187B9AE47A | Dorothy Ashby | 1968 | 237.3220 | 0.4107520 | 0.5303468 | 42.33168 | -83.04792 | Dorothy Ashby | Detroit, MI |
TRKFFKR128F9303AE3 | More Pipes | SOHQSPY12AB0181325 | Six Yanks | ARCENE01187B9AF929 | Barleyjuice | 2006 | 192.9400 | 0.3762635 | 0.5412950 | 40.99471 | -77.60454 | Barleyjuice | Pennsylvania |
The singer_locations
data frame contains geographical information associated with the artist location stored in two different formats: 1. as a (dirty!) variable named city; 2. as a latitude / longitude pair (stored in latitude, longitude respectively).
From the output of the first songs, we can see that some tracks don’t have this geographical information so I will filter to have only the ones that do contain this information.
singer_geo <- singer_locations %>%
filter(!is.na(city)) %>%
select(title, artist_name, year, latitude, longitude, city) #to make table smaller
singer_geo %>%
head() %>%
kable() %>%
kable_styling(full_width = F, position = "center")
title | artist_name | year | latitude | longitude | city |
---|---|---|---|---|---|
Lonely Island | Gene Chandler | 2004 | 41.88415 | -87.63241 | Chicago, IL |
Here’s That Rainy Day | Paul Horn | 1998 | 40.71455 | -74.00712 | New York, NY |
Games | Dorothy Ashby | 1968 | 42.33168 | -83.04792 | Detroit, MI |
More Pipes | Barleyjuice | 2006 | 40.99471 | -77.60454 | Pennsylvania |
Indian Deli | Madlib | 2007 | 34.20034 | -119.18044 | Oxnard, CA |
Miss Gorgeous | Seeed’s Pharaoh Riddim Feat. General Degree | 2003 | 50.73230 | 7.10169 | Bonn |
nrow(singer_locations)
## [1] 10100
nrow(singer_geo)
## [1] 4129
After, filtering the new data frame singer_geo
has 4129 observations, compared to 10100 that the original data frame had. However, as there are still so many observations, I will only work with the first 25 songs. As there are many variables too, I will only keep a few to have an easier data set to look at.
singer_geo <- singer_geo[1:25,]
singer_geo %>%
kable() %>%
kable_styling(full_width = F, position = "center")
title | artist_name | year | latitude | longitude | city |
---|---|---|---|---|---|
Lonely Island | Gene Chandler | 2004 | 41.88415 | -87.63241 | Chicago, IL |
Here’s That Rainy Day | Paul Horn | 1998 | 40.71455 | -74.00712 | New York, NY |
Games | Dorothy Ashby | 1968 | 42.33168 | -83.04792 | Detroit, MI |
More Pipes | Barleyjuice | 2006 | 40.99471 | -77.60454 | Pennsylvania |
Indian Deli | Madlib | 2007 | 34.20034 | -119.18044 | Oxnard, CA |
Miss Gorgeous | Seeed’s Pharaoh Riddim Feat. General Degree | 2003 | 50.73230 | 7.10169 | Bonn |
Lahainaluna | Keali’i Reichel | 2003 | 19.59009 | -155.43414 | Hawaii |
The Ingenue (LP Version) | Little Feat | 1989 | 34.05349 | -118.24532 | Los Angeles, CA |
The Unquiet Grave (Child No. 78) | Joan Baez | 1964 | 40.57250 | -74.15400 | Staten Island, NY |
The Breaks | 31Knots | 2008 | 45.51179 | -122.67563 | Portland, OR |
The Operator | Bleep | 1989 | 51.50632 | -0.12714 | UK - England - London |
Con Il Nastro Rosa | Lucio Battisti | 1980 | 42.50172 | 12.88512 | Poggio Bustone, Rieti, Italy |
SOS | Ray Brown Trio / Ralph Moore | 1991 | 40.43831 | -79.99745 | Pittsburgh, PA |
At The End | iio | 2002 | 40.71455 | -74.00712 | New York, NY |
The Hunting Song | Tom Lehrer | 1953 | 37.77916 | -122.42005 | New York, NY |
Mob Job (LP Version) | John Zorn | 1989 | 40.71455 | -74.00712 | New York, NY |
Nothing’s the Same | The Meeting Places | 2006 | 34.05349 | -118.24532 | Los Angeles, CA |
Bohemian Ballet | Deep Forest | 1995 | 37.27188 | -119.27023 | California |
Do You Mean To Imply | Billy Cobham | 1999 | 8.41770 | -80.11278 | Panama |
Pollen And Salt | Daphne Loves Derby | 2005 | 47.38028 | -122.23742 | KENT, WASHINGTON |
Surrounded | SOiL | 2009 | 41.88415 | -87.63241 | Chicago |
Headless | Run Level Zero | 2003 | 62.19845 | 17.55142 | SWEDEN |
Na Laethe Bhí | Clannad | 1993 | 53.41961 | -8.24055 | Ireland |
Haiku (Album Version) | Tally Hall | 2005 | 42.32807 | -83.73360 | Ann Arbor, MI |
Bedlam Boys | Old Blind Dogs | 2007 | 57.15382 | -2.10679 | Aberdeen, Scotland |
singer_address <- mapply(FUN = function(longitude, latitude) {
revgeocode(c(longitude, latitude), output = "address")},
singer_geo$longitude, singer_geo$latitude)
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=41.88415,-87.63241&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.33168,-83.04792&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.99471,-77.60454&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.20034,-119.18044&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=50.7323,7.10169&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=19.59009,-155.43414&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.05349,-118.24532&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.5725,-74.154&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=45.51179,-122.67563&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=51.50632,-0.12714&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.50172,12.88512&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.43831,-79.99745&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=37.77916,-122.42005&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71455,-74.00712&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=34.05349,-118.24532&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=37.27188,-119.27023&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=8.4177,-80.11278&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=47.38028,-122.23742&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=41.88415,-87.63241&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=62.19845,17.55142&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=53.41961,-8.24055&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=42.32807,-83.7336&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
## Information from URL : https://maps.googleapis.com/maps/api/geocode/json?latlng=57.15382,-2.10679&key=AIzaSyCZDZzfVa_lzlESRafTcuwH1BzethDNdcQ
singer_address
## [1] "134 N LaSalle St suite 1720, Chicago, IL 60602, USA"
## [2] "80 Chambers St, New York, NY 10007, USA"
## [3] "1001 Woodward Ave, Detroit, MI 48226, USA"
## [4] "Z. H. Confair Memorial Hwy, Howard, PA 16841, USA"
## [5] "300 W 3rd St, Oxnard, CA 93030, USA"
## [6] "Regina-Pacis-Weg 1, 53113 Bonn, Germany"
## [7] "Unnamed Road, Hawaii, USA"
## [8] "1420 S Oakhurst Dr, Los Angeles, CA 90035, USA"
## [9] "215 Arthur Kill Rd, Staten Island, NY 10306, USA"
## [10] "1500 SW 1st Ave, Portland, OR 97201, USA"
## [11] "39 Whitehall, Westminster, London SW1A 2BY, UK"
## [12] "Localita' Pescatore, Poggio Bustone, RI 02018, Italy"
## [13] "410 Grant St, Pittsburgh, PA 15219, USA"
## [14] "80 Chambers St, New York, NY 10007, USA"
## [15] "1 Dr Carlton B Goodlett Pl, San Francisco, CA 94102, USA"
## [16] "80 Chambers St, New York, NY 10007, USA"
## [17] "1420 S Oakhurst Dr, Los Angeles, CA 90035, USA"
## [18] "Shaver Lake, CA 93634, USA"
## [19] "Calle Aviacion, Río Hato, Panama"
## [20] "220 4th Ave S, Kent, WA 98032, USA"
## [21] "134 N LaSalle St suite 1720, Chicago, IL 60602, USA"
## [22] "Unnamed Road, 862 96 Njurunda, Sweden"
## [23] "ICastle view, Borris in ossory, Laois, Borris in ossory, Co. Laois, Ireland"
## [24] "3788 Pontiac Trail, Ann Arbor, MI 48105, USA"
## [25] "91 Hutcheon St, Aberdeen AB25 1EW, UK"
Now singer_address
contains the corresponding addresses from the given coordinates. Let’s see if these addresses match with the variable city
.
sing_add_city <- data.frame(address = singer_address, city = singer_geo$city)
sing_add_city %>%
kable() %>%
kable_styling(full_width = F)
address | city |
---|---|
134 N LaSalle St suite 1720, Chicago, IL 60602, USA | Chicago, IL |
80 Chambers St, New York, NY 10007, USA | New York, NY |
1001 Woodward Ave, Detroit, MI 48226, USA | Detroit, MI |
Z. H. Confair Memorial Hwy, Howard, PA 16841, USA | Pennsylvania |
300 W 3rd St, Oxnard, CA 93030, USA | Oxnard, CA |
Regina-Pacis-Weg 1, 53113 Bonn, Germany | Bonn |
Unnamed Road, Hawaii, USA | Hawaii |
1420 S Oakhurst Dr, Los Angeles, CA 90035, USA | Los Angeles, CA |
215 Arthur Kill Rd, Staten Island, NY 10306, USA | Staten Island, NY |
1500 SW 1st Ave, Portland, OR 97201, USA | Portland, OR |
39 Whitehall, Westminster, London SW1A 2BY, UK | UK - England - London |
Localita’ Pescatore, Poggio Bustone, RI 02018, Italy | Poggio Bustone, Rieti, Italy |
410 Grant St, Pittsburgh, PA 15219, USA | Pittsburgh, PA |
80 Chambers St, New York, NY 10007, USA | New York, NY |
1 Dr Carlton B Goodlett Pl, San Francisco, CA 94102, USA | New York, NY |
80 Chambers St, New York, NY 10007, USA | New York, NY |
1420 S Oakhurst Dr, Los Angeles, CA 90035, USA | Los Angeles, CA |
Shaver Lake, CA 93634, USA | California |
Calle Aviacion, Río Hato, Panama | Panama |
220 4th Ave S, Kent, WA 98032, USA | KENT, WASHINGTON |
134 N LaSalle St suite 1720, Chicago, IL 60602, USA | Chicago |
Unnamed Road, 862 96 Njurunda, Sweden | SWEDEN |
ICastle view, Borris in ossory, Laois, Borris in ossory, Co. Laois, Ireland | Ireland |
3788 Pontiac Trail, Ann Arbor, MI 48105, USA | Ann Arbor, MI |
91 Hutcheon St, Aberdeen AB25 1EW, UK | Aberdeen, Scotland |
From the table we can visually compare the cities, but let’s try with some code:
library(stringi)
stri_detect_fixed(sing_add_city$address, sing_add_city$city)
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [12] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [23] TRUE TRUE FALSE
From the output, we can see that 8 of the observations don’t match. But some of these mismatches are because some cities/countries are in capitals in the column city
.
I will try this again, putting all in low case.
low_address <- str_to_lower(sing_add_city$address)
low_city <- str_to_lower(sing_add_city$city)
words_address <- str_split(low_address, boundary("word"))
words_city <- str_split(low_city, boundary("word"))
no_match<-function(match_length){
match_length > 0 #to see if intersects or not
}
mapply(intersect, words_address, words_city) %>%
lapply(length) %>%
map(no_match)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] FALSE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
##
## [[13]]
## [1] TRUE
##
## [[14]]
## [1] TRUE
##
## [[15]]
## [1] FALSE
##
## [[16]]
## [1] TRUE
##
## [[17]]
## [1] TRUE
##
## [[18]]
## [1] FALSE
##
## [[19]]
## [1] TRUE
##
## [[20]]
## [1] TRUE
##
## [[21]]
## [1] TRUE
##
## [[22]]
## [1] TRUE
##
## [[23]]
## [1] TRUE
##
## [[24]]
## [1] TRUE
##
## [[25]]
## [1] TRUE
After applying the previous code, the mismatches were reduced from 8 to 3. Observation number 15 is a true mismatch, since address = San Francisco, and city = New York. The other two cases are because in the column city
only appears the state, and in the column address this state is abbreviated so they don’t match.
This things could potentially be true to different methods (i.e. having the states abbreviated in both columns) but it may compromise the accuracy of the output.
library(leaflet)
singer_geo %>%
leaflet() %>%
addTiles() %>%
addCircles(lng = singer_geo$longitude,
lat = singer_geo$latitude,
popup = singer_geo$artist_name,
color = "deeppink")
The map can show the artist name that corresponds to the city of each pink circle.