Introduction

The objective of this document is to provide a sample of my coding as I preform a variety of data handling tasks.

Extracting Place Data from Lyrics

The first step in the analysis is to extract location names from the unstructured text of the song lyrics.

In order to do this systematically and efficiently, I will rely on three common sense assumptions about the lyrics: 1. Although other aspects of the song are repetitive, lines with place names are not repeated. 2. Unlike most other song lyrics, place names are capitalized. 3. Place names will not include generic stop words.

#I read in the lyrics from a text file as a character vector with an string for each line
  setwd("~/Documents/Job Search/Code Sample")
  lyrics <- readLines("./I've Been Everywhere")

#Lets extract ngrams from the song
  #Eliminating blank or duplicated lines (first assumption)
    lines <- lyrics[! lyrics == ""] #We can eliminate blank lines
    lines <- lines[! lines %in% lines[duplicated(lines)]] #eliminate duplicate lines 
    
  #Pulling out capitalized phrases (second assumption)
    library(stringr)
    phrases <- str_extract_all(lines, pattern = "([A-Z][a-z'\\-]+ ?)+") 
    phrases <- unlist(phrases)
    phrases <- trimws(phrases)

    
  #Eliminating place names that include stop words (third assumption)
    library(stopwords)
    regex_stopwords <- paste(paste0(" ",paste(stopwords(), collapse = " | ")," "),
                             paste0("^",paste(stopwords(), collapse = " |^")," "),
                             paste0(" ",paste(stopwords(), collapse = "$| "),"$"),
                             paste0("^",paste(stopwords(), collapse = "$|^"),"$"),
                             sep = "|")
                                #^ these pastes make a long regex that will capture stopwords
                             
    places <- phrases[-grep(regex_stopwords, tolower(phrases))] #removes phrases with stopwords
    
  #Lets look at our list
    places <- places[-duplicated(places)]
    places
##  [1] "Winnemucca"    "Mack"          "Listen"        "Reno"         
##  [5] "Chicago"       "Fargo"         "Minnesota"     "Buffalo"      
##  [9] "Toronto"       "Winslow"       "Sarasota"      "Wichita"      
## [13] "Tulsa"         "Ottawa"        "Oklahoma"      "Tampa"        
## [17] "Panama"        "Mattawa"       "La Paloma"     "Bangor"       
## [21] "Baltimore"     "Salvador"      "Amarillo"      "Tocopilla"    
## [25] "Barranquilla"  "Padilla"       "Boston"        "Charleston"   
## [29] "Dayton"        "Louisiana"     "Washington"    "Houston"      
## [33] "Kingston"      "Texarkana"     "Monterey"      "Faraday"      
## [37] "Santa Fe"      "Tallapoosa"    "Glen Rock"     "Black Rock"   
## [41] "Little Rock"   "Oskaloosa"     "Tennessee"     "Hennessey"    
## [45] "Chicopee"      "Spirit Lake"   "Grand Lake"    "Devil's Lake" 
## [49] "Crater Lake"   "Pete's"        "Louisville"    "Nashville"    
## [53] "Knoxville"     "Ombabika"      "Schefferville" "Jacksonville" 
## [57] "Waterville"    "Costa Rica"    "Pittsfield"    "Springfield"  
## [61] "Bakersfield"   "Shreveport"    "Hackensack"    "Cadillac"     
## [65] "Fond"          "Lac"           "Davenport"     "Idaho"        
## [69] "Jellico"       "Argentina"     "Diamantina"    "Pasadena"     
## [73] "Catalina"      "Pittsburgh"    "Parkersburg"   "Gravelbourg"  
## [77] "Colorado"      "Ellensburg"    "Rexburg"       "Vicksburg"    
## [81] "El Dorado"     "Larimore"      "Admore"        "Haverstraw"   
## [85] "Chatanika"     "Chaska"        "Nebraska"      "Alaska"       
## [89] "Opelika"       "Baraboo"       "Waterloo"      "Kalamazoo"    
## [93] "Kansas City"   "Sioux City"    "Cedar City"    "Dodge City"

It looks like a systematic handling of the song lyrics, aided by the stated assumptions, did an alright job pulling place names out of the song lyrics. However, it does look like there were some errors, two false positives and a false negative. Such is the messy reality of text data.

“Listen” and “Pete’s” were retained. They fit my assumptions even though they was not a place names. “Fond du Lac” was not included. It failed to meet my assumptions even though it was a place name. I manually correct these errors below

#Removing false positives
  places <- places[! places %in% c("Listen", "Pete's")]

#Adding the place name Fond du Lac
  places <- places[! places %in% c("Fond","Lac")]
  places <- c(places, "Fond du Lac")

Geocode Function Writing

Place names are not very helpful on their own.A geocoding API, accessed through the ggmap package allows us to us to pull location data using the place names. This is as if we were searching Google maps for each location.

Just knowing place names is not particularly interesting. I want to get data about these places. I will use Google map’s API to get information about the locations we have extracted from song lyrics. I will write my own geocoding function to do so. There is a preexisting package with functions to interact with the Google maps API, however the function it includes is inflexible and the method of authentication is outdated.

library(jsonlite)
vector_fromJSON <- Vectorize(fromJSON, SIMPLIFY = FALSE)
  
#Now lets write a geocoding function that returns a dataframe
  geocode <- function(locations, 
                      regioncode = 'us', 
                      APIkey){
                             
      json_locations <- gsub(' ', '+', locations)
                                    
      response <- vector_fromJSON(paste0('https://maps.googleapis.com/maps/api/geocode/json?',
                                         'address=', 
                                         json_locations,
                                         '&regioncode=', regioncode,
                                         '&key=', APIkey))
                              
      response <- unlist(response, recursive = F)
      status <- response[grep('.status$',names(response))]
      status <- unlist(status)
      results <- response[grep('.results$',names(response))]
                              
      lat       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lat[1]})))
      lng       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lng[1]})))
                              
      address <- lapply(results, 
                        function(x){address = data.frame(x$address_components[[1]]$long_name, 
                                                         unlist(lapply(x$address_components[[1]]$types, 
                                                                       paste, 
                                                                       collapse = ", ")),
                                                         stringsAsFactors = FALSE)})
      address <- lapply(address,
                        function(x){names(x)<-c("comps", "types");x})
                              
      country   <- unlist(lapply(address, 
                                 function(x){ifelse(length(x$comps[grep('country', 
                                                                        x$types)]) == 1, 
                                                    x$comps[grep('country', x$types)],
                                                    NA)}))
      
      state   <- unlist(lapply(address, 
                               function(x){ifelse(length(x$comps[grep('administrative_area_level_1', 
                                                                      x$types
                                                                      )]) == 1, 
                                           x$comps[grep('administrative_area_level_1', 
                                                        x$types)],
                                            NA)}))
      
      county   <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('administrative_area_level_2', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('administrative_area_level_2', x$types)],
                                            NA)}))
      
      locality <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('locality', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('locality', x$types)],
                                            NA)}))

      df <- data.frame(locations,
                       lat, 
                       lng, 
                       locality, 
                       county, 
                       state, 
                       country, 
                       status, 
                       stringsAsFactors = F)
      
      names(df) <- c("locations", 
                     "lat",
                     "lng",
                     "locality",
                     "county", 
                     "state",
                     "country", 
                     "geocode_status")
      
      rownames(df) <- 1:nrow(df)
                              
      return(df)
                                      }

Now I can run my function on all the places we extracted from the song.

places <- geocode(locations = places, APIkey = Noahs_APIkey)

View(places)
locations lat lng locality county state country geocode_status
Winnemucca 40.972958 -117.73568 Winnemucca Humboldt County Nevada United States OK
Mack 36.077470 -96.05270 Tulsa Tulsa County Oklahoma United States OK
Reno 39.529633 -119.81380 Reno Washoe County Nevada United States OK
Chicago 41.878114 -87.62980 Chicago Cook County Illinois United States OK
Fargo 46.877186 -96.78980 Fargo Cass County North Dakota United States OK
Minnesota 46.729553 -94.68590 NA NA Minnesota United States OK
Buffalo 42.886447 -78.87837 Buffalo Erie County New York United States OK
Toronto 43.653226 -79.38318 Toronto Toronto Division Ontario Canada OK
Winslow 35.024187 -110.69736 Winslow Navajo County Arizona United States OK
Sarasota 27.336435 -82.53065 Sarasota Sarasota County Florida United States OK
Wichita 37.687176 -97.33005 Wichita Sedgwick County Kansas United States OK
Tulsa 36.153982 -95.99277 Tulsa Tulsa County Oklahoma United States OK
Ottawa 45.421530 -75.69719 Ottawa Ottawa Division Ontario Canada OK
Oklahoma 35.467560 -97.51643 Oklahoma City Oklahoma County Oklahoma United States OK
Tampa 27.950575 -82.45718 Tampa Hillsborough County Florida United States OK
Panama 8.537981 -80.78213 NA NA NA Panama OK
Mattawa 46.737910 -119.90282 Mattawa Grant County Washington United States OK
La Paloma 32.314651 -110.91615 Tucson Pima County Arizona United States OK
Bangor 44.801613 -68.77123 Bangor Penobscot County Maine United States OK
Baltimore 39.290385 -76.61219 Baltimore NA Maryland United States OK
Salvador -12.977749 -38.50163 NA Salvador State of Bahia Brazil OK
Amarillo 35.221997 -101.83130 Amarillo Potter County Texas United States OK
Tocopilla -22.088678 -70.19605 Tocopilla Tocopilla Province Antofagasta Chile OK
Barranquilla 11.004107 -74.80698 Barranquilla Barranquilla Atlantico Colombia OK
Padilla 26.544614 -81.85391 Fort Myers Lee County Florida United States OK
Boston 42.360082 -71.05888 Boston Suffolk County Massachusetts United States OK
Charleston 32.776475 -79.93105 Charleston Charleston County South Carolina United States OK
Dayton 39.758948 -84.19161 Dayton Montgomery County Ohio United States OK
Louisiana 30.984298 -91.96233 NA NA Louisiana United States OK
Washington 47.751074 -120.74014 NA NA Washington United States OK
Houston 29.760427 -95.36980 Houston Harris County Texas United States OK
Kingston 41.927037 -73.99736 Kingston Ulster County New York United States OK
Texarkana 33.425125 -94.04769 Texarkana Bowie County Texas United States OK
Monterey 36.600238 -121.89468 Monterey Monterey County California United States OK
Faraday 37.394787 -121.92830 San Jose Santa Clara County California United States OK
Santa Fe 35.686975 -105.93780 Santa Fe Santa Fe County New Mexico United States OK
Tallapoosa 33.744550 -85.28801 Tallapoosa Haralson County Georgia United States OK
Glen Rock 40.962876 -74.13292 Glen Rock Bergen County New Jersey United States OK
Black Rock 28.054965 -82.50466 Tampa Hillsborough County Florida United States OK
Little Rock 34.746481 -92.28959 Little Rock Pulaski County Arkansas United States OK
Oskaloosa 41.291673 -92.64936 Oskaloosa Mahaska County Iowa United States OK
Tennessee 35.517491 -86.58045 NA NA Tennessee United States OK
Hennessey 36.109205 -97.89867 Hennessey Kingfisher County Oklahoma United States OK
Chicopee 42.148704 -72.60787 Chicopee Hampden County Massachusetts United States OK
Spirit Lake 46.274303 -122.13371 NA Skamania County Washington United States OK
Grand Lake 40.252207 -105.82307 Grand Lake Grand County Colorado United States OK
Devil’s Lake 43.418397 -89.73095 NA Sauk County Wisconsin United States OK
Crater Lake 42.944587 -122.10900 NA Klamath County Oregon United States OK
Louisville 38.252665 -85.75846 Louisville Jefferson County Kentucky United States OK
Nashville 36.162664 -86.78160 Nashville Davidson County Tennessee United States OK
Knoxville 35.960638 -83.92074 Knoxville Knox County Tennessee United States OK
Ombabika 50.233333 -87.90000 Ombabika Thunder Bay District Ontario Canada OK
Schefferville 54.824559 -66.81748 Schefferville Sept-Rivières—Caniapiscau Quebec Canada OK
Jacksonville 30.332184 -81.65565 Jacksonville Duval County Florida United States OK
Waterville 43.965177 -71.52791 Waterville Valley Grafton County New Hampshire United States OK
Costa Rica 9.748917 -83.75343 NA NA NA Costa Rica OK
Pittsfield 42.450085 -73.24538 Pittsfield Berkshire County Massachusetts United States OK
Springfield 37.208957 -93.29230 Springfield Greene County Missouri United States OK
Bakersfield 35.373292 -119.01871 Bakersfield Kern County California United States OK
Shreveport 32.525152 -93.75018 Shreveport Caddo Parish Louisiana United States OK
Hackensack 40.885933 -74.04347 Hackensack Bergen County New Jersey United States OK
Cadillac 37.053795 -94.45353 Joplin Newton County Missouri United States OK
Davenport 41.523644 -90.57764 Davenport Scott County Iowa United States OK
Idaho 44.068202 -114.74204 NA NA Idaho United States OK
Jellico 36.587859 -84.12687 Jellico Campbell County Tennessee United States OK
Argentina -38.416097 -63.61667 NA NA NA Argentina OK
Diamantina -18.247454 -43.60121 Diamantina Diamantina Minas Gerais Brazil OK
Pasadena 34.147785 -118.14452 Pasadena Los Angeles County California United States OK
Catalina 33.387886 -118.41631 NA Los Angeles County California United States OK
Pittsburgh 40.440625 -79.99589 Pittsburgh Allegheny County Pennsylvania United States OK
Parkersburg 39.266742 -81.56151 Parkersburg Wood County West Virginia United States OK
Gravelbourg 49.875676 -106.55732 Gravelbourg Division No. 3 Saskatchewan Canada OK
Colorado 39.550051 -105.78207 NA NA Colorado United States OK
Ellensburg 46.996514 -120.54785 Ellensburg Kittitas County Washington United States OK
Rexburg 43.823110 -111.79242 Rexburg Madison County Idaho United States OK
Vicksburg 32.352646 -90.87788 Vicksburg Warren County Mississippi United States OK
El Dorado 38.809573 -94.45718 Raymore Cass County Missouri United States OK
Larimore 47.906657 -97.62675 Larimore Grand Forks County North Dakota United States OK
Admore 42.643147 -82.85890 Macomb Macomb County Michigan United States OK
Haverstraw 41.197595 -73.96458 Haverstraw Rockland County New York United States OK
Chatanika 65.111221 -147.46539 Chatanika Fairbanks North Star Alaska United States OK
Chaska 44.789345 -93.60184 Chaska Carver County Minnesota United States OK
Nebraska 41.492537 -99.90181 NA NA Nebraska United States OK
Alaska 64.200841 -149.49367 NA NA Alaska United States OK
Opelika 32.645412 -85.37828 Opelika Lee County Alabama United States OK
Baraboo 43.471094 -89.74429 Baraboo Sauk County Wisconsin United States OK
Waterloo 42.492786 -92.34258 Waterloo Black Hawk County Iowa United States OK
Kalamazoo 42.291707 -85.58723 Kalamazoo Kalamazoo County Michigan United States OK
Kansas City 39.099727 -94.57857 Kansas City Jackson County Missouri United States OK
Sioux City 42.496342 -96.40494 Sioux City Woodbury County Iowa United States OK
Cedar City 37.677477 -113.06189 Cedar City Iron County Utah United States OK
Dodge City 37.752798 -100.01708 Dodge City Ford County Kansas United States OK
Fond du Lac 43.773045 -88.44705 Fond du Lac Fond du Lac County Wisconsin United States OK

Mapping “Everywhere”

R has some marvelous packages to create maps. Let’s map where Cash has been.

#Now let's use leaflet to plot
    map <- leaflet(width = '100%', 
                   options = leafletOptions())
    map <- addProviderTiles(map,
                            providers$CartoDB.Positron)
    map <- setView(map, lng = -102, lat = 30, zoom = 2)
    map <- addMarkers(map, 
                      data = places[localities,],
                      lng = ~lng, 
                      lat = ~lat,
                      label = ~locations
                    )
 map <- addPolygons(map, 
                      data = country_shapes,
                      fillColor = "Orange",
                      weight = 2,
                      opacity = 1,
                      color = "Orange",
                      dashArray = "3",
                      fillOpacity = 0.7,
                      label = ~CNTRY_NAME,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 map <- addPolygons(map, 
                      data = state_shapes,
                      fillColor = "Green",
                      weight = 2,
                      opacity = 1,
                      color = "Green",
                      dashArray = "3",
                      label = ~NAME,
                      fillOpacity = 0.7,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 
 map 

Fancy mapping packages are good fun. However it is often more sensible to make a well designed static image. Below, I make a similar map, relying only on trusty ggplot2.

library(ggplot2)
library(broom)

#Convert to plain dfs, as ggplot expects
state_df <- tidy(state_map, region = "STUSPS")
country_df <- tidy(country_map, region = "CNTRY_NAME")

#Plot
ggplot()+
  geom_polygon(data = country_df, aes(long,lat,group=group), fill = "white", col = "grey")+
  geom_polygon(data = state_df, aes(long,lat,group=group), fill = "white", col = "grey")+
  geom_polygon(data = state_shapes, aes(long,lat,group=group), fill="Dark Green", col = NA)+
  geom_polygon(data = country_shapes, aes(long,lat,group=group),fill="Orange", col = NA)+
  geom_jitter(data = places[localities,], aes(lng, lat), color = "Dark Blue")+
  theme_void()+
  theme(plot.background = element_rect(fill = "grey", color = NA))+
  coord_quickmap(xlim = c(-180,10), ylim =c(-60,80), clip = "on", expand = FALSE)
## Regions defined for each Polygons
## Regions defined for each Polygons

Why has Cash been where he has been?

As one can see from the Map above, Jonny Cash has been plenty of places. He has not, however, been everywhere. This begs the question, why did Cash go where he went? Below, this question is tackled with a logistic regression.

The level of analysis is US counties. Excellent county by county data is available for the US. Additionally, the bulk of places named in the song can be pinpointed to US counties. However, This means that the model have to take into account locations fromt he song that fall outside the US, or locations that include several counties, i.e. states.

The model uses four variables:

  1. Population. It seems likely that Cash will go where there are people for which to preform.

  2. Percent of Population Employed in Cattle Ranching. Cash maintained an image as a cow boy. It seems likely he would want to be seen among real cowboys.

  3. Percent of Population Incarcerated. Cash famously preformed at Prison’s around the county. It seems likely that he would have been where the prisons were.

County by county population data is taken from the ACS. the number of ranchers is taken from the EEO survey. The number of incarcerated is taken from the census itself. Because the first two measures are estimates, they are not available for sparsely populated counties. We will only be considering counties for which we have data for all three variables. (Essentially, this is the largest third of US Counties).

library(DescTools)

Population <-read.csv("./Population better.csv")
Ranchers <-read.csv("./Ranchers.csv")
Prisoners <-read.csv("./Prison Population.csv")

counties <- merge(Prisoners, merge(Population, Ranchers, by = 'Geography'), by = 'Geography')
names(counties)<-c('names','prison','pop','ranch')
counties$prison <- counties$prison/counties$pop
counties$ranch  <- counties$ranch/counties$pop

counties$visited <- ifelse(counties$names %in% paste0(places$county,", ",places$state), 1, 0)

model <- glm(visited ~ pop + prison + ranch, 
             data = counties, 
             family = binomial(link = "logit")
)
summary(model)
## 
## Call:
## glm(formula = visited ~ pop + prison + ranch, family = binomial(link = "logit"), 
##     data = counties)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3508  -0.3224  -0.3050  -0.2945   2.5861  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.233e+00  2.402e-01 -13.460  < 2e-16 ***
## pop          9.405e-07  2.188e-07   4.299 1.71e-05 ***
## prison      -3.397e+00  1.102e+01  -0.308    0.758    
## ranch        3.735e+01  5.677e+01   0.658    0.511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.53  on 966  degrees of freedom
## Residual deviance: 392.84  on 963  degrees of freedom
## AIC: 400.84
## 
## Number of Fisher Scoring iterations: 6
DescTools::PseudoR2(model)
##   McFadden 
## 0.05686871

Looking at the distribution of residuals and the pseudo \(R^2\) it is clear this model has very little explanatory power. The McFadden pseudo R squared is particularly damning; only about six percent in the variation in the likelihood that Jonny Cash has been to a county is explained in by the model’s chosen predictors. Additionally only one of the predictors had a statistically significant effect: population. Note that while population is significant, it is also tiny. I can say with confidence this model does not increase our understanding of where Johny Cash has been.

It makes sense that this model would have little explanatory power. Jonny Cash did not actually write the son. It’s a cover.