I’ve Been Everywhere Analysis

Introduction

The objective of this document is to provide a sample of my coding as I preform a variety of data handling tasks.

Extracting Place Data from Lyrics

The first step in the analysis is to extract location names from the unstructured text of the song lyrics.

In order to do this systematically and efficiently, I will rely on three common sense assumptions about the lyrics: 1. Although other aspects of the song are repetitive, lines with place names are not repeated. 2. Unlike most other song lyrics, place names are capitalized. 3. Place names will not include generic stop words.

#I read in the lyrics from a text file as a character vector with an string for each line
  setwd("~/Documents/Job Search/Code Sample")
  lyrics <- readLines("./I've Been Everywhere")

#Lets extract ngrams from the song
  #Eliminating blank or duplicated lines (first assumption)
    lines <- lyrics[! lyrics == ""] #We can eliminate blank lines
    lines <- lines[! lines %in% lines[duplicated(lines)]] #eliminate duplicate lines 
    
  #Pulling out capitalized phrases (second assumption)
    library(stringr)
    phrases <- str_extract_all(lines, pattern = "([A-Z][a-z'\\-]+ ?)+") 
    phrases <- unlist(phrases)
    phrases <- trimws(phrases)

    
  #Eliminating place names that include stop words (third assumption)
    library(stopwords)
    regex_stopwords <- paste(paste0(" ",paste(stopwords(), collapse = " | ")," "),
                             paste0("^",paste(stopwords(), collapse = " |^")," "),
                             paste0(" ",paste(stopwords(), collapse = "$| "),"$"),
                             paste0("^",paste(stopwords(), collapse = "$|^"),"$"),
                             sep = "|")
                                #^ these pastes make a long regex that will capture stopwords
                             
    places <- phrases[-grep(regex_stopwords, tolower(phrases))] #removes phrases with stopwords
    
  #Lets look at our list
    places <- places[-duplicated(places)]
    places

##  [1] "Winnemucca"    "Mack"          "Listen"        "Reno"         
##  [5] "Chicago"       "Fargo"         "Minnesota"     "Buffalo"      
##  [9] "Toronto"       "Winslow"       "Sarasota"      "Wichita"      
## [13] "Tulsa"         "Ottawa"        "Oklahoma"      "Tampa"        
## [17] "Panama"        "Mattawa"       "La Paloma"     "Bangor"       
## [21] "Baltimore"     "Salvador"      "Amarillo"      "Tocopilla"    
## [25] "Barranquilla"  "Padilla"       "Boston"        "Charleston"   
## [29] "Dayton"        "Louisiana"     "Washington"    "Houston"      
## [33] "Kingston"      "Texarkana"     "Monterey"      "Faraday"      
## [37] "Santa Fe"      "Tallapoosa"    "Glen Rock"     "Black Rock"   
## [41] "Little Rock"   "Oskaloosa"     "Tennessee"     "Hennessey"    
## [45] "Chicopee"      "Spirit Lake"   "Grand Lake"    "Devil's Lake" 
## [49] "Crater Lake"   "Pete's"        "Louisville"    "Nashville"    
## [53] "Knoxville"     "Ombabika"      "Schefferville" "Jacksonville" 
## [57] "Waterville"    "Costa Rica"    "Pittsfield"    "Springfield"  
## [61] "Bakersfield"   "Shreveport"    "Hackensack"    "Cadillac"     
## [65] "Fond"          "Lac"           "Davenport"     "Idaho"        
## [69] "Jellico"       "Argentina"     "Diamantina"    "Pasadena"     
## [73] "Catalina"      "Pittsburgh"    "Parkersburg"   "Gravelbourg"  
## [77] "Colorado"      "Ellensburg"    "Rexburg"       "Vicksburg"    
## [81] "El Dorado"     "Larimore"      "Admore"        "Haverstraw"   
## [85] "Chatanika"     "Chaska"        "Nebraska"      "Alaska"       
## [89] "Opelika"       "Baraboo"       "Waterloo"      "Kalamazoo"    
## [93] "Kansas City"   "Sioux City"    "Cedar City"    "Dodge City"

It looks like a systematic handling of the song lyrics, aided by the stated assumptions, did an alright job pulling place names out of the song lyrics. However, it does look like there were some errors, two false positives and a false negative. Such is the messy reality of text data.

“Listen” and “Pete’s” were retained. They fit my assumptions even though they was not a place names. “Fond du Lac” was not included. It failed to meet my assumptions even though it was a place name. I manually correct these errors below

#Removing false positives
  places <- places[! places %in% c("Listen", "Pete's")]

#Adding the place name Fond du Lac
  places <- places[! places %in% c("Fond","Lac")]
  places <- c(places, "Fond du Lac")

Geocode Function Writing

Place names are not very helpful on their own.A geocoding API, accessed through the ggmap package allows us to us to pull location data using the place names. This is as if we were searching Google maps for each location.

Just knowing place names is not particularly interesting. I want to get data about these places. I will use Google map’s API to get information about the locations we have extracted from song lyrics. I will write my own geocoding function to do so. There is a preexisting package with functions to interact with the Google maps API, however the function it includes is inflexible and the method of authentication is outdated.

library(jsonlite)
vector_fromJSON <- Vectorize(fromJSON, SIMPLIFY = FALSE)
  
#Now lets write a geocoding function that returns a dataframe
  geocode <- function(locations, 
                      regioncode = 'us', 
                      APIkey){
                             
      json_locations <- gsub(' ', '+', locations)
                                    
      response <- vector_fromJSON(paste0('https://maps.googleapis.com/maps/api/geocode/json?',
                                         'address=', 
                                         json_locations,
                                         '&regioncode=', regioncode,
                                         '&key=', APIkey))
                              
      response <- unlist(response, recursive = F)
      status <- response[grep('.status$',names(response))]
      status <- unlist(status)
      results <- response[grep('.results$',names(response))]
                              
      lat       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lat[1]})))
      lng       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lng[1]})))
                              
      address <- lapply(results, 
                        function(x){address = data.frame(x$address_components[[1]]$long_name, 
                                                         unlist(lapply(x$address_components[[1]]$types, 
                                                                       paste, 
                                                                       collapse = ", ")),
                                                         stringsAsFactors = FALSE)})
      address <- lapply(address,
                        function(x){names(x)<-c("comps", "types");x})
                              
      country   <- unlist(lapply(address, 
                                 function(x){ifelse(length(x$comps[grep('country', 
                                                                        x$types)]) == 1, 
                                                    x$comps[grep('country', x$types)],
                                                    NA)}))
      
      state   <- unlist(lapply(address, 
                               function(x){ifelse(length(x$comps[grep('administrative_area_level_1', 
                                                                      x$types
                                                                      )]) == 1, 
                                           x$comps[grep('administrative_area_level_1', 
                                                        x$types)],
                                            NA)}))
      
      county   <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('administrative_area_level_2', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('administrative_area_level_2', x$types)],
                                            NA)}))
      
      locality <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('locality', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('locality', x$types)],
                                            NA)}))

      df <- data.frame(locations,
                       lat, 
                       lng, 
                       locality, 
                       county, 
                       state, 
                       country, 
                       status, 
                       stringsAsFactors = F)
      
      names(df) <- c("locations", 
                     "lat",
                     "lng",
                     "locality",
                     "county", 
                     "state",
                     "country", 
                     "geocode_status")
      
      rownames(df) <- 1:nrow(df)
                              
      return(df)
                                      }

Now I can run my function on all the places we extracted from the song.

places <- geocode(locations = places, APIkey = Noahs_APIkey)

View(places)

locations	lat	lng	locality	county	state	country	geocode_status
Winnemucca	40.972958	-117.73568	Winnemucca	Humboldt County	Nevada	United States	OK
Mack	36.077470	-96.05270	Tulsa	Tulsa County	Oklahoma	United States	OK
Reno	39.529633	-119.81380	Reno	Washoe County	Nevada	United States	OK
Chicago	41.878114	-87.62980	Chicago	Cook County	Illinois	United States	OK
Fargo	46.877186	-96.78980	Fargo	Cass County	North Dakota	United States	OK
Minnesota	46.729553	-94.68590	NA	NA	Minnesota	United States	OK
Buffalo	42.886447	-78.87837	Buffalo	Erie County	New York	United States	OK
Toronto	43.653226	-79.38318	Toronto	Toronto Division	Ontario	Canada	OK
Winslow	35.024187	-110.69736	Winslow	Navajo County	Arizona	United States	OK
Sarasota	27.336435	-82.53065	Sarasota	Sarasota County	Florida	United States	OK
Wichita	37.687176	-97.33005	Wichita	Sedgwick County	Kansas	United States	OK
Tulsa	36.153982	-95.99277	Tulsa	Tulsa County	Oklahoma	United States	OK
Ottawa	45.421530	-75.69719	Ottawa	Ottawa Division	Ontario	Canada	OK
Oklahoma	35.467560	-97.51643	Oklahoma City	Oklahoma County	Oklahoma	United States	OK
Tampa	27.950575	-82.45718	Tampa	Hillsborough County	Florida	United States	OK
Panama	8.537981	-80.78213	NA	NA	NA	Panama	OK
Mattawa	46.737910	-119.90282	Mattawa	Grant County	Washington	United States	OK
La Paloma	32.314651	-110.91615	Tucson	Pima County	Arizona	United States	OK
Bangor	44.801613	-68.77123	Bangor	Penobscot County	Maine	United States	OK
Baltimore	39.290385	-76.61219	Baltimore	NA	Maryland	United States	OK
Salvador	-12.977749	-38.50163	NA	Salvador	State of Bahia	Brazil	OK
Amarillo	35.221997	-101.83130	Amarillo	Potter County	Texas	United States	OK
Tocopilla	-22.088678	-70.19605	Tocopilla	Tocopilla Province	Antofagasta	Chile	OK
Barranquilla	11.004107	-74.80698	Barranquilla	Barranquilla	Atlantico	Colombia	OK
Padilla	26.544614	-81.85391	Fort Myers	Lee County	Florida	United States	OK
Boston	42.360082	-71.05888	Boston	Suffolk County	Massachusetts	United States	OK
Charleston	32.776475	-79.93105	Charleston	Charleston County	South Carolina	United States	OK
Dayton	39.758948	-84.19161	Dayton	Montgomery County	Ohio	United States	OK
Louisiana	30.984298	-91.96233	NA	NA	Louisiana	United States	OK
Washington	47.751074	-120.74014	NA	NA	Washington	United States	OK
Houston	29.760427	-95.36980	Houston	Harris County	Texas	United States	OK
Kingston	41.927037	-73.99736	Kingston	Ulster County	New York	United States	OK
Texarkana	33.425125	-94.04769	Texarkana	Bowie County	Texas	United States	OK
Monterey	36.600238	-121.89468	Monterey	Monterey County	California	United States	OK
Faraday	37.394787	-121.92830	San Jose	Santa Clara County	California	United States	OK
Santa Fe	35.686975	-105.93780	Santa Fe	Santa Fe County	New Mexico	United States	OK
Tallapoosa	33.744550	-85.28801	Tallapoosa	Haralson County	Georgia	United States	OK
Glen Rock	40.962876	-74.13292	Glen Rock	Bergen County	New Jersey	United States	OK
Black Rock	28.054965	-82.50466	Tampa	Hillsborough County	Florida	United States	OK
Little Rock	34.746481	-92.28959	Little Rock	Pulaski County	Arkansas	United States	OK
Oskaloosa	41.291673	-92.64936	Oskaloosa	Mahaska County	Iowa	United States	OK
Tennessee	35.517491	-86.58045	NA	NA	Tennessee	United States	OK
Hennessey	36.109205	-97.89867	Hennessey	Kingfisher County	Oklahoma	United States	OK
Chicopee	42.148704	-72.60787	Chicopee	Hampden County	Massachusetts	United States	OK
Spirit Lake	46.274303	-122.13371	NA	Skamania County	Washington	United States	OK
Grand Lake	40.252207	-105.82307	Grand Lake	Grand County	Colorado	United States	OK
Devil’s Lake	43.418397	-89.73095	NA	Sauk County	Wisconsin	United States	OK
Crater Lake	42.944587	-122.10900	NA	Klamath County	Oregon	United States	OK
Louisville	38.252665	-85.75846	Louisville	Jefferson County	Kentucky	United States	OK
Nashville	36.162664	-86.78160	Nashville	Davidson County	Tennessee	United States	OK
Knoxville	35.960638	-83.92074	Knoxville	Knox County	Tennessee	United States	OK
Ombabika	50.233333	-87.90000	Ombabika	Thunder Bay District	Ontario	Canada	OK
Schefferville	54.824559	-66.81748	Schefferville	Sept-Rivières—Caniapiscau	Quebec	Canada	OK
Jacksonville	30.332184	-81.65565	Jacksonville	Duval County	Florida	United States	OK
Waterville	43.965177	-71.52791	Waterville Valley	Grafton County	New Hampshire	United States	OK
Costa Rica	9.748917	-83.75343	NA	NA	NA	Costa Rica	OK
Pittsfield	42.450085	-73.24538	Pittsfield	Berkshire County	Massachusetts	United States	OK
Springfield	37.208957	-93.29230	Springfield	Greene County	Missouri	United States	OK
Bakersfield	35.373292	-119.01871	Bakersfield	Kern County	California	United States	OK
Shreveport	32.525152	-93.75018	Shreveport	Caddo Parish	Louisiana	United States	OK
Hackensack	40.885933	-74.04347	Hackensack	Bergen County	New Jersey	United States	OK
Cadillac	37.053795	-94.45353	Joplin	Newton County	Missouri	United States	OK
Davenport	41.523644	-90.57764	Davenport	Scott County	Iowa	United States	OK
Idaho	44.068202	-114.74204	NA	NA	Idaho	United States	OK
Jellico	36.587859	-84.12687	Jellico	Campbell County	Tennessee	United States	OK
Argentina	-38.416097	-63.61667	NA	NA	NA	Argentina	OK
Diamantina	-18.247454	-43.60121	Diamantina	Diamantina	Minas Gerais	Brazil	OK
Pasadena	34.147785	-118.14452	Pasadena	Los Angeles County	California	United States	OK
Catalina	33.387886	-118.41631	NA	Los Angeles County	California	United States	OK
Pittsburgh	40.440625	-79.99589	Pittsburgh	Allegheny County	Pennsylvania	United States	OK
Parkersburg	39.266742	-81.56151	Parkersburg	Wood County	West Virginia	United States	OK
Gravelbourg	49.875676	-106.55732	Gravelbourg	Division No. 3	Saskatchewan	Canada	OK
Colorado	39.550051	-105.78207	NA	NA	Colorado	United States	OK
Ellensburg	46.996514	-120.54785	Ellensburg	Kittitas County	Washington	United States	OK
Rexburg	43.823110	-111.79242	Rexburg	Madison County	Idaho	United States	OK
Vicksburg	32.352646	-90.87788	Vicksburg	Warren County	Mississippi	United States	OK
El Dorado	38.809573	-94.45718	Raymore	Cass County	Missouri	United States	OK
Larimore	47.906657	-97.62675	Larimore	Grand Forks County	North Dakota	United States	OK
Admore	42.643147	-82.85890	Macomb	Macomb County	Michigan	United States	OK
Haverstraw	41.197595	-73.96458	Haverstraw	Rockland County	New York	United States	OK
Chatanika	65.111221	-147.46539	Chatanika	Fairbanks North Star	Alaska	United States	OK
Chaska	44.789345	-93.60184	Chaska	Carver County	Minnesota	United States	OK
Nebraska	41.492537	-99.90181	NA	NA	Nebraska	United States	OK
Alaska	64.200841	-149.49367	NA	NA	Alaska	United States	OK
Opelika	32.645412	-85.37828	Opelika	Lee County	Alabama	United States	OK
Baraboo	43.471094	-89.74429	Baraboo	Sauk County	Wisconsin	United States	OK
Waterloo	42.492786	-92.34258	Waterloo	Black Hawk County	Iowa	United States	OK
Kalamazoo	42.291707	-85.58723	Kalamazoo	Kalamazoo County	Michigan	United States	OK
Kansas City	39.099727	-94.57857	Kansas City	Jackson County	Missouri	United States	OK
Sioux City	42.496342	-96.40494	Sioux City	Woodbury County	Iowa	United States	OK
Cedar City	37.677477	-113.06189	Cedar City	Iron County	Utah	United States	OK
Dodge City	37.752798	-100.01708	Dodge City	Ford County	Kansas	United States	OK
Fond du Lac	43.773045	-88.44705	Fond du Lac	Fond du Lac County	Wisconsin	United States	OK

Mapping “Everywhere”

R has some marvelous packages to create maps. Let’s map where Cash has been.

#Now let's use leaflet to plot
    map <- leaflet(width = '100%', 
                   options = leafletOptions())
    map <- addProviderTiles(map,
                            providers$CartoDB.Positron)
    map <- setView(map, lng = -102, lat = 30, zoom = 2)
    map <- addMarkers(map, 
                      data = places[localities,],
                      lng = ~lng, 
                      lat = ~lat,
                      label = ~locations
                    )
 map <- addPolygons(map, 
                      data = country_shapes,
                      fillColor = "Orange",
                      weight = 2,
                      opacity = 1,
                      color = "Orange",
                      dashArray = "3",
                      fillOpacity = 0.7,
                      label = ~CNTRY_NAME,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 map <- addPolygons(map, 
                      data = state_shapes,
                      fillColor = "Green",
                      weight = 2,
                      opacity = 1,
                      color = "Green",
                      dashArray = "3",
                      label = ~NAME,
                      fillOpacity = 0.7,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 
 map

Fancy mapping packages are good fun. However it is often more sensible to make a well designed static image. Below, I make a similar map, relying only on trusty ggplot2.

library(ggplot2)
library(broom)

#Convert to plain dfs, as ggplot expects
state_df <- tidy(state_map, region = "STUSPS")
country_df <- tidy(country_map, region = "CNTRY_NAME")

#Plot
ggplot()+
  geom_polygon(data = country_df, aes(long,lat,group=group), fill = "white", col = "grey")+
  geom_polygon(data = state_df, aes(long,lat,group=group), fill = "white", col = "grey")+
  geom_polygon(data = state_shapes, aes(long,lat,group=group), fill="Dark Green", col = NA)+
  geom_polygon(data = country_shapes, aes(long,lat,group=group),fill="Orange", col = NA)+
  geom_jitter(data = places[localities,], aes(lng, lat), color = "Dark Blue")+
  theme_void()+
  theme(plot.background = element_rect(fill = "grey", color = NA))+
  coord_quickmap(xlim = c(-180,10), ylim =c(-60,80), clip = "on", expand = FALSE)

## Regions defined for each Polygons
## Regions defined for each Polygons

Why has Cash been where he has been?

As one can see from the Map above, Jonny Cash has been plenty of places. He has not, however, been everywhere. This begs the question, why did Cash go where he went? Below, this question is tackled with a logistic regression.

The level of analysis is US counties. Excellent county by county data is available for the US. Additionally, the bulk of places named in the song can be pinpointed to US counties. However, This means that the model have to take into account locations fromt he song that fall outside the US, or locations that include several counties, i.e. states.

The model uses four variables:

Population. It seems likely that Cash will go where there are people for which to preform.
Percent of Population Employed in Cattle Ranching. Cash maintained an image as a cow boy. It seems likely he would want to be seen among real cowboys.
Percent of Population Incarcerated. Cash famously preformed at Prison’s around the county. It seems likely that he would have been where the prisons were.

County by county population data is taken from the ACS. the number of ranchers is taken from the EEO survey. The number of incarcerated is taken from the census itself. Because the first two measures are estimates, they are not available for sparsely populated counties. We will only be considering counties for which we have data for all three variables. (Essentially, this is the largest third of US Counties).

library(DescTools)

Population <-read.csv("./Population better.csv")
Ranchers <-read.csv("./Ranchers.csv")
Prisoners <-read.csv("./Prison Population.csv")

counties <- merge(Prisoners, merge(Population, Ranchers, by = 'Geography'), by = 'Geography')
names(counties)<-c('names','prison','pop','ranch')
counties$prison <- counties$prison/counties$pop
counties$ranch  <- counties$ranch/counties$pop

counties$visited <- ifelse(counties$names %in% paste0(places$county,", ",places$state), 1, 0)

model <- glm(visited ~ pop + prison + ranch, 
             data = counties, 
             family = binomial(link = "logit")
)
summary(model)

## 
## Call:
## glm(formula = visited ~ pop + prison + ranch, family = binomial(link = "logit"), 
##     data = counties)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3508  -0.3224  -0.3050  -0.2945   2.5861  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.233e+00  2.402e-01 -13.460  < 2e-16 ***
## pop          9.405e-07  2.188e-07   4.299 1.71e-05 ***
## prison      -3.397e+00  1.102e+01  -0.308    0.758    
## ranch        3.735e+01  5.677e+01   0.658    0.511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.53  on 966  degrees of freedom
## Residual deviance: 392.84  on 963  degrees of freedom
## AIC: 400.84
## 
## Number of Fisher Scoring iterations: 6

DescTools::PseudoR2(model)

##   McFadden 
## 0.05686871

Looking at the distribution of residuals and the pseudo \(R^2\) it is clear this model has very little explanatory power. The McFadden pseudo R squared is particularly damning; only about six percent in the variation in the likelihood that Jonny Cash has been to a county is explained in by the model’s chosen predictors. Additionally only one of the predictors had a statistically significant effect: population. Note that while population is significant, it is also tiny. I can say with confidence this model does not increase our understanding of where Johny Cash has been.

It makes sense that this model would have little explanatory power. Jonny Cash did not actually write the son. It’s a cover.