The objective of this document is to provide a sample of my coding as I preform a variety of data handling tasks.
The first step in the analysis is to extract location names from the unstructured text of the song lyrics.
In order to do this systematically and efficiently, I will rely on three common sense assumptions about the lyrics: 1. Although other aspects of the song are repetitive, lines with place names are not repeated. 2. Unlike most other song lyrics, place names are capitalized. 3. Place names will not include generic stop words.
#I read in the lyrics from a text file as a character vector with an string for each line
setwd("~/Documents/Job Search/Code Sample")
lyrics <- readLines("./I've Been Everywhere")
#Lets extract ngrams from the song
#Eliminating blank or duplicated lines (first assumption)
lines <- lyrics[! lyrics == ""] #We can eliminate blank lines
lines <- lines[! lines %in% lines[duplicated(lines)]] #eliminate duplicate lines
#Pulling out capitalized phrases (second assumption)
library(stringr)
phrases <- str_extract_all(lines, pattern = "([A-Z][a-z'\\-]+ ?)+")
phrases <- unlist(phrases)
phrases <- trimws(phrases)
#Eliminating place names that include stop words (third assumption)
library(stopwords)
regex_stopwords <- paste(paste0(" ",paste(stopwords(), collapse = " | ")," "),
paste0("^",paste(stopwords(), collapse = " |^")," "),
paste0(" ",paste(stopwords(), collapse = "$| "),"$"),
paste0("^",paste(stopwords(), collapse = "$|^"),"$"),
sep = "|")
#^ these pastes make a long regex that will capture stopwords
places <- phrases[-grep(regex_stopwords, tolower(phrases))] #removes phrases with stopwords
#Lets look at our list
places <- places[-duplicated(places)]
places
## [1] "Winnemucca" "Mack" "Listen" "Reno"
## [5] "Chicago" "Fargo" "Minnesota" "Buffalo"
## [9] "Toronto" "Winslow" "Sarasota" "Wichita"
## [13] "Tulsa" "Ottawa" "Oklahoma" "Tampa"
## [17] "Panama" "Mattawa" "La Paloma" "Bangor"
## [21] "Baltimore" "Salvador" "Amarillo" "Tocopilla"
## [25] "Barranquilla" "Padilla" "Boston" "Charleston"
## [29] "Dayton" "Louisiana" "Washington" "Houston"
## [33] "Kingston" "Texarkana" "Monterey" "Faraday"
## [37] "Santa Fe" "Tallapoosa" "Glen Rock" "Black Rock"
## [41] "Little Rock" "Oskaloosa" "Tennessee" "Hennessey"
## [45] "Chicopee" "Spirit Lake" "Grand Lake" "Devil's Lake"
## [49] "Crater Lake" "Pete's" "Louisville" "Nashville"
## [53] "Knoxville" "Ombabika" "Schefferville" "Jacksonville"
## [57] "Waterville" "Costa Rica" "Pittsfield" "Springfield"
## [61] "Bakersfield" "Shreveport" "Hackensack" "Cadillac"
## [65] "Fond" "Lac" "Davenport" "Idaho"
## [69] "Jellico" "Argentina" "Diamantina" "Pasadena"
## [73] "Catalina" "Pittsburgh" "Parkersburg" "Gravelbourg"
## [77] "Colorado" "Ellensburg" "Rexburg" "Vicksburg"
## [81] "El Dorado" "Larimore" "Admore" "Haverstraw"
## [85] "Chatanika" "Chaska" "Nebraska" "Alaska"
## [89] "Opelika" "Baraboo" "Waterloo" "Kalamazoo"
## [93] "Kansas City" "Sioux City" "Cedar City" "Dodge City"
It looks like a systematic handling of the song lyrics, aided by the stated assumptions, did an alright job pulling place names out of the song lyrics. However, it does look like there were some errors, two false positives and a false negative. Such is the messy reality of text data.
“Listen” and “Pete’s” were retained. They fit my assumptions even though they was not a place names. “Fond du Lac” was not included. It failed to meet my assumptions even though it was a place name. I manually correct these errors below
#Removing false positives
places <- places[! places %in% c("Listen", "Pete's")]
#Adding the place name Fond du Lac
places <- places[! places %in% c("Fond","Lac")]
places <- c(places, "Fond du Lac")
Place names are not very helpful on their own.A geocoding API, accessed through the ggmap package allows us to us to pull location data using the place names. This is as if we were searching Google maps for each location.
Just knowing place names is not particularly interesting. I want to get data about these places. I will use Google map’s API to get information about the locations we have extracted from song lyrics. I will write my own geocoding function to do so. There is a preexisting package with functions to interact with the Google maps API, however the function it includes is inflexible and the method of authentication is outdated.
library(jsonlite)
vector_fromJSON <- Vectorize(fromJSON, SIMPLIFY = FALSE)
#Now lets write a geocoding function that returns a dataframe
geocode <- function(locations,
regioncode = 'us',
APIkey){
json_locations <- gsub(' ', '+', locations)
response <- vector_fromJSON(paste0('https://maps.googleapis.com/maps/api/geocode/json?',
'address=',
json_locations,
'®ioncode=', regioncode,
'&key=', APIkey))
response <- unlist(response, recursive = F)
status <- response[grep('.status$',names(response))]
status <- unlist(status)
results <- response[grep('.results$',names(response))]
lat <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lat[1]})))
lng <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lng[1]})))
address <- lapply(results,
function(x){address = data.frame(x$address_components[[1]]$long_name,
unlist(lapply(x$address_components[[1]]$types,
paste,
collapse = ", ")),
stringsAsFactors = FALSE)})
address <- lapply(address,
function(x){names(x)<-c("comps", "types");x})
country <- unlist(lapply(address,
function(x){ifelse(length(x$comps[grep('country',
x$types)]) == 1,
x$comps[grep('country', x$types)],
NA)}))
state <- unlist(lapply(address,
function(x){ifelse(length(x$comps[grep('administrative_area_level_1',
x$types
)]) == 1,
x$comps[grep('administrative_area_level_1',
x$types)],
NA)}))
county <- unlist(lapply(address,
function(x){ifelse(length(x$comps[grep('administrative_area_level_2',
x$types)]) == 1,
x$comps[grep('administrative_area_level_2', x$types)],
NA)}))
locality <- unlist(lapply(address,
function(x){ifelse(length(x$comps[grep('locality',
x$types)]) == 1,
x$comps[grep('locality', x$types)],
NA)}))
df <- data.frame(locations,
lat,
lng,
locality,
county,
state,
country,
status,
stringsAsFactors = F)
names(df) <- c("locations",
"lat",
"lng",
"locality",
"county",
"state",
"country",
"geocode_status")
rownames(df) <- 1:nrow(df)
return(df)
}
Now I can run my function on all the places we extracted from the song.
places <- geocode(locations = places, APIkey = Noahs_APIkey)
View(places)
| locations | lat | lng | locality | county | state | country | geocode_status |
|---|---|---|---|---|---|---|---|
| Winnemucca | 40.972958 | -117.73568 | Winnemucca | Humboldt County | Nevada | United States | OK |
| Mack | 36.077470 | -96.05270 | Tulsa | Tulsa County | Oklahoma | United States | OK |
| Reno | 39.529633 | -119.81380 | Reno | Washoe County | Nevada | United States | OK |
| Chicago | 41.878114 | -87.62980 | Chicago | Cook County | Illinois | United States | OK |
| Fargo | 46.877186 | -96.78980 | Fargo | Cass County | North Dakota | United States | OK |
| Minnesota | 46.729553 | -94.68590 | NA | NA | Minnesota | United States | OK |
| Buffalo | 42.886447 | -78.87837 | Buffalo | Erie County | New York | United States | OK |
| Toronto | 43.653226 | -79.38318 | Toronto | Toronto Division | Ontario | Canada | OK |
| Winslow | 35.024187 | -110.69736 | Winslow | Navajo County | Arizona | United States | OK |
| Sarasota | 27.336435 | -82.53065 | Sarasota | Sarasota County | Florida | United States | OK |
| Wichita | 37.687176 | -97.33005 | Wichita | Sedgwick County | Kansas | United States | OK |
| Tulsa | 36.153982 | -95.99277 | Tulsa | Tulsa County | Oklahoma | United States | OK |
| Ottawa | 45.421530 | -75.69719 | Ottawa | Ottawa Division | Ontario | Canada | OK |
| Oklahoma | 35.467560 | -97.51643 | Oklahoma City | Oklahoma County | Oklahoma | United States | OK |
| Tampa | 27.950575 | -82.45718 | Tampa | Hillsborough County | Florida | United States | OK |
| Panama | 8.537981 | -80.78213 | NA | NA | NA | Panama | OK |
| Mattawa | 46.737910 | -119.90282 | Mattawa | Grant County | Washington | United States | OK |
| La Paloma | 32.314651 | -110.91615 | Tucson | Pima County | Arizona | United States | OK |
| Bangor | 44.801613 | -68.77123 | Bangor | Penobscot County | Maine | United States | OK |
| Baltimore | 39.290385 | -76.61219 | Baltimore | NA | Maryland | United States | OK |
| Salvador | -12.977749 | -38.50163 | NA | Salvador | State of Bahia | Brazil | OK |
| Amarillo | 35.221997 | -101.83130 | Amarillo | Potter County | Texas | United States | OK |
| Tocopilla | -22.088678 | -70.19605 | Tocopilla | Tocopilla Province | Antofagasta | Chile | OK |
| Barranquilla | 11.004107 | -74.80698 | Barranquilla | Barranquilla | Atlantico | Colombia | OK |
| Padilla | 26.544614 | -81.85391 | Fort Myers | Lee County | Florida | United States | OK |
| Boston | 42.360082 | -71.05888 | Boston | Suffolk County | Massachusetts | United States | OK |
| Charleston | 32.776475 | -79.93105 | Charleston | Charleston County | South Carolina | United States | OK |
| Dayton | 39.758948 | -84.19161 | Dayton | Montgomery County | Ohio | United States | OK |
| Louisiana | 30.984298 | -91.96233 | NA | NA | Louisiana | United States | OK |
| Washington | 47.751074 | -120.74014 | NA | NA | Washington | United States | OK |
| Houston | 29.760427 | -95.36980 | Houston | Harris County | Texas | United States | OK |
| Kingston | 41.927037 | -73.99736 | Kingston | Ulster County | New York | United States | OK |
| Texarkana | 33.425125 | -94.04769 | Texarkana | Bowie County | Texas | United States | OK |
| Monterey | 36.600238 | -121.89468 | Monterey | Monterey County | California | United States | OK |
| Faraday | 37.394787 | -121.92830 | San Jose | Santa Clara County | California | United States | OK |
| Santa Fe | 35.686975 | -105.93780 | Santa Fe | Santa Fe County | New Mexico | United States | OK |
| Tallapoosa | 33.744550 | -85.28801 | Tallapoosa | Haralson County | Georgia | United States | OK |
| Glen Rock | 40.962876 | -74.13292 | Glen Rock | Bergen County | New Jersey | United States | OK |
| Black Rock | 28.054965 | -82.50466 | Tampa | Hillsborough County | Florida | United States | OK |
| Little Rock | 34.746481 | -92.28959 | Little Rock | Pulaski County | Arkansas | United States | OK |
| Oskaloosa | 41.291673 | -92.64936 | Oskaloosa | Mahaska County | Iowa | United States | OK |
| Tennessee | 35.517491 | -86.58045 | NA | NA | Tennessee | United States | OK |
| Hennessey | 36.109205 | -97.89867 | Hennessey | Kingfisher County | Oklahoma | United States | OK |
| Chicopee | 42.148704 | -72.60787 | Chicopee | Hampden County | Massachusetts | United States | OK |
| Spirit Lake | 46.274303 | -122.13371 | NA | Skamania County | Washington | United States | OK |
| Grand Lake | 40.252207 | -105.82307 | Grand Lake | Grand County | Colorado | United States | OK |
| Devil’s Lake | 43.418397 | -89.73095 | NA | Sauk County | Wisconsin | United States | OK |
| Crater Lake | 42.944587 | -122.10900 | NA | Klamath County | Oregon | United States | OK |
| Louisville | 38.252665 | -85.75846 | Louisville | Jefferson County | Kentucky | United States | OK |
| Nashville | 36.162664 | -86.78160 | Nashville | Davidson County | Tennessee | United States | OK |
| Knoxville | 35.960638 | -83.92074 | Knoxville | Knox County | Tennessee | United States | OK |
| Ombabika | 50.233333 | -87.90000 | Ombabika | Thunder Bay District | Ontario | Canada | OK |
| Schefferville | 54.824559 | -66.81748 | Schefferville | Sept-Rivières—Caniapiscau | Quebec | Canada | OK |
| Jacksonville | 30.332184 | -81.65565 | Jacksonville | Duval County | Florida | United States | OK |
| Waterville | 43.965177 | -71.52791 | Waterville Valley | Grafton County | New Hampshire | United States | OK |
| Costa Rica | 9.748917 | -83.75343 | NA | NA | NA | Costa Rica | OK |
| Pittsfield | 42.450085 | -73.24538 | Pittsfield | Berkshire County | Massachusetts | United States | OK |
| Springfield | 37.208957 | -93.29230 | Springfield | Greene County | Missouri | United States | OK |
| Bakersfield | 35.373292 | -119.01871 | Bakersfield | Kern County | California | United States | OK |
| Shreveport | 32.525152 | -93.75018 | Shreveport | Caddo Parish | Louisiana | United States | OK |
| Hackensack | 40.885933 | -74.04347 | Hackensack | Bergen County | New Jersey | United States | OK |
| Cadillac | 37.053795 | -94.45353 | Joplin | Newton County | Missouri | United States | OK |
| Davenport | 41.523644 | -90.57764 | Davenport | Scott County | Iowa | United States | OK |
| Idaho | 44.068202 | -114.74204 | NA | NA | Idaho | United States | OK |
| Jellico | 36.587859 | -84.12687 | Jellico | Campbell County | Tennessee | United States | OK |
| Argentina | -38.416097 | -63.61667 | NA | NA | NA | Argentina | OK |
| Diamantina | -18.247454 | -43.60121 | Diamantina | Diamantina | Minas Gerais | Brazil | OK |
| Pasadena | 34.147785 | -118.14452 | Pasadena | Los Angeles County | California | United States | OK |
| Catalina | 33.387886 | -118.41631 | NA | Los Angeles County | California | United States | OK |
| Pittsburgh | 40.440625 | -79.99589 | Pittsburgh | Allegheny County | Pennsylvania | United States | OK |
| Parkersburg | 39.266742 | -81.56151 | Parkersburg | Wood County | West Virginia | United States | OK |
| Gravelbourg | 49.875676 | -106.55732 | Gravelbourg | Division No. 3 | Saskatchewan | Canada | OK |
| Colorado | 39.550051 | -105.78207 | NA | NA | Colorado | United States | OK |
| Ellensburg | 46.996514 | -120.54785 | Ellensburg | Kittitas County | Washington | United States | OK |
| Rexburg | 43.823110 | -111.79242 | Rexburg | Madison County | Idaho | United States | OK |
| Vicksburg | 32.352646 | -90.87788 | Vicksburg | Warren County | Mississippi | United States | OK |
| El Dorado | 38.809573 | -94.45718 | Raymore | Cass County | Missouri | United States | OK |
| Larimore | 47.906657 | -97.62675 | Larimore | Grand Forks County | North Dakota | United States | OK |
| Admore | 42.643147 | -82.85890 | Macomb | Macomb County | Michigan | United States | OK |
| Haverstraw | 41.197595 | -73.96458 | Haverstraw | Rockland County | New York | United States | OK |
| Chatanika | 65.111221 | -147.46539 | Chatanika | Fairbanks North Star | Alaska | United States | OK |
| Chaska | 44.789345 | -93.60184 | Chaska | Carver County | Minnesota | United States | OK |
| Nebraska | 41.492537 | -99.90181 | NA | NA | Nebraska | United States | OK |
| Alaska | 64.200841 | -149.49367 | NA | NA | Alaska | United States | OK |
| Opelika | 32.645412 | -85.37828 | Opelika | Lee County | Alabama | United States | OK |
| Baraboo | 43.471094 | -89.74429 | Baraboo | Sauk County | Wisconsin | United States | OK |
| Waterloo | 42.492786 | -92.34258 | Waterloo | Black Hawk County | Iowa | United States | OK |
| Kalamazoo | 42.291707 | -85.58723 | Kalamazoo | Kalamazoo County | Michigan | United States | OK |
| Kansas City | 39.099727 | -94.57857 | Kansas City | Jackson County | Missouri | United States | OK |
| Sioux City | 42.496342 | -96.40494 | Sioux City | Woodbury County | Iowa | United States | OK |
| Cedar City | 37.677477 | -113.06189 | Cedar City | Iron County | Utah | United States | OK |
| Dodge City | 37.752798 | -100.01708 | Dodge City | Ford County | Kansas | United States | OK |
| Fond du Lac | 43.773045 | -88.44705 | Fond du Lac | Fond du Lac County | Wisconsin | United States | OK |
R has some marvelous packages to create maps. Let’s map where Cash has been.
#Now let's use leaflet to plot
map <- leaflet(width = '100%',
options = leafletOptions())
map <- addProviderTiles(map,
providers$CartoDB.Positron)
map <- setView(map, lng = -102, lat = 30, zoom = 2)
map <- addMarkers(map,
data = places[localities,],
lng = ~lng,
lat = ~lat,
label = ~locations
)
map <- addPolygons(map,
data = country_shapes,
fillColor = "Orange",
weight = 2,
opacity = 1,
color = "Orange",
dashArray = "3",
fillOpacity = 0.7,
label = ~CNTRY_NAME,
highlight = highlightOptions(
weight = 2,
color = "white",
dashArray = "3",
fillOpacity = 0.75,
bringToFront = TRUE)
)
map <- addPolygons(map,
data = state_shapes,
fillColor = "Green",
weight = 2,
opacity = 1,
color = "Green",
dashArray = "3",
label = ~NAME,
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 2,
color = "white",
dashArray = "3",
fillOpacity = 0.75,
bringToFront = TRUE)
)
map
Fancy mapping packages are good fun. However it is often more sensible to make a well designed static image. Below, I make a similar map, relying only on trusty ggplot2.
library(ggplot2)
library(broom)
#Convert to plain dfs, as ggplot expects
state_df <- tidy(state_map, region = "STUSPS")
country_df <- tidy(country_map, region = "CNTRY_NAME")
#Plot
ggplot()+
geom_polygon(data = country_df, aes(long,lat,group=group), fill = "white", col = "grey")+
geom_polygon(data = state_df, aes(long,lat,group=group), fill = "white", col = "grey")+
geom_polygon(data = state_shapes, aes(long,lat,group=group), fill="Dark Green", col = NA)+
geom_polygon(data = country_shapes, aes(long,lat,group=group),fill="Orange", col = NA)+
geom_jitter(data = places[localities,], aes(lng, lat), color = "Dark Blue")+
theme_void()+
theme(plot.background = element_rect(fill = "grey", color = NA))+
coord_quickmap(xlim = c(-180,10), ylim =c(-60,80), clip = "on", expand = FALSE)
## Regions defined for each Polygons
## Regions defined for each Polygons
As one can see from the Map above, Jonny Cash has been plenty of places. He has not, however, been everywhere. This begs the question, why did Cash go where he went? Below, this question is tackled with a logistic regression.
The level of analysis is US counties. Excellent county by county data is available for the US. Additionally, the bulk of places named in the song can be pinpointed to US counties. However, This means that the model have to take into account locations fromt he song that fall outside the US, or locations that include several counties, i.e. states.
The model uses four variables:
Population. It seems likely that Cash will go where there are people for which to preform.
Percent of Population Employed in Cattle Ranching. Cash maintained an image as a cow boy. It seems likely he would want to be seen among real cowboys.
Percent of Population Incarcerated. Cash famously preformed at Prison’s around the county. It seems likely that he would have been where the prisons were.
County by county population data is taken from the ACS. the number of ranchers is taken from the EEO survey. The number of incarcerated is taken from the census itself. Because the first two measures are estimates, they are not available for sparsely populated counties. We will only be considering counties for which we have data for all three variables. (Essentially, this is the largest third of US Counties).
library(DescTools)
Population <-read.csv("./Population better.csv")
Ranchers <-read.csv("./Ranchers.csv")
Prisoners <-read.csv("./Prison Population.csv")
counties <- merge(Prisoners, merge(Population, Ranchers, by = 'Geography'), by = 'Geography')
names(counties)<-c('names','prison','pop','ranch')
counties$prison <- counties$prison/counties$pop
counties$ranch <- counties$ranch/counties$pop
counties$visited <- ifelse(counties$names %in% paste0(places$county,", ",places$state), 1, 0)
model <- glm(visited ~ pop + prison + ranch,
data = counties,
family = binomial(link = "logit")
)
summary(model)
##
## Call:
## glm(formula = visited ~ pop + prison + ranch, family = binomial(link = "logit"),
## data = counties)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3508 -0.3224 -0.3050 -0.2945 2.5861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.233e+00 2.402e-01 -13.460 < 2e-16 ***
## pop 9.405e-07 2.188e-07 4.299 1.71e-05 ***
## prison -3.397e+00 1.102e+01 -0.308 0.758
## ranch 3.735e+01 5.677e+01 0.658 0.511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.53 on 966 degrees of freedom
## Residual deviance: 392.84 on 963 degrees of freedom
## AIC: 400.84
##
## Number of Fisher Scoring iterations: 6
DescTools::PseudoR2(model)
## McFadden
## 0.05686871
Looking at the distribution of residuals and the pseudo \(R^2\) it is clear this model has very little explanatory power. The McFadden pseudo R squared is particularly damning; only about six percent in the variation in the likelihood that Jonny Cash has been to a county is explained in by the model’s chosen predictors. Additionally only one of the predictors had a statistically significant effect: population. Note that while population is significant, it is also tiny. I can say with confidence this model does not increase our understanding of where Johny Cash has been.
It makes sense that this model would have little explanatory power. Jonny Cash did not actually write the son. It’s a cover.