This projects seeks to investigate the risk posed by non-extinct volcanoes on nearby centers of population. This idea for this project was sparked by a conversation with a friend who had recently visted a volcano in Hawaii – the investiation seemed to provide an interesting way to identify disparate datasets on the internet, link them, analyze the linked data, and create a useful visualization of the analysis.
In order to identify cities facing a risk of volcanic activity, the following sets of data are utilized:
Volcanoes: data about volcanoes’ locations and eruptions
Cities: data about cities’ location and populations
Countries: data about each country in the world
A MySQL schema with four tables is created to store the data. The near table will be used to house instances of cities located near volcanoes; the other tables align with the datasets outlined in the section above.
library(RMySQL)
volcano_db = dbConnect(MySQL(), user='root', password='password', host='localhost')
dbSendQuery(volcano_db, 'CREATE SCHEMA volcanoes;')
dbSendQuery(volcano_db, 'USE volcanoes;')
dbSendQuery(volcano_db,'CREATE TABLE volcanoes (
volc_id INT PRIMARY KEY,
name VARCHAR(50),
eruption INT,
latitude FLOAT,
longitude FLOAT;')
dbSendQuery(volcano_db,'CREATE TABLE cities (
geonameId INT PRIMARY KEY,
asciiname VARCHAR(50),
country_code CHAR(2),
latitude FLOAT,
longitude FLOAT,
population INT);')
dbSendQuery(volcano_db,'CREATE TABLE countries (
country_code CHAR(2) PRIMARY KEY,
Country VARCHAR(50) NOT NULL);')
dbSendQuery(volcano_db,'CREATE TABLE near (
near_id INT PRIMARY KEY,
geonameId INT NOT NULL,
volc_id INT NOT NULL,
distance FLOAT NOT NULL,
FOREIGN KEY (geonameId) REFERENCES cities(geonameId),
FOREIGN KEY (volc_id) REFERENCES volcanoes(volc_id));')The Smithsonian Institute’s Global Volcanism Project hosts the Volcanoes of the World database, which “describ[es] the physical characteristics of Holocene volcanoes and their eruptions.” This database is used to acquire volcano data.
By performing a search of the database with no parameters, an .xls file is downloaded containing all information for all volcanoes in the databse. The downloaded file is converted to a csv file following some minor cleanup. The first ten columns of this file is read into R.
GVP <- read.csv('https://github.com/dsmilo/DATA607/raw/master/Projects/Data/GVP_Volcano_List.csv',
stringsAsFactors = FALSE)[1:10]Observing the structure of the imported data, some cleansing is required. The columns of the GVP data frame are first renamed to make the cleansing cleaner.
Many volcano names included have names indicating a direct relationship to another volcano – these names include phrases such as “Cone of”, “Thermal of”, “Synonym of”, etc. These entities are removed in order to leave only the unique volcanoes.
Because the eruption years are stored as characters, they are converted to numbers. For volcanoes with ‘Unknown’ last eruptions, the eruption date is set to -100000 to make them very ancient but allow for downstream analysis by eliminating NAs.
Finally, non-ASCII characters are removed from the volcano names – these characters create issues when passing to the MySQL database.
library(dplyr)
names(GVP) <- c("Number", "Name", "Type", "Eruption", names(GVP)[5:10])
GVP <- GVP %>% filter(!grepl(" of ", Type))
GVP$Eruption <- as.numeric(GVP$Eruption)
GVP$Eruption <- ifelse(is.na(GVP$Eruption), -100000, GVP$Eruption)
GVP$Name <- iconv(GVP$Name, to = "ASCII", sub = "")The names of the GVP data frame are edited, relevant columns selected, and the results are written to the volcanoes table in the MySQL database.
library(stringr)
names(GVP) <- str_to_lower(names(GVP))
names(GVP)[1] <- "volc_id"
GVP <- GVP %>% select(volc_id, name, eruption, latitude, longitude)
dbWriteTable(volcano_db, "volcanoes", GVP, append = TRUE, row.names = FALSE)GeoNames is a hosted database that “contains over 10 million geographical names and consists of over 9 million unique features whereof 2.8 million populated places.” Its webservice allows querying of the database following user registration. This database is used to acquire city data.
A list of cities with population of at least 1000 is available on the GeoNames website as a compressed file. The txt file contained is extracted and read into R.
cities1000 <- read.table('https://raw.githubusercontent.com/dsmilo/DATA607/master/Projects/Data/cities1000.txt',
header = FALSE, sep = "\t", fill = TRUE, stringsAsFactors = FALSE, quote = "")Additionally, GeoNames hosts a list of country information. This file is read in in order to give the full country names.
countries <- read.table('http://download.geonames.org/export/dump/countryInfo.txt', header = TRUE,
skip = 50, sep = "\t", fill = TRUE, stringsAsFactors = FALSE,
quote = "", comment.char = "", check.names = FALSE)Names are added to the cities1000 columns, and numeric columns are converted to the appropriate types
names(cities1000) <- c("geonameId", "name", "asciiname", "alternatenames", "latitude", "longitude",
"feature_class", "feature_code", "country_code", "cc2",
"admin1_code", "admin2_code", "admin3_code", "admin4_code",
"population", "elevation", "dem", "timezone", "modification_date")
cities1000$geonameId <- as.integer(cities1000$geonameId)
cities1000$latitude <- as.double(cities1000$latitude)
cities1000$longitude <- as.double(cities1000$longitude)
cities1000$population <- as.integer(cities1000$population)The first column of countries is changed, as it is read in with a # symbol, which causes issues when attempting to access the column.
The country code for Namibia is NA – this causes an issue, as it is read in as NA rather than as the text “NA”.
names(countries)[1] <- "country_code"
countries[countries$Country == "Namibia", 1] <- "NA"The relevant columns from the cities1000 and countries data frames are selected and written to the cities and countries tables in the MySQL database.
cities1000 <- cities1000 %>% select(geonameId, asciiname, country_code, latitude, longitude, population)
dbWriteTable(volcano_db, "cities", cities1000, append = TRUE, row.names = FALSE)
countries <- countries %>% select(country_code, Country)
dbWriteTable(volcano_db, "countries", countries, append = TRUE, row.names = FALSE)With the datasets populated in MySQL tables, the identification of cities near volcanoes is completed. In order to identify these cities, the GeoNames findNearbyPlaceName API is utilized. This API returns information about places close to a location of interest. In addition to the required username, latitude, and longitude parameters, the API takes the following optional parameters:
lat: the latitude of interest
lng: the longitude of interest
maxRows: the maximum number of rows to return
radius: the radius (in km) within which to return populated places
style: the verbosity of the response (one of short, medium, long)
cities: the list of cities (cities1000, cities5000, cities15000) to restrict the search
To access the API, a wrapper is created to retrieve the information in JSON format. The geonames element of the response contains a dataframe with the relevant responses – this is stored in the places data frame. For locations with no cities within the given radius, an empty list is returned; this empty list is passed through the function to avoid errors and to allow for filtering by class in downstream analysis. For data frames returning results, the three columns of interest are returned.
library(jsonlite)
get.places <- function(lat, lng, maxRows = 10000, radius = 161) {
username <- getOption('geonamesUsername')
api_query <- paste0('http://api.geonames.org/findNearbyPlaceNameJSON?style=short&cities=cities1000
&username=', username, '&lat=', lat, '&lng=', lng, '&maxRows=', maxRows, '&radius=', radius)
places <- fromJSON(api_query)$geonames
if (class(places) == "data.frame"){
places <- places %>% select(geonameId, distance)
}
else{places <- list(NULL)}
places
}To identify cities that are near volcanoes, the location of each volcano in the GVP data frame is passed to the get.places function. Two columns are added to the resulting data frame: the id of the volcano of interest and a unique identifier near_id. The data frame, with columns rearranged, is written to the near table in the MySQL database. The data frame is then cleared and the process repeated for the next volcano.
near_count <- 0
for (i in 1:nrow(GVP)) {
nearby <- data.frame(NULL)
nearby <- get.places(GVP$latitude[i], GVP$longitude[i])
if(class(nearby) == "list") next
nearby$near_id <- c(seq(near_count + 1, near_count + nrow(nearby), 1))
nearby$volc_id <- rep(GVP$volc_id[i], nrow(nearby))
nearby <- nearby %>% select(near_id, geonameId, volc_id, distance)
dbWriteTable(volcano_db, "near", nearby, append = TRUE, row.names = FALSE)
near_count <- near_count + nrow(nearby)
}Once the loop has been run over all volcanoes, and the list of cities located near each volcano stored in the database, a query is executed to extract all instances of cities located near volcanoes, including information about both the city and the volcano.
near_query <- dbSendQuery(volcano_db,
'SELECT c.asciiname, co.Country, c.latitude, c.longitude, c.population,
v.eruption, n.distance
FROM near AS n
INNER JOIN cities AS c
ON c.geonameId = n.geonameId
INNER JOIN countries AS co
ON c.country_code = co.country_code
INNER JOIN volcanoes AS v
ON n.volc_id = v.volc_id;')
at_risk <- dbFetch(near_query, n = -1)
dbClearResult(near_query)
dbDisconnect(volcano_db)The data is modified in order to allow for the quantification of risk. Distances are converted to numbers and to miles from kilometers. The eruption date is converted to years ago (from 2016).
at_risk$distance <- as.numeric(at_risk$distance) * 0.621371
at_risk$eruption <- 2016 - at_risk$eruptionIn order to determine the risk posed by each volcano on each city, three factors are considered:
Population of the city
Last eruption of the volcano
Distance between city and volcano
It can reasonably be deduced that the potential impact of volcanic activity is affected by population based on order of magnitude rather than by the precise number of residents. As such, the population is transformed using a logarithm. Following similar reasoning, the time since last eruption is also log-transformed. In both cases, base 10 is used, as it is more intuitive for determining order of magnitude.
Ignoring any variations due to wind patterns, volcanic ash and debris will be distributed radially. Because of this, the risk will scale inversely with the squared distance between city and volcano. Using these assumptions, the following quantification of risk from volcanic activity is generated:
\[risk = \frac{\log_{10} (population)}{\log_{10} (eruption) \times distance^2}\]
The relationship between population and eruption in this equation means that the logarithm base does not matter, as long as they are the same.
In order to avoid distortion from the above calculation using small numbers, slight adjustments are made to the variables:
at_risk <- at_risk %>%
mutate(risk = log10(population + 2) / log10(eruption + 2) / ((distance + 1)^2))The cases in at_risk represent instances of an individual city and an individual volcano being located near one another. In order to calculate the risk on a city-by-city basis, the risk in each city from each volcano is aggregated. The individual values are added – if the risk is based on a representation of probability, the probability of one volcano or another erupting is the sum of the two probabilities assuming the probabilities are independent. To give the aggregated risk values a sense of scale, they are recalculated relative to the highest risk, which is assigned a value of 1000.
city_risk <- at_risk %>% mutate(City = paste(asciiname, Country, sep = ", ")) %>% group_by(City) %>%
summarise(lat = mean(latitude), lng = mean(longitude), pop = mean(population), risk = sum(risk)) %>%
arrange(desc(risk))
max_risk <- max(city_risk$risk)
city_risk$risk <- city_risk$risk / max_risk * 1000The city_risk data frame contains 33,599 rows, with each row representing a city with some level of risk from volanic activity
The table above shows that of the top ten at-risk cities, six are located in Central America, three are in southeast Asia, and one is located in Iceland. The most at-risk city, San Roque, Philippines, has a risk rating of 1000 (by design). The remainder of the top 15 at-risk cities range between 558 and 808; all other cities have risk ratings under 550. 14,998 of the 33,599 cities have a rating between 0 and 1.
Summary statistics of the risk ratings are calculated:
summary(city_risk$risk) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0058 0.2350 0.9656 7.0330 4.2160 1000.0000
The summary statistics suggest an extremely right-skewed distribution, as the mean (7.03) is greater than the median (0.97), and even the third quartile(4.22) by a good deal – this is in line with the observations made from the table above.
To further investigate the distribution, a histogram is created using a bin width of 10. Due to the large spread of risk ratings, a histogram for the log (base 10) of the risk is also created.
The first plot shows that over 85% of observations have a risk rating between 0 and 10. This is further confirmed by the log graph. The log graph also indicates that about 12% of risk ratings lie between 10 and 100, with just over 1% lying between 100 and 1000. This is further verified with the following table:
round(table(floor(log10(city_risk$risk)))/nrow(city_risk)*100, 3)
-3 -2 -1 0 1 2 3
1.202 9.453 39.998 36.081 12.134 1.128 0.003
A map of the world is obtained using ggmap, and volcanoes and identified cities are plotted, with the size of cities’ points determined by the risk rating
The plot produced does not reveal a great level of detail due to the complex background. Due to issues with other map types and sources in ggmap, a map is instead created using polygons in ggplot2
This provides a more informative plot, however it still has issues. There are a large number of volcanoes, and a decent number of cities, in high latitudes, where there is significant distortion at higher altitudes. To attempt to lessen this, an example from R Psychologist utilizing transformed shapefiles from Natural Earth with the rgdal, raster, and ggplot2 packages is adapted to plot less-distorted maps.
This plot shows a significant concentration of at-risk cities in the following regions:
Pacific Coast of US
Central America
Southeast Asia
Japan
Mediterranean Sea
As outlined above, a significant poriton of the risk ratings are very low. In order to better represent the cities that face significant risk, another plot is created illustrating only those cities with a risk rating over 10.
This plot shows a similar patterns as the non-thinned map above. However, the number and severity of risk in cities on the Mediterranean Sea and the Pacific Coast of the US are far diminished. With this thinned dataset, it can be observed that while there is a high concentration of volcanoes on the Pacific Coast of both North and South America, the number of cities facing significant risk is low. It appears that the instances of at risk cities are most concentrated in Central America, Indonesia, the Philippines, and Japan. The median and mean risk ratings for cities identified in each of these countries is identified:
| Country | Median | Mean |
|---|---|---|
| Nicaragua | 23.4 | 85.3 |
| Guatemala | 42.0 | 66.4 |
| Costa Rica | 45.3 | 48.3 |
| Japan | 9.6 | 18.5 |
| Philippines | 4.1 | 13.0 |
| Honduras | 2.4 | 5.2 |
| Indonesia | 1.2 | 4.5 |
| Panama | 0.2 | 1.4 |
From this table, it is clear that the median risk ratings for all countries in the identified set, aside from Panama, are higher than the median for the population at large. Also, four of the countries (Nicaragua, Guatemala, Costa Rica, and Japan) have mean risk ratings higher than the population of identified cities.
From the data gathered and the quantification of risk performed, it is observed that most cities located within 100 miles of a non-extinct volcano face a very low level of risk from volcanic activity. Cities in Central America, Indonesia, the Philippines, and Japan, however, face more significant risk.
In seeking to apply the findings of this investigation, the assumptions underlying the quantification of risk to cities should first be investigated – if these assumptions (log relationship between population and last eruption, as well as inverse-squared with distance, to risk) are not valid, the number of cities and degree of risk may change. Finally, it should be noted that the risk ratings attempt to quantify the risk faced by cities based on past eruption information – they are not not meant to serve as an actual predictor of future volcanic activity.
Bivand, Roger et al (2016). rgdal: Bindings for the Geospatial Data Abstraction Library. R package version 1.1-8. https://CRAN.R-project.org/package=rgdal
Hijmans, Robert J. (2015). raster: Geographic Data Analysis and Modeling. R package version 2.5-2. https://CRAN.R-project.org/package=raster
Kahle, David and Wickham, Hadley (2013). ‘ggmap: Spatial Visualization with ggplot2.’ The R Journal, 5(1), 144-161. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Magnusson, Kristoffer (2013). ‘Working with shapefiles, projections and world maps in ggplot.’ R Psychologist. http://rpsychologist.com/working-with-shapefiles-projections-and-world-maps-in-ggplot
Ooms, Jeroen et al (2016). RMySQL: Database Interface and ‘MySQL’ Driver for R. R package version 0.10.8. https://CRAN.R-project.org/package=RMySQL
Smithsonian Institution National Museum of Natural History (2016). Global Volcanism Project. http://volcano.si.edu/
Wick, Marc (2016). GeoNames. http://www.geonames.org/
Wickham, Hadley (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wickham, Hadley and Francois, Romain (2015). dplyr: A Grammar of Data Manipulation. R package version 0.4.3. https://CRAN.R-project.org/package=dplyr
Xie, Yihui (2015). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.1. https://CRAN.R-project.org/package=DT