The ClinicalTrials.gov website is the frontend of the huge database of clinical trials maintained by the National Library of Medicine (NLM) which is a part of the National Institutes of Health (NIH). All US Food Drug Administration (FDA) Registered studies are available to the public there.
In this article, I will show you how to use the rclinicaltrials package and others to download, aggregate, and visualize this Clinical Trials data. More information relating to this data can be found on this article about the Visualizations presented here.
In this case, we are interested in searching for Inhalation therapies for respiratory conditions. The rClinicalTrials R package provides a beautiful interface to the ClinicalTrials.gov website. In the past, you would have needed to do the download, and parsing of the XML data on your own. Here is a StackOverflow question about XML parsing where I post the queston and it was gracefuly answered. The few lines of code below is all that is needed to download and do the initial processing of the data.
library(rclinicaltrials)
# With the search parameters, download the data. The search is meatn to download less than 100, if there are more than 100 results, Clincal Trials.gov will only give the first 100
a <- clinicaltrials_download(query = c('term=inhalation AND respiratory AND dose','recr=Open', 'type=Intr', 'cntry1=NA%3AUS'), count = 200, include_results = TRUE)
# We want to extract all the locations in the United States to
b <- a[1]$study_information$locations
c <- b[which(b$address.country=='United States'), ]
# First five results of the search (about 1500+ locations):
head(c)
## name address.city address.state
## 1 St. Jude Children's Research Hospital Memphis Tennessee
## 2 Western Sky Medical Research El Paso Texas
## 3 Arizona Research Associates Tucson Arizona
## 4 VA Harbor Medical Center New York New York
## 5 St. Luke's-Roosevelt Hospital Center New York New York
## 6 Arthur F Gelb Medical Corporation Lakewood California
## address.zip address.country nct_id
## 1 38105 United States NCT00186927
## 2 79903 United States NCT00915538
## 3 85712 United States NCT00993707
## 4 10010 United States NCT00993707
## 5 10019 United States NCT00993707
## 6 90712 United States NCT01225913
# Of those results, we only need the address columns (City, State, and Country) since zip code is only provided by some.
d <- c[c(2, 3, 5)]
d$address <- paste(d$address.city, d$address.state, d$address.country,sep=",")
# We can sumarize the results with a frequency (cities with more than one clinical trial)
library(plyr)
e <- count(d, 'address')
# then sort assending just to list these results
e <- e[order(-e$freq),]
# Highest occurence 20 results from the search:
head(e, 20)
## address freq
## 291 Miami,Florida,United States 27
## 85 Cincinnati,Ohio,United States 22
## 324 New York,New York,United States 22
## 96 Columbus,Ohio,United States 19
## 404 San Antonio,Texas,United States 18
## 441 St. Louis,Missouri,United States 18
## 340 Oklahoma City,Oklahoma,United States 16
## 386 Richmond,Virginia,United States 16
## 75 Charlotte,North Carolina,United States 15
## 367 Pittsburgh,Pennsylvania,United States 15
## 42 Birmingham,Alabama,United States 14
## 86 Clearwater,Florida,United States 14
## 363 Phoenix,Arizona,United States 14
## 405 San Diego,California,United States 14
## 375 Portland,Oregon,United States 13
## 431 Spartanburg,South Carolina,United States 13
## 73 Charleston,South Carolina,United States 12
## 197 Greenville,South Carolina,United States 12
## 219 Houston,Texas,United States 12
## 284 Medford,Oregon,United States 12
The data that we have downloaded and processed so far for the Clinical Trial Locations, does not inlcude Longitude and Latitude Coordinates, which are the geolocation numeric variables that we can plot on maps. The following code uses the Data Science Toolkit to pinpoint each address, in this case City, State, and Country to a specific Longitude and Latitude. The description of the geolocation code can be found in thisStack Overflow Question/Answer about GeoLocationn, where again, my question was gracefuly answered.
library(httr)
library(rjson)
data <- paste0("[",paste(paste0("\"",d$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- as.data.frame(
do.call(rbind,sapply(json,
function(x) c(address=x$addess,lon=x$longitude,lat=x$latitude))))
geocode$address <- rownames(geocode)
First we need to create the dataframes with the information to visualize on the maps, this is done by putting inserting the latitude and longitude to the dataframes with the relevant data:
g <- merge(e, geocode,by="address") # Use in density by location map
h <- merge(d, geocode,by="address") # Use for heat maps
Then, load the required libraries, and download the map image (using ggmap) for the United States
# load the required libraries
library(ggplot2)
library(ggmap)
# download the map background images
map<-get_map(location='united states', zoom=4, maptype = "terrain",
source='google',color='color', force=TRUE)
Using ggmap to add the map layer, and ggplot2 to add the circles representing the number of studies by location (color and size), we create the map below. A good read about this method can be found here.
ggmap(map) + geom_point(
aes(x=lon, y=lat, show_guide = TRUE, colour=freq),
data=g, alpha=.5, na.rm = T, size = g$freq*0.8) +
scale_color_gradient(low="green", high="red")
The heatmap is similar, however, it uses the ggmap modifies geom_density2d and stat_density2d to plot the density of locations as a heat map. The technique to produce this visualization and other similar visualizations can be found here.
ggmap(map) + geom_density2d(data = h, aes(x = lon, y = lat), size = 0.3)+
stat_density2d(data=h, aes(fill = ..level.., alpha = ..level..), geom="polygon", bins=15) +
scale_fill_gradient(low = "green", high = "red")+
scale_alpha(range = c(0.1, 0.3), guide = FALSE)
## Warning: Removed 3 rows containing non-finite values (stat_density2d).
## Warning: Removed 3 rows containing non-finite values (stat_density2d).
A more in depth description of the potential uses of this information, as well as the background for this analysis is provided in this Pulse article.