The Clinical Trials Data

The ClinicalTrials.gov website is the frontend of the huge database of clinical trials maintained by the National Library of Medicine (NLM) which is a part of the National Institutes of Health (NIH). All US Food Drug Administration (FDA) Registered studies are available to the public there.

In this article, I will show you how to use the rclinicaltrials package and others to download, aggregate, and visualize this Clinical Trials data. More information relating to this data can be found on this article about the Visualizations presented here.

The Search Case

In this case, we are interested in searching for Inhalation therapies for respiratory conditions. The rClinicalTrials R package provides a beautiful interface to the ClinicalTrials.gov website. In the past, you would have needed to do the download, and parsing of the XML data on your own. Here is a StackOverflow question about XML parsing where I post the queston and it was gracefuly answered. The few lines of code below is all that is needed to download and do the initial processing of the data.

library(rclinicaltrials)
# With the search parameters, download the data.  The search is meatn to download less than 100, if there are more than 100 results, Clincal Trials.gov will only give the first 100
a <- clinicaltrials_download(query = c('term=inhalation AND respiratory AND dose','recr=Open', 'type=Intr', 'cntry1=NA%3AUS'), count = 200, include_results = TRUE)

# We want to extract all the locations in the United States to 
b <- a[1]$study_information$locations
c <- b[which(b$address.country=='United States'), ]

# First five results of the search (about 1500+ locations):
head(c)
##                                    name address.city address.state
## 1 St. Jude Children's Research Hospital      Memphis     Tennessee
## 2          Western Sky Medical Research      El Paso         Texas
## 3           Arizona Research Associates       Tucson       Arizona
## 4              VA Harbor Medical Center     New York      New York
## 5  St. Luke's-Roosevelt Hospital Center     New York      New York
## 6     Arthur F Gelb Medical Corporation     Lakewood    California
##   address.zip address.country      nct_id
## 1       38105   United States NCT00186927
## 2       79903   United States NCT00915538
## 3       85712   United States NCT00993707
## 4       10010   United States NCT00993707
## 5       10019   United States NCT00993707
## 6       90712   United States NCT01225913
# Of those results, we only need the address columns (City, State, and Country) since zip code is only provided by some.
d <- c[c(2, 3, 5)]
d$address <- paste(d$address.city, d$address.state, d$address.country,sep=",")

# We can sumarize the results with a frequency (cities with more than one clinical trial)
library(plyr)
e <- count(d, 'address')
# then sort assending just to list these results
e <- e[order(-e$freq),]

# Highest occurence 20 results from the search:
head(e, 20)
##                                      address freq
## 291              Miami,Florida,United States   27
## 85             Cincinnati,Ohio,United States   22
## 324          New York,New York,United States   22
## 96               Columbus,Ohio,United States   19
## 404          San Antonio,Texas,United States   18
## 441         St. Louis,Missouri,United States   18
## 340     Oklahoma City,Oklahoma,United States   16
## 386          Richmond,Virginia,United States   16
## 75    Charlotte,North Carolina,United States   15
## 367    Pittsburgh,Pennsylvania,United States   15
## 42          Birmingham,Alabama,United States   14
## 86          Clearwater,Florida,United States   14
## 363            Phoenix,Arizona,United States   14
## 405       San Diego,California,United States   14
## 375            Portland,Oregon,United States   13
## 431 Spartanburg,South Carolina,United States   13
## 73   Charleston,South Carolina,United States   12
## 197  Greenville,South Carolina,United States   12
## 219              Houston,Texas,United States   12
## 284             Medford,Oregon,United States   12

Geolocation of the Cities

The data that we have downloaded and processed so far for the Clinical Trial Locations, does not inlcude Longitude and Latitude Coordinates, which are the geolocation numeric variables that we can plot on maps. The following code uses the Data Science Toolkit to pinpoint each address, in this case City, State, and Country to a specific Longitude and Latitude. The description of the geolocation code can be found in thisStack Overflow Question/Answer about GeoLocationn, where again, my question was gracefuly answered.

library(httr)
library(rjson)

data <- paste0("[",paste(paste0("\"",d$address,"\""),collapse=","),"]")
url  <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json     <- fromJSON(content(response,type="text"))

geocode  <- as.data.frame(
  do.call(rbind,sapply(json,
    function(x) c(address=x$addess,lon=x$longitude,lat=x$latitude))))

geocode$address <- rownames(geocode)

Visualizing the Results on Maps

First we need to create the dataframes with the information to visualize on the maps, this is done by putting inserting the latitude and longitude to the dataframes with the relevant data:

  1. g represents data aggregated by frequency of incidence of clinical trials for each city, will be used for the map with location spots.
  2. h is the entire dataset, without aggregating, which will be used for the heatmap.
g <- merge(e, geocode,by="address") # Use in density by location map
h <- merge(d, geocode,by="address") # Use for heat maps

Then, load the required libraries, and download the map image (using ggmap) for the United States

# load the required libraries
library(ggplot2)
library(ggmap)
# download the map background images
map<-get_map(location='united states', zoom=4, maptype = "terrain",
             source='google',color='color', force=TRUE)

Plot the Density by Location Map for Inhalation Therapy Clinical Trials

Using ggmap to add the map layer, and ggplot2 to add the circles representing the number of studies by location (color and size), we create the map below. A good read about this method can be found here.

ggmap(map) + geom_point(
  aes(x=lon, y=lat, show_guide = TRUE, colour=freq), 
  data=g, alpha=.5, na.rm = T, size = g$freq*0.8)  + 
  scale_color_gradient(low="green", high="red")

Plot the Heatmap for Inhalation Therapy Clinical Trials

The heatmap is similar, however, it uses the ggmap modifies geom_density2d and stat_density2d to plot the density of locations as a heat map. The technique to produce this visualization and other similar visualizations can be found here.

ggmap(map) + geom_density2d(data = h,  aes(x = lon, y = lat), size = 0.3)+
  stat_density2d(data=h, aes(fill = ..level.., alpha = ..level..), geom="polygon", bins=15) +
  scale_fill_gradient(low = "green", high = "red")+
  scale_alpha(range = c(0.1, 0.3), guide = FALSE)
## Warning: Removed 3 rows containing non-finite values (stat_density2d).

## Warning: Removed 3 rows containing non-finite values (stat_density2d).

A more in depth description of the potential uses of this information, as well as the background for this analysis is provided in this Pulse article.