FDA Form 483 Citation Text Mining and Visualization using R

Downloading 483 Citations for a particular Year

The Citations contained in 483 forms are curated observations that have been served to different companies in FDA regulated industries, including pharmaceuticals, biotechnology, and biomedical devices. The database is available as sets of Excel and CSV files in the FDA Citations database page.

Downloading the Data

The first set of instructions download and read a CSV file. The URL corresponds to the 2016 data available as September of 2016. The second set of instrucitons filter the data by “Program Area”, to only those related to Drugs, typically, pharmaceutical companites.

library(RCurl)
# only download if file does not exist
destFile <- "UCM518846.csv"
if(!file.exists(destFile)){
  URL <- paste0("http://www.fda.gov/downloads/ICECI/Inspections/",destFile)
  x <- download.file(URL, destFile) }
# read and initial processing
a <- read.csv(destFile)
b <- a[a$Program.Area=="Drugs",] #we are limiting it to pharma
c <- as.data.frame(b[,9, drop=FALSE]) #only long citations to be analyzed

Text Mining basic manipulations

The second set of instructions use the ‘tm’ package (as in text mining) to process the text in all the citations to structured text (using the corpus ), and then turn into the term document matrix, or the incidence of terms per document (citation in this case).

The term-document-matrix is then simplified by reducing sparsity, and the rest of other variables needed to plot the visualizations are calculated.

library(tm)
# these three functions create the term document matrix
corpus <- Corpus(DataframeSource(c))
tdm <- TermDocumentMatrix(corpus, control = list(removePunctuation = TRUE,
                                                 removeNumbers = TRUE,
                                                 stopwords = TRUE))
# reduce sparcity and convert to matrix
tdm2 <- removeSparseTerms(tdm, sparse =0.90)
tdm.matrix <- as.matrix(tdm2)
tdm.matrix[tdm.matrix>=1] <- 1
tdm.matrix <- tdm.matrix %*% t(tdm.matrix)
# calculate frequency of terms
freqterms <- findFreqTerms(tdm2)
vtxcnt <- rowSums(cor(as.matrix(t(tdm[freqterms,])))>.5)-1

Visualizing as Flow Diagram

The first visualization produces a flow diagram based on distance between the most frequent terms. In this plot, it is possible to infer phrases based on how each node (rectangles) connects with the rest.

mycols<-c("#f7fbff","#deebf7","#c6dbef",
          "#9ecae1","#6baed6","#4292c6",
          "#2171b5", "#084594")
vc <- mycols[vtxcnt+1]
names(vc) <- names(vtxcnt)
pp <- plot(tdm,
           terms = freqterms,
           corThreshold = 0.26,
           nodeAttrs=list(fillcolor=vc))

Visualizing as Cluster Dendogram

Similar plots are made in philogentics to show relationships between genes or species. The dendogram produces a plot that shows distance between nodes and hyerarchy of these relationships.

# cluster Plot
library(cluster)   
d <- dist(t(tdm.matrix), method="euclidean")   
hc <- hclust(d=d, method="complete")
plot(hc, xlab="term", ylab="distance", hang=-1)

Visualizing as Network Diagram

Network diagrams have become popular as they can be used to represent large amounts of interconnected data. All possible relationships are shwon as lines interconnecting terms, represented as nodes. In this case, nodes with gray color and small font are the least frequent terms, as nodes in bright green are the most frequent and interconnected terms. In this representation, nodes that have more connections migrate to the center of the plot, while nodes with less connections, migrate to the perifery of the plot.

# Network Diagram Plot
library(igraph)
set.seed(4321)
# build a graph from the above matrix
g <- graph.adjacency(tdm.matrix, weighted=TRUE, mode = 'plus')
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
# set seed to make the layout reproducible
layout1 <- layout.fruchterman.reingold(g, niter = 1500)
sizer <- 4.5 * V(g)$degree / max(V(g)$degree) -2.7
V(g)$label.cex <- sizer
V(g)$label.font <- 2
V(g)$size <- 10
E(g)$curved <- 0.2
E(g)$width <- 1.2
c_scale <- colorRamp(c('gray', 'blue', 'green'))
colorer <- (sizer-min(sizer))/(max(sizer)-min(sizer))
V(g)$color = apply(c_scale(colorer), 1, function(x) rgb(x[1]/255,x[2]/255,x[3]/255) )
pg <- plot(g, layout = layout1, vertex.frame.color = 'black',
           vertex.label.dist = 0.6)

The article explaining these results and contrasting them to other years can be found in this Pulse Article about text mining FDA 483 citation data.