Shaking the Biostat Cage, Someone Give Me Two Bananas

Susan Telke

Introduction:

To begin,recall an article that while reading you had difficulty with the tables. The article I choose, and used in this tutorial, was of personal interest to me because I like to think about the interface between medical research findings, physician understanding and implementation of those findings, and patient acceptance(or not) and understanding. I know a few of the authors of the paper and the lead author is a mentor to me. I hear Jim has a strong preference for tables of numbers over displaying information in plots. Needless to say, my preference is not the same!

This tutorial will walk you through both the thought process of what might be a good way to display the information in the tables and the R code need to build the display of information.

The original article is quite old, but luckily available through the inter-webs. Be sure to log into the U of MN library system to gain access to the U of MN journal holdings. Neaton J, Wentworth D, Rhame F, Hogan C, Abrams D, Deyton, and CPCRA, (1994) Methods of Studying Interventions: Considerations in Choice of a Clinical Endpoint for AIDS Clinical Trials. Statistics in Medicine, 13: 2107-2125.

The overriding goal of the paper by Neaton and colleagues is not completely contained in tables I and II, but these were the tables I had difficulty following and thus good fodder for this tutorial.

Tables I Neaton et al. Tables II Neaton et al.

The challenge:

The primary purpose of the table II was to express the subjective assessments of severity of disease progression by both physicians and patients. This is valuable because estimates of risk of death alone (Table I) ignore the intrinsic differences in quality of life associated with the different opportunistic events. Additionally, the table shows the degree of agreement/disagreement between physician and patient ratings. Finally, the table indicates rank correlations with RR and associated p-values.

The tables were engaging, however, it is difficult to flip between table I and table II. Also, making sense of whether or not the patients and physicians have overall agreement on severity rankings is not completely transparent (at least to me). Most importantly the illustration of the relationship between the severity rankings and the relative risk of death was lost between tables even though the authors put the risk of death for each opportunistic infection in risk order. The magnitude of the differences in relative risk between diseases was not maintained in table II from table I.

Our goal is to develop a plot to visualize the agreement between rank severity ordering by physicians and patients, while indicating the magnitude of the relative risk of death for those with the event compared to not. Additionally, by comparing the color scale and the position of the disease on the scatter-plot a view may be alerted to a discrepancy in the severity ranking compared the estimated risk of death. This may indicate an opportunistic disease that has serious quality of life implications or a potential disconnect between the risk of death for the event and the physicians' or patients' perception of risk of death.

The Data:

Recording Data

Raw data for medical journal articles is rarely available, but the data displayed in the tables is enough information to get started. Using a spreadsheet program, enter the column headings for the variables of interest. For this tutorial, this is Event, RR, text.RR, CI, Phys.score, Patient.score.

DATA

Importing the Data

Importing data into R can be a little tricky. If you are using Excel as a spreadsheet, you may want to save the file as a .csv file. “.csv” stands for commma separated value. Other options are to copy and paste the file into a very basic text editor. You will need to know what is separating the columns in your data file. Common separators are commas, white space, or tab. More info: How to read in a file with a different deliminator

# Reading data into a dataframe in R.
aids <- read.csv("C:/Users/telke/Desktop/PhD/Epsy8252/composite.endpoints.aids.csv")

Notice a few things in the code above.

______________________________________________________

Code Fragment Meaning
aids<- assigns the name aids to the working data set in R
read.csv tells R to read in a comma separated value file

______________________________________________________

The remaining R code (“C:/Users/telke/Desktop/PhD/Epsy8252/composite.endpoints.aids.csv”) is the path to where the data is stored.

Once you have read the data into R, it does not automatically display. There are a few options to see the data.

# to view the data outside the R console
View(aids)

R view of data

# to veiw the first several rows of your data within R console
head(aids)
##                    Event   RR   text.RR           CI Phys.score
## 1                    PML 18.3  RR: 18.3 (10.2, 32.9)        8.7
## 2               Lymphoma  8.1 RR:  8.1   (5.7, 11.4)        9.7
## 3      Kaposi's sarcoma   4.9 RR:  4.9    (3.4, 7.0)        6.7
## 4 AIDS Dementia Complex   4.6  RR:  4.6   (3.6, 6.0)        7.0
## 5           Toxolasmosis  4.0 RR:  4.0    (2.8, 5.8)        6.5
## 6         Histoplasmosis  3.2 RR:  3.2    (1.5, 6.6)        4.6
##   Patient.score
## 1           9.0
## 2           9.0
## 3           6.9
## 4           8.9
## 5           7.6
## 6           6.1
# to veiw the first several rows of your data within R console
tail(aids)
##                             Event  RR   text.RR          CI Phys.score
## 12                  Herpes zoster 2.3 RR:  2.3   (1.2, 4.3)        4.3
## 13                   Tuberculosis 2.1  RR:  2.1  (1.4, 3.2)        5.0
## 14 Other mycobacterial infections 2.0  RR:  2.0  (1.1, 3.4)        5.6
## 15                    Candidiasis 1.7  RR:  1.7  (1.4, 2.1)        2.6
## 16                 Herpes simplex 1.4 RR:  1.4   (0.7, 2.0)        4.7
## 17              Cryptosporidiosis 0.8  RR:  0.8 (0.5, 1.4)         7.5
##    Patient.score
## 12           5.1
## 13           7.4
## 14           7.5
## 15           4.0
## 16           5.4
## 17           6.6

Be sure to take a look at the data, so you know R has read the data as you thought it would.

The Plot:

Install and Load ggplot2 Package

gglot2 is a package within R used for graphics displays. Intall this package into R.

install.packages( "ggplot2", dependencies = TRUE )

Next, load the ggplot2 package into R with the library command. Each time R is opened the package will require loading. You will only need to install the package once.

# loading ggplot2
library(ggplot2)

Basic Scatter-Plot

One of the goals of the tables was to show the relationship between severity rank ordering of opportunistic infections as rated by 10 experienced physicians and 8 HIV positive patients. Both of these variables are quantitative, so a scatter-plot may be a natural way to express this relationship visually.

# plotting a scatter-plot using ggplots
ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot() is the routine that calls on the plotting function. Into the ggplot routine you will indicate the data set and the global aesthetics for the graph. Think of the global aesthetics as the overall theme. In this case, the horizontal axis indicates the patient score and the vertical axis the physician score. ggplot() works in layers. The code ggplot() does not indicate what to plot (points, box-plot, violin plot, etc). To add the layer that puts the data-points into the scatter-plot put a “+” sign at the end of the first line of code and use geom_point()on the next line. It is not necessary to have separate lines for the code between the “+” sign, but this will help explain the layers of code for plotting.

______________________________________________________

Code Fragment Meaning
aes() these are the global aesthetics for the plot
geom_point() plots the points at x=Patient.score and y=Phys.score

______________________________________________________

Labeling Data in the Plot The scatter-plot shows the relationship between the severity ranks of opportunistic disease by physicians and patients, but does not indicate which data points refers to which disease. ggplot allows the labeling of the data-values in the plot.

# labeling of data values
ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) + 
geom_point() + geom_text(aes(label = Event, angle = 90, hjust = -0.05), size = 3)

plot of chunk unnamed-chunk-6

______________________________________________________

Code Fragment Meaning
geom_text() allows labeling of data with text
aes() adjustments to the text
angle align the text label at any specified angle
hjust move label higher or lower on horizontal axis
size make the text font a particular size

______________________________________________________

Titles, Axis Labels, Background and Annotation

DATA

Notice that the plot lacks a title, the axis labels are the column titles, and there is not information about the number of physicians and patients. Also, the background is a shaded gray, which may cause printing issues. There are several ways to remedy these, one way is shown below.

#adding titles, axis labels, background change and text
  ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) +
  geom_text(aes(label=Event,angle=90,hjust=-.05), size=3)+
  geom_point()+

  theme_bw () + #makes background black instead of gray
  ggtitle("Relative Risk of Death for Those with Event vs. Not*\n and\n Severity Rank of Event by Physicians and Patients**") +
  ylab("Physician Severity Rank") +
  xlab("Patient Severity Rank") +
  annotate("text", label = "**Average severity scores assigned by N=10 Physicians, N=8 Patients", x = 7, y = 2.8, size = 3, colour = "navyblue") 

plot of chunk unnamed-chunk-7


______________________________________________________

Code Fragment Meaning
theme_bw() changes background to white, adds line around the plot
ggtitle() inserts a title, line breaks are achieved with “\n”
ylab() inserts y axis label
xlab() inserts x axis label
annotate() inserts text strings centered at the x, y values in the plot locates
size allows a change of font size
color allows a change of color

______________________________________________________

More on colours

Axis Tickmarks and Range

When R plots your data, there are default settings for the axis ticks and range of values. These defaults may not result in the clearest presentation of the information. R allows you to scale the axis to your specifications.

#further refinement
  ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) +
  geom_text(aes(label=Event,angle=90,hjust=-.05), size=3)+
  geom_point()+
  theme_bw () + #makes background black instead of gray
  ggtitle("Relative Risk of Death for Those with Event vs. Not*\n and\n Severity Rank of Event by Physicians and Patients**") +

  scale_x_continuous(name="Patient Severity Rank\n \n(Higher Rank Indicates Higher Severity)",limits=c(4, 9),breaks=c(4,5,6,7,8,9)) +
  scale_y_continuous(name="Physician Severity Rank",limits=c(2, 11),breaks=c(3,4,5,6,7,8,9, 10))+
  annotate("text", label = "**Average severity scores assigned by N=10 Physicians, N=8 Patients", x = 7, y = 2.8, size = 3, colour = "navyblue") 

plot of chunk unnamed-chunk-8


______________________________________________________

Code Fragment Meaning
scale_y_continuous() sets controls for tick marks, range and naming on y axis
scale_x_continuous() sets controls for tick marks, range and naming on x axis

______________________________________________________

Note: xlab and ylab are removed from code because the scale function above has a naming option. Again, notice “\n” allowing for line breaks in the label for the horizontal axis.

Dynamic Change of Color for Data in Scatterplot

The scatter-plot above is a method visualize the agreement between rank severity ordering by physicians and patients. However, the plot does not yet indicate the magnitude of the relative risk of death for those with the event compared to not. Also allowing for an indication of a discrepancy in the severity ranking compared the estimated risk of death. The geom_point() function allows the points to change size or color as a function of a data value. In this case, color changes in response to the magnitude of the relative risk. Additionally, the default color change was in a R default order. To control the change of color the scale-color-gradientn() function is used.

# further refinement
ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) + theme_bw() + ggtitle("Relative Risk of Death for Those with Event vs. Not*\n and\n Severity Rank of Event by Physicians and Patients**") + 
    scale_x_continuous(name = "Patient Severity Rank\n \n(Higher Rank Indicates Higher Severity)", 
        limits = c(4, 9), breaks = c(4, 5, 6, 7, 8, 9)) + scale_y_continuous(name = "Physician Severity Rank", 
    limits = c(2, 11), breaks = c(3, 4, 5, 6, 7, 8, 9, 10)) + annotate("text", 
    label = "**Average severity scores assigned by N=10 Physicians, N=8 Patients", 
    x = 7, y = 2.8, size = 3) + geom_text(aes(label = Event, angle = 90, hjust = -0.05), 
    size = 3) + 
geom_point(data = aids, aes(colour = RR, x = Patient.score, y = Phys.score), 
    size = 3.5) + scale_colour_gradientn(limits = c(0, 20), colours = c("green", 
    "navyblue", "darkmagenta", "darkorange1"))

plot of chunk unnamed-chunk-9

______________________________________________________

Code Fragment Meaning
geom_point() sets controls for data points
scale_colour_gradientn () sets controls for gradient color change

______________________________________________________

Note: We replaced the generic geom_point() in the code from earlier with the function with more specifications.

A Few More Labels

You can add additional text to the plot to help guide the viewer. Again, the text is located at the x and y coordinates specified.

# further refinement, inserting the RR and CI for Each Disease and More
# Text
ggplot(data = aids, aes(x = Patient.score, y = Phys.score)) + theme_bw() + ggtitle("Relative Risk of Death for Those with Event vs. Not*\n and\n Severity Rank of Event by Physicians and Patients**") + 
    scale_x_continuous(name = "Patient Severity Rank\n \n(Higher Rank Indicates Higher Severity)", 
        limits = c(4, 9), breaks = c(4, 5, 6, 7, 8, 9)) + scale_y_continuous(name = "Physician Severity Rank", 
    limits = c(2, 11), breaks = c(3, 4, 5, 6, 7, 8, 9, 10)) + annotate("text", 
    label = "**Average severity scores assigned by N=10 Physicians, N=8 Patients", 
    x = 7.16, y = 2.8, size = 3) + geom_text(aes(label = Event, angle = 90, 
    hjust = -0.05), size = 3) + geom_point(data = aids, aes(colour = RR, x = Patient.score, 
    y = Phys.score), size = 3.5) + scale_colour_gradientn(limits = c(0, 20), 
    colours = c("green", "navyblue", "darkmagenta", "darkorange1")) + 
geom_text(aes(label = text.RR, angle = 90, vjust = 1.6, hjust = -0.1), size = 3) + 
    geom_text(aes(label = CI, vjust = 2.6, angle = 90, hjust = -0.1), size = 3) + 
    annotate("text", label = "*Ranking of clinical events indicative of HIV disease progression by risk of death:  results for 3382 CPCRA patients", 
        x = 7.25, y = 3, size = 3) + annotate("text", label = "Rank correlation with relative risk of death:  Physicians=0.49 (p = 0.05), Patients=0.63 (p = 0.007)", 
    x = 7.19, y = 2.2, size = 3, colour = "navyblue")

plot of chunk unnamed-chunk-10

Final Thoughts

The text labels for the opportunistic infections are too small. They work in the html setting OK, but are terrible in print. If I could figure out how to do it, I would allow the user to hover over the data-point and then the disease, RR and CI would appear. There are many ways in which interaction with the plot would allow improved undertanding. I look forward to learning how to use those tools.

Programming Humor

humor