The data used for this activity is comprised of de-identified Hepatitis C patient encounter records. Patients were screened at a local health facility within the Paso Del Norte region. The dataset is comprised of 564 entries and 17 variables. A quick look at the data shows several questionable inputs, these will require re-coding to be more useable when creating a data viz.
Adjustments were done to the race and ethnicity variables. These variables were collapsed into one, resulting in a joint variable (i.e. White_non_His). In addition to this, the gender (Male & Female) were re-coded as (0 & 1). It is also noted that several of the values entered comprise of both an alpha and numeric response (0-No / 1-Yes), these must be re-coded to numeric values only. All variable responses with similar characteristics should be addressed similarly. A final step in addressing some of the ambiguity within the dataset was conducted to collapse several of the levels, limiting the variables to a positive or negative response only.
A closer look at the full dataset also shows several missing data (NA’s). To address this issue, A count of all missing values. To accomplish this, the (R code1) provided below can be used. A Multivariate Imputation by Chained Equations “MICE” function. This allows for the replacement of missing values (imputation). MICE does this by populating the missing values (NA’s) by cross-referencing them with the adjacent values captured within the entire data set. This can be done with the (R code 1) provided below.
(R − Code 1)
As a low-level visualization, for this assignment a heatmap was produced. This plot provides a visual representation of each level, allowing for the visualization of any concerns /abnormalities that may have occurred in the variable. Any noted abnormalities will be addressed at this point, and we will restart at step 1. From the heatplot below, we can see repationships between the variables.
plot(dat)
This assignment evaluated five health centers (N=5) and the flow of patients that visited each site. Tufte’s Principles will be utilized to produce a well-accepted data visualization (DataViz).
The story to be told - In reference to the data set presented above, the DataViz will focus on showing the association among patients treated at five local health centers, over the first three years of the program’s implementation. The plot below shows the association among HCV patients per health center. Specifically, the plots look at the association between health center, HCV RNA results, gender, and race.
dat1$Gender<-as.character.factor(dat1$Gender)
dat1$RNA.Test<-as.factor(dat1$RNA.Test)
p <- ggplot(data = dat1, mapping = aes(x = Race,y = RNA.Test))
p
To achieve “Graphical Integrity” (GI), all variations/manipulations to the dataset has been documented and articulated in this publication/presentation. All data formatting steps is presented along with its subsequent coding. Keeping in line with GI, all scales are clearly legible (i.e. labeled margins and scales will start from 0).
Specific attention was paid to the layering of the variables. Precisely, several graphic devices were used to separate the categories. In addition to this, the layers were applied in a hierarchy approach to reduce confusion in the story being told. For this specific assignment, the graphs were layered to show HCV incidence per site in a hierarchy manner. Keen attention was paid to the color schemes and scale. An extra emphasis was placed on depicting small changes within the data. As the assignments continue to progress, a focus will be placed on the “Parallelism” of the datviz. Specifically, isomorphism and visual juxtapositions revealing connections will also be layered in.
To avoided “Chart-junk”, a cookbook approach to datviz was first utilized to develop sound graphs. Once the graph was developed, a more inventive/liberal approach to datviz was applied. Specifically focusing on visual noise reduction, which may disrupt the story that is being told.
To improve on the heat plot, first, each gender was color-oriented. This allowed for visualization of the proportion of gender, race, and site screened.
p <- ggplot(data = dat1,
mapping = aes(x = Race,
y = RNA.Test, color=Gender))
p + geom_point(alpha = 0.2) +
facet_grid(Site.Name ~ Gender)+scale_x_discrete(guide = guide_axis(n.dodge=5))+ggtitle("Association Plot")+theme(plot.title = element_text(lineheight=.8, face="bold"))+xlab("Patient's Race")+ylab("RNA Test Result")
A Mosaic Plot provides a more aesthetically appealing visualization but becomes difficult to interpret once several variables are added.
mosaicplot(xtabs(~ dat$Gender + dat$RNA.Test, data = dat),
col = c("firebrick", "goldenrod1"),
xlab = "Gender",
ylab = "HCV RNA Test Result",
main = "Association Plot 1")
attach(dat1)
dfm<- data.frame(Gender,Site.Name,RNA.Test)
An Association Plot below allows for more interpretability once several variables are added and includes the parson’s residuals for each association. While this is the ideal plot to deliver the envisioned story, the first ggplot provides the viewer with a clear visualization of the presented associations.
assoc(dfm, shade=TRUE, legend=TRUE, main = "Association Plot Improved 2")