The data used for this activity is comprised of de-identified Hepatitis C patient encounter records. Patients were screened at a local health facility within the Paso Del Norte region. The dataset is comprised of 564 entries and 17 variables. A quick look at the data shows several questionable inputs, these will require re-coding to be more useable when creating a data viz.
Adjustments were done to the race and ethnicity variables. These variables were collaped into one, resulting in a joint variable (i.e. White_non_His). In addition to this, the gender (Male & Female) were re-coded as (0 & 1). It is also noted that several of the values entered comprise of both an alpha and numeric response (0-No / 1-Yes), these must be re-coded to numeric values only. All variable responses with similar characteristics should be addressed similarly. A final step in addressing some of the ambiguity within the dataset was conducted to collapse several of the levels, limiting the variables to a positive or negative response only.
A closer look at the full dataset also shows several missing data (NA’s). To address this issue, A count of all missing values. To accomplish this, the (R code1) provided below can be used. A Multivariate Imputation by Chained Equations “MICE” function. This allows for the replacement of missing values (imputation). MICE does this by populating the missing values (NA’s) by cross-referencing them with the adjacent values captured within the entire data set. This can be done with the (R code 1) provided below.
(R − Code 1)
As a low-level visualization, for this assignment a heatmap was produced. This plot provides a visual representation of each level, allowing for the visualization of any concerns /abnormalities that may have occurred in the variable. Any noted abnormalities will be addressed at this point, and we will restart at step 1. From the heatplot below, we can see repationships between the variables.
df<-data.matrix(dat1)
heatmap.2(df)
This assignment evaluated five health centers (N=5) and the flow of patients that visited each site. Tufte’s Principles will be utilized to produce a well-accepted data visualization (DataViz).
The story to be told - In reference to the data set presented above, the DataViz will focus on showing the proportion of patients among the five local health centers, over the first three years of the program’s implementation. The ammended heatplot provided shows the proportion of HCV patients per health center. Here we see that the majority of the patients were screened were male, white-his and see at the Alameda health center.
First
To achieve “Graphical Integrity” (GI), all variations/manipulations to the dataset has been documented and articulated in this publication/presentation. All data formatting steps is presented along with its subsequent coding. Keeping in line with GI, all scales are clearly legible (i.e. labeled margins and scales will start from 0).
Second
Specific attention was paid to the layering of the variables. Precisely, several graphic devices were used to separate the categories. In addition to this, the layers were applied in a hierarchy approach to reduce confusion in the story being told. For this specific assignment, the graphs were layered to show HCV incidence per site in a hierarchy manner. Keen attention was paid to the color schemes and scale. An extra emphasis was placed on depicting small changes within the data. As the assignments continue to progress, a focus will be placed on the “Parallelism” of the datviz. Specifically, isomorphism and visual juxtapositions revealing connections will also be layered in.
To avoided “Chart-junk”, a cookbook approach to datviz was first utilized to develop sound graphs. Once the graph was developed, a more inventive/liberal approach to datviz was applied. Specifically focusing on visual noise reduction, which may disrupt the story that is being told.
To improve on the heat plot, first, each gender was color-oriented. This allowed for visualization of the proportion of gender, race, and site screened.
Visualizing Proportion Improvement (Heat Plot)
dat_m<-melt(dat)
## Using Site.Name, Date, Race, Gender, RNA.Pos.1st.App., RNA.Pos..No.1st.App. as id variables
ggplot(data = dat_m, aes(x=Site.Name, y=Race, fill=Gender)) +
geom_tile()
A stock boxplot was developed as an alternative to the heat plot. While this plot provides more information the heat plot is easier to real. Frot this reason the best plot to to show proportion of patients at each site over the life of the program, is the ammended heat map.
ggplot(dat, aes(fill=Race, y= Gender, x=Site.Name)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis(discrete = T) +
ggtitle("Proportion of Patients Per Health Center ") +
theme_ipsum() +
xlab("Health Center")+ylab("Gender")+scale_y_discrete(guide = guide_axis(n.dodge=3))+
coord_flip()