The data used for this activity is comprised of de-identified Hepatitis C patient encounter records. Patients were screened at a local health facility within the Paso Del Norte region. The dataset is comprised of 564 entries and 17 variables. A quick look at the data shows several questionable inputs, these will require re-coding to be more useable when creating a data viz.
Adjustments were done to the race and ethnicity variables. These variables were collaped into one, resulting in a joint variable (i.e. White_non_His). In addition to this, the gender (Male & Female) were re-coded as (0 & 1). It is also noted that several of the values entered comprise of both an alpha and numeric response (0-No / 1-Yes), these must be re-coded to numeric values only. All variable responses with similar characteristics should be addressed similarly. A final step in addressing some of the ambiguity within the dataset was conducted to collapse several of the levels, limiting the variables to a positive or negative response only.
A closer look at the full dataset also shows several missing data (NA’s). To address this issue, A count of all missing values. To accomplish this, the (R code1) provided below can be used. A Multivariate Imputation by Chained Equations “MICE” function. This allows for the replacement of missing values (imputation). MICE does this by populating the missing values (NA’s) by cross-referencing them with the adjacent values captured within the entire data set. This can be done with the (R code 1) provided below.
(R − Code 1)
As a low-level visualization, I generally run the visualization and plot commands in R. These commands provide a visual representation of each level, allowing for the visualization of any concerns /abnormalities that may have occurred in the variable. Any noted abnormalities will be addressed at this point, and we will restart at step 1. From the boxplot below, we can see the levels have some outliners, but it is acceptable.
ggplot(dat1, aes(x=Client.ID..Unique..., y=Site.Name)) +
geom_boxplot()
This assignment evaluated five health centers (N=5) and the flow of patients that visited each site. Tufte’s Principles will be utilized to produce a well-accepted data visualization (DataViz).
The story to be told - In reference to the data set presented above, the DataViz will focus on showing the distribution of patients among the five local health centers, over the first three years of the program’s implementation. The bar plot provided shows the distribution of HCV patients per health center. Here we see that the majority of the patients were screened at the median life of the program (June 2019). While this acceptable more can be done to show the distribution of patients per site.
First
To achieve “Graphical Integrity” (GI), all variations/manipulations to the dataset has been documented and articulated in this publication/presentation. All data formatting steps is presented along with its subsequent coding. Keeping in line with GI, all scales are clearly legible (i.e. labeled margins and scales will start from 0).
Second
Specific attention was paid to the layering of the variables. Precisely, several graphic devices were used to separate the categories. In addition to this, the layers were applied in a hierarchy approach to reduce confusion in the story being told. For this specific assignment, the graphs were layered to show HCV incidence per site in a hierarchy manner. Keen attention was paid to the color schemes and scale. An extra emphasis was placed on depicting small changes within the data. As the assignments continue to progress, a focus will be placed on the “Parallelism” of the datviz. Specifically, isomorphism and visual juxtapositions revealing connections will also be layered in.
To avoided “Chart-junk”, a cookbook approach to datviz was first utilized to develop sound graphs. Once the graph was developed, a more inventive/liberal approach to datviz was applied. Specifically focusing on visual noise reduction, which may disrupt the story that is being told.
To improve on the boxplot, first, each site was color-oriented. To this, a layer of points was added (jitter), so the distribution of the patients at each site.
Visualizing Distribution Improvement (Box Plot)
d<-ggplot(dat1, aes(x=Client.ID..Unique..., y=Site.Name, fill=Site.Name)) +
geom_boxplot(outlier.colour="red", outlier.shape=5,
outlier.size=2)+ theme_classic()+labs(title="Plot of Patient Flow Per Site",x="Patient See at Clinic")
d + geom_jitter(shape=16, position=position_jitter(0.5))
To further develop on this plot. A violin layer was added to show the change of the distribution of the patient flow over the life of the program. This final version is best suited to show the distribution of patients at each site over the life of the program.
v<-ggplot(dat1, aes(x=Client.ID..Unique..., y=Site.Name, fill=Site.Name)) + geom_jitter(shape=16, position=position_jitter(0.5))+geom_violin()+
geom_boxplot(width=0.1, color="grey", alpha=0.2,outlier.colour="red", outlier.shape=5,outlier.size=2)
v+ ggtitle("Patient Distribution Per Health Center")+xlim(0,600)+labs(title="Plot of Patient Flow Per Site",x="Patient see at Clinic")