The data used for this activity is comprised of de-identified Hepatitis C patient encounter records. Patients were screened at a local health facility within the Paso Del Norte region. The dataset is comprised of 564 entries and 17 variables. A quick look at the data shows several questionable inputs, these will require re-coding to be more useable when creating a data viz.
Adjustments were done to the race and ethnicity variables. These variables were collapsed into one, resulting in a joint variable (i.e. White_non_His). In addition to this, the gender (Male & Female) were re-coded as (0 & 1). It is also noted that several of the values entered comprise of both an alpha and numeric response (0-No / 1-Yes), these must be re-coded to numeric values only. All variable responses with similar characteristics should be addressed similarly. A final step in addressing some of the ambiguity within the dataset was conducted to collapse several of the levels, limiting the variables to a positive or negative response only.
A closer look at the full dataset also shows several missing data (NA’s). To address this issue, A count of all missing values. To accomplish this, the (R code1) provided below can be used. A Multivariate Imputation by Chained Equations “MICE” function. This allows for the replacement of missing values (imputation). MICE does this by populating the missing values (NA’s) by cross-referencing them with the adjacent values captured within the entire data set. This can be done with the (R code 1) provided below.
(R − Code 1)
As a low-level visualization, for this assignment a heatmap was produced. This plot provides a visual representation of each level, allowing for the visualization of any concerns /abnormalities that may have occurred in the variable. Any noted abnormalities will be addressed at this point, and we will restart at step 1. From the heatplot below, we can see repationships between the variables.
plot(dat)
This assignment evaluated five health centers (N=5) and the flow of patients that visited each site. Tufte’s Principles will be utilized to produce a well-accepted data visualization (DataViz).
The story to be told - In reference to the data set presented above, the DataViz will focus on showing the trends in patients treated over the first three years of the program’s implementation. Specifically, the plot will focus on patient flow from 2017-2019 and the RNA diagnosis trends.
dat1$Gender<-as.character.factor(dat1$Gender)
dat1$RNA.Test<-as.factor(dat1$RNA.Test)
dat1$Date<-as.Date.factor(dat1$Date)
p <- ggplot(dat1)+geom_point(aes(x =Date ,y =Client.ID..Unique..., fill=RNA.Test,color=RNA.Test, group=Gender),shape=2)
p
To achieve “Graphical Integrity” (GI), all variations/manipulations to the dataset has been documented and articulated in this publication/presentation. All data formatting steps is presented along with its subsequent coding. Keeping in line with GI, all scales are clearly legible (i.e. labeled margins and scales will start from 0).
Specific attention was paid to the layering of the variables. Precisely, several graphic devices were used to separate the categories. In addition to this, the layers were applied in a hierarchy approach to reduce confusion in the story being told. For this specific assignment, the graphs were layered to show HCV incidence per site in a hierarchy manner. Keen attention was paid to the color schemes and scale. An extra emphasis was placed on depicting small changes within the data. As the assignments continue to progress, a focus will be placed on the “Parallelism” of the datviz. Specifically, isomorphism and visual juxtapositions revealing connections will also be layered in.
To avoided “Chart-junk”, a cookbook approach to datviz was first utilized to develop sound graphs. Once the graph was developed, a more inventive/liberal approach to datviz was applied. Specifically focusing on visual noise reduction, which may disrupt the story that is being told.
To improve on the point plot, first, each axis was titled. In addition to this, the scale on the X-axis was adjusted to better fit the total population. This allowed for the proper visualization of the trends per year.
p+ylim(0,600)+xlab("Time (years)")+ylab("Patient Count")
Within the final plot, a grid was added subdividing the year into halves (quatre 1,2 & quatre 3,4). Smoothing was added to the plot, identifying several outliers that were not located near the imposed mean centerline of the data set. This final plot provides better visualization of the data set and provides a better representation of the outliners identified within the graph. This is the best plot to visualize the trends and the uncertain outliers for this data set.
ps<-p+ylim(0,600)+xlab("Time (years)")+ylab("Patient Count")+ theme_bw()
ps+ geom_smooth(formula = y~x,data=dat1, method=lm,aes(x =Date ,y =Client.ID..Unique...),se=T,fullrange=T,linetype="dashed")+ggtitle("Visualizing Trends")