Goal: communicate the most frequently used words in the corpus to an audience of bloggers with no particular history with statistics. Ideally, the people who contributed to this corpus would be able to read this graph to learn more about the project they contributed to.
Draft 1: Shows frequencies of the 15 most frequently occurring words. The words do not appear in any order, and there isn’t a lot of aesthetic appeal.
Draft 2: The words have been reordered to go from most common to least common of the 15 most common words so improve readability. I added some color, changed the theme, and removed some gridlines to improve aesthetic appeal, and removed the space between the words and the bars to help with visual flow.
Final version: I added more words so that some of the words that appear only in the charts for RQ2 appear here as well, as a preview of the next visualizations. I also added age group to the variables being presented here, to show how popular each word is amount each age bracket — also as a nod to what will be shown in RQ2 and RQ3’s visualizations. We can see that people in their 20s are using common words more often than people in their teens (who make up a similar proportion of the corpus) and people over 30 (who make up a smaller proportion of the corpus).
Goal: communicate the relationship between word count and blogger age, including words that are used by all groups and words specific to one or two age groups.
Draft 1: These were basic bar charts with flipped axes displaying word frequency by age group. While containing the most important information I wanted to include, there was a lot to improve upon here, by adding scales (i.e. filling by age group and taking out a few under-informative words).
Draft 2: I began by creating new palettes in order to show words that were used by all age groups, those only used by one age group, and those used by 2 age groups. I also tested the palette to see whether it was colorblind friendly. However, I was left feeling as though I could synthesize this information into a single plot with a better color palette. I checked whether the palettes I was using were colorblind safe, and although the first and third plots seemed okay, decided it made the most sense to use a gradient of a single color for the final plot–this was also part of the feedback we received during our presentation.
Final version: For the final plot, I decided to combine the 3 age groups into a single figure (originally using ggarrange(), then switching to patchwork package) while keeping the axes separate, rather than creating a single figure with the same y-axis for all age groups. I also checked whether this figure was colorblind safe, and it performed better than all previous figures for protanopia, deuteranopia, and desaturated (black/white) conditions. While combining the legend may have added to cohesiveness, I was having trouble doing this in a way that made sense and ended up keeping the individual legends by each age group because I felt it reduced cognitive load for defining fill within the individual figures. However, because the formatting for the dashboard didn’t turn out perfect, I also saved a jpeg of the figure to our github repository.
Goal: describe the proportion of topics which were addressed the most.
Draft 1: Created a pie chart using ggplot and coord_polar(theta = “y”), which showed the proportions relative to one another but not enough information was included here (no percentages and labelling needed work).
Draft 2: I opted to use the pie() function because I knew how to add labels and percentages with this function, making the plot more informative and reducing cognitive load. I also changed the color palette based upon feedback from our peer review
Final version: I used yet another package and set of functions, which elevated the plot aesthetically (although it prints a little funny–cutting off the far right text). I kept the color palette as it allows for the most contrast between groups, and believe this plot does the best job out of the 3 pie charts of displaying the most popular topics.
Goal: communicate the relationship between date, word count, and blogger age (if any relationship exists) to an audience of people who might have contributed to the corpus.
Draft 1: Shows only total word count by date. This graph is somewhat misleading, because the word counts are affected by how many blog posts from the corpus were made on a given day: it is not necessarily that people suddenly started writing longer posts; there are just more posts being included after 2004. This visualization also does not have much visual appeal, and the y-axis is annoying to read.
Draft 2: I added age group to the plot to add some visual interest through color. The axes have also been changed to give some more spread to the data (without making the y-axis log10, the data are mostly squished down at the bottom, with only a few outliars reaching up to 10k or 100k words)
Final version: Since there was not much relationship between date posted and word count, I broke the visualization apart into three plots, one for each age group, and added age as a continuous variable within that to show more of the variety. What we can see now is that people in their 20s started making more blog posts at earlier dates (at least within this corpus) than people under 20 or over 30, suggesting that, if one were to look only at the early years of the dataset, 20-somethings would be overrepresented compared to the overall age breakdown of the dataset. I kept the trend lines for all three groups, although I think this is more informative in draft 2, since it is harder to compare their slops when they are not on the same plot.