Review & Conclusion

Congratulations! You’ve officially completed this workshop on applied statistics and data visualization in spreadsheet software. Let’s check out a quick review of what you’ve learned.





Ch. 1: Review & Preparation

Recall the order of operations, or PEMDAS (Please Excuse My Dear Aunt Sally). Formulas in Excel and Sheets follow the rules of PEMDAS, too!


Data tables consist of tabular data, or data with:

  • Horizontal rows (i.e. observations, instances, cases, etc.)

  • Vertical columns (i.e. variables, fields, features, etc.).


Spreadsheet software is designed specifically for tabular data. You can specify rows (e.g. 10:20), columns (e.g. B:D), and specific cells (e.g. D4) and sections of tables, called ranges (e.g. B4:D10).


Use formulas to operate on multiple values at once, like B:B + 25. Functions can be used to perform mathematical or statistical operations on one or more cells (e.g. = AVERAGE(C2:C20)).



Dragging a cell’s “handle” to apply a formula to multiple cells.





Ch. 2: Data & Formatting

Quantitative data consist of quantifiable numbers and qualitative data consist of names, categories, and text.


In spreadsheet software, common kinds of data include:

  • Character data are qualitative text

  • Numeric data are quantitative numbers

  • Logical data represent binary TRUE or FALSE values

  • Date time data represent points in time


In statistics, common classifications of data include:

  • Discrete values can be counted but not divided or measured

  • Continuous values can be divided or measured, but not counted

  • Categorical data can be said to be categories, groups, or classifications

  • Ordinal data are categories or classifications with an inherent order


In spreadsheet software, you must format each column appropriately so that a program like Excel or Sheets understands how to use these data.



Mathematically operating on date time variables.





Ch. 3: Introductory Statistics

The most basic statistical analysis is the act of summarizing or interpreting data. These create summary or descriptive statistics. Common statistics include:

  • Average or Mean: The mid-point of combined values.

  • Median: The mid-point of combined observations.

  • Minimum and Maximum: The smallest and largest values.

  • First and Third Quartiles: Values at 25% and 75%, or the 25th and 75th percentiles.


These last five summary statistics create the five number summary and each has an associated function for finding them (e.g. = MEDIAN(A2:A20)).



The mean, or average, calculated with function AVERAGE().





Ch. 4: Data Visualization

Data visualization is the visual representation of data. We use it to explore new data, or exploratory analysis, as well as to explain data, or explanatory analysis.


Often, we must manipulate data to visualize them. This may involve cleaning, arranging, or summarizing with functions. Creating a pivot table makes this much easier.


We can use Excel or Sheets to automatically create visualizations, then we can customize or refine their chart elements for polish and clarity.



Automatically creating a visualization in Excel.





Ch. 5: Best Practices

The best visualization to use often depends on the type of data we want to show.

  • Histograms show the “spread” of a single numeric variable

  • Scatter plots are used to represent two numeric variables

  • Bar charts show one numeric and one categorical variable

  • Line graphs are best for numeric data over time


We can represent a third variable using aesthetics like color, size, or shape. Some aesthetics are objectively better than others.


It’s important to take the time to polish and refine visualizations.

  • Visual hierarchy ensures that data and elements do not compete for attention

  • Legibility ensures clean, easy-to-read text

  • Chartjunk is visual clutter that are unnecessary and obfuscate clarity

  • Meaning ensures that everything in your chart is significant to the viewer

  • Labeling provides contextualizing information to support your data overall

  • Conventions leverage “common sense” to make visualizations easier to interpret

  • Respect your audience’s time and intelligence to avoid visual deception

  • Annotation is added contextualizing information for a specific data point



Decluttering a visualization by removing chartjunk.





Final Thoughts

Before we conclude, let’s leave you with some data analytic musings to marinate.




Challenging Decisions

There’s something to be said about the data you’ve used to practice these techniques. Most analyses require some summary decision(s) on the part of the analyst, her- or himself, to process (i.e. clean and manipulate) data before looking for insights. This can have a significant effect on insights gleaned from an analysis.


For example, you may have noticed that in “The Largest Vocabulary in Hip Hop” (Daniels, 2019), Notorious B.I.G. isn’t in the data set, despite “Juicy” being the top-rated track in “The Greatest Hip-Hop Songs of All Time” (Brown, 2020). Why? Well, as the author notes:


Quite a few rappers don’t have enough official material to be included


“Biggie” is the first artist the author provides as an example. Due to the cutoff at 35,000 words, we exclude, arguably, one of the greatest hip hop artists of all time - likely because Christopher Wallace, a.k.a. Biggie Smalls, passed away at the age of 24.




Challenging Methods

In other cases, a “bag of words” analysis on lyrical diversity can have different shortcomings. Must lyrics be in English to “count” as unique? Or are out-of-dictionary words ignored? What about foreign words? For example:


Who’s the one you call Mr. Macho the head honcho?


…asked by The Notorious B.I.G. in “Dead Wrong”. Here, hanchō comes from the Japanese words for “group” (han) and “leader” (chō). Similarly, Afrika Bambaataa’s “Planet Rock” also uses Japanese:


Everybody say, ‘ichi, ni, san, shi’


That is, “one, two, three, four”. If, indeed, only English words are excluded, this makes hip hop artists like Immortal Technique all the more impressive, since a number of his tracks combine English and Spanish, e.g. “No Me Importa” from his debut album, Revolutionary Vol. 1.


Moreover, what of heavily modified words, alliteration, or references, frequently used by artists like Snoop Dogg or Das EFX, e.g. in “They Want EFX”?


Bum stiggedy bum stiggedy bum, hon, I got the old pa-rum-pum-pum-pum.


Are these nonsense words? Not really - as “pa-rum-pum-pum-pum” alludes to “Little Drummer Boy” - an alliteration repeated by Big Punisher in “Whatcha Gonna Do?”.


For better or worse, you can ply data analytics not only to challenge conclusions but to challenge the analysis and source data, as well. Feel free to find the potential flaws or shortcomings in our practice data and reach your own conclusions.




Key Takeaways

Before parting, it would be valuable to summarize some important takeaways.

  1. Beneath every great visualization is a robust, accurate, well-formatted data set

  2. Like visualization itself, statistics concisely summarize your data

  3. Great data visualization conveys one or more “big ideas”; discuss ideas, not visualizations

  4. It doesn’t hurt to be easy on the eyes, but visualization must be efficient - not pretty

  5. Just ten minutes of polish can vastly improve a visualization’s clarity

  6. Visualization is always persuading, regardless of how objective you try to remain


Lastly, remember that visualization not only represents data, it represents you, your work, and the degree of respect you have for your audience.