Congratulations! You’ve officially completed this workshop on applied statistics and data visualization in spreadsheet software. Let’s check out a quick review of what you’ve learned.
Recall the order of operations, or PEMDAS (Please Excuse My Dear Aunt Sally). Formulas in Excel and Sheets follow the rules of PEMDAS, too!
Data tables consist of tabular data, or data with:
Horizontal rows (i.e. observations, instances, cases, etc.)
Vertical columns (i.e. variables, fields, features, etc.).
Spreadsheet software is designed specifically for tabular data. You can specify rows (e.g. 10:20), columns (e.g. B:D), and specific cells (e.g. D4) and sections of tables, called ranges (e.g. B4:D10).
Use formulas to operate on multiple values at once, like B:B + 25. Functions can be used to perform mathematical or statistical operations on one or more cells (e.g. = AVERAGE(C2:C20)).
Dragging a cell’s “handle” to apply a formula to multiple cells.
Quantitative data consist of quantifiable numbers and qualitative data consist of names, categories, and text.
In spreadsheet software, common kinds of data include:
Character data are qualitative text
Numeric data are quantitative numbers
Logical data represent binary TRUE or FALSE values
Date time data represent points in time
In statistics, common classifications of data include:
Discrete values can be counted but not divided or measured
Continuous values can be divided or measured, but not counted
Categorical data can be said to be categories, groups, or classifications
Ordinal data are categories or classifications with an inherent order
In spreadsheet software, you must format each column appropriately so that a program like Excel or Sheets understands how to use these data.
Mathematically operating on date time variables.
The most basic statistical analysis is the act of summarizing or interpreting data. These create summary or descriptive statistics. Common statistics include:
Average or Mean: The mid-point of combined values.
Median: The mid-point of combined observations.
Minimum and Maximum: The smallest and largest values.
First and Third Quartiles: Values at 25% and 75%, or the 25th and 75th percentiles.
These last five summary statistics create the five number summary and each has an associated function for finding them (e.g. = MEDIAN(A2:A20)).
The mean, or average, calculated with function AVERAGE().
Data visualization is the visual representation of data. We use it to explore new data, or exploratory analysis, as well as to explain data, or explanatory analysis.
Often, we must manipulate data to visualize them. This may involve cleaning, arranging, or summarizing with functions. Creating a pivot table makes this much easier.
We can use Excel or Sheets to automatically create visualizations, then we can customize or refine their chart elements for polish and clarity.
Automatically creating a visualization in Excel.
The best visualization to use often depends on the type of data we want to show.
Histograms show the “spread” of a single numeric variable
Scatter plots are used to represent two numeric variables
Bar charts show one numeric and one categorical variable
Line graphs are best for numeric data over time
We can represent a third variable using aesthetics like color, size, or shape. Some aesthetics are objectively better than others.
It’s important to take the time to polish and refine visualizations.
Visual hierarchy ensures that data and elements do not compete for attention
Legibility ensures clean, easy-to-read text
Chartjunk is visual clutter that are unnecessary and obfuscate clarity
Meaning ensures that everything in your chart is significant to the viewer
Labeling provides contextualizing information to support your data overall
Conventions leverage “common sense” to make visualizations easier to interpret
Respect your audience’s time and intelligence to avoid visual deception
Annotation is added contextualizing information for a specific data point
Decluttering a visualization by removing chartjunk.
Before we conclude, let’s leave you with some data analytic musings to marinate.
There’s something to be said about the data you’ve used to practice these techniques. Most analyses require some summary decision(s) on the part of the analyst, her- or himself, to process (i.e. clean and manipulate) data before looking for insights. This can have a significant effect on insights gleaned from an analysis.
For example, you may have noticed that in “The Largest Vocabulary in Hip Hop” (Daniels, 2019), Notorious B.I.G. isn’t in the data set, despite “Juicy” being the top-rated track in “The Greatest Hip-Hop Songs of All Time” (Brown, 2020). Why? Well, as the author notes:
Quite a few rappers don’t have enough official material to be included
“Biggie” is the first artist the author provides as an example. Due to the cutoff at 35,000 words, we exclude, arguably, one of the greatest hip hop artists of all time - likely because Christopher Wallace, a.k.a. Biggie Smalls, passed away at the age of 24.
In other cases, a “bag of words” analysis on lyrical diversity can have different shortcomings. Must lyrics be in English to “count” as unique? Or are out-of-dictionary words ignored? What about foreign words? For example:
Who’s the one you call Mr. Macho the head honcho?
…asked by The Notorious B.I.G. in “Dead Wrong”. Here, hanchō comes from the Japanese words for “group” (han) and “leader” (chō). Similarly, Afrika Bambaataa’s “Planet Rock” also uses Japanese:
Everybody say, ‘ichi, ni, san, shi’
That is, “one, two, three, four”. If, indeed, only English words are excluded, this makes hip hop artists like Immortal Technique all the more impressive, since a number of his tracks combine English and Spanish, e.g. “No Me Importa” from his debut album, Revolutionary Vol. 1.
Moreover, what of heavily modified words, alliteration, or references, frequently used by artists like Snoop Dogg or Das EFX, e.g. in “They Want EFX”?
Bum stiggedy bum stiggedy bum, hon, I got the old pa-rum-pum-pum-pum.
Are these nonsense words? Not really - as “pa-rum-pum-pum-pum” alludes to “Little Drummer Boy” - an alliteration repeated by Big Punisher in “Whatcha Gonna Do?”.
For better or worse, you can ply data analytics not only to challenge conclusions but to challenge the analysis and source data, as well. Feel free to find the potential flaws or shortcomings in our practice data and reach your own conclusions.
Before parting, it would be valuable to summarize some important takeaways.
Beneath every great visualization is a robust, accurate, well-formatted data set
Like visualization itself, statistics concisely summarize your data
Great data visualization conveys one or more “big ideas”; discuss ideas, not visualizations
It doesn’t hurt to be easy on the eyes, but visualization must be efficient - not pretty
Just ten minutes of polish can vastly improve a visualization’s clarity
Visualization is always persuading, regardless of how objective you try to remain
Lastly, remember that visualization not only represents data, it represents you, your work, and the degree of respect you have for your audience.