Ch. 3: Introductory Statistics

In this chapter, you’ll learn how to perform basic statistical operations in Excel and Sheets.

Because this workshop focuses on practical applications, we’ll skip over a lot of formulas and theories, but know that there’s much more to unpack than we’ll see here.




Statistical Analysis

There are various kinds of analyses which increase in sophistication depending on your needs.

Simply reporting data does not make it an analysis. It requires some degree of summarization.



Types of analyses which increase in sophistication.

(Source: Leek, 2015)


We’ll focus on the most basic of analyses: Descriptive analysis.

Hence, the statistics we’ll derive from our data are called:

  • Descriptive Statistics, or

  • Summary Statistics




Summary Statistics

Summary or descriptive statistics are values that summarize or describe a larger body of data.


There are several summary statistics, but we’ll focus on the following:

  • Means or averages

  • Five-Number Summaries, or the five most fundamental descriptive statistics:

    • Minimum, or the smallest value in a data set

    • Lower Quartile, or the value at the 25th percentile

    • Median, or the value at the 50th percentile

    • Upper Quartile, or the value at the 75th percentile

    • Maximum, or the largest value in a data set


These statistics are called measures of dispersion because they describe how much your data are “spread out” or “dispersed”. See the diagram below.



A five-number summary shows the “spread” or “dispersion” of our data.

(Source: Siyavula)



Summarizing Unique Lyrics

Scenario: Imagine if we were to order our hip hop artists by their number of unique lyrics.

  • The mean number of unique lyrics is 4,351, putting Ice Cube right around average.

  • In ordering artists by their unique lyrics, we can find the five-number summary.


Artist Unique Lyrics Rank Percentile Statistic
Aesop Rock 7,392 1 100 % Maximum
Atmosphere & Tech N9ne 4,829.5 39 & 40 75 % Third Quartile
Biz Markie & Kevin Gates 4,315.5 79 & 80 50 % Median
Wiz Khalifa & Machine Gun Kelly 3,698.5 119 & 120 25 % First Quartile
NF 2,472 159 0 % Minimum



Interpreting Summaries

Armed with only the mean (i.e. average) and five-number summary, we can learn quite a bit about our data and both compare and draw conclusions about each artist. For example:


  • NF and Aesop Rock represent the least and most lyrically diverse rappers, respectively

    • This is determined by the minimum and maximum values, respectively


  • Atmosphere and Tech N9ne have more diverse vocabularies than 75% of artists (~119)

    • This is determined by the upper or third quartile, a.k.a. the 75th percentile


  • Wiz Khalifa and Machine Gun Kelly have less diverse vocabularies than 75% of artists (~119)

    • This is determined by the lower or first quartile, a.k.a. the 25th percentile


  • Biz Markie’s and Kevin Gates’ lyrics are no more or less diverse than 50% of artists (~80)


  • The difference between the upper and lower quartiles is 1,131 unique words

    • In other words, 50% of artists only differ in vocabulary size by 1,131 words, maximum

    • This measure, known as the interquartile range, or IQR, suggests a small “spread”


Means v. Medians

Interestingly, the mean or average number of unique lyrics (4,351) is slightly higher than the median number of unique lyrics (4,315.5).

This indicates that certain artists like Busdriver and Jedi Mind Tricks are markedly more diverse, enough so that they “pull up” or “raise” the average number of unique words.

For this same reason, typical worker compensation for any given company is reported with medians, rather than averages, since CEOs tend to earn significantly more. For example, Elon Musk earns 40,668 times more than the median wage of a Tesla employee (Klein, 2019). That might throw your average off a bit!


Got an average of being “excellent”; the median’s just “dope”.


Thanks, El-P. You and Killer Mike.




Statistics in Spreadsheets

Spreadsheet software like Microsoft Excel and Google Sheets have a bevy of statistical functions.

Fortunately, averages and five-number summaries are very easy to use.


Just like we used function SUM() in the introduction, we must:

  1. Select the cell in which to place the summary statistic

  2. Insert = and our statistical function in the formula bar

  3. Select the cell or range of values for which we desire the statistic



Averages

Function AVERAGE() accepts a range of values (e.g. C2:C16) in parentheses (). For example:



Let’s see it in action:



The mean, or average, calculated with function AVERAGE().



Minima & Maxima

Functions MIN() and MAX() accept a range of values (e.g. G2:G340) in parentheses () and return the minimum and maximum values, respectively. For example:



Let’s see these in action:



Minimum and maximum values with functions MIN() and MAX(), respectively.



Medians

Function MEDIAN() accepts a range of values (e.g. A6:A21) in parentheses () and returns the median. For example:


Let’s see it in action:



The median calculated with function MEDIAN().



Quartiles

The lower and upper quartiles are calculated with function QUARTILE().


Unlike other functions we’ve seen, QUARTILE() requires two arguments separated by commas:

  1. The range of values to input (e.g. C2:C16)

  2. A number, between 0 and 4, indicating which quartile to calculate


Because the lower quartile is the first quartile, this requires a “1”, like so:


Because the upper quartile is the third quartile, this requires a “3”, like so:


Let’s see these in action:



Function QUARTILE() requires both a range and specification of quartiles 1, 2, or 3.



Shortcuts

Both Microsoft Excel and Google Sheets have shortcuts for common statistical functions.

  • In Excel, these may be accessed in the “Editing” panel of the “Home” toolbar

  • In Google Sheets, these may be accessed in the right-most icon, “Functions”



Using the shortcut for function AVERAGE() in Microsoft Excel.



Using the shortcut for function AVERAGE() in Google Sheets.





Applied Practice

Practice Data: In early 2020, BBC Music polled over one hundred “critics, artists, and music industry folks from 15 countries” and asked them to vote for their top five hip hop tracks of all time. These data were made available on the data science hub, Kaggle (read more).


These data have been reformatted with new variable names and sorted according to:

  1. Points, or the combination of first, second, third, fourth, and fifth choices

  2. Year, in ascending order (from oldest to newest)

  3. Gender of artist(s), including Male, Female, and Mixed (Male and Female)

You can navigate between the differently-sorted data along the tabs at the bottom.


These data are available both in Microsoft Excel and Google Sheets.

  • Download the Excel workbook by clicking here

  • Open and copy the Google Sheets spreadsheet by clicking here


Instructions: Use the statistical functions you’ve just learned to answer the following.

  1. What was the maximum number of points earned by female artists?

  2. What was the average and median number of points earned by male artists?

  3. Which decade had the highest average number of points?

  4. Among all songs, between what two quartiles is Luniz’ “I Got 5 On It”?

  5. Do you have a favorite artist here? How does their best track compare to average points?


Challenges: Answer the following questions for a bit of added challenge.

Note: “G.O.A.T.” stands for “Greatest of All Time”

  1. On average, does Rakim perform better as a solo artist or when paired with Eric B.?

  2. Using median values, per these data, who was the true G.O.A.T.: 2Pac or The Notorious B.I.G.?

  3. Who has more greatest tracks: Machine Gun Kelly or Eminem?