Ch. 2: Data & Formatting

Virtually every bit of information is some kind of datum.

In this chapter, we’ll explore the various classes and types of data, how to format data accordingly in spreadsheets, and the implications regarding the kinds of data with which you’ll work.




Quantitative v. Qualitative

The broadest classifications may be summarized as quantitative or qualitative.


Quantitative data are data relating to quantities, i.e. numbers. For example:

  • An individual’s height or weight

  • The number or departures from an airport

  • The number of credits provided by a college course


Qualitative data are data relating to qualities, i.e. descriptive text. For example:

  • An individual’s first and last name

  • The airline company name for each departure from an airport

  • The entire curriculum provided by a college course


In our previous example of hip hop artists, the name of the artist is qualitative, descriptive text, while the number of unique lyrics is quantitative, or numeric and measurable. Check it out:



Names are qualitative, while numbers are typically quantitative.



Extracting the Qualitative

It is possible to transform qualitative data into quantitative data, and the reverse is also true. It does not necessarily alter the data, but it allows us to derive new variables from existing data.


For example, we can turn quantitative data into ranges or categories based on their numeric values. This is called discretization, because we take numbers on a continuous scale and slice them into new ranges or categories.


Below, we discretize unique vocabulary as ranges:

  • Small vocabularies are Unique Words less than 4,000

  • Medium vocabularies are Unique Words greater than 4,000 but less than 5,000

  • Large vocabularies are Unique Words greater than 5,000



We can extract qualitative data from quantitative data with discretization.



Extracting the Quantitative

Similarly, we can extract quantitative data from qualitative data.


For example, we can use function LEN() to tally the number of letters and spaces in each artist’s name - that is, the “length” (LEN()) of a text value.



We can extract quantitative data from qualitative data with text functions.



Potential Pitfalls

Not all data are as they seem. It’s important to avoid confusing quantitative and qualitative data.


For example, a ZIP code is made up of numbers, but we cannot operate on them mathematically.

If my ZIP code is “30307” and your ZIP code is “30301”, taking the average of these numbers, “30304”, does not give us a location exactly between where we both live. This is because, despite being made of numbers, ZIP codes are qualitative.


As the adage attributed to Miles Kington goes:


Knowledge is knowing that a tomato is a fruit.
Wisdom is not putting it in a fruit salad.


It’s the same with quantitative and qualitative data.




Programmatic Classes

Data comes in a wide variety of classes. The framework for naming them is not universal across spreadsheet software or programming languages, but we’ll explore some common data classes here.


In many cases, classes affect how which methods you use in analyzing data.
However, they are even more important when visualizing your data.


There are a number of data classes and more than one class can exist in a spreadsheet. The following are common, though names may differ:



Character

Character or text data are recognized by Excel and Sheets as alphabetical (all letters), alphanumeric (letters and numbers), or simply numeric values that are actually qualitative (e.g. ZIP codes).



Numeric

Numeric data are recognized as numbers on which Excel or Sheets may operate mathematically, like addition, subtraction, averages, etc.

  • Integers are a specific class of numeric data as they are only made of whole numbers.

  • For example, tallies of individual students are typically integer, since you can’t have a 50% of a person in a course.

  • Qualitative, character, or text values are often referred to as strings



Logical

Logical data are comprised of TRUE and FALSE values, also known as binary or Boolean.

  • Under the hood, these are actually numeric values, with “0” representing FALSE and “1” representing TRUE.

  • In Sheets, function INT() converts logical values TRUE and FALSE to numbers.



Function INT() is only available in Google Sheets and converts logical values to integers.



Dates & Times

Date and date time variables comprise, well, dates and times. Pretty straightforward.

  • However, date values may be formatted a variety of ways.

  • For example, June 1, 2020 may be formatted as 06/01/20 or 2020-06-01.

  • Lastly, like logical, date values are also numeric values under the hood.


In other words, we can use function INT() to convert class date variables into class numeric variables. Below, we concert a random assortment of dates into integers using function INT(), add 365, and convert all dates to one year later.



Even though dates aren’t in a conventional format, we can still treat them as numbers.




Statistical Types

Data types are more important in determining statistical methods of analysis, rather than data visualization, and are most often used to further classify qualitative variables (i.e. text or character data).




Discrete

Discrete data are comprised of numeric values that are whole, indivisible, and most notably, countable.

In our hip hop data, we can clearly count the number of unique words, but there are no partial words. Snoop Dogg has 3,974 unique words - not 3,974.3 unique words.

We count discrete values, but we do not measure them.



Continuous

Continuous data are comprised of numeric values that may or may not be whole numbers, exist along a spectrum, and are divisible into smaller values.

When you calculated the proportion of artists’ unique words out of 35,000, you measured, rather than counted this number.

For example, when comparing Dirty South rappers, 13.7314 % of Ludacris’ first 35,000 lyrics were unique, whereas André 3000 and Big Bi of Outlast have a slightly more varied vocabulary with 14.8914 % unique words. These proportions are continuous.

We measure continuous values, but we do not count them.



Categorical & Ordinal

Categorical, also called nominal data are values which may be defined as categories.

Although names are qualitative, they typically aren’t categorical.

We could categorize them based on shared modifiers, like “Li”:



A Li list.



One categorical variable might be “Era” or “Decade”. That is, when did the artist emerge?



Although “Year” is quantitative, “Era” or “Decade” is qualitative and categorical.



In fact, “Era” or “Decade” is not only categorical, but it’s also ordinal.

In other words, ordinal data are categories with a natural order or hierarchy. Here:

  1. 1980s

  2. 1990s

  3. 2000s

  4. 2010s


Non-ordinal categories, for example, may include sub genres of Dirty South hip hop:


Despite what some may claim, there is no inherent order to these categories.



Practice Exercises

Instructions: Identify the correct data type among the choices provided.


  1. Socioeconomic Status: “Low Income”, “Middle Income”, “High Income”
    1. Discrete
    2. Continuous
    3. Categorical
    4. Non-Ordinal Categorical


  1. Continent: “Africa”, “Asia”, “Antarctica”, “Australia”, “Europe”, “N. America”, “S. America”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Educational Attainment: “High School”, “Bachelors”, “Masters/JD”, “PhD/MD”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Eye Color: “Amber”, “Blue”, “Brown”, “Hazel”, “Green”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Boston Marathon Times: “03:08:21”, “03:23:48”, “03:33: 18”, etc.
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Paired Dice Rolls: “2”, “3”, “4”, … “12”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Likert Scale Options: “Strongly Disagree”, “Agree”, “Neutral”, “Agree”, “Strongly Agree”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Enrolled Students: “0”, “1”, “2”, “3”, “4”, “5”, … etc.
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. NBA Games Played (GP): “0”, “1”, “2”, … etc.
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical


  1. Free Throw Percentage (FT%): “72.0 %”, “80.3 %”, “75.9 %”, “90.2 %”
    1. Discrete
    2. Continuous
    3. Ordinal Categorical
    4. Non-Ordinal Categorical




Formatting Data

Both Excel and Sheets allow you to format your data in a way that’s appropriate for its type and class.



Why Format?

Question: Can you spot what’s “wrong” with the following Connecticut ZIP code data?



Connecticut ZIP codes - almost.


What’s wrong?

Aren’t ZIP codes supposed to have five digits?

(Why, yes - yes, they are).


What happened?

Connecticut ZIP codes begin with a “0”. These are called leading zeroes.

Since ZIP codes are comprised of digits, Excel or Sheets will automatically interpret them as numbers and “drop” leading zeroes.


(Note I: Leading zeroes will not be on the final exam).

(Note II: There’s no final exam).


What’s This Got to Do with Formatting?

We format data, principally, for two reasons:

  1. Formatted data are easier to understand and interpret

  2. Formatted data explicitly tell our spreadsheet software how to interpret the data


If we format our ZIP codes as character or text data, our zeroes return:



This is - what analysts call - “the right way”.



How to Format

Formatting your data is more or less the same in both Excel and Sheets.

Note: Although we’re formatting our data to suit their class and/or type, “formatting” is used broadly to include fonts, colors, fills, borders, and other aesthetics. Here, we’re strictly describing formatting values.


Selecting Cells: In both Excel and Sheets, you must first select the cells you wish to format, including:

  • Clicking to highlight a single cell

  • Clicking and dragging to highlight a range of cells

  • Clicking a column letter (e.g. A, B, F) to highlight all cells in a column

  • Clicking and dragging to highlight multiple column headers



In order to format values, you must select their cells.


In Excel: Once cells are selected:

  1. Make sure you’re in the “Home” tab

  2. Click on the box next to “Number”

  3. In the “Number” tab, select the appropriate “Category”

  4. Specify any details, e.g. decimal places



Formatting “Unique Words” as a “Number” with commas and decimal positions in Excel.


In Sheets: Once cells are selected:

  1. Click “Format”

  2. Click “Number” in the drop-down menu

  3. Select the appropriate format



Formatting “Unique Words” as a “Number” in Google Sheets.



Formatting Shortcuts

Both Excel and Sheets have formatting shortcuts in their toolbars.


These shortcuts allow you to quickly format values with a quick click of the mouse, including:

  • Accounting formats in US dollars, UK pounds, EU Euros, Chinese Yuan, etc.

  • Percentage formats and thousands separators (commas)

  • Adding or removing decimal places



Formatting shortcuts in Microsoft Excel.



Formatting shortcuts in Google Sheets.





Applied Practice

Practice Data: Once again, you’ll practice by using The Pudding’s, “The Largest Vocabulary in Hip Hop” (Daniels, 2019).


This time, however, the formatting is slightly different and there are four new variables:

  • Era, indicating the decade containing the lyrics analyzed

  • Region, an incomplete list of each artist’s associated region

  • Wu-Tang, a logical TRUE or FALSE indicating membership in Wu-Tang Clan

  • Source, or the original of the analyzed lyrics


You can access these data in either Excel or Sheets:

  • Download the Excel workbook by clicking here and clicking “Enable Editing”

  • Open and copy the data in Sheets by clicking here


Instructions: Format each variable per the following recommendations.

  1. Era is ordinal (that is, ordered categorical) and should be “text”


  1. Unique Words is discrete and should be “number”

    1. Include a thousands separator (i.e. commas)

    2. Exclude any decimal places (i.e. “0.0” or “0.00”)


  1. Convert variable Wu-Tang to binary, or “0” for FALSE and “1” for TRUE

    1. Note: I’m not sure why all Wu-Tang members aren’t here - is The Pudding not down with ODB?

    2. This must be performed manually with a “Find & Replace”:

      1. Press the Ctrl and F keys at the same time

      2. Select “Replace”

      3. In the “Find what:” field, enter “FALSE”

      4. In the “Replace with:” field, enter “0”

      5. Click “Replace All”

      6. Repeat with “TRUE” and “1”

      7. Format the entire column as “Number” without decimal places


Challenges: A benefit of formatting in binary is speedy calculations.

  1. Use function SUM() on all values in variable Wu-Tang to count total members in our data.


  1. Use function AVERAGE() on all values in variable Wu-Tang to get the proportion of members.

    1. Format the result as a percent. What percentage of artists are members of Wu-Tang Clan?