Virtually every bit of information is some kind of datum.
In this chapter, we’ll explore the various classes and types of data, how to format data accordingly in spreadsheets, and the implications regarding the kinds of data with which you’ll work.
The broadest classifications may be summarized as quantitative or qualitative.
Quantitative data are data relating to quantities, i.e. numbers. For example:
An individual’s height or weight
The number or departures from an airport
The number of credits provided by a college course
Qualitative data are data relating to qualities, i.e. descriptive text. For example:
An individual’s first and last name
The airline company name for each departure from an airport
The entire curriculum provided by a college course
In our previous example of hip hop artists, the name of the artist is qualitative, descriptive text, while the number of unique lyrics is quantitative, or numeric and measurable. Check it out:
Names are qualitative, while numbers are typically quantitative.
It is possible to transform qualitative data into quantitative data, and the reverse is also true. It does not necessarily alter the data, but it allows us to derive new variables from existing data.
For example, we can turn quantitative data into ranges or categories based on their numeric values. This is called discretization, because we take numbers on a continuous scale and slice them into new ranges or categories.
Below, we discretize unique vocabulary as ranges:
Small vocabularies are Unique Words less than 4,000
Medium vocabularies are Unique Words greater than 4,000 but less than 5,000
Large vocabularies are Unique Words greater than 5,000
We can extract qualitative data from quantitative data with discretization.
Similarly, we can extract quantitative data from qualitative data.
For example, we can use function LEN() to tally the number of letters and spaces in each artist’s name - that is, the “length” (LEN()) of a text value.
We can extract quantitative data from qualitative data with text functions.
Not all data are as they seem. It’s important to avoid confusing quantitative and qualitative data.
For example, a ZIP code is made up of numbers, but we cannot operate on them mathematically.
If my ZIP code is “30307” and your ZIP code is “30301”, taking the average of these numbers, “30304”, does not give us a location exactly between where we both live. This is because, despite being made of numbers, ZIP codes are qualitative.
As the adage attributed to Miles Kington goes:
Knowledge is knowing that a tomato is a fruit.
Wisdom is not putting it in a fruit salad.
It’s the same with quantitative and qualitative data.
Data comes in a wide variety of classes. The framework for naming them is not universal across spreadsheet software or programming languages, but we’ll explore some common data classes here.
In many cases, classes affect how which methods you use in analyzing data.
However, they are even more important when visualizing your data.
There are a number of data classes and more than one class can exist in a spreadsheet. The following are common, though names may differ:
Character or text data are recognized by Excel and Sheets as alphabetical (all letters), alphanumeric (letters and numbers), or simply numeric values that are actually qualitative (e.g. ZIP codes).
Numeric data are recognized as numbers on which Excel or Sheets may operate mathematically, like addition, subtraction, averages, etc.
Integers are a specific class of numeric data as they are only made of whole numbers.
For example, tallies of individual students are typically integer, since you can’t have a 50% of a person in a course.
Qualitative, character, or text values are often referred to as strings
Logical data are comprised of TRUE and FALSE values, also known as binary or Boolean.
Under the hood, these are actually numeric values, with “0” representing FALSE and “1” representing TRUE.
In Sheets, function INT() converts logical values TRUE and FALSE to numbers.
Function INT() is only available in Google Sheets and converts logical values to integers.
Date and date time variables comprise, well, dates and times. Pretty straightforward.
However, date values may be formatted a variety of ways.
For example, June 1, 2020 may be formatted as 06/01/20 or 2020-06-01.
Lastly, like logical, date values are also numeric values under the hood.
In other words, we can use function INT() to convert class date variables into class numeric variables. Below, we concert a random assortment of dates into integers using function INT(), add 365, and convert all dates to one year later.
Even though dates aren’t in a conventional format, we can still treat them as numbers.
Data types are more important in determining statistical methods of analysis, rather than data visualization, and are most often used to further classify qualitative variables (i.e. text or character data).
Discrete data are comprised of numeric values that are whole, indivisible, and most notably, countable.
In our hip hop data, we can clearly count the number of unique words, but there are no partial words. Snoop Dogg has 3,974 unique words - not 3,974.3 unique words.
We count discrete values, but we do not measure them.
Continuous data are comprised of numeric values that may or may not be whole numbers, exist along a spectrum, and are divisible into smaller values.
When you calculated the proportion of artists’ unique words out of 35,000, you measured, rather than counted this number.
For example, when comparing Dirty South rappers, 13.7314 % of Ludacris’ first 35,000 lyrics were unique, whereas André 3000 and Big Bi of Outlast have a slightly more varied vocabulary with 14.8914 % unique words. These proportions are continuous.
We measure continuous values, but we do not count them.
Categorical, also called nominal data are values which may be defined as categories.
Although names are qualitative, they typically aren’t categorical.
We could categorize them based on shared modifiers, like “Li”:
A Li list.
One categorical variable might be “Era” or “Decade”. That is, when did the artist emerge?
Although “Year” is quantitative, “Era” or “Decade” is qualitative and categorical.
In fact, “Era” or “Decade” is not only categorical, but it’s also ordinal.
In other words, ordinal data are categories with a natural order or hierarchy. Here:
1980s
1990s
2000s
2010s
Non-ordinal categories, for example, may include sub genres of Dirty South hip hop:
Snap (Atlanta, GA)
Trap (Atlanta, GA)
Bounce (New Orleans, LA)
Trunk (Memphis, TN)
Chopped & Screwed (Houston, TX)
Booty Bass (Miami, FL)
Despite what some may claim, there is no inherent order to these categories.
Instructions: Identify the correct data type among the choices provided.
Both Excel and Sheets allow you to format your data in a way that’s appropriate for its type and class.
Question: Can you spot what’s “wrong” with the following Connecticut ZIP code data?
Connecticut ZIP codes - almost.
Aren’t ZIP codes supposed to have five digits?
(Why, yes - yes, they are).
Connecticut ZIP codes begin with a “0”. These are called leading zeroes.
Since ZIP codes are comprised of digits, Excel or Sheets will automatically interpret them as numbers and “drop” leading zeroes.
(Note I: Leading zeroes will not be on the final exam).
(Note II: There’s no final exam).
We format data, principally, for two reasons:
Formatted data are easier to understand and interpret
Formatted data explicitly tell our spreadsheet software how to interpret the data
If we format our ZIP codes as character or text data, our zeroes return:
This is - what analysts call - “the right way”.
Formatting your data is more or less the same in both Excel and Sheets.
Note: Although we’re formatting our data to suit their class and/or type, “formatting” is used broadly to include fonts, colors, fills, borders, and other aesthetics. Here, we’re strictly describing formatting values.
Selecting Cells: In both Excel and Sheets, you must first select the cells you wish to format, including:
Clicking to highlight a single cell
Clicking and dragging to highlight a range of cells
Clicking a column letter (e.g. A, B, F) to highlight all cells in a column
Clicking and dragging to highlight multiple column headers
In order to format values, you must select their cells.
In Excel: Once cells are selected:
Make sure you’re in the “Home” tab
Click on the box next to “Number”
In the “Number” tab, select the appropriate “Category”
Specify any details, e.g. decimal places
Formatting “Unique Words” as a “Number” with commas and decimal positions in Excel.
In Sheets: Once cells are selected:
Click “Format”
Click “Number” in the drop-down menu
Select the appropriate format
Formatting “Unique Words” as a “Number” in Google Sheets.
Both Excel and Sheets have formatting shortcuts in their toolbars.
These shortcuts allow you to quickly format values with a quick click of the mouse, including:
Accounting formats in US dollars, UK pounds, EU Euros, Chinese Yuan, etc.
Percentage formats and thousands separators (commas)
Adding or removing decimal places
Formatting shortcuts in Microsoft Excel.
Formatting shortcuts in Google Sheets.
Practice Data: Once again, you’ll practice by using The Pudding’s, “The Largest Vocabulary in Hip Hop” (Daniels, 2019).
This time, however, the formatting is slightly different and there are four new variables:
Era, indicating the decade containing the lyrics analyzed
Region, an incomplete list of each artist’s associated region
Wu-Tang, a logical TRUE or FALSE indicating membership in Wu-Tang Clan
Source, or the original of the analyzed lyrics
You can access these data in either Excel or Sheets:
Download the Excel workbook by clicking here and clicking “Enable Editing”
Open and copy the data in Sheets by clicking here
Instructions: Format each variable per the following recommendations.
Era is ordinal (that is, ordered categorical) and should be “text”Unique Words is discrete and should be “number”
Include a thousands separator (i.e. commas)
Exclude any decimal places (i.e. “0.0” or “0.00”)
Convert variable Wu-Tang to binary, or “0” for FALSE and “1” for TRUE
Note: I’m not sure why all Wu-Tang members aren’t here - is The Pudding not down with ODB?
This must be performed manually with a “Find & Replace”:
Press the Ctrl and F keys at the same time
Select “Replace”
In the “Find what:” field, enter “FALSE”
In the “Replace with:” field, enter “0”
Click “Replace All”
Repeat with “TRUE” and “1”
Format the entire column as “Number” without decimal places
Challenges: A benefit of formatting in binary is speedy calculations.
SUM() on all values in variable Wu-Tang to count total members in our data.Use function AVERAGE() on all values in variable Wu-Tang to get the proportion of members.