Learning Data Science with R:

Measures
Storing data in R
- Types of data in R
- Types of variables in R

Data is the result of measurement of physical or emotive quantities. We encode quantities like distance, clock cycles, rates of change, or feelings into measures. We then record repeated measures as data, which is grouped and labelled on a computer as variables.

How those quantities are encoded into measures impacts how we can analyze data.

Measures

Scales of Measurement

It is often useful when analyzing data to think of the data as being one of four main types, according to the typology proposed by Stevens.¹ This can help with selecting the analysis to perform and prevent basic analysis mistakes.

Nominal: Data identifying unique classifications or objects where the order of values is not meaningful. Examples include zip codes, gender, nationality, sports team names and multiple choice answers on most test.
Ordinal: Data where the order is important but the difference or distance between items is not important or not measured. Examples include team rankings in sport (team A is better than team B, but how much better is open to debate), scales such as health (e.g. “healthy” to “sick”), ranges of opinion (e.g. Likert-type scales of “strongly agree” to “strongly disagree” or “on a scale of 1 to 10”) and Intelligence Quotient.
Interval: Numeric data identified by values where the degree of difference between items is significant and meaningful, but their ratio is not. Common examples are dates, e.g. 1000 CE is not 1/2 of 2000 CE in any meaningful way, and temperatures on the Celsius and Fahrenheit scales, where a difference of 10 degrees is meaningful, but 10 degrees is not twice as hot as 5 degrees.
Ratio: Numeric data where the ratio between numbers is meaningful. Usually, such scales have a meaningful “0.” Examples include length, mass, velocity, acceleration, voltage, power, duration, energy and Kelvin-scale temperature.

Statistics and Operations

Some of the appropriate statistics and mathematical operations for each type are summarized in the table below.

Scale Type	Statistics	Operations
Nominal	counts, mode, frequency, chi-squared, cluster analysis	\(=\), \(\neq\)
Ordinal	the above, plus: median, non-parametric tests, Kruskal-Wallis, rank-correlation	\(=\), \(\neq\), \(<\), \(>\)
Interval	plus: arithmetic mean, some parametric tests (proportions, sometimes t-test), individuals control chart, correlation, regression, ANOVA (sometimes), factor analysis	\(=\), \(\neq\), \(<\), \(>\) , \(+\), \(-\)
Ratio	plus: geometric and harmonic mean, ANOVA, correlation coefficient	\(=\), \(\neq\), \(<\), \(>\) , \(+\), \(-\) , \(\times\), \(\div\)

Exceptions

While this is a useful typology for most use, and certainly for initial consideration, there are criticisms of Stevens’ typology, and there are alternatives. It doesn’t work as a hard-and-fast rule for all situations, so it is essential for the data scientist to understand the statistical methods being applied and their sensitivity to departures from underlying assumptions.

For example, percentages and count data have some characteristics of ratio-scale data, but with additional constraints. e.g. the average of the counts (2, 2, 1) may not be meaningful: can we talk meaningfully about \(1.66\ldots\) counts of something? Conversely, preference data collected on a Likert-type ordinal scale (e.g. on a scale of “not important” to “extremely important”) is clearly of ordinal type but may, in some cases, be safely treated as ratio scale data.

See Velleman and Wilkinson² for one such critique of Stevens’ typology, in particular focusing on the way in which it fails when using automated machine learning methods.

Storing data in R

Types of data in R

R recognizes at least fifteen different types of data. Several of these are related to identifying functions and other objects—most users don’t need to worry about most of them. The main types that data scientists will need to use are:

numeric: Real numbers. Also known as double, real and single (note that R stores all real numbers in double-precision). May be used for all scales of measurement, but is particularly suited to ratio scale measurements.
complex: Imaginary numbers can be manipulated directly as a data type using x <- 1 + i2 or x <- complex(real=1, imaginary=2). Like type numeric, may be used for all scales of measurement.
integer: Stores integers only, without any decimal point. Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.
logical: Stores Boolean values of TRUE or FALSE, typically used as nominal data.
character: Stores text strings and can be used as nominal or ordinal data.

Types of variables in R

The above types of data can be stored in several types, or structures, of variables. The equivalent to a variable in Excel would be rows, columns or tables of data. The main ones that we will use are:

vector: Contains one or many elements, and behaves like a column or row of data. Vectors can contain any of the above types of data but each vector is stored, or encoded, as a single type. The vector c(1, 2, 1, 3, 4) is, by default, a numeric vector of type double, but c(1, 2, 1, 3, 4, "name") will be a character vector, or a vector where all data is stored as type character. In the latter case, the numbers will be stored as characters (i.e. on a nominal scale of measurement) rather than as numbers.
factor: A special type of character vector, where the text strings signify factor levels and are encoded internally as integer counts of the occurrence of each factor. Factors can be treated as nominal data when the order does not matter, or as ordinal data when the order does matter.

factor(c("a", "b", "c", "a"), levels=c("a","b","c","d"))

array: A generalization of vectors from one dimension to two or more dimensions. Array dimensions must be pre-defined and can have any number of dimensions. Like vectors, all elements of an array must be of the same data type. (Note that the letters object used in the example below is a variable supplied by R that contains the letters a through z.)

# letters a - c in 2x4 array
array(data=letters[1:3], dim=c(2,4))
# numbers 1 - 3 in 2x4 array
array(data=1:3, dim=c(2,4))

matrix: A special type of array with the properties of a mathematical matrix. It may only be two-dimensional, having rows and columns, where all columns must have the same type of data and every column must have the same number of rows. R provides several functions specific to manipulating matrices, such as taking the transpose, performing matrix multiplication and calculation eigenvectors and eigenvalues.

matrix(data = rep(1:3, times=2), nrow=2, ncol=3)

list: Vectors whose elements are other R objects, where each object of the list can be of a different data type, and each object can be of different length and dimension than the other objects. Lists can therefore store all other data types, including other lists.

list("text", "more", 2, c(1,2,3,2))

Data Frame: For industrial scientists, data frames (or the tidyverse extension of the data frame, the tibble) are the most widely useful type of variable. A data frame is the list analog to the matrix: it is an \(m \times n\) list where all columns must be vectors of the same number of rows (as reported by NROW()). However, unlike matrices, different columns can contain different types of data and each row and column must have a name. If not named explicitly, R automatically names rows by their row number and columns according to the data assigned assigned to the column. Data frames are typically used to store the sort of data that industrial engineers and scientists most often work with, and is the closest analog in R to an Excel spreadsheet. Usually data frames are made up of one or more columns of factors and one or more columns of numeric data.

data.frame(rnorm(5), rnorm(5), rnorm(5))

More generally, in R all variables are objects, and R distinguishes between objects by their internal storage type and by their class declaration, which are accessible via the functions typeof()and class(). Functions in R are also objects, and the users can define new objects to control the output from functions. This is how summary() and print() are able to work with a wide range of object types, from numeric vectors to data frames to ggplot2 graph objects. For more on objects, types and classes, see section 2 of the R Language Definition.

S. S. Stevens. On the theory of measurement scales. Science, 103(2684), 1946.↩
P. F. Velleman and L. Wilkinson. Nominal, Ordinal, Interval, and Ratio Typologies and Misleading. The American Statistician, 47(1):65–72, 1993.↩