Data

Data are facts and figures collected, analyzed and summarized for presentation and interpretation. A variable is a characteristic or trait of the data. For example, for a Major League Baseball (MLB) player, his age, batting average and hometown would be considered variables. An observation is a set of values corresponding to a set of variables. Here we have four observations each corresponding to one of the four players. We are concerned with how the value of a variable can vary. Variation is the difference in a variable measured over observations (time, customers, items, etc.).

Player Age BA Hometown
John 25 0.235 Kelowna
Bill 20 0.315 Penticton
Kevin 29 0.287 Vancouver
Steve 33 0.259 Victoria

The role of descriptive analytics is to collect and analyze data to gain a better understanding of variation. A quantity that is not known with any certainty is called a random variable. When we collect data, we are gathering past observed values. The goal is to learn more about the variation and how we may gain insight as a result of those observations.

Data Types

Data can be categorized in many ways based on the type of data and how the data are collected.

Populations versus Samples

In many cases, we cannot collect data from the entire population of interest. In those cases, we collect data from a subset of the population called a sample. Sample data must be representative of the population data or generalizations (inferences) cannot be made. Normally, a representative sample is fairly straight forward to collect. Dealing with population data versus sample data will introduce small differences in how we calculate and interpret summary statistics. However, there are a variety of reasons (such as data size and collection efficiency) that we prefer to collect samples versus population data.

Quantitative versus Categorical Data

Qualitative (or categorical) data is descriptive data. Quantitative data is numerical in nature and lends itself well to mathematical calculations. If arithmetic operations cannot be performed on the data, they are considered categorical data.

Cross-Sectional versus Time Series Data

For statistical analysis, it is important to distinguish between cross-sectional data and time series data. Cross-sectional data are collected from several entities at the same, or approximately the same, point in time. Time series data are collected over several time periods.

Experimental versus Observational Studies

Data can often be obtained with an appropriate study; such statistical studies can be classified as either experimental or observational. In an experimental study, a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. Nonexperimental, or observational, studies make no attempt to control the variables of interest. Surveys tend to be observational in nature while clinical trials tend to be experimental.

Cleaning Data

The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modeling. Common tasks in data preparation include dealing with missing data, identifying outliers, and determining the best way to represent variables.

Legitimately Missing Data

Real world data sets will often include observations with missing values. In some cases, missing data naturally occur; these are called legitimately missing data. Generally no remedial action is taken for legitimately missing data.

For example, suppose we know the following are test scores and the NA values represent students who dropped the class or were exempt from the test: <100, 82, NA, 73, NA, 45, 56>. Finding the test average involves removing the NA values. The R command is.na(x) will return a vector indicating which values of vector \(x\) were missing.

x <- c(100,82,NA,73,NA,45,56)
is.na(x) # returns a vector (F,F,T,F,T,F,F)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

Since some of the data were missing, calculating the mean of \(x\) will result in an NA.

mean(x) # returns NA
## [1] NA

We can, however, use the na.rm=TRUE command to remove missing values in the data.

mean(x,na.rm=TRUE)
## [1] 71.2

Illegitimately Missing Data

If missing data is not characterized as legitimately missing, we categorize the missing data as illegitimately missing data. These occur for many reasons, such as a respondent electing not to answer a question, a respondent dropping out of a study before its completion, or electronic sensor data collection equipment failing during a study. We should always consider remedial action in these types of cases. The primary options for addressing such missing data are

  1. to discard observations (rows) with any missing values,

  2. to discard any variable (column) with missing values,

  3. to fill in missing entries with estimated values, or

  4. to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.

Deciding on a strategy for dealing with missing data requires some understanding of why the data are missing and the potential impact these missing values might have on an analysis. Thus, a combination of strategic thinking and common sense must be used.

If there is no relationship between the missingness of the data and any values, observed or missing, the missing value is referred to as missing completely at random (MCAR).

However, the data may not be missing completely at random. If the missing observation is related to the value of some other variable(s) in the data, the missing value is called missing at random (MAR). For data that is MAR, the reason for the missing values may determine its importance.

If the missing variable is related to the value that is missing, the value is classified as missing not at random (MNAR).

For example

  1. Younger people are less likely to get their blood pressure checked at the doctor so their medical records often miss blood pressure data.
    • There is a relationship between age and a lack of records, so this is MAR.
  2. A scale has a limit of 30lbs, so shipments beyond that weight do not have a recorded weight.
    • All values that are missing are likely to exceed 30lbs. Thus, they are MNAR.
  3. School district 26 has missing test scores while all other districts are not missing their test scores.
    • Again, the group in question is identified based on location, so the data is MAR.
  4. Questionnaires were randomly lost in the mail.
    • Here, there is no group identified and no correlation between any factors, so the data is MCAR.
  5. Blood samples tested during the night shift were more likely to be contaminated creating missing lab results but no time of test was recorded.
    • We cannot identify a group of patients since the test time was not recorded, so here the data is MNAR.

Dealing With Missing Data

Understanding which of these three categories (MCAR, MAR, and MNAR) missing values belongs determines how to deal with the data.

  • If a variable has observations for which the missing values are MCAR or MAR and only a relatively small number of observations are missing values, the observations that are missing values can be ignored (discard the rows). We will certainly lose information if the observations that are missing values for the variable are ignored, but the results of an analysis of the data will not be biased by the missing values.

The complete.cases command helps us remove rows that have missing values. For example

df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)
df
##   col1 col2  col3 col4
## 1    1 this  TRUE  2.5
## 2    2 <NA> FALSE  4.2
## 3    3   is  TRUE  3.2
## 4   NA text  TRUE   NA
# Identify the rows that are complete.
complete.cases(df) 
## [1]  TRUE FALSE  TRUE FALSE

If we want to only display rows with no missing values, we can look at the data frame in the following way.

# Keep the complete rows and display them.
df[complete.cases(df),] 
##   col1 col2 col3 col4
## 1    1 this TRUE  2.5
## 3    3   is TRUE  3.2

If a variable has observations with missing values that are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased. If the variable with MNAR values is thought to be redundant with another variable in the data for which there are few or no missing values, removing the MNAR variable from consideration may be an option. In particular, if the MNAR variable is highly correlated with another variable that is known for a majority of observations, the loss of information may be minimal.

For example, In a data set where we are trying to determine the height of a male based on the variables “hand size”, “foot size”, and “weight”, we notice that a number of observations are missing the foot size variable. Since there is a strong correlation between the hand size and foot size of a male, we may omit the “foot size” column with little loss of information.

No matter the type of missing values (MCAR, MAR, or MNAR), we should always attempt to determine the actual value that is missing. We can sometimes achieve this by examining the source of the data or logically determining the likely value that is missing. If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (the systematic replacement of missing values with values that seems reasonable) may be useful. Options for replacing the missing entries for a variable include replacing the missing value with the variable’s mode, mean, or median. Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data. If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries. How to deal with missing values is fairly subjective, and caution must be used to not induce bias by replacing missing values.

An Example

As an example, we consider the airquality data set. We want to determine which variables are missing and how we might impute the values.

# Load and view the data
aq <- airquality
head(aq,10)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
# Determine the number of rows in the data.
nrow(aq)
## [1] 153
#  Determine the number of rows with missing values
nrow(aq[!complete.cases(aq),])
## [1] 42

We see that we are missing 42 cases out of a total of 153 (27.5%). We can determine how many missing data are in each column. We use the cbind() command to display the data in a table.

# Add up how many NA values in each column.
cbind(
  sum(is.na(aq$Ozone)),
  sum(is.na(aq$Solar.R)),
  sum(is.na(aq$Wind)),
  sum(is.na(aq$Temp)),
  sum(is.na(aq$Month)),
  sum(is.na(aq$Day)))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   37    7    0    0    0    0

Most of the missing data are from the Ozone column and the remaining is in the Solar.R. Thus, we can replace those missing values by the column means (ignoring all of the NA values in those columns).

#  Impute the NA values with the mean.
aq$Ozone[is.na(aq$Ozone)] <- mean(aq$Ozone, na.rm = TRUE)
aq$Solar.R[is.na(aq$Solar.R)] <- mean(aq$Solar.R, na.rm = TRUE)
head(aq,10)
##       Ozone  Solar.R Wind Temp Month Day
## 1  41.00000 190.0000  7.4   67     5   1
## 2  36.00000 118.0000  8.0   72     5   2
## 3  12.00000 149.0000 12.6   74     5   3
## 4  18.00000 313.0000 11.5   62     5   4
## 5  42.12931 185.9315 14.3   56     5   5
## 6  28.00000 185.9315 14.9   66     5   6
## 7  23.00000 299.0000  8.6   65     5   7
## 8  19.00000  99.0000 13.8   59     5   8
## 9   8.00000  19.0000 20.1   61     5   9
## 10 42.12931 194.0000  8.6   69     5  10

The missing values are now the average of the remaining values. Thus, we have taken a number of MCAR variables and replaced them with the average of the columns. We now have a full (though, not perfect) dataset.

Citations

Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.

Dealing with Missing Values - UC Business Analytics R Programming Guide. Accessed April 19, 2021. Available here.

The Analysis Factor. “How to Diagnose the Missing Data Mechanism,” May 20, 2013. Available here.