Data are facts and figures collected, analyzed and summarized for presentation and interpretation. A variable is a characteristic or trait of the data. For example, for a Major League Baseball (MLB) player, his age, batting average and hometown would be considered variables. An observation is a set of values corresponding to a set of variables. Here we have four observations each corresponding to one of the four players. We are concerned with how the value of a variable can vary. Variation is the difference in a variable measured over observations (time, customers, items, etc.).
| Player | Age | BA | Hometown |
|---|---|---|---|
| John | 25 | 0.235 | Kelowna |
| Bill | 20 | 0.315 | Penticton |
| Kevin | 29 | 0.287 | Vancouver |
| Steve | 33 | 0.259 | Victoria |
The role of descriptive analytics is to collect and analyze data to gain a better understanding of variation. A quantity that is not known with any certainty is called a random variable. When we collect data, we are gathering past observed values. The goal is to learn more about the variation and how we may gain insight as a result of those observations.
Data can be categorized in many ways based on the type of data and how the data are collected.
In many cases, we cannot collect data from the entire population of interest. In those cases, we collect data from a subset of the population called a sample. Sample data must be representative of the population data or generalizations (inferences) cannot be made. Normally, a representative sample is fairly straight forward to collect. Dealing with population data versus sample data will introduce small differences in how we calculate and interpret summary statistics. However, there are a variety of reasons (such as data size and collection efficiency) that we prefer to collect samples versus population data.
Qualitative (or categorical) data is descriptive data. Quantitative data is numerical in nature and lends itself well to mathematical calculations. If arithmetic operations cannot be performed on the data, they are considered categorical data.
For statistical analysis, it is important to distinguish between cross-sectional data and time series data. Cross-sectional data are collected from several entities at the same, or approximately the same, point in time. Time series data are collected over several time periods.
Data can often be obtained with an appropriate study; such statistical studies can be classified as either experimental or observational. In an experimental study, a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. Nonexperimental, or observational, studies make no attempt to control the variables of interest. Surveys tend to be observational in nature while clinical trials tend to be experimental.
The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modeling. Common tasks in data preparation include dealing with missing data, identifying outliers, and determining the best way to represent variables.
Real world data sets will often include observations with missing values. In some cases, missing data naturally occur; these are called legitimately missing data. Generally no remedial action is taken for legitimately missing data.
For example, suppose we know the following are test scores and the NA values represent students who dropped the class or were exempt from the test: <100, 82, NA, 73, NA, 45, 56>. Finding the test average involves removing the NA values. The R command is.na(x) will return a vector indicating which values of vector \(x\) were missing.
x <- c(100,82,NA,73,NA,45,56)
is.na(x) # returns a vector (F,F,T,F,T,F,F)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE
Since some of the data were missing, calculating the mean of \(x\) will result in an NA.
mean(x) # returns NA
## [1] NA
We can, however, use the na.rm=TRUE command to remove missing values in the data.
mean(x,na.rm=TRUE)
## [1] 71.2
If missing data is not characterized as legitimately missing, we categorize the missing data as illegitimately missing data. These occur for many reasons, such as a respondent electing not to answer a question, a respondent dropping out of a study before its completion, or electronic sensor data collection equipment failing during a study. We should always consider remedial action in these types of cases. The primary options for addressing such missing data are
to discard observations (rows) with any missing values,
to discard any variable (column) with missing values,
to fill in missing entries with estimated values, or
to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.
Deciding on a strategy for dealing with missing data requires some understanding of why the data are missing and the potential impact these missing values might have on an analysis. Thus, a combination of strategic thinking and common sense must be used.
If there is no relationship between the missingness of the data and any values, observed or missing, the missing value is referred to as missing completely at random (MCAR).
However, the data may not be missing completely at random. If the missing observation is related to the value of some other variable(s) in the data, the missing value is called missing at random (MAR). For data that is MAR, the reason for the missing values may determine its importance.
If the missing variable is related to the value that is missing, the value is classified as missing not at random (MNAR).
For example
Understanding which of these three categories (MCAR, MAR, and MNAR) missing values belongs determines how to deal with the data.
The complete.cases command helps us remove rows that have missing values. For example
df <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
df
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 2 2 <NA> FALSE 4.2
## 3 3 is TRUE 3.2
## 4 NA text TRUE NA
# Identify the rows that are complete.
complete.cases(df)
## [1] TRUE FALSE TRUE FALSE
If we want to only display rows with no missing values, we can look at the data frame in the following way.
# Keep the complete rows and display them.
df[complete.cases(df),]
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 3 3 is TRUE 3.2
If a variable has observations with missing values that are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased. If the variable with MNAR values is thought to be redundant with another variable in the data for which there are few or no missing values, removing the MNAR variable from consideration may be an option. In particular, if the MNAR variable is highly correlated with another variable that is known for a majority of observations, the loss of information may be minimal.
For example, In a data set where we are trying to determine the height of a male based on the variables “hand size”, “foot size”, and “weight”, we notice that a number of observations are missing the foot size variable. Since there is a strong correlation between the hand size and foot size of a male, we may omit the “foot size” column with little loss of information.
No matter the type of missing values (MCAR, MAR, or MNAR), we should always attempt to determine the actual value that is missing. We can sometimes achieve this by examining the source of the data or logically determining the likely value that is missing. If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (the systematic replacement of missing values with values that seems reasonable) may be useful. Options for replacing the missing entries for a variable include replacing the missing value with the variable’s mode, mean, or median. Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data. If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries. How to deal with missing values is fairly subjective, and caution must be used to not induce bias by replacing missing values.
As an example, we consider the airquality data set. We want to determine which variables are missing and how we might impute the values.
# Load and view the data
aq <- airquality
head(aq,10)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
# Determine the number of rows in the data.
nrow(aq)
## [1] 153
# Determine the number of rows with missing values
nrow(aq[!complete.cases(aq),])
## [1] 42
We see that we are missing 42 cases out of a total of 153 (27.5%). We can determine how many missing data are in each column. We use the cbind() command to display the data in a table.
# Add up how many NA values in each column.
cbind(
sum(is.na(aq$Ozone)),
sum(is.na(aq$Solar.R)),
sum(is.na(aq$Wind)),
sum(is.na(aq$Temp)),
sum(is.na(aq$Month)),
sum(is.na(aq$Day)))
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 37 7 0 0 0 0
Most of the missing data are from the Ozone column and the remaining is in the Solar.R. Thus, we can replace those missing values by the column means (ignoring all of the NA values in those columns).
# Impute the NA values with the mean.
aq$Ozone[is.na(aq$Ozone)] <- mean(aq$Ozone, na.rm = TRUE)
aq$Solar.R[is.na(aq$Solar.R)] <- mean(aq$Solar.R, na.rm = TRUE)
head(aq,10)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.0000 7.4 67 5 1
## 2 36.00000 118.0000 8.0 72 5 2
## 3 12.00000 149.0000 12.6 74 5 3
## 4 18.00000 313.0000 11.5 62 5 4
## 5 42.12931 185.9315 14.3 56 5 5
## 6 28.00000 185.9315 14.9 66 5 6
## 7 23.00000 299.0000 8.6 65 5 7
## 8 19.00000 99.0000 13.8 59 5 8
## 9 8.00000 19.0000 20.1 61 5 9
## 10 42.12931 194.0000 8.6 69 5 10
The missing values are now the average of the remaining values. Thus, we have taken a number of MCAR variables and replaced them with the average of the columns. We now have a full (though, not perfect) dataset.
Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.
Dealing with Missing Values - UC Business Analytics R Programming Guide. Accessed April 19, 2021. Available here.
The Analysis Factor. “How to Diagnose the Missing Data Mechanism,” May 20, 2013. Available here.