Data: Quality

Preliminary

We will use the DescTools and caret packages to address quality issues with our data. Both packages have very useful and comprehensive vignettes which can be accessed through the help documentation in RStudio. We load the packages for use in the R session using the library() function.

library(DescTools)
library(caret)

In the lesson that follows we will use the “carseats_v2.csv” file. This dataset contains sales of child car seats at 403 different stores. The variables in the dataset include:

CompPrice: Price charged by competitor at each location
Income: Community income level (in 1,000s of dollars)
Advertising: Local advertising budget for the company at each location (in 1,000s of dollars)
Population: Population size in region (in 1,000s)
Price: Price charged by company for car seats at each locations
ShelveLoc: Quality of shelving location for the car seats at each location (Bad, Medium, Good)
Age: Average age of the local population
Education: Education level at each location
Urban: Indicates if the location is urban (Yes) or rural (No)
US: Indicates if the location is in the US (Yes) or outside of the US (No)
Sales_Lev: Indicates the sales level of the location. High sales are those that are greater than 8 units (in 1,000s) and Low sales are those than are less than or equal to 8 units (in 1,000s)

We use the read.csv() function to import the CSV file into R as a dataframe named cs2. We set stringsAsFactors = FALSE to keep any character columns as-is. We use the na.strings argument to treat blank values ("“) and spaces (” ") in character columns as NA (missing) values file.

cs2 <- read.csv(file = "carseats_v2.csv",
                na.strings = c("", " "),
                stringsAsFactors = FALSE)

First, can prepare our variables. We set up variable name vectors for convenience to identify our numeric, unordered factors and ordered factors.

facs <- c("Urban", "US")
ords <- c("ShelveLoc", "Sales_Lev")
nums <- names(cs2)[!names(cs2) %in% c(facs, ords)]

Next, we convert our categorical (nominal and ordinal) variables.

# Unordered
cs2[ ,facs] <- lapply(X = cs2[ , facs], 
                     FUN = factor)
# Ordered
cs2$ShelveLoc <- factor(x = cs2$ShelveLoc, 
                       levels = c("Bad", "Medium", "Good"),
                       ordered = TRUE)

cs2$Sales_Lev <- factor(x = cs2$Sales_Lev,
                       levels = c("Low", "High"),
                       ordered = TRUE)

Before exploring and addressing the quality of the data, we can obtain some high-level information about the cs2 dataframe.

str(cs2) # structure

## 'data.frame':    403 obs. of  11 variables:
##  $ CompPrice  : int  138 111 113 138 117 141 124 115 136 132 ...
##  $ Income     : int  73 48 35 73 100 64 113 105 81 110 ...
##  $ Advertising: int  11 16 10 11 4 3 13 0 15 0 ...
##  $ Population : int  276 260 269 276 466 340 501 45 425 108 ...
##  $ Price      : int  120 83 80 120 97 128 72 108 120 124 ...
##  $ ShelveLoc  : Ord.factor w/ 3 levels "Bad"<"Medium"<..: 1 3 2 1 2 1 1 2 3 2 ...
##  $ Age        : int  42 65 59 42 55 38 78 71 67 76 ...
##  $ Education  : int  17 10 12 17 14 13 16 15 10 10 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 NA 1 ...
##  $ Sales_Lev  : Ord.factor w/ 2 levels "Low"<"High": 2 2 2 2 1 1 2 1 2 1 ...

Abstract(cs2) # dataframe overview (DescTools)

## ------------------------------------------------------------------------------ 
## cs2
## 
## data frame:  403 obs. of  11 variables
##      396 complete cases (98.3%)
## 
##   Nr  ColName      Class            NAs       Levels               
##   1   CompPrice    integer          .                              
##   2   Income       integer          2 (0.5%)                       
##   3   Advertising  integer          1 (0.2%)                       
##   4   Population   integer          1 (0.2%)                       
##   5   Price        integer          .                              
##   6   ShelveLoc    ordered, factor  2 (0.5%)  (3): 1-Bad, 2-Medium,
##                                               3-Good               
##   7   Age          integer          .                              
##   8   Education    integer          1 (0.2%)                       
##   9   Urban        factor           .         (2): 1-No, 2-Yes     
##   10  US           factor           1 (0.2%)  (2): 1-No, 2-Yes     
##   11  Sales_Lev    ordered, factor  .         (2): 1-Low, 2-High

summary(cs2) # summary

##    CompPrice       Income        Advertising       Population   
##  Min.   : 77   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0  
##  1st Qu.:115   1st Qu.: 43.00   1st Qu.: 0.000   1st Qu.:139.2  
##  Median :125   Median : 69.00   Median : 5.000   Median :274.0  
##  Mean   :125   Mean   : 68.71   Mean   : 6.841   Mean   :264.7  
##  3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000   3rd Qu.:397.5  
##  Max.   :175   Max.   :120.00   Max.   :75.000   Max.   :509.0  
##                NA's   :2        NA's   :1        NA's   :1      
##      Price        ShelveLoc        Age          Education     Urban    
##  Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.00   No :118  
##  1st Qu.:100.0   Medium:220   1st Qu.:40.00   1st Qu.:12.00   Yes:285  
##  Median :118.0   Good  : 85   Median :54.00   Median :14.00            
##  Mean   :115.8   NA's  :  2   Mean   :53.32   Mean   :13.89            
##  3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.00            
##  Max.   :191.0                Max.   :80.00   Max.   :18.00            
##                                               NA's   :1                
##     US      Sales_Lev 
##  No  :143   Low :237  
##  Yes :259   High:166  
##  NA's:  1             
##                       
##                       
##                       
##

Data Quality

Duplicate Observations

We use the duplicated() function to identify duplicate observations (the first matching row will be considered the unique observation and all others after are ‘duplicated’). The duplicated() function returns a boolean vector indicating if each row is duplicated (TRUE) or unique (FALSE). We use the output to create a subset (using indexing) of the duplicated observations.

cs2[duplicated(cs2), ]

##   CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 4       138     73          11        276   120       Bad  42        17   Yes
##    US Sales_Lev
## 4 Yes      High

We can remove duplicates by keeping only those rows that are not duplicated (!duplicated()).

cs2 <- cs2[!duplicated(cs2), ]

Missing Values

Missing numeric variables are a special type of value in R, NA. By default, missing values in character variables in a dataframe will be "". The na.strings argument of read.csv() was used when importing the cs2 data to convert missing character values to NA values. Using the PlotMiss() function in the DescTools package, we can visualize the missing values in our data by Variable. The observation number is on the x-axis and the variable names are on the y-axis. This plot can help us to identify any variables that have a high percentage of missing values. These variables are candidates for elimination prior to analysis.

PlotMiss(x = cs2, 
         main = "Missing Values by Variable")

Complete cases are rows that do not contain any missing (NA) values. We can view rows without missing values using the complete.cases() function and we can view rows with missing values (not complete cases) using !complete.cases().

cs2[!complete.cases(cs2), ]

##     CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 9         136     81          15        425   120      Good  67        10   Yes
## 19        120     NA          10        300   122    Medium  62        12   Yes
## 34        125     94          NA        447    89      Good  30        12   Yes
## 44        119     98           0         18   126      <NA>  73        17    No
## 46         77     NA           0         25    24    Medium  50        18   Yes
## 225       124     44           0         NA   107    Medium  80        11   Yes
## 256       133     97           0         70   117      <NA>  32        NA   Yes
##       US Sales_Lev
## 9   <NA>      High
## 19    No       Low
## 34    No      High
## 44    No       Low
## 46    No      High
## 225   No       Low
## 256   No      High

We can save the row names of the observations with missing values as a vector.

na_rows <- rownames(cs2)[!complete.cases(cs2)]

Remove Missing Values

To remove any observations that contain NA values, the na.omit() function can be used. We create a new dataframe, cs2_noNA, which is a subset of the cs2 dataframe, which omits rows with NA values.

cs2_noNA <- na.omit(cs2)
summary(cs2_noNA)

##    CompPrice         Income        Advertising       Population   
##  Min.   : 85.0   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0  
##  1st Qu.:115.0   1st Qu.: 42.00   1st Qu.: 0.000   1st Qu.:141.0  
##  Median :125.0   Median : 69.00   Median : 5.000   Median :272.0  
##  Mean   :125.1   Mean   : 68.52   Mean   : 6.871   Mean   :265.5  
##  3rd Qu.:135.0   3rd Qu.: 90.00   3rd Qu.:12.000   3rd Qu.:397.0  
##  Max.   :175.0   Max.   :120.00   Max.   :75.000   Max.   :509.0  
##      Price        ShelveLoc        Age          Education     Urban    
##  Min.   : 49.0   Bad   : 95   Min.   :25.00   Min.   :10.00   No :117  
##  1st Qu.:100.0   Medium:217   1st Qu.:40.00   1st Qu.:12.00   Yes:278  
##  Median :118.0   Good  : 83   Median :54.00   Median :14.00            
##  Mean   :116.1                Mean   :53.29   Mean   :13.89            
##  3rd Qu.:131.0                3rd Qu.:65.50   3rd Qu.:16.00            
##  Max.   :191.0                Max.   :80.00   Max.   :18.00            
##    US      Sales_Lev 
##  No :137   Low :234  
##  Yes:258   High:161  
##                      
##                      
##                      
##

Impute Missing Values

Rather than removing any rows with missing values, we can perform imputation and replace the missing values. Typically, a measure of central tendency such as the mean, median or mode is used for replacement.

First, we will create a copy of the cs2 dataframe, cs2_copy, which we will perform the imputations on so that before and after imputation can be compared.

cs2_copy <- cs2

Single Variable Imputation

Mean Imputation

The mean can be used to replace missing numerical values on a per-column basis. We use the mean() function to obtain the average value for the variable and we set na.rm = TRUE to ignore the NA values when computing the mean. We assign the result of the mean() function to the observations where the Income variable is equal to NA.

cs2_copy$Income[is.na(cs2_copy$Income)] <- mean(cs2_copy$Income, 
                                      na.rm = TRUE)

We can compare the variable in the original cs2 dataframe and the imputed variable in the cs2_copy dataframe.

cbind(cs2[na_rows, "Income"], cs2_copy[na_rows, "Income"])

##      [,1]    [,2]
## [1,]   81 81.0000
## [2,]   NA 68.6975
## [3,]   94 94.0000
## [4,]   98 98.0000
## [5,]   NA 68.6975
## [6,]   44 44.0000
## [7,]   97 97.0000

Mode Imputation

For categorical (or discrete numerical with few unique values) variables, the most frequent value, the mode, can be used to replace missing values on a per-variable basis or a new category representing missingness can be added. If the variable has more than one mode, this may not be the best imputation method. The custom function modefun() can be created and used to identify the most frequent unique value of a variable.

modefun <- function(x){
  if(any(tabulate(match(x, unique(x))) > 1)){
    outp <- unique(x)[which(tabulate(match(x, unique(x))) == max(tabulate(match(x, unique(x)))))]
  } else {
    outp <- "No mode exists"}
  return(outp)
}

cs2_copy$ShelveLoc[is.na(cs2_copy$ShelveLoc)] <- modefun(x = cs2_copy$ShelveLoc)
cs2_copy$US[is.na(cs2_copy$US)] <- modefun(x = cs2_copy$US)

We can compare the variable before and after mode imputation using the cbind() function.

cbind(cs2[na_rows, c("US", "ShelveLoc")],
      cs2_copy[na_rows, c("US", "ShelveLoc")])

##       US ShelveLoc  US ShelveLoc
## 9   <NA>      Good Yes      Good
## 19    No    Medium  No    Medium
## 34    No      Good  No      Good
## 44    No      <NA>  No    Medium
## 46    No    Medium  No    Medium
## 225   No    Medium  No    Medium
## 256   No      <NA>  No    Medium

Multiple Variable Imputation

We can use the preProcess() function to perform imputation to more than one numeric variable.

Median Imputation

We set method = medianImpute in the preProcess() function to perform median imputation on all of the numeric columns in the dataframe identified in the x argument.

pp <- preProcess(x = cs2_copy,
                 method = "medianImpute")

To apply the imputation to the data, we use the predict() function. We overwrite the existing cs2_copy dataframe with the newly created dataframe including median imputed values.

cs2_copy <- predict(object = pp, 
                   newdata = cs2_copy)

We can view the data before and after applying the median imputation using the cbind() function.

cbind(cs2[na_rows, c("Advertising", "Population", "Education")],
      cs2_copy[na_rows, c("Advertising", "Population", "Education")])

##     Advertising Population Education Advertising Population Education
## 9            15        425        10          15        425        10
## 19           10        300        12          10        300        12
## 34           NA        447        12           5        447        12
## 44            0         18        17           0         18        17
## 46            0         25        18           0         25        18
## 225           0         NA        11           0        272        11
## 256           0         70        NA           0         70        14

Outliers

Two popular methods to identify outliers are using Box Plots or Z-Scores. In some cases, we should consider removing outliers.

Box Plot Method

We can use the Outlier() function from the DescTools package to identify outlier values as those values that extend beyond the whiskers of the plot.

Outlier(x = cs2_copy$CompPrice, method = "boxplot")

## [1]  77 175

We can set value = FALSE to identify the row index number of the outliers (instead of outlier values) and use indexing to view the observations.

cs2[Outlier(x = cs2_copy$CompPrice,
            method = "boxplot", value = FALSE),]

##     CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 46         77     NA           0         25    24    Medium  50        18   Yes
## 314       175     65          29        419   166    Medium  53        12   Yes
##      US Sales_Lev
## 46   No      High
## 314 Yes      High

Z-Score Method

Z-Scores are produced by standardizing a numeric variable. Those Z-Scores that have a larger absolute value than 3 are typically considered to be outliers. We can use the scale() function, which by default centers and scales the variable, to obtain the Z-Scores. We can use conditional indexing to view those rows where the absolute value of the Z-Score for the variable is greater than 3.

cs2_copy[abs(scale(cs2_copy$CompPrice)) > 3, ]

##     CompPrice  Income Advertising Population Price ShelveLoc Age Education
## 46         77 68.6975           0         25    24    Medium  50        18
## 314       175 65.0000          29        419   166    Medium  53        12
##     Urban  US Sales_Lev
## 46    Yes  No      High
## 314   Yes Yes      High