We will use the DescTools and caret packages to address quality issues with our data. Both packages have very useful and comprehensive vignettes which can be accessed through the help documentation in RStudio. We load the packages for use in the R session using the library() function.
library(DescTools)
library(caret)
In the lesson that follows we will use the “carseats_v2.csv” file. This dataset contains sales of child car seats at 403 different stores. The variables in the dataset include:
CompPrice: Price charged by competitor at each locationIncome: Community income level (in 1,000s of dollars)Advertising: Local advertising budget for the company at each location (in 1,000s of dollars)Population: Population size in region (in 1,000s)Price: Price charged by company for car seats at each locationsShelveLoc: Quality of shelving location for the car seats at each location (Bad, Medium, Good)Age: Average age of the local populationEducation: Education level at each locationUrban: Indicates if the location is urban (Yes) or rural (No)US: Indicates if the location is in the US (Yes) or outside of the US (No)Sales_Lev: Indicates the sales level of the location. High sales are those that are greater than 8 units (in 1,000s) and Low sales are those than are less than or equal to 8 units (in 1,000s)We use the read.csv() function to import the CSV file into R as a dataframe named cs2. We set stringsAsFactors = FALSE to keep any character columns as-is. We use the na.strings argument to treat blank values ("“) and spaces (” ") in character columns as NA (missing) values file.
cs2 <- read.csv(file = "carseats_v2.csv",
na.strings = c("", " "),
stringsAsFactors = FALSE)
First, can prepare our variables. We set up variable name vectors for convenience to identify our numeric, unordered factors and ordered factors.
facs <- c("Urban", "US")
ords <- c("ShelveLoc", "Sales_Lev")
nums <- names(cs2)[!names(cs2) %in% c(facs, ords)]
Next, we convert our categorical (nominal and ordinal) variables.
# Unordered
cs2[ ,facs] <- lapply(X = cs2[ , facs],
FUN = factor)
# Ordered
cs2$ShelveLoc <- factor(x = cs2$ShelveLoc,
levels = c("Bad", "Medium", "Good"),
ordered = TRUE)
cs2$Sales_Lev <- factor(x = cs2$Sales_Lev,
levels = c("Low", "High"),
ordered = TRUE)
Before exploring and addressing the quality of the data, we can obtain some high-level information about the cs2 dataframe.
str(cs2) # structure
## 'data.frame': 403 obs. of 11 variables:
## $ CompPrice : int 138 111 113 138 117 141 124 115 136 132 ...
## $ Income : int 73 48 35 73 100 64 113 105 81 110 ...
## $ Advertising: int 11 16 10 11 4 3 13 0 15 0 ...
## $ Population : int 276 260 269 276 466 340 501 45 425 108 ...
## $ Price : int 120 83 80 120 97 128 72 108 120 124 ...
## $ ShelveLoc : Ord.factor w/ 3 levels "Bad"<"Medium"<..: 1 3 2 1 2 1 1 2 3 2 ...
## $ Age : int 42 65 59 42 55 38 78 71 67 76 ...
## $ Education : int 17 10 12 17 14 13 16 15 10 10 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 NA 1 ...
## $ Sales_Lev : Ord.factor w/ 2 levels "Low"<"High": 2 2 2 2 1 1 2 1 2 1 ...
Abstract(cs2) # dataframe overview (DescTools)
## ------------------------------------------------------------------------------
## cs2
##
## data frame: 403 obs. of 11 variables
## 396 complete cases (98.3%)
##
## Nr ColName Class NAs Levels
## 1 CompPrice integer .
## 2 Income integer 2 (0.5%)
## 3 Advertising integer 1 (0.2%)
## 4 Population integer 1 (0.2%)
## 5 Price integer .
## 6 ShelveLoc ordered, factor 2 (0.5%) (3): 1-Bad, 2-Medium,
## 3-Good
## 7 Age integer .
## 8 Education integer 1 (0.2%)
## 9 Urban factor . (2): 1-No, 2-Yes
## 10 US factor 1 (0.2%) (2): 1-No, 2-Yes
## 11 Sales_Lev ordered, factor . (2): 1-Low, 2-High
summary(cs2) # summary
## CompPrice Income Advertising Population
## Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0
## 1st Qu.:115 1st Qu.: 43.00 1st Qu.: 0.000 1st Qu.:139.2
## Median :125 Median : 69.00 Median : 5.000 Median :274.0
## Mean :125 Mean : 68.71 Mean : 6.841 Mean :264.7
## 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:397.5
## Max. :175 Max. :120.00 Max. :75.000 Max. :509.0
## NA's :2 NA's :1 NA's :1
## Price ShelveLoc Age Education Urban
## Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.00 No :118
## 1st Qu.:100.0 Medium:220 1st Qu.:40.00 1st Qu.:12.00 Yes:285
## Median :118.0 Good : 85 Median :54.00 Median :14.00
## Mean :115.8 NA's : 2 Mean :53.32 Mean :13.89
## 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.00
## Max. :191.0 Max. :80.00 Max. :18.00
## NA's :1
## US Sales_Lev
## No :143 Low :237
## Yes :259 High:166
## NA's: 1
##
##
##
##
We use the duplicated() function to identify duplicate observations (the first matching row will be considered the unique observation and all others after are ‘duplicated’). The duplicated() function returns a boolean vector indicating if each row is duplicated (TRUE) or unique (FALSE). We use the output to create a subset (using indexing) of the duplicated observations.
cs2[duplicated(cs2), ]
## CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 4 138 73 11 276 120 Bad 42 17 Yes
## US Sales_Lev
## 4 Yes High
We can remove duplicates by keeping only those rows that are not duplicated (!duplicated()).
cs2 <- cs2[!duplicated(cs2), ]
Missing numeric variables are a special type of value in R, NA. By default, missing values in character variables in a dataframe will be "". The na.strings argument of read.csv() was used when importing the cs2 data to convert missing character values to NA values. Using the PlotMiss() function in the DescTools package, we can visualize the missing values in our data by Variable. The observation number is on the x-axis and the variable names are on the y-axis. This plot can help us to identify any variables that have a high percentage of missing values. These variables are candidates for elimination prior to analysis.
PlotMiss(x = cs2,
main = "Missing Values by Variable")
Complete cases are rows that do not contain any missing (NA) values. We can view rows without missing values using the complete.cases() function and we can view rows with missing values (not complete cases) using !complete.cases().
cs2[!complete.cases(cs2), ]
## CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 9 136 81 15 425 120 Good 67 10 Yes
## 19 120 NA 10 300 122 Medium 62 12 Yes
## 34 125 94 NA 447 89 Good 30 12 Yes
## 44 119 98 0 18 126 <NA> 73 17 No
## 46 77 NA 0 25 24 Medium 50 18 Yes
## 225 124 44 0 NA 107 Medium 80 11 Yes
## 256 133 97 0 70 117 <NA> 32 NA Yes
## US Sales_Lev
## 9 <NA> High
## 19 No Low
## 34 No High
## 44 No Low
## 46 No High
## 225 No Low
## 256 No High
We can save the row names of the observations with missing values as a vector.
na_rows <- rownames(cs2)[!complete.cases(cs2)]
To remove any observations that contain NA values, the na.omit() function can be used. We create a new dataframe, cs2_noNA, which is a subset of the cs2 dataframe, which omits rows with NA values.
cs2_noNA <- na.omit(cs2)
summary(cs2_noNA)
## CompPrice Income Advertising Population
## Min. : 85.0 Min. : 21.00 Min. : 0.000 Min. : 10.0
## 1st Qu.:115.0 1st Qu.: 42.00 1st Qu.: 0.000 1st Qu.:141.0
## Median :125.0 Median : 69.00 Median : 5.000 Median :272.0
## Mean :125.1 Mean : 68.52 Mean : 6.871 Mean :265.5
## 3rd Qu.:135.0 3rd Qu.: 90.00 3rd Qu.:12.000 3rd Qu.:397.0
## Max. :175.0 Max. :120.00 Max. :75.000 Max. :509.0
## Price ShelveLoc Age Education Urban
## Min. : 49.0 Bad : 95 Min. :25.00 Min. :10.00 No :117
## 1st Qu.:100.0 Medium:217 1st Qu.:40.00 1st Qu.:12.00 Yes:278
## Median :118.0 Good : 83 Median :54.00 Median :14.00
## Mean :116.1 Mean :53.29 Mean :13.89
## 3rd Qu.:131.0 3rd Qu.:65.50 3rd Qu.:16.00
## Max. :191.0 Max. :80.00 Max. :18.00
## US Sales_Lev
## No :137 Low :234
## Yes:258 High:161
##
##
##
##
Rather than removing any rows with missing values, we can perform imputation and replace the missing values. Typically, a measure of central tendency such as the mean, median or mode is used for replacement.
First, we will create a copy of the cs2 dataframe, cs2_copy, which we will perform the imputations on so that before and after imputation can be compared.
cs2_copy <- cs2
The mean can be used to replace missing numerical values on a per-column basis. We use the mean() function to obtain the average value for the variable and we set na.rm = TRUE to ignore the NA values when computing the mean. We assign the result of the mean() function to the observations where the Income variable is equal to NA.
cs2_copy$Income[is.na(cs2_copy$Income)] <- mean(cs2_copy$Income,
na.rm = TRUE)
We can compare the variable in the original cs2 dataframe and the imputed variable in the cs2_copy dataframe.
cbind(cs2[na_rows, "Income"], cs2_copy[na_rows, "Income"])
## [,1] [,2]
## [1,] 81 81.0000
## [2,] NA 68.6975
## [3,] 94 94.0000
## [4,] 98 98.0000
## [5,] NA 68.6975
## [6,] 44 44.0000
## [7,] 97 97.0000
For categorical (or discrete numerical with few unique values) variables, the most frequent value, the mode, can be used to replace missing values on a per-variable basis or a new category representing missingness can be added. If the variable has more than one mode, this may not be the best imputation method. The custom function modefun() can be created and used to identify the most frequent unique value of a variable.
modefun <- function(x){
if(any(tabulate(match(x, unique(x))) > 1)){
outp <- unique(x)[which(tabulate(match(x, unique(x))) == max(tabulate(match(x, unique(x)))))]
} else {
outp <- "No mode exists"}
return(outp)
}
cs2_copy$ShelveLoc[is.na(cs2_copy$ShelveLoc)] <- modefun(x = cs2_copy$ShelveLoc)
cs2_copy$US[is.na(cs2_copy$US)] <- modefun(x = cs2_copy$US)
We can compare the variable before and after mode imputation using the cbind() function.
cbind(cs2[na_rows, c("US", "ShelveLoc")],
cs2_copy[na_rows, c("US", "ShelveLoc")])
## US ShelveLoc US ShelveLoc
## 9 <NA> Good Yes Good
## 19 No Medium No Medium
## 34 No Good No Good
## 44 No <NA> No Medium
## 46 No Medium No Medium
## 225 No Medium No Medium
## 256 No <NA> No Medium
We can use the preProcess() function to perform imputation to more than one numeric variable.
We set method = medianImpute in the preProcess() function to perform median imputation on all of the numeric columns in the dataframe identified in the x argument.
pp <- preProcess(x = cs2_copy,
method = "medianImpute")
To apply the imputation to the data, we use the predict() function. We overwrite the existing cs2_copy dataframe with the newly created dataframe including median imputed values.
cs2_copy <- predict(object = pp,
newdata = cs2_copy)
We can view the data before and after applying the median imputation using the cbind() function.
cbind(cs2[na_rows, c("Advertising", "Population", "Education")],
cs2_copy[na_rows, c("Advertising", "Population", "Education")])
## Advertising Population Education Advertising Population Education
## 9 15 425 10 15 425 10
## 19 10 300 12 10 300 12
## 34 NA 447 12 5 447 12
## 44 0 18 17 0 18 17
## 46 0 25 18 0 25 18
## 225 0 NA 11 0 272 11
## 256 0 70 NA 0 70 14
Two popular methods to identify outliers are using Box Plots or Z-Scores. In some cases, we should consider removing outliers.
We can use the Outlier() function from the DescTools package to identify outlier values as those values that extend beyond the whiskers of the plot.
Outlier(x = cs2_copy$CompPrice, method = "boxplot")
## [1] 77 175
We can set value = FALSE to identify the row index number of the outliers (instead of outlier values) and use indexing to view the observations.
cs2[Outlier(x = cs2_copy$CompPrice,
method = "boxplot", value = FALSE),]
## CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 46 77 NA 0 25 24 Medium 50 18 Yes
## 314 175 65 29 419 166 Medium 53 12 Yes
## US Sales_Lev
## 46 No High
## 314 Yes High
Z-Scores are produced by standardizing a numeric variable. Those Z-Scores that have a larger absolute value than 3 are typically considered to be outliers. We can use the scale() function, which by default centers and scales the variable, to obtain the Z-Scores. We can use conditional indexing to view those rows where the absolute value of the Z-Score for the variable is greater than 3.
cs2_copy[abs(scale(cs2_copy$CompPrice)) > 3, ]
## CompPrice Income Advertising Population Price ShelveLoc Age Education
## 46 77 68.6975 0 25 24 Medium 50 18
## 314 175 65.0000 29 419 166 Medium 53 12
## Urban US Sales_Lev
## 46 Yes No High
## 314 Yes Yes High