This lesson introduces some ways to deal with missing values. It’s important that missing values are either removed or filled in with imputed values so that algorithms do not throw errors.
Load the dplyr and magrittr packages.
library(dplyr)
library(magrittr)
Make sure that this file and the jan17Items.csv file are in the same folder and that the working directory is set to that folder.
Read in the jan17Items data as j17i.
j17i <- read.csv('jan17Items.csv')
NA and NaN values are treated differently than numeric values. For
instance, when using the sum()
function, you have to
explicitly indicate that you want to remove NAs from the computation or
else it will cause an error.
v1 <- c(1,NA,3,9,15,NA)
sum(v1, na.rm = T)
[1] 28
Testing if a value is equal to NA is different from testing if a
value is equal to a numeric or string value. You must use is.na() rather
than ==
.
4 == NA # Returns NA
[1] NA
is.na(4) # Returns FALSE.
[1] FALSE
When exploring a new dataset, it’s worthwhile to identify the pattern
of missing values. The summary()
includes the number of
missing values for each column along with the summary statistics.
summary(j17i)
Time OperationType BarCode CashierName
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
LineItem Department Category CardholderName
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
RegisterName StoreNumber TransactionNumber CustomerCode
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Cost Price Quantity Modifiers
Min. :-189.0400 Min. :-322.810 Min. : 1.000 Min. :-2.5200
1st Qu.: 0.1100 1st Qu.: 4.500 1st Qu.: 1.000 1st Qu.: 0.0100
Median : 0.1100 Median : 12.020 Median : 1.000 Median : 0.0100
Mean : 0.2108 Mean : 9.612 Mean : 1.177 Mean : 0.6548
3rd Qu.: 0.1100 3rd Qu.: 14.680 3rd Qu.: 1.000 3rd Qu.: 1.0800
Max. : 189.0300 Max. : 24.630 Max. :36.000 Max. :57.5500
NA's :562
Subtotal Discounts NetTotal Tax
Min. :-322.80 Min. : -0.0300 Min. :-322.77 Min. :-25.340
1st Qu.: 5.24 1st Qu.: -0.0300 1st Qu.: 4.87 1st Qu.: 0.410
Median : 13.10 Median : -0.0300 Median : 13.13 Median : 1.040
Mean : 15.10 Mean : 0.7626 Mean : 14.34 Mean : 1.131
3rd Qu.: 15.76 3rd Qu.: -0.0300 3rd Qu.: 15.79 3rd Qu.: 1.230
Max. :3328.12 Max. :951.8300 Max. :3328.15 Max. :261.260
NA's :341
TotalDue
Min. :-348.11
1st Qu.: 5.68
Median : 14.17
Mean : 15.88
3rd Qu.: 17.02
Max. :3589.41
NA's :341
Notice that the Price, Tax, and TotalDue columns have missing values. The others numeric columns do not. Interesting that the number of missing values for Tax and TotalDue are the same. A visual inspection confirms that these are associated with the same observations.
If a column has a large percentage of missing values, then you may
want to consider dropping that column all together. However, if you are
really interested in the effect of that column, then a better idea is to
remove the observations that have missing values. To indicate that you
want to keep observations for which the value is NOT NA, then use the
exclamation point to negate the is.na()
function.
j17i_2 <- j17i %>%
filter(!is.na(Price))
summary(j17i_2)
Time OperationType BarCode CashierName
Length:8337 Length:8337 Length:8337 Length:8337
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
LineItem Department Category CardholderName
Length:8337 Length:8337 Length:8337 Length:8337
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
RegisterName StoreNumber TransactionNumber CustomerCode
Length:8337 Length:8337 Length:8337 Length:8337
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Cost Price Quantity Modifiers
Min. :-189.04000 Min. :-322.810 Min. : 1.000 Min. :-2.5200
1st Qu.: 0.11000 1st Qu.: 4.500 1st Qu.: 1.000 1st Qu.: 0.0100
Median : 0.11000 Median : 12.020 Median : 1.000 Median : 0.0100
Mean : 0.04306 Mean : 9.612 Mean : 1.186 Mean : 0.6982
3rd Qu.: 0.11000 3rd Qu.: 14.680 3rd Qu.: 1.000 3rd Qu.: 1.0800
Max. : 7.64000 Max. : 24.630 Max. :36.000 Max. :57.5500
Subtotal Discounts NetTotal Tax
Min. :-322.80 Min. : -0.0300 Min. :-322.77 Min. :-25.3400
1st Qu.: 4.84 1st Qu.: -0.0300 1st Qu.: 4.54 1st Qu.: 0.4000
Median : 12.61 Median : -0.0300 Median : 12.06 Median : 1.0000
Mean : 11.16 Mean : 0.3395 Mean : 10.82 Mean : 0.8769
3rd Qu.: 15.64 3rd Qu.: -0.0300 3rd Qu.: 15.50 3rd Qu.: 1.2200
Max. : 496.99 Max. :396.1100 Max. : 212.35 Max. : 16.6100
NA's :328
TotalDue
Min. :-348.11
1st Qu.: 5.50
Median : 13.70
Mean : 12.07
3rd Qu.: 16.89
Max. : 228.96
NA's :328
Notice that by removing the observations with missing values for Price must have included some observations for which Tax and TotalDue were missing because there are only 328 missing values instead of 341.
I would give a quick visual inspection to the observations that are missing values for Tax to see if I can identify a pattern. If the pattern isn’t meaningful, then I would export the transactions and go over them with the cashier or manager.
missingPrice <- j17i %>%
filter(is.na(Price))
There are also functions in other packages that help you visualize the pattern of missing values to help you get an idea of the pattern of missing values.
If you don’t want to lose the other information from the incomplete
observation, then you can fill in the missing values with the mean or
median value from that column. To do that, you can use the
ifelse()
function, which is identical to the
=IF()
function in Excel. We’ll do that within the mutate
function, and then calculate the value for TotalDue.
?ifelse
j17i_3 <- j17i %>%
mutate(
Tax = ifelse(is.na(Tax), mean(Tax, na.rm = T), Tax) # Could use median() instead of mean() if there are outliers
, TotalDue = NetTotal + Tax
)
summary(j17i_3)
Time OperationType BarCode CashierName
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
LineItem Department Category CardholderName
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
RegisterName StoreNumber TransactionNumber CustomerCode
Length:8899 Length:8899 Length:8899 Length:8899
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Cost Price Quantity Modifiers
Min. :-189.0400 Min. :-322.810 Min. : 1.000 Min. :-2.5200
1st Qu.: 0.1100 1st Qu.: 4.500 1st Qu.: 1.000 1st Qu.: 0.0100
Median : 0.1100 Median : 12.020 Median : 1.000 Median : 0.0100
Mean : 0.2108 Mean : 9.612 Mean : 1.177 Mean : 0.6548
3rd Qu.: 0.1100 3rd Qu.: 14.680 3rd Qu.: 1.000 3rd Qu.: 1.0800
Max. : 189.0300 Max. : 24.630 Max. :36.000 Max. :57.5500
NA's :562
Subtotal Discounts NetTotal Tax
Min. :-322.80 Min. : -0.0300 Min. :-322.77 Min. :-25.340
1st Qu.: 5.24 1st Qu.: -0.0300 1st Qu.: 4.87 1st Qu.: 0.460
Median : 13.10 Median : -0.0300 Median : 13.13 Median : 1.040
Mean : 15.10 Mean : 0.7626 Mean : 14.34 Mean : 1.131
3rd Qu.: 15.76 3rd Qu.: -0.0300 3rd Qu.: 15.79 3rd Qu.: 1.230
Max. :3328.12 Max. :951.8300 Max. :3328.15 Max. :261.260
TotalDue
Min. :-348.11
1st Qu.: 5.25
Median : 14.17
Mean : 15.47
3rd Qu.: 17.02
Max. :3589.41
The ifelse
function is powerful, and you can use it in
many other ways. Conditional statements will be covered elsewhere in
more depth.
There are other ways for dealing with missing values. A model-based approach fills in missing values based on values from one or more other columns.
You can see that the easiest approach for dealing with missing values is to remove the columns or the observations; however, you may be throwing the baby out with the bathwater, especially with small datasets. Ultimately, you’ll have to use judgment for dealing with missing values. The important thing is to fully disclose what you do.
The best approach is to perform your analyses using both approaches and then identify if the results change based on what you do. If the results of your do not qualitatively change, then it’s not a big deal. If the results are different, then think carefully about which approach is best and justify the approach that you think is best.