Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions and supporting decision making.
Many times in the beginning we spend hours on handling problems with missing values, logical inconsistencies or outliers in our datasets. In this tutorial we will go through the most popular techniques in data cleansing.
We will be working with the messy dataset iris.
Originally published at UCI Machine Learning Repository: Iris Data Set,
this small dataset from 1936 is often used for testing out machine
learning algorithms and visualizations. Each row of the table represents
an iris flower, including its species and dimensions of its botanical
parts, sepal and petal, in centimeters.
Take a look at this dataset here:
mydata
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.4 3.2 4.500 1.5 versicolor
## 2 6.3 3.3 6.000 2.5 virginica
## 3 6.2 NA 5.400 2.3 virginica
## 4 5.0 3.4 1.600 0.4 setosa
## 5 5.7 2.6 3.500 1.0 versicolor
## 6 5.3 NA NA 0.2 setosa
## 7 6.4 2.7 5.300 NA virginica
## 8 5.9 3.0 5.100 1.8 virginica
## 9 5.8 2.7 4.100 1.0 versicolor
## 10 4.8 3.1 1.600 0.2 setosa
## 11 5.0 3.5 1.600 0.6 setosa
## 12 6.0 2.7 5.100 1.6 versicolor
## 13 6.0 3.0 4.800 NA virginica
## 14 6.8 2.8 4.800 1.4 versicolor
## 15 NA 3.9 1.700 0.4 setosa
## 16 5.0 -3.0 3.500 1.0 versicolor
## 17 5.5 NA 4.000 1.3 versicolor
## 18 4.7 3.2 1.300 0.2 setosa
## 19 NA 4.0 NA 0.2 setosa
## 20 5.6 NA 4.200 1.3 versicolor
## 21 4.9 3.6 NA 0.1 setosa
## 22 5.4 NA 4.500 1.5 versicolor
## 23 6.2 2.8 NA 1.8 virginica
## 24 6.7 3.3 5.700 2.5 virginica
## 25 NA 3.0 5.900 2.1 virginica
## 26 4.6 3.2 1.400 0.2 setosa
## 27 4.9 3.1 1.500 0.1 setosa
## 28 73.0 29.0 63.000 NA virginica
## 29 6.5 3.2 5.100 2.0 virginica
## 30 NA 2.8 0.820 1.3 versicolor
## 31 4.4 3.2 NA 0.2 setosa
## 32 5.9 3.2 4.800 NA versicolor
## 33 5.7 2.8 4.500 1.3 versicolor
## 34 6.2 2.9 NA 1.3 versicolor
## 35 6.6 2.9 23.000 1.3 versicolor
## 36 4.8 3.0 1.400 0.1 setosa
## 37 6.5 3.0 5.500 1.8 virginica
## 38 6.2 2.2 4.500 1.5 versicolor
## 39 6.7 2.5 5.800 1.8 virginica
## 40 5.0 3.0 1.600 0.2 setosa
## 41 5.0 NA 1.200 0.2 setosa
## 42 5.8 2.7 3.900 1.2 versicolor
## 43 0.0 NA 1.300 0.4 setosa
## 44 5.8 2.7 5.100 1.9 virginica
## 45 5.5 4.2 1.400 0.2 setosa
## 46 7.7 2.8 6.700 2.0 virginica
## 47 5.7 NA NA 0.4 setosa
## 48 7.0 3.2 4.700 1.4 versicolor
## 49 6.5 3.0 5.800 2.2 virginica
## 50 6.0 3.4 4.500 1.6 versicolor
## 51 5.5 2.6 4.400 1.2 versicolor
## 52 4.9 3.1 NA 0.2 setosa
## 53 5.2 2.7 3.900 1.4 versicolor
## 54 4.8 3.4 1.600 0.2 setosa
## 55 6.3 3.3 4.700 1.6 versicolor
## 56 7.7 3.8 6.700 2.2 virginica
## 57 5.1 3.8 1.500 0.3 setosa
## 58 NA 2.9 4.500 1.5 versicolor
## 59 6.4 2.8 5.600 NA virginica
## 60 6.4 2.8 5.600 2.1 virginica
## 61 5.0 2.3 3.300 NA versicolor
## 62 7.4 2.8 6.100 1.9 virginica
## 63 4.3 3.0 1.100 0.1 setosa
## 64 5.0 3.3 1.400 0.2 setosa
## 65 7.2 3.0 5.800 1.6 virginica
## 66 6.3 2.5 4.900 1.5 versicolor
## 67 5.1 2.5 NA 1.1 versicolor
## 68 NA 3.2 5.700 2.3 virginica
## 69 5.1 3.5 NA NA setosa
## 70 5.0 3.5 1.300 0.3 setosa
## 71 6.1 3.0 4.600 1.4 versicolor
## 72 6.9 3.1 5.100 2.3 virginica
## 73 5.1 3.5 1.400 0.3 setosa
## 74 6.5 NA 4.600 1.5 versicolor
## 75 5.6 2.8 4.900 2.0 virginica
## 76 4.9 2.5 4.500 NA virginica
## 77 5.5 3.5 1.300 0.2 setosa
## 78 7.6 3.0 6.600 2.1 virginica
## 79 5.1 3.8 0.000 0.2 setosa
## 80 7.9 3.8 6.400 2.0 virginica
## 81 6.1 2.6 5.600 1.4 virginica
## 82 5.4 3.4 1.700 0.2 setosa
## 83 6.1 2.9 4.700 1.4 versicolor
## 84 5.4 3.7 1.500 0.2 setosa
## 85 6.7 3.0 5.200 2.3 virginica
## 86 5.1 3.8 1.900 Inf setosa
## 87 6.4 2.9 4.300 1.3 versicolor
## 88 5.7 2.9 4.200 1.3 versicolor
## 89 4.4 2.9 1.400 0.2 setosa
## 90 6.3 2.5 5.000 1.9 virginica
## 91 7.2 3.2 6.000 1.8 virginica
## 92 4.9 NA 3.300 1.0 versicolor
## 93 5.2 3.4 1.400 0.2 setosa
## 94 5.8 2.7 5.100 1.9 virginica
## 95 6.0 2.2 5.000 1.5 virginica
## 96 6.9 3.1 NA 1.5 versicolor
## 97 5.5 2.3 4.000 1.3 versicolor
## 98 6.7 NA 5.000 1.7 versicolor
## 99 5.7 3.0 4.200 1.2 versicolor
## 100 6.3 2.8 5.100 1.5 virginica
## 101 5.4 3.4 1.500 0.4 setosa
## 102 7.2 3.6 NA 2.5 virginica
## 103 6.3 2.7 4.900 NA virginica
## 104 5.6 3.0 4.100 1.3 versicolor
## 105 5.1 3.7 NA 0.4 setosa
## 106 5.5 NA 0.925 1.0 versicolor
## 107 6.5 3.0 5.200 2.0 virginica
## 108 4.8 3.0 1.400 NA setosa
## 109 6.1 2.8 NA 1.3 versicolor
## 110 4.6 3.4 1.400 0.3 setosa
## 111 6.3 3.4 NA 2.4 virginica
## 112 5.0 3.4 1.500 0.2 setosa
## 113 5.1 3.4 1.500 0.2 setosa
## 114 NA 3.3 5.700 2.1 virginica
## 115 6.7 3.1 4.700 1.5 versicolor
## 116 7.7 2.6 6.900 2.3 virginica
## 117 6.3 NA 4.400 1.3 versicolor
## 118 4.6 3.1 1.500 0.2 setosa
## 119 NA 3.0 5.500 2.1 virginica
## 120 NA 2.8 4.700 1.2 versicolor
## 121 5.9 3.0 NA 1.5 versicolor
## 122 4.5 2.3 1.300 0.3 setosa
## 123 6.4 3.2 5.300 2.3 virginica
## 124 5.2 4.1 1.500 0.1 setosa
## 125 49.0 30.0 14.000 2.0 setosa
## 126 5.6 2.9 3.600 1.3 versicolor
## 127 6.8 3.2 5.900 2.3 virginica
## 128 5.8 NA 5.100 2.4 virginica
## 129 4.6 3.6 NA 0.2 setosa
## 130 5.7 0.0 1.700 0.3 setosa
## 131 5.6 2.5 3.900 1.1 versicolor
## 132 6.7 3.1 4.400 1.4 versicolor
## 133 4.8 NA 1.900 0.2 setosa
## 134 5.1 3.3 1.700 0.5 setosa
## 135 4.4 3.0 1.300 NA setosa
## 136 7.7 3.0 NA 2.3 virginica
## 137 4.7 3.2 1.600 0.2 setosa
## 138 NA 3.0 4.900 1.8 virginica
## 139 6.9 3.1 5.400 2.1 virginica
## 140 6.0 2.2 4.000 1.0 versicolor
## 141 5.0 NA 1.400 0.2 setosa
## 142 5.5 NA 3.800 1.1 versicolor
## 143 6.6 3.0 4.400 1.4 versicolor
## 144 6.3 2.9 5.600 1.8 virginica
## 145 5.7 2.5 5.000 2.0 virginica
## 146 6.7 3.1 5.600 2.4 virginica
## 147 5.6 3.0 4.500 1.5 versicolor
## 148 5.2 3.5 1.500 0.2 setosa
## 149 6.4 3.1 NA 1.8 virginica
## 150 5.8 2.6 4.000 NA versicolor
In all of our data analyses so far we implicitly assumed that we don’t have any missing values in our data. In practice, that is often not the case. While some statistical and machine learning methods work with missing data, many commonly used methods can’t, so it is important to learn how to deal with missing values.
If we judge by the imputation or data removing methods that are most commonly used in practice, we might conclude that missing data is a relatively simple problem that is secondary to the inference, predictive modelling, etc. that are the primary goal of our analysis. Unfortunately, that is not the case. Dealing with missing values is very challenging, often in itself a modelling problem.
The choice of an appropriate method is inseparable from our understanding of or assumptions about the process that generated the missing values (the missingness mechanism). Based on the characteristics of this process we typically characterize the missing data problem as one of these three cases:
MCAR (Missing Completely At Random): Whether or not a value is missing is independent of both the observed values and the missing (unobserved) values. For example, if we had temperature measuring devices at different locations and they occassionally and random intervals stopped working. Or, in surveys, where respondents don’t respond with a certain probability, independent of the characteristics that we are surveying.
MAR (Missing At Random): Whether or not a value is missing is independent of the missing (unobserved) values but depends on the observed values. That is, there is a pattern to how the values are missing, but we could fully explain that pattern given only the observed data. For example, if our temperature measuring devices stopped working more often in certain locations than in others. Or, in surveys, if women are less likely to report their weight than men.
MNAR (Missing Not At Random): Whether or not a value is missing also depends on the missing (unobserved) values in a way that can’t be explained by the observed values. That is, there is a pattern to how the values are missing, but we wouldn’t be able to fully explain it without observing the values that are missing. For example, if our temperature measuring device had a tendency to stop working when the temperature is very low. Or, in surveys, if a person was less likely to report their salary if their salary was high.
Every variable in our data might have a different missingness mechanism. So, how do we determine whether it is MCAR, MAR, or MNAR?
Showing with a reasonable degree of certainty that the mechanism is not MCAR is equivalent to showing that the missingness (whether or not a value is missing) can be predicted from observed values. That is, it is a prediction problem and it is sufficient to show one way that missingness can be predicted. On the other hand, it is infeasible to show that the mechanism is MCAR, because that would require us to show that there is no way of predicting missingness from observed values. We can, however, rule out certain kinds of dependency (for example, linear dependency).
For MNAR, the situation is even worse. In general, it is impossible to determine from the data the relationship between missingness and the value that is missing, because we don’t know what is missing. That is, unless we are able to somehow measure the values that are missing, we won’t be able to determine whether or not the missingness regime is MNAR. Getting our hands on the missing values, however, is in most cases impossible or infeasible.
To summarize, we’ll often be able to show that our missingness regime is not MCAR and never that it is MCAR. Subsequently, we’ll often know that the missingness regime is at least MAR, but we’ll rarely be able to determine whether it is MAR or MNAR, unless we can get our hands on the missing data. Therefore, it becomes very important to utilize not only data but also domain-specific background knowledge, when applicable. In particular, known relationships between the variables in our data and what caused the values to be missing.
Understanding the cause for missing data can often help us identify the missingness mechanism and avoid introducing a bias. In general, we can split the causes into two classes: intentional or unintentional.
Intentionally or systematically missing data are missing by design. For example:
Unintentionally missing data were not planned. For example:
and other reasons not planned by and outside the control of the data collector.
There is no general rule and further conclusions can only be made on a case-by-case basis, using domain specific knowledge. Measurement error can range from completly random to completely not-at-random, such a temperature sensor breaking down at high temperatures. Subjects typically do not drop out of studies at random, but that is also a possiblity. When not all measurements are made to reduce cost, they are often omitted completely at random, but sometimes a different design is used.
A missing value, represented by NA in R, is a placeholder for a datum of which the type is known but its value isn’t. Therefore, it is impossible to perform statistical analysis on data where one or more values in the data are missing. One may choose to either omit elements from a dataset that contain missing values or to impute a value, but missingness is something to be dealt with prior to any analysis.
Can you see that many values in our dataset have status NA = Not Available? Count (or plot), how many (%) of all 150 rows is complete.
sum(complete.cases(mydata))
## [1] 96
nrow(mydata[complete.cases(mydata), ])/nrow(mydata)*100
## [1] 64
Does the data contain other special values? If it does, replace them with NA.
is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
sapply(mydata, is.special)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE TRUE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE TRUE TRUE FALSE FALSE
## [7,] FALSE FALSE FALSE TRUE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE TRUE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE
## [15,] TRUE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE TRUE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE
## [19,] TRUE FALSE TRUE FALSE FALSE
## [20,] FALSE TRUE FALSE FALSE FALSE
## [21,] FALSE FALSE TRUE FALSE FALSE
## [22,] FALSE TRUE FALSE FALSE FALSE
## [23,] FALSE FALSE TRUE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE
## [25,] TRUE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE TRUE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE
## [30,] TRUE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE TRUE FALSE FALSE
## [32,] FALSE FALSE FALSE TRUE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE TRUE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE TRUE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE TRUE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE FALSE
## [47,] FALSE TRUE TRUE FALSE FALSE
## [48,] FALSE FALSE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE FALSE
## [52,] FALSE FALSE TRUE FALSE FALSE
## [53,] FALSE FALSE FALSE FALSE FALSE
## [54,] FALSE FALSE FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE FALSE FALSE
## [56,] FALSE FALSE FALSE FALSE FALSE
## [57,] FALSE FALSE FALSE FALSE FALSE
## [58,] TRUE FALSE FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE TRUE FALSE
## [60,] FALSE FALSE FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE TRUE FALSE
## [62,] FALSE FALSE FALSE FALSE FALSE
## [63,] FALSE FALSE FALSE FALSE FALSE
## [64,] FALSE FALSE FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE FALSE FALSE
## [67,] FALSE FALSE TRUE FALSE FALSE
## [68,] TRUE FALSE FALSE FALSE FALSE
## [69,] FALSE FALSE TRUE TRUE FALSE
## [70,] FALSE FALSE FALSE FALSE FALSE
## [71,] FALSE FALSE FALSE FALSE FALSE
## [72,] FALSE FALSE FALSE FALSE FALSE
## [73,] FALSE FALSE FALSE FALSE FALSE
## [74,] FALSE TRUE FALSE FALSE FALSE
## [75,] FALSE FALSE FALSE FALSE FALSE
## [76,] FALSE FALSE FALSE TRUE FALSE
## [77,] FALSE FALSE FALSE FALSE FALSE
## [78,] FALSE FALSE FALSE FALSE FALSE
## [79,] FALSE FALSE FALSE FALSE FALSE
## [80,] FALSE FALSE FALSE FALSE FALSE
## [81,] FALSE FALSE FALSE FALSE FALSE
## [82,] FALSE FALSE FALSE FALSE FALSE
## [83,] FALSE FALSE FALSE FALSE FALSE
## [84,] FALSE FALSE FALSE FALSE FALSE
## [85,] FALSE FALSE FALSE FALSE FALSE
## [86,] FALSE FALSE FALSE TRUE FALSE
## [87,] FALSE FALSE FALSE FALSE FALSE
## [88,] FALSE FALSE FALSE FALSE FALSE
## [89,] FALSE FALSE FALSE FALSE FALSE
## [90,] FALSE FALSE FALSE FALSE FALSE
## [91,] FALSE FALSE FALSE FALSE FALSE
## [92,] FALSE TRUE FALSE FALSE FALSE
## [93,] FALSE FALSE FALSE FALSE FALSE
## [94,] FALSE FALSE FALSE FALSE FALSE
## [95,] FALSE FALSE FALSE FALSE FALSE
## [96,] FALSE FALSE TRUE FALSE FALSE
## [97,] FALSE FALSE FALSE FALSE FALSE
## [98,] FALSE TRUE FALSE FALSE FALSE
## [99,] FALSE FALSE FALSE FALSE FALSE
## [100,] FALSE FALSE FALSE FALSE FALSE
## [101,] FALSE FALSE FALSE FALSE FALSE
## [102,] FALSE FALSE TRUE FALSE FALSE
## [103,] FALSE FALSE FALSE TRUE FALSE
## [104,] FALSE FALSE FALSE FALSE FALSE
## [105,] FALSE FALSE TRUE FALSE FALSE
## [106,] FALSE TRUE FALSE FALSE FALSE
## [107,] FALSE FALSE FALSE FALSE FALSE
## [108,] FALSE FALSE FALSE TRUE FALSE
## [109,] FALSE FALSE TRUE FALSE FALSE
## [110,] FALSE FALSE FALSE FALSE FALSE
## [111,] FALSE FALSE TRUE FALSE FALSE
## [112,] FALSE FALSE FALSE FALSE FALSE
## [113,] FALSE FALSE FALSE FALSE FALSE
## [114,] TRUE FALSE FALSE FALSE FALSE
## [115,] FALSE FALSE FALSE FALSE FALSE
## [116,] FALSE FALSE FALSE FALSE FALSE
## [117,] FALSE TRUE FALSE FALSE FALSE
## [118,] FALSE FALSE FALSE FALSE FALSE
## [119,] TRUE FALSE FALSE FALSE FALSE
## [120,] TRUE FALSE FALSE FALSE FALSE
## [121,] FALSE FALSE TRUE FALSE FALSE
## [122,] FALSE FALSE FALSE FALSE FALSE
## [123,] FALSE FALSE FALSE FALSE FALSE
## [124,] FALSE FALSE FALSE FALSE FALSE
## [125,] FALSE FALSE FALSE FALSE FALSE
## [126,] FALSE FALSE FALSE FALSE FALSE
## [127,] FALSE FALSE FALSE FALSE FALSE
## [128,] FALSE TRUE FALSE FALSE FALSE
## [129,] FALSE FALSE TRUE FALSE FALSE
## [130,] FALSE FALSE FALSE FALSE FALSE
## [131,] FALSE FALSE FALSE FALSE FALSE
## [132,] FALSE FALSE FALSE FALSE FALSE
## [133,] FALSE TRUE FALSE FALSE FALSE
## [134,] FALSE FALSE FALSE FALSE FALSE
## [135,] FALSE FALSE FALSE TRUE FALSE
## [136,] FALSE FALSE TRUE FALSE FALSE
## [137,] FALSE FALSE FALSE FALSE FALSE
## [138,] TRUE FALSE FALSE FALSE FALSE
## [139,] FALSE FALSE FALSE FALSE FALSE
## [140,] FALSE FALSE FALSE FALSE FALSE
## [141,] FALSE TRUE FALSE FALSE FALSE
## [142,] FALSE TRUE FALSE FALSE FALSE
## [143,] FALSE FALSE FALSE FALSE FALSE
## [144,] FALSE FALSE FALSE FALSE FALSE
## [145,] FALSE FALSE FALSE FALSE FALSE
## [146,] FALSE FALSE FALSE FALSE FALSE
## [147,] FALSE FALSE FALSE FALSE FALSE
## [148,] FALSE FALSE FALSE FALSE FALSE
## [149,] FALSE FALSE TRUE FALSE FALSE
## [150,] FALSE FALSE FALSE TRUE FALSE
for (n in colnames(mydata)){
is.na(mydata[[n]]) <- is.special(mydata[[n]])
}
summary(mydata)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. : 0.000 Min. :-3.000 Min. : 0.00 Min. :0.100
## 1st Qu.: 5.100 1st Qu.: 2.800 1st Qu.: 1.60 1st Qu.:0.300
## Median : 5.750 Median : 3.000 Median : 4.50 Median :1.300
## Mean : 6.559 Mean : 3.391 Mean : 4.45 Mean :1.207
## 3rd Qu.: 6.400 3rd Qu.: 3.300 3rd Qu.: 5.10 3rd Qu.:1.800
## Max. :73.000 Max. :30.000 Max. :63.00 Max. :2.500
## NA's :10 NA's :17 NA's :19 NA's :13
## Species
## Length:150
## Class :character
## Mode :character
##
##
##
##
OMG! So hard :-) It’s better to use visualizations, tests… or automatically detect NA’s with some function…
There are a variety of different plots to explore missing data
available in the naniar package. If you would like to know more about
the philosophy of the naniar package, you should read the
vignette Getting
Started with naniar.
A key point to remember with the visualisation tools in
naniar is that there is a way to get the data from the plot
out from the visualisation.
One of the first plots that I recommend you start with when you are
first exploring your missing data, is the vis_miss() plot,
which is re-exported from visdat.
vis_miss(airquality)
This plot provides a specific visualiation of the amount of missing
data, showing in black the location of missing values, and also
providing information on the overall percentage of missing values
overall (in the legend), and in each variable.
An upset plot from the UpSetR package can be used to
visualise the patterns of missingness, or rather the combinations of
missingness across cases. To see combinations of missingness and
intersections of missingness amongst variables, use the
gg_miss_upset function:
gg_miss_upset(airquality)
We can explore this with more complex data, such as riskfactors:
gg_miss_upset(riskfactors)
There are a few different ways to explore different missing data
mechanisms and relationships. One way incorporates the method of
shifting missing values so that they can be visualised on the same axes
as the regular values, and then colours the missing and not missing
points. This is implemented with geom_miss_point().
# using regular geom_point()
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
This plot shows the number of missing values in each variable in a
dataset. It is powered by the miss_var_summary()
function.
gg_miss_var(airquality)
gg_miss_var(airquality) + labs(y = "Look at all the missing ones")
This plot shows the number of missing values in each case. It is
powered by the miss_case_summary() function.
gg_miss_case(airquality) + labs(x = "Number of Cases")
This plot shows the number of missings in each column, broken down by
a categorical variable from the dataset. It is powered by a
dplyr::group_by statement followed by
miss_var_summary().
gg_miss_fct(x = riskfactors, fct = marital)
naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with miss_.
miss_var_summary(airquality)
## # A tibble: 6 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 Ozone 37 24.2
## 2 Solar.R 7 4.58
## 3 Wind 0 0
## 4 Temp 0 0
## 5 Month 0 0
## 6 Day 0 0
You could also group_by() to work out the number of missings in each variable across the levels within it.
airquality %>%
group_by(Month) %>%
miss_var_summary()
## # A tibble: 25 × 4
## # Groups: Month [5]
## Month variable n_miss pct_miss
## <int> <chr> <int> <dbl>
## 1 5 Ozone 5 16.1
## 2 5 Solar.R 4 12.9
## 3 5 Wind 0 0
## 4 5 Temp 0 0
## 5 5 Day 0 0
## 6 6 Ozone 21 70
## 7 6 Solar.R 0 0
## 8 6 Wind 0 0
## 9 6 Temp 0 0
## 10 6 Day 0 0
## # ℹ 15 more rows
Thanks to rstatix library we can easily detect outliers using boxplot methods. Boxplots are a popular and an easy method for identifying outliers.
# Convert the variable dose from a numeric to a factor variable
tooth <- as.data.frame(ToothGrowth)
tooth$dose <- as.factor(tooth$dose)
# Change box plot line colors by groups
p<-ggplot(tooth, aes(x=dose, y=len, color=dose)) +
geom_boxplot()
p
There are two categories of outlier: (1) outliers and (2) extreme points.
Values above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered as outliers. Values above Q3 + 3xIQR or below Q1 - 3xIQR are considered as extreme points (or extreme outliers).
Q1 and Q3 are the first and third quartile, respectively. IQR is the interquartile range (IQR = Q3 - Q1).
Generally speaking, data points that are labelled outliers in boxplots are not considered as troublesome as those considered extreme points and might even be ignored. Note that, any NA and NaN are automatically removed before the quantiles are computed.
Example
tooth %>%
group_by(dose) %>%
identify_outliers(len)
## # A tibble: 1 × 5
## dose len supp is.outlier is.extreme
## <fct> <dbl> <fct> <lgl> <lgl>
## 1 0.5 21.5 OJ TRUE FALSE
See? one outlier! (that was visible on the boxplot)
Consistent data are technically correct data that are fit for statistical analysis. They are data in which missing values, special values, (obvious) errors and outliers are either removed, corrected or imputed. The data are consistent with constraints based on real-world knowledge about the subject that the data describe.
We have the following background knowledge:
Species should be one of the following values: setosa, versicolor or virginica.
All measured numerical properties of an iris should be positive.
The petal length of an iris is at least 2 times its petal width.
The sepal length of an iris cannot exceed 30 cm.
The sepals of an iris are longer than its petals.
Define these rules in a separate object ‘RULE’ and read them into R using editset (package ‘editrules’). Print the resulting constraint object.
RULE <- editset(c("Sepal.Length <= 30","Species %in% c('setosa','versicolor','virginica')"
, "Sepal.Length > 0", "Sepal.Width > 0", "Petal.Length > 0", "Petal.Width > 0",
"Petal.Length >= 2 * Petal.Width", "Sepal.Length>Petal.Length"))
RULE
##
## Data model:
## dat1 : Species %in% c('setosa', 'versicolor', 'virginica')
##
## Edit set:
## num1 : Sepal.Length <= 30
## num2 : 0 < Sepal.Length
## num3 : 0 < Sepal.Width
## num4 : 0 < Petal.Length
## num5 : 0 < Petal.Width
## num6 : 2*Petal.Width <= Petal.Length
## num7 : Petal.Length < Sepal.Length
Now we are ready to determine how often each rule is broken (violatedEdits). Also we can summarize and plot the result.
summary(violatedEdits(RULE, mydata))
## Edit violations, 150 observations, 0 completely missing (0%):
##
## editname freq rel
## num6 3 2%
## num1 2 1.3%
## num3 2 1.3%
## num7 2 1.3%
## num2 1 0.7%
## num4 1 0.7%
##
## Edit violations per record:
##
## errors freq rel
## 0 90 60%
## 1 17 11.3%
## 2 13 8.7%
## 3 25 16.7%
## 4 4 2.7%
## 5 1 0.7%
What percentage of the data has no errors?
violated <- violatedEdits(RULE, mydata)
summary(violated)
## Edit violations, 150 observations, 0 completely missing (0%):
##
## editname freq rel
## num6 3 2%
## num1 2 1.3%
## num3 2 1.3%
## num7 2 1.3%
## num2 1 0.7%
## num4 1 0.7%
##
## Edit violations per record:
##
## errors freq rel
## 0 90 60%
## 1 17 11.3%
## 2 13 8.7%
## 3 25 16.7%
## 4 4 2.7%
## 5 1 0.7%
plot(violated)
Find out which observations have too long sepals using the result of violatedEdits.
# solution for the exercise 1 here ;-)
Find outliers in sepal length using boxplot and boxplot.stats. Retrieve the corresponding observations and look at the other values. Any ideas what might have happened? Set the outliers to NA (or a value that you find more appropiate)
boxplot(mydata$Sepal.Length)
outliers <- boxplot.stats(mydata$Sepal.Length)$out
outliers_idx <- which(mydata$Sepal.Length %in% outliers)
mydata[outliers_idx,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 28 73 29 63.0 NA virginica
## 43 0 NA 1.3 0.4 setosa
## 125 49 30 14.0 2.0 setosa
# they all seem to be too big... may they were measured in mm i.o cm?
mydata[outliers_idx,1:4] <- mydata[outliers_idx,1:4]/10
summary(mydata)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :0.000 Min. :-3.000 Min. : 0.000 Min. :0.040
## 1st Qu.:5.100 1st Qu.: 2.800 1st Qu.: 1.600 1st Qu.:0.300
## Median :5.700 Median : 3.000 Median : 4.400 Median :1.300
## Mean :5.775 Mean : 2.992 Mean : 3.912 Mean :1.192
## 3rd Qu.:6.400 3rd Qu.: 3.300 3rd Qu.: 5.100 3rd Qu.:1.800
## Max. :7.900 Max. : 4.200 Max. :23.000 Max. :2.500
## NA's :10 NA's :17 NA's :19 NA's :13
## Species
## Length:150
## Class :character
## Mode :character
##
##
##
##
Note that simple boxplot shows an extra outlier!
boxplot(Sepal.Length ~ Species, data=mydata)
Replace non positive values from Sepal.Width with NA using correctWithRules from the library ‘deducorrect’.
cr <- correctionRules(expression(
if (!is.na(Sepal.Width) && Sepal.Width <=0 ) Sepal.Width = NA
))
correctWithRules(cr, mydata)
## $corrected
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.4 3.2 4.500 1.50 versicolor
## 2 6.3 3.3 6.000 2.50 virginica
## 3 6.2 NA 5.400 2.30 virginica
## 4 5.0 3.4 1.600 0.40 setosa
## 5 5.7 2.6 3.500 1.00 versicolor
## 6 5.3 NA NA 0.20 setosa
## 7 6.4 2.7 5.300 NA virginica
## 8 5.9 3.0 5.100 1.80 virginica
## 9 5.8 2.7 4.100 1.00 versicolor
## 10 4.8 3.1 1.600 0.20 setosa
## 11 5.0 3.5 1.600 0.60 setosa
## 12 6.0 2.7 5.100 1.60 versicolor
## 13 6.0 3.0 4.800 NA virginica
## 14 6.8 2.8 4.800 1.40 versicolor
## 15 NA 3.9 1.700 0.40 setosa
## 16 5.0 NA 3.500 1.00 versicolor
## 17 5.5 NA 4.000 1.30 versicolor
## 18 4.7 3.2 1.300 0.20 setosa
## 19 NA 4.0 NA 0.20 setosa
## 20 5.6 NA 4.200 1.30 versicolor
## 21 4.9 3.6 NA 0.10 setosa
## 22 5.4 NA 4.500 1.50 versicolor
## 23 6.2 2.8 NA 1.80 virginica
## 24 6.7 3.3 5.700 2.50 virginica
## 25 NA 3.0 5.900 2.10 virginica
## 26 4.6 3.2 1.400 0.20 setosa
## 27 4.9 3.1 1.500 0.10 setosa
## 28 7.3 2.9 6.300 NA virginica
## 29 6.5 3.2 5.100 2.00 virginica
## 30 NA 2.8 0.820 1.30 versicolor
## 31 4.4 3.2 NA 0.20 setosa
## 32 5.9 3.2 4.800 NA versicolor
## 33 5.7 2.8 4.500 1.30 versicolor
## 34 6.2 2.9 NA 1.30 versicolor
## 35 6.6 2.9 23.000 1.30 versicolor
## 36 4.8 3.0 1.400 0.10 setosa
## 37 6.5 3.0 5.500 1.80 virginica
## 38 6.2 2.2 4.500 1.50 versicolor
## 39 6.7 2.5 5.800 1.80 virginica
## 40 5.0 3.0 1.600 0.20 setosa
## 41 5.0 NA 1.200 0.20 setosa
## 42 5.8 2.7 3.900 1.20 versicolor
## 43 0.0 NA 0.130 0.04 setosa
## 44 5.8 2.7 5.100 1.90 virginica
## 45 5.5 4.2 1.400 0.20 setosa
## 46 7.7 2.8 6.700 2.00 virginica
## 47 5.7 NA NA 0.40 setosa
## 48 7.0 3.2 4.700 1.40 versicolor
## 49 6.5 3.0 5.800 2.20 virginica
## 50 6.0 3.4 4.500 1.60 versicolor
## 51 5.5 2.6 4.400 1.20 versicolor
## 52 4.9 3.1 NA 0.20 setosa
## 53 5.2 2.7 3.900 1.40 versicolor
## 54 4.8 3.4 1.600 0.20 setosa
## 55 6.3 3.3 4.700 1.60 versicolor
## 56 7.7 3.8 6.700 2.20 virginica
## 57 5.1 3.8 1.500 0.30 setosa
## 58 NA 2.9 4.500 1.50 versicolor
## 59 6.4 2.8 5.600 NA virginica
## 60 6.4 2.8 5.600 2.10 virginica
## 61 5.0 2.3 3.300 NA versicolor
## 62 7.4 2.8 6.100 1.90 virginica
## 63 4.3 3.0 1.100 0.10 setosa
## 64 5.0 3.3 1.400 0.20 setosa
## 65 7.2 3.0 5.800 1.60 virginica
## 66 6.3 2.5 4.900 1.50 versicolor
## 67 5.1 2.5 NA 1.10 versicolor
## 68 NA 3.2 5.700 2.30 virginica
## 69 5.1 3.5 NA NA setosa
## 70 5.0 3.5 1.300 0.30 setosa
## 71 6.1 3.0 4.600 1.40 versicolor
## 72 6.9 3.1 5.100 2.30 virginica
## 73 5.1 3.5 1.400 0.30 setosa
## 74 6.5 NA 4.600 1.50 versicolor
## 75 5.6 2.8 4.900 2.00 virginica
## 76 4.9 2.5 4.500 NA virginica
## 77 5.5 3.5 1.300 0.20 setosa
## 78 7.6 3.0 6.600 2.10 virginica
## 79 5.1 3.8 0.000 0.20 setosa
## 80 7.9 3.8 6.400 2.00 virginica
## 81 6.1 2.6 5.600 1.40 virginica
## 82 5.4 3.4 1.700 0.20 setosa
## 83 6.1 2.9 4.700 1.40 versicolor
## 84 5.4 3.7 1.500 0.20 setosa
## 85 6.7 3.0 5.200 2.30 virginica
## 86 5.1 3.8 1.900 NA setosa
## 87 6.4 2.9 4.300 1.30 versicolor
## 88 5.7 2.9 4.200 1.30 versicolor
## 89 4.4 2.9 1.400 0.20 setosa
## 90 6.3 2.5 5.000 1.90 virginica
## 91 7.2 3.2 6.000 1.80 virginica
## 92 4.9 NA 3.300 1.00 versicolor
## 93 5.2 3.4 1.400 0.20 setosa
## 94 5.8 2.7 5.100 1.90 virginica
## 95 6.0 2.2 5.000 1.50 virginica
## 96 6.9 3.1 NA 1.50 versicolor
## 97 5.5 2.3 4.000 1.30 versicolor
## 98 6.7 NA 5.000 1.70 versicolor
## 99 5.7 3.0 4.200 1.20 versicolor
## 100 6.3 2.8 5.100 1.50 virginica
## 101 5.4 3.4 1.500 0.40 setosa
## 102 7.2 3.6 NA 2.50 virginica
## 103 6.3 2.7 4.900 NA virginica
## 104 5.6 3.0 4.100 1.30 versicolor
## 105 5.1 3.7 NA 0.40 setosa
## 106 5.5 NA 0.925 1.00 versicolor
## 107 6.5 3.0 5.200 2.00 virginica
## 108 4.8 3.0 1.400 NA setosa
## 109 6.1 2.8 NA 1.30 versicolor
## 110 4.6 3.4 1.400 0.30 setosa
## 111 6.3 3.4 NA 2.40 virginica
## 112 5.0 3.4 1.500 0.20 setosa
## 113 5.1 3.4 1.500 0.20 setosa
## 114 NA 3.3 5.700 2.10 virginica
## 115 6.7 3.1 4.700 1.50 versicolor
## 116 7.7 2.6 6.900 2.30 virginica
## 117 6.3 NA 4.400 1.30 versicolor
## 118 4.6 3.1 1.500 0.20 setosa
## 119 NA 3.0 5.500 2.10 virginica
## 120 NA 2.8 4.700 1.20 versicolor
## 121 5.9 3.0 NA 1.50 versicolor
## 122 4.5 2.3 1.300 0.30 setosa
## 123 6.4 3.2 5.300 2.30 virginica
## 124 5.2 4.1 1.500 0.10 setosa
## 125 4.9 3.0 1.400 0.20 setosa
## 126 5.6 2.9 3.600 1.30 versicolor
## 127 6.8 3.2 5.900 2.30 virginica
## 128 5.8 NA 5.100 2.40 virginica
## 129 4.6 3.6 NA 0.20 setosa
## 130 5.7 NA 1.700 0.30 setosa
## 131 5.6 2.5 3.900 1.10 versicolor
## 132 6.7 3.1 4.400 1.40 versicolor
## 133 4.8 NA 1.900 0.20 setosa
## 134 5.1 3.3 1.700 0.50 setosa
## 135 4.4 3.0 1.300 NA setosa
## 136 7.7 3.0 NA 2.30 virginica
## 137 4.7 3.2 1.600 0.20 setosa
## 138 NA 3.0 4.900 1.80 virginica
## 139 6.9 3.1 5.400 2.10 virginica
## 140 6.0 2.2 4.000 1.00 versicolor
## 141 5.0 NA 1.400 0.20 setosa
## 142 5.5 NA 3.800 1.10 versicolor
## 143 6.6 3.0 4.400 1.40 versicolor
## 144 6.3 2.9 5.600 1.80 virginica
## 145 5.7 2.5 5.000 2.00 virginica
## 146 6.7 3.1 5.600 2.40 virginica
## 147 5.6 3.0 4.500 1.50 versicolor
## 148 5.2 3.5 1.500 0.20 setosa
## 149 6.4 3.1 NA 1.80 virginica
## 150 5.8 2.6 4.000 NA versicolor
##
## $corrections
## row variable old new
## 1 16 Sepal.Width -3 NA
## 2 130 Sepal.Width 0 NA
## how
## 1 if (!is.na(Sepal.Width) && Sepal.Width <= 0) Sepal.Width = NA
## 2 if (!is.na(Sepal.Width) && Sepal.Width <= 0) Sepal.Width = NA
Replace all erroneous values with NA using (the result of) localizeErrors:
mydata[localizeErrors(RULE, mydata)$adapt] <- NA
# anything violated?
any(violatedEdits(RULE,mydata), na.rm=TRUE)
## [1] FALSE
Well done! No errors!
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
Here we will introduce data transformation methods
provided by the dlookr package. You will learn how to transform of
tbl_df data that inherits from data.frame and
data.frame with functions provided by dlookr.
dlookr increases synergy with dplyr. Particularly in
data transformation and data wrangle, it increases the efficiency of the
tidyverse package group.
To illustrate the basic use of data transformation in the dlookr
package, I use a Carseats dataset. Carseats in
the ISLR package is simulation dataset that sells
children’s car seats at 400 stores. This data is a data.frame created
for the purpose of predicting sales volume.
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)
When data analysis is performed, data containing missing values is
often encountered. However, Carseats is complete data
without missing. Therefore, the missing values are generated as follows.
And I created a data.frame object named carseats.
carseats <- ISLR::Carseats
suppressWarnings(RNGversion("3.5.0"))
set.seed(123)
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA
suppressWarnings(RNGversion("3.5.0"))
set.seed(456)
carseats[sample(seq(NROW(carseats)), 10), "Urban"] <- NA
dlookr imputes missing values and outliers and resolves skewed data. It also provides the ability to bin continuous variables as categorical variables.
Here is a list of the data conversion functions and functions provided by dlookr:
find_na() finds a variable that contains the missing
values variable, and imputate_na() imputes the missing
values.find_outliers() finds a variable that contains the
outliers, and imputate_outlier() imputes the outlier.summary.imputation() and plot.imputation()
provide information and visualization of the imputed variables.find_skewness() finds the variables of the skewed data,
and transform() performs the resolving of the skewed
data.transform() also performs standardization of numeric
variables.summary.transform() and plot.transform()
provide information and visualization of transformed variables.binning() and binning_by() convert
binational data into categorical data.print.bins() and summary.bins() show and
summarize the binning results.plot.bins() and plot.optimal_bins()
provide visualization of the binning result.transformation_report() performs the data transform and
reports the result.imputate_na() imputes the missing value contained in the
variable. The predictor with missing values support both numeric and
categorical variables, and supports the following
method.
Imputation is the process of estimating or deriving values for fields where data is missing. There is a vast body of literature on imputation methods and it goes beyond the scope of this tutorial to discuss all of them.
There is no one single best imputation method that works in all cases. The imputation model of choice depends on what auxiliary information is available and whether there are (multivariate) edit restrictions on the data to be imputed. The availability of R software for imputation under edit restrictions is, to our best knowledge, limited. However, a viable strategy for imputing numerical data is to first impute missing values without restrictions, and then minimally adjust the imputed values so that the restrictions are obeyed. Separately, these methods are available in R.
Example
In the following example, imputate_na() imputes the
missing value of Income, a numeric variable of carseats,
using the “rpart” method. summary() summarizes missing
value imputation information, and plot() visualizes missing
information.
income <- imputate_na(carseats, Income, US, method = "rpart")
# summary of imputation
summary(income)
## * Impute missing values based on Recursive Partitioning and Regression Trees
## - method : rpart
##
## * Information of Imputation (before vs after)
## Original Imputation
## described_variables "value" "value"
## n "380" "400"
## na "20" " 0"
## mean "68.86053" "69.05073"
## sd "28.09161" "27.57382"
## se_mean "1.441069" "1.378691"
## IQR "48.25" "46.00"
## skewness "0.04490600" "0.02935732"
## kurtosis "-1.089201" "-1.035086"
## p00 "21" "21"
## p01 "21.79" "21.99"
## p05 "26" "26"
## p10 "30.0" "30.9"
## p20 "39" "40"
## p25 "42.75" "44.00"
## p30 "48.00000" "51.58333"
## p40 "62" "63"
## p50 "69" "69"
## p60 "78.0" "77.4"
## p70 "86.3" "84.3"
## p75 "91" "90"
## p80 "96.2" "96.0"
## p90 "108.1" "106.1"
## p95 "115.05" "115.00"
## p99 "119.21" "119.01"
## p100 "120" "120"
# viz of imputation
plot(income)
Example
The following imputes the categorical variable urban by
the “mice” method.
library(mice)
##
## Dołączanie pakietu: 'mice'
## Następujący obiekt został zakryty z 'package:stats':
##
## filter
## Następujące obiekty zostały zakryte z 'package:base':
##
## cbind, rbind
urban <- imputate_na(carseats, Urban, US, method = "mice")
##
## iter imp variable
## 1 1 Income Urban
## 1 2 Income Urban
## 1 3 Income Urban
## 1 4 Income Urban
## 1 5 Income Urban
## 2 1 Income Urban
## 2 2 Income Urban
## 2 3 Income Urban
## 2 4 Income Urban
## 2 5 Income Urban
## 3 1 Income Urban
## 3 2 Income Urban
## 3 3 Income Urban
## 3 4 Income Urban
## 3 5 Income Urban
## 4 1 Income Urban
## 4 2 Income Urban
## 4 3 Income Urban
## 4 4 Income Urban
## 4 5 Income Urban
## 5 1 Income Urban
## 5 2 Income Urban
## 5 3 Income Urban
## 5 4 Income Urban
## 5 5 Income Urban
# result of imputation
urban
## [1] Yes Yes Yes Yes Yes No Yes Yes No No No Yes Yes Yes Yes No Yes Yes
## [19] No Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes No
## [37] No Yes Yes No No Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes
## [55] No Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes Yes Yes No
## [73] Yes No No No Yes No Yes Yes Yes Yes Yes Yes No No Yes No Yes No
## [91] No Yes Yes No Yes Yes No Yes No No No Yes No Yes Yes Yes No Yes
## [109] Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No
## [127] Yes Yes Yes No Yes Yes Yes Yes Yes No No Yes Yes No Yes Yes Yes Yes
## [145] No Yes Yes No No Yes Yes No No No No Yes Yes No No No No No
## [163] Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes No Yes
## [181] Yes Yes Yes Yes No Yes No Yes Yes No No Yes No Yes Yes Yes Yes Yes
## [199] Yes Yes No Yes No Yes Yes Yes Yes No Yes No No Yes Yes Yes Yes Yes
## [217] Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes No No No No Yes No
## [235] No Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes
## [253] Yes No Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes
## [271] Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes No No Yes No Yes
## [289] No Yes No No Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes Yes
## [307] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes
## [325] Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes No Yes Yes No
## [343] No Yes No Yes No No Yes No No No Yes No Yes Yes Yes Yes Yes Yes
## [361] No No Yes Yes Yes No No Yes No Yes Yes Yes No Yes Yes Yes Yes No
## [379] Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes
## [397] No Yes Yes Yes
## attr(,"var_type")
## [1] categorical
## attr(,"method")
## [1] mice
## attr(,"na_pos")
## [1] 33 36 84 94 113 132 151 292 313 339
## attr(,"seed")
## [1] 24283
## attr(,"type")
## [1] missing values
## attr(,"message")
## [1] complete imputation
## attr(,"success")
## [1] TRUE
## Levels: No Yes
# summary of imputation
summary(urban)
## * Impute missing values based on Multivariate Imputation by Chained Equations
## - method : mice
## - random seed : 24283
##
## * Information of Imputation (before vs after)
## original imputation original_percent imputation_percent
## No 115 119 28.75 29.75
## Yes 275 281 68.75 70.25
## <NA> 10 0 2.50 0.00
# viz of imputation
plot(urban)
In the following exercise try to impute the missing value of the
Income variable, and then calculates the arithmetic mean
for each level of US. In this case, dplyr
should be used, and it is easily interpreted logically using pipes!
# The mean before and after the imputation of the Income variable
imputate_outlier() imputes the outliers value. The
predictor with outliers supports only numeric variables and supports the
following methods.
Use the function imputate_outlier() to impute the
outliers of the numeric variable Price as the “capping”
method. Hint: summary() summarizes outliers imputation
information, and plot() visualizes imputation
information.
# solution for the exercise 4 here ;-)
The following example imputes the outliers of the Price
variable, and then calculates the arithmetic mean for each level of
US. In this case, dplyr is used, and it is
easily interpreted logically using pipes.
Example
# The mean before and after the imputation of the Price variable
carseats %>%
mutate(Price_imp = imputate_outlier(carseats, Price, method = "capping")) %>%
group_by(US) %>%
summarise(orig = mean(Price, na.rm = TRUE),
imputation = mean(Price_imp, na.rm = TRUE))
## # A tibble: 2 × 3
## US orig imputation
## <fct> <dbl> <dbl>
## 1 No 114. 114.
## 2 Yes 117. 117.
Finally, we sometimes encounter the situation where we have problems with skewed distributions or we just want to transform, recode or perform discretization. Let’s review some of the most popular transformation methods.
First, standardization (also known as normalization):
Z-score approach - standardization procedure, using the formula: \(z=\frac{x-\mu}{\sigma}\) where \(\mu\) = mean and \(\sigma\) = standard deviation. Z-scores are also known as standardized scores; they are scores (or data values) that have been given a common standard. This standard is a mean of zero and a standard deviation of 1.
minmax approach - An alternative approach to Z-score normalization (or standardization) is the so-called MinMax scaling (often also simply called “normalization” - a common cause for ambiguities). In this approach, the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers. If you would like to perform MinMax scaling - simply substract minimum value and divide it by range:\((x-min)/(max-min)\)
In order to solve problems with very skewed distributions we can also use several types of simple transformations:
+ "log" : log transformation. log(x)
+ "log+1" : log transformation. log(x + 1). Used for values that contain 0.
+ "sqrt" : square root transformation.
+ "1/x" : 1 / x transformation
+ "x^2" : x square transformation
+ "x^3" : x^3 square transformation
Use the methods “zscore” and “minmax” to perform standardization.
carseats %>%
mutate(Income_minmax = transform(carseats$Income, method = "minmax"),
Sales_minmax = transform(carseats$Sales, method = "minmax")) %>%
select(Income_minmax, Sales_minmax) %>%
boxplot()
Sometimes we just would like to perform so called ‘binning’ procedure to be able to analyze our categorical data, to compare several categorical variables, to construct statistical models etc. Thanks to the ‘binning’ function we can transform quantitative variables into categorical using several methods:
binning() transforms a numeric variable into a
categorical variable by binning it. The following types of binning are
supported.
Here are some examples of how to bin Income using
binning().:
Example:
# Binning the carat variable. default type argument is "quantile"
bin <- binning(carseats$Income)
# Print bins class object
bin
## binned type: quantile
## number of bins: 10
## x
## [21,30] (30,39] (39,48] (48,62] (62,69]
## 40 37 38 40 42
## (69,78] (78,86.56667] (86.56667,96.6] (96.6,108.6333] (108.6333,120]
## 33 36 38 38 38
## <NA>
## 20
# Summarize bins class object
summary(bin)
## levels freq rate
## 1 [21,30] 40 0.1000
## 2 (30,39] 37 0.0925
## 3 (39,48] 38 0.0950
## 4 (48,62] 40 0.1000
## 5 (62,69] 42 0.1050
## 6 (69,78] 33 0.0825
## 7 (78,86.56667] 36 0.0900
## 8 (86.56667,96.6] 38 0.0950
## 9 (96.6,108.6333] 38 0.0950
## 10 (108.6333,120] 38 0.0950
## 11 <NA> 20 0.0500
# Plot bins class object
plot(bin)
Using pipes & dplyr:
carseats %>%
mutate(Income_bin = binning(carseats$Income) %>%
extract()) %>%
group_by(ShelveLoc, Income_bin) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
head(10)
## `summarise()` has grouped output by 'ShelveLoc'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: ShelveLoc [1]
## ShelveLoc Income_bin freq
## <fct> <ord> <int>
## 1 Medium [21,30] 25
## 2 Medium (62,69] 24
## 3 Medium (48,62] 23
## 4 Medium (39,48] 21
## 5 Medium (30,39] 20
## 6 Medium (86.56667,96.6] 20
## 7 Medium (108.6333,120] 20
## 8 Medium (69,78] 18
## 9 Medium (96.6,108.6333] 18
## 10 Medium (78,86.56667] 17
Example: Recode the original distribution of incomes using fixed length of intervals and assign them labels.
income_fixed<- binning(carseats$Income, nbins = 4,
labels = c("low", "average", "high", "very high"))
summary(income_fixed)
## levels freq rate
## 1 low 95 0.2375
## 2 average 102 0.2550
## 3 high 89 0.2225
## 4 very high 94 0.2350
## 5 <NA> 20 0.0500
plot(income_fixed)
In case of statistical modeling (i.e. credit scoring purposes) - we need to be aware of the fact, that the optimal discretization of the original distribution must be achieved. The ‘binning_by’ function comes with some help here.
Example: Perform discretization of the variable ‘Advertising’ using optimal binning.
bin <- binning_by(carseats, "US", "Advertising")
## Warning in binning_by(carseats, "US", "Advertising"): The factor y has been changed to a numeric vector consisting of 0 and 1.
## 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
summary(bin)
## ── Binning Table ──────────────────────── Several Metrics ──
## Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV JSD
## 1 [-1,0] 144 19 125 0.07364 0.88028 0.1520 -2.48101 2.00128 0.20093
## 2 (0,6] 69 54 15 0.20930 0.10563 3.6000 0.68380 0.07089 0.00869
## 3 (6,29] 187 185 2 0.71705 0.01408 92.5000 3.93008 2.76272 0.21861
## 4 Total 400 258 142 1.00000 1.00000 1.8169 NA 4.83489 0.42823
## AUC
## 1 0.03241
## 2 0.01883
## 3 0.00903
## 4 0.06028
##
## ── General Metrics ─────────────────────────────────────────
## • Gini index : -0.87944
## • IV (Jeffrey) : 4.83489
## • JS (Jensen-Shannon) Divergence : 0.42823
## • Kolmogorov-Smirnov Statistics : 0.80664
## • HHI (Herfindahl-Hirschman Index) : 0.37791
## • HHI (normalized) : 0.06687
## • Cramer's V : 0.81863
##
## ── Significance Tests ──────────────────── Chisquare Test ──
## Bin A Bin B statistics p_value
## 1 [-1,0] (0,6] 87.67064 7.731349e-21
## 2 (0,6] (6,29] 34.73349 3.780706e-09
plot(bin)
Please take a look once again at the final report of our binning procedure. Try to interpret information criterion (Jeffrey’s or K-S) and plots.
We can finally print some summary reports after the cleansing procedures are all done - in the PDF format (TEX installation is required first) or HTML format. Please use this chunk in your own RStudio in order to produce some reports.
carseats %>%
transformation_report(target = US, output_format = "html",
output_file = "transformation_carseats.html")
Use quantile approach to perform binning of the variable ‘Income’ (4 equal categories).
# solution for the exercise 5 here ;-)
dlookr provides two automated data transformation reports:
transformation_web_report() creates dynamic report for
object inherited from data.frame(tbl_df, tbl,
etc) or data.frame.
The contents of the report are as follows.:
The following script creates a data transformation report for the
tbl_df class object, heartfailure.
heartfailure %>% transformation_web_report(target = “death_event”, subtitle = “heartfailure”,output_dir = “./”, output_file = “transformation.html”, theme = “blue”)
transformation_paged_report() create static report for
object inherited from data.frame(tbl_df, tbl,
etc) or data.frame.
The contents of the report are as follows.:
The following script creates a data transformation report for the
data.frame class object, heartfailure.
heartfailure %>% transformation_paged_report(target = “death_event”, subtitle = “heartfailure”, output_dir = “./”, output_file = “transformation.pdf”, theme = “blue”)
diagnose_paged_report() creates static report for object inherited from data.frame(tbl_df, tbl, etc) or data.frame.
The following script creates a quality diagnosis report for the tbl_df class object, flights.
flights %>% diagnose_paged_report(subtitle = “flights”, output_dir = “./”, output_file = “Diagn.pdf”, theme = “blue”)
The cover of the report is shown in the following figure.:
dlookr provides two automated EDA reports:
Web page-based dynamic reports can perform in-depth analysis through visualization and statistical tables.
Static reports generated as pdf files or html files can be archived as output of data analysis.
eda_web_report() creates dynamic report for object inherited from data.frame(tbl_df, tbl, etc) or data.frame.
The following script creates a EDA report for the data.frame class object, heartfailure:
heartfailure %>% eda_web_report(target = “death_event”, subtitle = “heartfailure”, output_dir = “./”, output_file = “EDA.html”, theme = “blue”)
The dynamic contents of the report is shown in the following figure.:
dlookr provides two automated data transformation reports:
transformation_web_report() creates dynamic report for object inherited from data.frame(tbl_df, tbl, etc) or data.frame.
The following script creates a data transformation report for the tbl_df class object, heartfailure:
heartfailure %>% transformation_web_report(target = “death_event”, subtitle = “heartfailure”, output_dir = “./”, output_file = “transformation.html”, theme = “blue”)
The dynamic contents of the report is shown in the following figure.:
If we are to take away one thing from this TIDY WEEK it should be that dealing with missing data is not an easy problem. Data will rarely be missing in a nice (completely random) way, so if we want to resort to more simple removal or imputation techniques, we must put reasonable effort into determining whether or not the dependencies between observed data and missingness are a cause for concern. If strong dependencies exist and/or there is a lot of data missing, the missing data problem becomes a prediction problem (or a series of prediction problems). We should also be aware of the possibility that missingness depends on the missing value in a way that can’t be explained by observed variables. This can also cause bias in our analyses and we will not be able to detect it unless we get our hands on some of the missing values.
Note that we focused only on standard cross-sectional data. There are many other types of data, such as time-series data, spatial data, images, sound, graphs, etc. The basic principles remain unchanged. We must be aware of the missingness mechanism and introducing bias. We can deal with missing values either by removing observations/variables or by imputing them. However, different, sometimes additional models and techniques will be appropriate. For example, temporal and spatial data lend themselves to interpolation of missing values.
A gentle introduction from a practitioners perspective: Blankers, M., Koeter, M. W., & Schippers, G. M. (2010). Missing data approaches in eHealth research: simulation study and a tutorial for nonmathematically inclined researchers. Journal of medical Internet research, 12(5), e54.
A great book on basic and some advance techniques: Allison, P. D. (2001). Missing data (Vol. 136). Sage publications.
Another great book with many examples and case-studies: Van Buuren, S. (2018). Flexible imputation of missing data. Chapman and Hall/CRC.
Understanding multiple imputation with chained equations (in R): Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68.
When doing missing data value handling and prediction separately, we should be aware that certain types of prediction models might work better with certain methods for handling missing values. A paper that illustrate this: Yadav, M. L., & Roychoudhury, B. (2018). Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems, 160, 104-118.
Data science students should work towards obtaining the knowledge and the skills that enable them to: