Missing values detection and patterns of NA’s

Now we will work with the table containing data on famous Chilean plebiscite held in 1988 to decide whether to extend the rule of Augusto Pinochet or not. See details about variables here. We will load a table using a link:

df <- read.csv("http://math-info.hse.ru/f/2017-18/ps-ms/Chile.csv")

Let’s look at the data frame using the code:

View(df) # V is capital

And explore its structure:

str(df)
## 'data.frame':    2700 obs. of  9 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region    : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ population: int  175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex       : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
##  $ age       : int  65 29 38 49 23 28 26 24 41 41 ...
##  $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
##  $ income    : int  35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ statusquo : num  1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote      : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...

We can request the number of rows or columns separately:

nrow(df)
## [1] 2700
ncol(df)
## [1] 9

We can check whether rows contain missing values:

head(complete.cases(df)) # vector of TRUE/FALSE for all rows
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

And count the number of complete cases, so rows with no ‘NAs’.

sum(complete.cases(df)) # rows without NA's
## [1] 2431

Although TRUE and FALSE values are logical, the sum() function works correctly since they are automatically converted into integers: 1 for TRUE and 0 for FALSE.

And how to count, vice versa, the number of rows with missing values? Simply using a negation sign before complete.cases(), which is often represented by ! in programming.

sum(!complete.cases(df)) # rows with NA's
## [1] 269

If we want to look at the rows with NA’s, we can filter them from the whole data set. Filtering is usually done by writing conditions in square brackets. At the first place we should indicate conditions for rows and at the second place - for columns. Now we need to choose rows that do not belong to complete cases, and all columns.

head(df[!complete.cases(df), ])
##       X region population sex age education income statusquo vote
## 13   13      N     175000   F  27        PS     NA   1.43448    Y
## 15   15      N     175000   M  36        PS  35000   1.49026 <NA>
## 28   28      N     175000   F  43         P     NA   0.15489    A
## 76   76      N     125000   F  32         S     NA  -0.85035    N
## 98   98      N     125000   F  34         P   2500   0.10807 <NA>
## 113 113      N     250000   F  46         S     NA   0.15489 <NA>

The second position we left blank since if we want to see all the columns, there is no need to specify them.

So as to take a close look at these rows, we can save them into a separate data frame and then view in a more convenient mode:

with_na <- df[!complete.cases(df), ]
View(with_na)

Sometimes it is important to know the patterns of missing values so as understand whether the lack of data is systematic or random. If it seems to be random, there should not be problems, but if missing values are systematic, i.e. occur in certain columns, it might be a sign of bias. For instance, if we see that there are NA’s in answers to three questions simultaneously, in the same rows, it can serve as an evidence of poorly worded questions or questions on sensitive topics like money or politics.

To visualise the patterns we have to install two libraries: mice and VIM. All libraries (or packages) are installed via the function install.packages(). It is enough to install a library once. Then a library can be launched during any RStudio session.

# make sure your Internet connection is ok
install.packages("mice")
install.packages("VIM")

Once libraries are installed, we can use its functions. However, we usually have to launch a library first by referring to it via library(). This should be done in any new R session:

library(mice)
library(VIM)

Without these lines R will return an error Could not find function ... while typing commands from these libraries as it will not understand where to find the commands.

Now everything is ready to plot graphs for missing values.

aggr(df)

This graph is comprised of two parts. The left graph illustrates the frequencies of NA’s in each column. Judging by this bar chart we can conclude that most missing values are concentrated in the columns vote and income. However, the percentage of NA’s in vote is not very high (0.06 or 6%) if we consider the size of our data set (2700 rows). The right graph shows the patterns of missing values, the combinations of variables with NA’s. From this graph we can see that people who did not indicate their income, said nothing about their intentions to vote. And they did not provide any information on their attitude towards statusquo as well. Thus, missing values in this data set are not distributed randomly, there is a certain pattern.

matrixplot(df)

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.

Let’s look at another graph. It also illustrates patterns of missing values. By y-axis go numbers of rows (indices of rows), by x-axis go variables. Cell colours range from white to black, where darker shades correspond to larger values. NA’s are represented by red cells. So, if we see big red rectangles, it means that there are a lot of missing values. Besides, again we can see what columns contain NA’s simultaneously.

Now for simplicity we will delete rows with missing values so as to avoid problems with further calculations and graphs:

df <- na.omit(df)

We can check:

sum(!complete.cases(df)) # 0 NA's
## [1] 0

Now let’s see the statistical summary, i.e. descriptive statistics, and proceed to visualisation of variables’ distribution.

summary(df)
##        X          region     population     sex           age       
##  Min.   :   1.0   C :548   Min.   :  3750   F:1250   Min.   :18.00  
##  1st Qu.: 636.5   M : 75   1st Qu.: 25000   M:1181   1st Qu.:25.00  
##  Median :1326.0   N :305   Median :175000            Median :36.00  
##  Mean   :1326.9   S :655   Mean   :151605            Mean   :38.29  
##  3rd Qu.:2006.5   SA:848   3rd Qu.:250000            3rd Qu.:49.00  
##  Max.   :2700.0            Max.   :250000            Max.   :70.00  
##  education     income         statusquo        vote   
##  P :1002   Min.   :  2500   Min.   :-1.72594   A:177  
##  PS: 419   1st Qu.:  7500   1st Qu.:-1.00974   N:867  
##  S :1010   Median : 15000   Median :-0.08924   U:551  
##            Mean   : 34020   Mean   :-0.01127   Y:836  
##            3rd Qu.: 35000   3rd Qu.: 0.96969          
##            Max.   :200000   Max.   : 1.71355

Nominal variables: visualising a distribution

First of all, for a nominal variable we can make a frequency table. For instance, we will take the variable vote and count the number of occurrences of each unique value of vote. To choose a variable from df, we need a dollar sign ($):

table(df$vote)
## 
##   A   N   U   Y 
## 177 867 551 836

By default R computes absolute frequencies, but be can get relative ones by hand:

table(df$vote)/sum(table(df$vote)) # shares, fractions
## 
##          A          N          U          Y 
## 0.07280954 0.35664336 0.22665570 0.34389140
table(df$vote)/sum(table(df$vote)) * 100 # in percent (%)
## 
##         A         N         U         Y 
##  7.280954 35.664336 22.665570 34.389140
res <- table(df$vote)/sum(table(df$vote)) * 100 
round_res <- round(res, 2) # round to the 2nd digit after a point

Secondly, now we can plot a basic R bar chart for this variable:

barplot(table(df$vote))

A basic plot is grey and dull. Let’s change its colour:

barplot(table(df$vote), col = "red")