1 Goal

The goal of this tutorial is to group the missing values to learn which missing values are isolated and which belongs to largue groups. This could lead to different treatment of the missing values according to different criteria.

2 Loading libraries

library(dplyr)

3 Data import

# In this tutorial we use the global power plant database
# http://datasets.wri.org/dataset/globalpowerplantdatabase

Dataset <- read.csv("global_power_plant_database.csv", header = TRUE, stringsAsFactors = FALSE)
head(Dataset)

##   country country_long
## 1     AFG  Afghanistan
## 2     AFG  Afghanistan
## 3     AFG  Afghanistan
## 4     AFG  Afghanistan
## 5     AFG  Afghanistan
## 6     AFG  Afghanistan
##                                                        name    gppd_idnr
## 1              Kajaki Hydroelectric Power Plant Afghanistan GEODB0040538
## 2             Mahipar Hydroelectric Power Plant Afghanistan GEODB0040541
## 3          Naghlu Dam Hydroelectric Power Plant Afghanistan GEODB0040534
## 4 Nangarhar (Darunta) Hydroelectric Power Plant Afghanistan GEODB0040536
## 5                   Northwest Kabul Power Plant Afghanistan GEODB0040540
## 6        Pul-e-Khumri Hydroelectric Power Plant Afghanistan GEODB0040537
##   capacity_mw latitude longitude fuel1 fuel2 fuel3 fuel4
## 1       33.00  32.3220   65.1190 Hydro                  
## 2       66.00  34.5560   69.4787 Hydro                  
## 3      100.00  34.6410   69.7170 Hydro                  
## 4       11.55  34.4847   70.3633 Hydro                  
## 5       42.00  34.5638   69.1134   Gas                  
## 6        6.00  35.9416   68.7100 Hydro                  
##   commissioning_year owner source                                url
## 1                 NA        GEODB http://globalenergyobservatory.org
## 2                 NA        GEODB http://globalenergyobservatory.org
## 3                 NA        GEODB http://globalenergyobservatory.org
## 4                 NA        GEODB http://globalenergyobservatory.org
## 5                 NA        GEODB http://globalenergyobservatory.org
## 6                 NA        GEODB http://globalenergyobservatory.org
##   geolocation_source year_of_capacity_data generation_gwh_2013
## 1              GEODB                  2017                  NA
## 2              GEODB                  2017                  NA
## 3              GEODB                  2017                  NA
## 4              GEODB                  2017                  NA
## 5              GEODB                  2017                  NA
## 6              GEODB                  2017                  NA
##   generation_gwh_2014 generation_gwh_2015 generation_gwh_2016
## 1                  NA                  NA                  NA
## 2                  NA                  NA                  NA
## 3                  NA                  NA                  NA
## 4                  NA                  NA                  NA
## 5                  NA                  NA                  NA
## 6                  NA                  NA                  NA
##   estimated_generation_gwh
## 1                       NA
## 2                       NA
## 3                       NA
## 4                       NA
## 5                       NA
## 6                       NA

4 Identify the missing values

4.1 First we get the index of the missing values for a column

# We are going to identify the missing values of the column generation_gwh_2016
# Percentage of missing values in this column is
round(length(which(is.na(Dataset$generation_gwh_2016)))/length(Dataset$generation_gwh_2016) * 100,2)

## [1] 70.95

# The index of the missing values on this column are
my_nas_index <- which(is.na(Dataset$generation_gwh_2016))

4.2 We define a function that group the missing values

# The function runs over the indexes and assign same group to consecutive indexes
group_my_nas <- function(my_na_vector){
  
  group_number <- 1
  group_vector <- c()
  for(i in 1:(length(my_na_vector)-1)){
    group_vector[i] <- group_number
  
    if((my_na_vector[i+1] - my_na_vector[i]) > 1){
      group_number <- group_number + 1
    }
  }
  if(last(group_vector) != group_number){
    group_vector[length(group_vector)+1] <- group_number
  }
  return(group_vector)
}

4.3 Now we apply the function and check the distribution of missing values

# We apply the function to the indexes of the missing values
# We get different groups according to how the nas are grouped
head(group_my_nas(my_nas_index))

## [1] 1 1 1 1 1 1

# We can check the size of those groups using the table function
# We see that most of the missing values are grouped in 5 big groups of thousands
table(group_my_nas(my_nas_index))

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  760 7652 2352   13   49   10   36   17   25   11    8    5    1    2    4 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##   13    2   13    6   39   12    7    1   45   23    8    2    9    5   19 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##    1    1    1    1    2    7    9    8    2    2    8    1    6    1    7 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   23    2   23   18   15   27    1   32    2   49    1    1   44    3   39 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##   28   42    8    2    2   11    4   34   83    2 1920    1  451   15    1 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
##   74    2    3    4    8    3    2    6    1    1    4 1571    2    5 4198 
##   91   92   93   94   95   96   97   98   99  100  101  102  103  104  105 
##    1    1    1    1    2    1    1    1    1    1    1    1    1    1    2 
##  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120 
##    1    1    1    2    1    1    1    1    1    1    2    1    1    1    1 
##  121  122  123  124  125  126  127  128  129  130  131  132  133  134  135 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    2 
##  136  137  138  139  140  141  142  143  144  145  146  147  148  149  150 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    2 
##  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  181  182  183  184  185  186  187  188  189  190  191  192  193  194  195 
##    2    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  196  197  198  199  200  201  202  203  204  205  206  207  208  209  210 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  211  212  213  214  215  216  217  218  219  220  221  222  223  224  225 
##    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
##  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    2    1  131    2    2    1    1    4    4    1    1    1    2    4    2 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255 
##    3    3    1    1    5    1    1    3    1    9    1    1    1    1    2 
##  256  257  258  259 
##    1    1    2   23

5 Conclusion

In this tutorial we have learnt how to find the distribution of missing values. The best way to deal with this is creating a function that groups the missing values if they are consecutive.

How to group your NAs

Luis Serra @ Ubiqum Code Academy

1 Goal

2 Loading libraries

3 Data import

4 Identify the missing values

4.1 First we get the index of the missing values for a column

4.2 We define a function that group the missing values

4.3 Now we apply the function and check the distribution of missing values

5 Conclusion