0.1 Introduction

When it comes to explorative data analysis, we’ll often encounter data series with missing values. But the challenge is that how do we decide which time series to keep and how to score them. The most simple way to do is to compute the total percentage of the missing data. But this has a big flaw that it can’t differentiate the quality of the time series when they have the same amount of missing data points but positioned differently. Let’s take a look at the following two data vectors: and .

The recover rate for these two vectors are different. The first time series is more often considered easier to impute, a.k.a estimate missing values. Because of the differences in these two series, I have come up with another method to score the quality of the series: porosity score. The concept is derived from environmental physics (https://www.classzone.com/books/earth_science/terc/content/investigations/es1401/es1401page04.cfm). What this does is to compute an adjusted porosity score of the time series vector by considering how the missing/bad data is positioned, the size of each block of missing data and adjust their impact on the overall dataset. Whether it is all discrete or continuously positioned every k index.

The porosity score will penalize the missing data block by its size. The bigger continuous hole it has, the worse the data is.

0.2 Define function

The function is defined below as PorosityScore. By default, the function will return a PorosityScore with penalty turned on. This is recommended metric. What this mean is that it penalize each block of missing data differently. For example, the penalty weight for a missing block size of 4 will be 4 while it will be 1 for block size 1. This make sense because the bigger hole you have, the worse data it should be.

# This function  is intended to compute the porosity of a time series vector.
# The computed porosicy(completeness) can be then used to screen feature variables in a dataframe.
# This function find the blocks of mimssing data and track the size of each block
#
# Input: Time Sereis Vector
#        tolerance: default 1 discrete missing value
#        Missing Value: NA or 0 or user specified (e.g. -99999 )
#        batch: when used in apply function, set it to TRUE and only adjuested.porosity will be generated. 
# Output: A list contains:
#         1. total.porosity.score (0-1) 
#         2. adjusted.porosity.score  (0-1) 
#         3. score with penalty (recommended) (0 - length(tsIn)^2) 
#         4. missing.blocksize
#         adjusted and penlty is used to control what type of output will be provided when run using apply function.
#
# e.g.
# > a <-  c(1,2,NA,3,NA,NA,4,5,6,7,8,NA,9,10,NA,NA)
# > result <- Porosity(a,tolerance = 2)
# > result$adjusted.porosity.score
# > result$total.porosity.score
#
# for dataframe usage. e.g. apply(dfIn,2,PorosityScore,tolerance=0,batch=TRUE, adjusted = FALSE, penalty = TRUE)

PorosityScore<- function(tsIn,tolerance =0,missingValue = NA,batch = FALSE, adjusted = FALSE, penalty = TRUE){
  #tsIn <- c(1,2,NA,3,NA,NA,4,5,6,7,8,NA,9,10,NA,NA)
  
  mVal = -99999999.9999 
  if(is.na(missingValue)) {
    tsIn[is.na(tsIn)] <- mVal
  }else{
    mVal = missingValue
  }
  idx <- which(tsIn == mVal )
  # Compute the total sparsity of the data
  totalPorosity <- length(idx) / length(tsIn)
  
  result <- list() 
  
  count <- 0
  i = 1
  while(i <= length(tsIn)) {
    if(tsIn[i] == mVal){
      count <-  count + 1
    }else{
      if(count !=0){
        result <- append(result,count)
        }
      count <- 0
    } 
    
    i <-  i +1
  } 
  
  if(count !=0) {
    result <- append(result,count)
  } 
  
  if(length(result) ==0){
      adjPorosity <- 0
      PenaltyPorosity <- 0
      blockSizeVec <- NA
      sprintf("The average porosity is: %5.1f.", mean(blockSizeVec))
      sprintf("The total and adjusted porosity score is:(%5.1f , %5.1f)", totalPorosity,adjPorosity)
      resultlist <-  list("total.porosity.score" =  totalPorosity ,"adjusted.porosity.score" = adjPorosity, 
                 "PenaltyPorosity"=PenaltyPorosity, "missing.blocksize" = blockSizeVec) 
  }else{
      # convert it to a vector
      blockSizeVec <- sapply(result,sum) # Map OF number of missing value in each missing blocks
     # If the spacing of the missing data is continous (>1), bad (e.g. [2,3,3,4,4,1,1,5,6,6])
      AvgPorosity <- mean(blockSizeVec)                            # The smaller,  the better
     # adjusted porosity score
      resVecAdj <- blockSizeVec[blockSizeVec>tolerance] 
      adjPorosity <- sum(resVecAdj)/length(tsIn) 
      PenaltyPorosity <- sum(blockSizeVec*resVecAdj)
      sprintf("The average porosity is: %5.1f.", mean(blockSizeVec))
      sprintf("The total and adjusted porosity score is:(%5.1f , %5.1f)", totalPorosity,adjPorosity)
      resultlist <-  list("total.porosity.score" =  totalPorosity ,"adjusted.porosity.score" = adjPorosity, 
                 "PenaltyPorosity"=PenaltyPorosity, "missing.blocksize" = blockSizeVec) 
  }
  
 if(batch) {
  # for using with apply function 
  # only return adjusted porosity since total porosity is too easy to compute
   if(adjusted){
     return(adjPorosity)
   }
   if(penalty){
     return(PenaltyPorosity)    
   }
 }else{
    return(resultlist) 
 }  
}

0.3 Example

Let’s look at the example

# use it with single vector
 print("dataset one")
## [1] "dataset one"
 a <-  c(1,2,NA,3,NA,NA,4,5,6,7,8,NA,9,10,NA,NA)
 result <- PorosityScore(a)
 print(result)
## $total.porosity.score
## [1] 0.375
## 
## $adjusted.porosity.score
## [1] 0.375
## 
## $PenaltyPorosity
## [1] 10
## 
## $missing.blocksize
## [1] 1 2 1 2
 #print("data set one")
 #print(result$adjusted.porosity.score)
 #print(result$total.porosity.score)
 #print(result$PenaltyPorosity)
 print("dataset two")
## [1] "dataset two"
 a2 <-  c(1,NA,2,3,4,NA,4,NA,6,NA,8,NA,9,10,NA)
 result2 <- PorosityScore(a2)
 print(result2)
## $total.porosity.score
## [1] 0.4
## 
## $adjusted.porosity.score
## [1] 0.4
## 
## $PenaltyPorosity
## [1] 6
## 
## $missing.blocksize
## [1] 1 1 1 1 1 1
 #print(result2$adjusted.porosity.score)
 #print(result2$total.porosity.score)
 #print(result2$PenaltyPorosity)
# how to use it with a dataframe
#dfIn <- as.data.frame(matrix(5,5,2))
#results <- apply(dfIn,2,PorosityScore,tolerance=1,batch=TRUE, adjusted = FALSE, penalty = TRUE)

0.4 Conclusion

As we can see that the function can successfully distinguish time sereis with different missing patterns. In the above example, the first vector has a greater porosity score with penalty. We can use this score to filter out numeric features with missing data by rank them.