What is LOCF?

Missing data is a common issue in many real-world datasets, leading to challenges in analysis and model training. One technique to handle missing data, particularly in time series or longitudinal data, is the Last Observation Carried Forward (LOCF) method. LOCF fills missing observations with the last available non-missing value. While this method is straightforward and often effective in maintaining the temporal integrity of data, it assumes that the last observation is a reasonable substitute for the missing ones, which might not always be the case. This can potentially introduce bias if the missingness is not random or if the data shows significant trends or shifts over time. Nonetheless, LOCF remains a popular choice due to its simplicity and ease of implementation.

For this tutorial on LOCF, I will be using the “Horse Colic” dataset from UCI Machine Learning repository, which contains various features related to horse colic cases, and notably includes missing values, making it a good candidate for demonstrating how to handle missing data using LOCF.

Filling in Missing Data with LOCF

library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)
## Warning: package 'zoo' was built under R version 4.3.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data"
colic_data = fread(url, na.strings = "?")


colic_data[] = lapply(colic_data, function(x) as.numeric(as.character(x)))

summary(colic_data)
##        V1              V2             V3                V4       
##  Min.   :1.000   Min.   :1.00   Min.   : 518476   Min.   :35.40  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 528904   1st Qu.:37.80  
##  Median :1.000   Median :1.00   Median : 530306   Median :38.20  
##  Mean   :1.398   Mean   :1.64   Mean   :1085889   Mean   :38.17  
##  3rd Qu.:2.000   3rd Qu.:1.00   3rd Qu.: 534728   3rd Qu.:38.50  
##  Max.   :2.000   Max.   :9.00   Max.   :5305629   Max.   :40.80  
##  NA's   :1                                        NA's   :60     
##        V5               V6              V7              V8       
##  Min.   : 30.00   Min.   : 8.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:18.50   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 64.00   Median :24.50   Median :3.000   Median :2.000  
##  Mean   : 71.91   Mean   :30.42   Mean   :2.348   Mean   :2.017  
##  3rd Qu.: 88.00   3rd Qu.:36.00   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :184.00   Max.   :96.00   Max.   :4.000   Max.   :4.000  
##  NA's   :24       NA's   :58      NA's   :56      NA's   :69     
##        V9             V10             V11             V12       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :3.000   Median :1.000   Median :3.000   Median :3.000  
##  Mean   :2.854   Mean   :1.306   Mean   :2.951   Mean   :2.918  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :3.000   Max.   :5.000   Max.   :4.000  
##  NA's   :47      NA's   :32      NA's   :55      NA's   :44     
##       V13             V14             V15             V16       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :2.000   Median :2.000   Median :1.000   Median :5.000  
##  Mean   :2.266   Mean   :1.755   Mean   :1.582   Mean   :4.708  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:6.500  
##  Max.   :4.000   Max.   :3.000   Max.   :3.000   Max.   :7.500  
##  NA's   :56      NA's   :104     NA's   :106     NA's   :247    
##       V17             V18             V19            V20             V21       
##  Min.   :1.000   Min.   :1.000   Min.   :23.0   Min.   : 3.30   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:38.0   1st Qu.: 6.50   1st Qu.:1.000  
##  Median :3.000   Median :4.000   Median :45.0   Median : 7.50   Median :2.000  
##  Mean   :2.758   Mean   :3.692   Mean   :46.3   Mean   :24.46   Mean   :2.037  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:52.0   3rd Qu.:57.00   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :5.000   Max.   :75.0   Max.   :89.00   Max.   :3.000  
##  NA's   :102     NA's   :118     NA's   :29     NA's   :33      NA's   :165    
##       V22             V23             V24             V25       
##  Min.   : 0.10   Min.   :1.000   Min.   :1.000   Min.   :    0  
##  1st Qu.: 2.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 2112  
##  Median : 2.25   Median :1.000   Median :1.000   Median : 2674  
##  Mean   : 3.02   Mean   :1.552   Mean   :1.363   Mean   : 3658  
##  3rd Qu.: 3.90   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 3209  
##  Max.   :10.10   Max.   :3.000   Max.   :2.000   Max.   :41110  
##  NA's   :198     NA's   :1                                      
##       V26               V27                V28      
##  Min.   :   0.00   Min.   :   0.000   Min.   :1.00  
##  1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:1.00  
##  Median :   0.00   Median :   0.000   Median :2.00  
##  Mean   :  90.23   Mean   :   7.363   Mean   :1.67  
##  3rd Qu.:   0.00   3rd Qu.:   0.000   3rd Qu.:2.00  
##  Max.   :7111.00   Max.   :2209.000   Max.   :2.00  
## 
missing_values = sapply(colic_data, function(x) sum(is.na(x)))
print(missing_values)
##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
##   1   0   0  60  24  58  56  69  47  32  55  44  56 104 106 247 102 118  29  33 
## V21 V22 V23 V24 V25 V26 V27 V28 
## 165 198   1   0   0   0   0   0

Here we see the summary statistics for this dataset.

As you can see, there is some missing data.

Since this dataset is longtitudinal time series data, we will use LOCF to fill in these missing values. I do this by using the locf function.

colic_data_locf = data.frame(lapply(colic_data, function(x) na.locf(x, na.rm = FALSE, fromLast = FALSE)))

summary(colic_data_locf)
##        V1            V2             V3                V4       
##  Min.   :1.0   Min.   :1.00   Min.   : 518476   Min.   :35.40  
##  1st Qu.:1.0   1st Qu.:1.00   1st Qu.: 528904   1st Qu.:37.80  
##  Median :1.0   Median :1.00   Median : 530306   Median :38.20  
##  Mean   :1.4   Mean   :1.64   Mean   :1085889   Mean   :38.17  
##  3rd Qu.:2.0   3rd Qu.:1.00   3rd Qu.: 534728   3rd Qu.:38.50  
##  Max.   :2.0   Max.   :9.00   Max.   :5305629   Max.   :40.80  
##                                                                
##        V5               V6              V7             V8             V9       
##  Min.   : 30.00   Min.   : 8.00   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:18.00   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.000  
##  Median : 64.00   Median :24.00   Median :3.00   Median :1.00   Median :3.000  
##  Mean   : 71.96   Mean   :30.49   Mean   :2.37   Mean   :2.01   Mean   :2.903  
##  3rd Qu.: 88.50   3rd Qu.:36.00   3rd Qu.:3.00   3rd Qu.:3.00   3rd Qu.:4.000  
##  Max.   :184.00   Max.   :96.00   Max.   :4.00   Max.   :4.00   Max.   :6.000  
##                                                                 NA's   :1      
##       V10             V11             V12             V13            V14       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :1.000   Median :3.000   Median :3.000   Median :2.00   Median :2.000  
##  Mean   :1.303   Mean   :2.947   Mean   :2.937   Mean   :2.24   Mean   :1.731  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :5.000   Max.   :4.000   Max.   :4.00   Max.   :3.000  
##                                                                 NA's   :3      
##       V15            V16             V17            V18             V19       
##  Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :23.00  
##  1st Qu.:1.00   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:3.000   1st Qu.:38.00  
##  Median :1.00   Median :5.000   Median :3.00   Median :4.000   Median :45.00  
##  Mean   :1.64   Mean   :4.804   Mean   :2.68   Mean   :3.717   Mean   :46.39  
##  3rd Qu.:2.00   3rd Qu.:6.500   3rd Qu.:4.00   3rd Qu.:5.000   3rd Qu.:52.00  
##  Max.   :3.00   Max.   :7.500   Max.   :4.00   Max.   :5.000   Max.   :75.00  
##  NA's   :3      NA's   :3                                                     
##       V20             V21            V22              V23       
##  Min.   : 3.30   Min.   :1.00   Min.   : 0.100   Min.   :1.000  
##  1st Qu.: 6.50   1st Qu.:1.00   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 7.50   Median :2.00   Median : 2.600   Median :1.000  
##  Mean   :24.47   Mean   :2.03   Mean   : 3.282   Mean   :1.553  
##  3rd Qu.:57.00   3rd Qu.:3.00   3rd Qu.: 4.500   3rd Qu.:2.000  
##  Max.   :89.00   Max.   :3.00   Max.   :10.100   Max.   :3.000  
##                  NA's   :1      NA's   :1                       
##       V24             V25             V26               V27          
##  Min.   :1.000   Min.   :    0   Min.   :   0.00   Min.   :   0.000  
##  1st Qu.:1.000   1st Qu.: 2112   1st Qu.:   0.00   1st Qu.:   0.000  
##  Median :1.000   Median : 2674   Median :   0.00   Median :   0.000  
##  Mean   :1.363   Mean   : 3658   Mean   :  90.23   Mean   :   7.363  
##  3rd Qu.:2.000   3rd Qu.: 3209   3rd Qu.:   0.00   3rd Qu.:   0.000  
##  Max.   :2.000   Max.   :41110   Max.   :7111.00   Max.   :2209.000  
##                                                                      
##       V28      
##  Min.   :1.00  
##  1st Qu.:1.00  
##  Median :2.00  
##  Mean   :1.67  
##  3rd Qu.:2.00  
##  Max.   :2.00  
## 
missing_values2 = sapply(colic_data_locf, function(x) sum(is.na(x)))
print(missing_values2)
##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
##   0   0   0   0   0   0   0   0   1   0   0   0   0   3   3   3   0   0   0   0 
## V21 V22 V23 V24 V25 V26 V27 V28 
##   1   1   0   0   0   0   0   0

Now, there is no missing data anymore.

The impact of LOCF.

Now, lets evaluate the dataset before and after application of the LOCF…

summary(colic_data)
##        V1              V2             V3                V4       
##  Min.   :1.000   Min.   :1.00   Min.   : 518476   Min.   :35.40  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 528904   1st Qu.:37.80  
##  Median :1.000   Median :1.00   Median : 530306   Median :38.20  
##  Mean   :1.398   Mean   :1.64   Mean   :1085889   Mean   :38.17  
##  3rd Qu.:2.000   3rd Qu.:1.00   3rd Qu.: 534728   3rd Qu.:38.50  
##  Max.   :2.000   Max.   :9.00   Max.   :5305629   Max.   :40.80  
##  NA's   :1                                        NA's   :60     
##        V5               V6              V7              V8       
##  Min.   : 30.00   Min.   : 8.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:18.50   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 64.00   Median :24.50   Median :3.000   Median :2.000  
##  Mean   : 71.91   Mean   :30.42   Mean   :2.348   Mean   :2.017  
##  3rd Qu.: 88.00   3rd Qu.:36.00   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :184.00   Max.   :96.00   Max.   :4.000   Max.   :4.000  
##  NA's   :24       NA's   :58      NA's   :56      NA's   :69     
##        V9             V10             V11             V12       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :3.000   Median :1.000   Median :3.000   Median :3.000  
##  Mean   :2.854   Mean   :1.306   Mean   :2.951   Mean   :2.918  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :3.000   Max.   :5.000   Max.   :4.000  
##  NA's   :47      NA's   :32      NA's   :55      NA's   :44     
##       V13             V14             V15             V16       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :2.000   Median :2.000   Median :1.000   Median :5.000  
##  Mean   :2.266   Mean   :1.755   Mean   :1.582   Mean   :4.708  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:6.500  
##  Max.   :4.000   Max.   :3.000   Max.   :3.000   Max.   :7.500  
##  NA's   :56      NA's   :104     NA's   :106     NA's   :247    
##       V17             V18             V19            V20             V21       
##  Min.   :1.000   Min.   :1.000   Min.   :23.0   Min.   : 3.30   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:38.0   1st Qu.: 6.50   1st Qu.:1.000  
##  Median :3.000   Median :4.000   Median :45.0   Median : 7.50   Median :2.000  
##  Mean   :2.758   Mean   :3.692   Mean   :46.3   Mean   :24.46   Mean   :2.037  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:52.0   3rd Qu.:57.00   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :5.000   Max.   :75.0   Max.   :89.00   Max.   :3.000  
##  NA's   :102     NA's   :118     NA's   :29     NA's   :33      NA's   :165    
##       V22             V23             V24             V25       
##  Min.   : 0.10   Min.   :1.000   Min.   :1.000   Min.   :    0  
##  1st Qu.: 2.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 2112  
##  Median : 2.25   Median :1.000   Median :1.000   Median : 2674  
##  Mean   : 3.02   Mean   :1.552   Mean   :1.363   Mean   : 3658  
##  3rd Qu.: 3.90   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 3209  
##  Max.   :10.10   Max.   :3.000   Max.   :2.000   Max.   :41110  
##  NA's   :198     NA's   :1                                      
##       V26               V27                V28      
##  Min.   :   0.00   Min.   :   0.000   Min.   :1.00  
##  1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:1.00  
##  Median :   0.00   Median :   0.000   Median :2.00  
##  Mean   :  90.23   Mean   :   7.363   Mean   :1.67  
##  3rd Qu.:   0.00   3rd Qu.:   0.000   3rd Qu.:2.00  
##  Max.   :7111.00   Max.   :2209.000   Max.   :2.00  
## 
summary(colic_data_locf)
##        V1            V2             V3                V4       
##  Min.   :1.0   Min.   :1.00   Min.   : 518476   Min.   :35.40  
##  1st Qu.:1.0   1st Qu.:1.00   1st Qu.: 528904   1st Qu.:37.80  
##  Median :1.0   Median :1.00   Median : 530306   Median :38.20  
##  Mean   :1.4   Mean   :1.64   Mean   :1085889   Mean   :38.17  
##  3rd Qu.:2.0   3rd Qu.:1.00   3rd Qu.: 534728   3rd Qu.:38.50  
##  Max.   :2.0   Max.   :9.00   Max.   :5305629   Max.   :40.80  
##                                                                
##        V5               V6              V7             V8             V9       
##  Min.   : 30.00   Min.   : 8.00   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:18.00   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.000  
##  Median : 64.00   Median :24.00   Median :3.00   Median :1.00   Median :3.000  
##  Mean   : 71.96   Mean   :30.49   Mean   :2.37   Mean   :2.01   Mean   :2.903  
##  3rd Qu.: 88.50   3rd Qu.:36.00   3rd Qu.:3.00   3rd Qu.:3.00   3rd Qu.:4.000  
##  Max.   :184.00   Max.   :96.00   Max.   :4.00   Max.   :4.00   Max.   :6.000  
##                                                                 NA's   :1      
##       V10             V11             V12             V13            V14       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :1.000   Median :3.000   Median :3.000   Median :2.00   Median :2.000  
##  Mean   :1.303   Mean   :2.947   Mean   :2.937   Mean   :2.24   Mean   :1.731  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :5.000   Max.   :4.000   Max.   :4.00   Max.   :3.000  
##                                                                 NA's   :3      
##       V15            V16             V17            V18             V19       
##  Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :23.00  
##  1st Qu.:1.00   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:3.000   1st Qu.:38.00  
##  Median :1.00   Median :5.000   Median :3.00   Median :4.000   Median :45.00  
##  Mean   :1.64   Mean   :4.804   Mean   :2.68   Mean   :3.717   Mean   :46.39  
##  3rd Qu.:2.00   3rd Qu.:6.500   3rd Qu.:4.00   3rd Qu.:5.000   3rd Qu.:52.00  
##  Max.   :3.00   Max.   :7.500   Max.   :4.00   Max.   :5.000   Max.   :75.00  
##  NA's   :3      NA's   :3                                                     
##       V20             V21            V22              V23       
##  Min.   : 3.30   Min.   :1.00   Min.   : 0.100   Min.   :1.000  
##  1st Qu.: 6.50   1st Qu.:1.00   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 7.50   Median :2.00   Median : 2.600   Median :1.000  
##  Mean   :24.47   Mean   :2.03   Mean   : 3.282   Mean   :1.553  
##  3rd Qu.:57.00   3rd Qu.:3.00   3rd Qu.: 4.500   3rd Qu.:2.000  
##  Max.   :89.00   Max.   :3.00   Max.   :10.100   Max.   :3.000  
##                  NA's   :1      NA's   :1                       
##       V24             V25             V26               V27          
##  Min.   :1.000   Min.   :    0   Min.   :   0.00   Min.   :   0.000  
##  1st Qu.:1.000   1st Qu.: 2112   1st Qu.:   0.00   1st Qu.:   0.000  
##  Median :1.000   Median : 2674   Median :   0.00   Median :   0.000  
##  Mean   :1.363   Mean   : 3658   Mean   :  90.23   Mean   :   7.363  
##  3rd Qu.:2.000   3rd Qu.: 3209   3rd Qu.:   0.00   3rd Qu.:   0.000  
##  Max.   :2.000   Max.   :41110   Max.   :7111.00   Max.   :2209.000  
##                                                                      
##       V28      
##  Min.   :1.00  
##  1st Qu.:1.00  
##  Median :2.00  
##  Mean   :1.67  
##  3rd Qu.:2.00  
##  Max.   :2.00  
## 

In some columns, the NA count has not changed after applying LOCF. This suggests that LOCF might not have been effective due to:

Initial NAs:

If the first values in the sequence of a column are missing (NA), LOCF cannot fill these because there is no preceding value to carry forward.

Incorrect Application:

The LOCF method may not have been applied correctly, particularly if the data doesn’t have an inherent ordering that supports the assumption behind LOCF.

Partial Effectiveness:

In some columns, the number of NAs reduces slightly or remains unchanged. This could be seen where some initial values are present, allowing for subsequent NAs to be filled but leaving leading NAs untouched.

Data Characteristics:

The effectiveness of LOCF heavily depends on the data’s characteristics. In datasets where the order of data (like time series) is not meaningful or incorrectly assumed, LOCF may not be the appropriate method.

Conclusion

The use of LOCF in dealing with missing data in the Horse Colic dataset provides a practical illustration of both the utility and limitations of this method. While LOCF is useful for maintaining data consistency in scenarios where subsequent data points closely follow previous ones, its effectiveness is constrained in cases where data lacks an inherent sequential order or when the first entries are missing. Furthermore, this method does not account for the possibility that later observations might fundamentally differ from earlier ones, potentially leading to biased analyses. Therefore, while LOCF is an appealing choice due to its simplicity, researchers should carefully consider the nature of their data and possibly complement LOCF with other imputation methods to address its shortcomings effectively. This ensures a more robust approach to handling missing data, crucial for deriving reliable insights from analyses.