Missing data is a common issue in many real-world datasets, leading to challenges in analysis and model training. One technique to handle missing data, particularly in time series or longitudinal data, is the Last Observation Carried Forward (LOCF) method. LOCF fills missing observations with the last available non-missing value. While this method is straightforward and often effective in maintaining the temporal integrity of data, it assumes that the last observation is a reasonable substitute for the missing ones, which might not always be the case. This can potentially introduce bias if the missingness is not random or if the data shows significant trends or shifts over time. Nonetheless, LOCF remains a popular choice due to its simplicity and ease of implementation.
For this tutorial on LOCF, I will be using the “Horse Colic” dataset from UCI Machine Learning repository, which contains various features related to horse colic cases, and notably includes missing values, making it a good candidate for demonstrating how to handle missing data using LOCF.
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data"
colic_data = fread(url, na.strings = "?")
colic_data[] = lapply(colic_data, function(x) as.numeric(as.character(x)))
summary(colic_data)
## V1 V2 V3 V4
## Min. :1.000 Min. :1.00 Min. : 518476 Min. :35.40
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 528904 1st Qu.:37.80
## Median :1.000 Median :1.00 Median : 530306 Median :38.20
## Mean :1.398 Mean :1.64 Mean :1085889 Mean :38.17
## 3rd Qu.:2.000 3rd Qu.:1.00 3rd Qu.: 534728 3rd Qu.:38.50
## Max. :2.000 Max. :9.00 Max. :5305629 Max. :40.80
## NA's :1 NA's :60
## V5 V6 V7 V8
## Min. : 30.00 Min. : 8.00 Min. :1.000 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:18.50 1st Qu.:1.000 1st Qu.:1.000
## Median : 64.00 Median :24.50 Median :3.000 Median :2.000
## Mean : 71.91 Mean :30.42 Mean :2.348 Mean :2.017
## 3rd Qu.: 88.00 3rd Qu.:36.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :184.00 Max. :96.00 Max. :4.000 Max. :4.000
## NA's :24 NA's :58 NA's :56 NA's :69
## V9 V10 V11 V12
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :1.000 Median :3.000 Median :3.000
## Mean :2.854 Mean :1.306 Mean :2.951 Mean :2.918
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :3.000 Max. :5.000 Max. :4.000
## NA's :47 NA's :32 NA's :55 NA's :44
## V13 V14 V15 V16
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :2.000 Median :1.000 Median :5.000
## Mean :2.266 Mean :1.755 Mean :1.582 Mean :4.708
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:6.500
## Max. :4.000 Max. :3.000 Max. :3.000 Max. :7.500
## NA's :56 NA's :104 NA's :106 NA's :247
## V17 V18 V19 V20 V21
## Min. :1.000 Min. :1.000 Min. :23.0 Min. : 3.30 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:38.0 1st Qu.: 6.50 1st Qu.:1.000
## Median :3.000 Median :4.000 Median :45.0 Median : 7.50 Median :2.000
## Mean :2.758 Mean :3.692 Mean :46.3 Mean :24.46 Mean :2.037
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:52.0 3rd Qu.:57.00 3rd Qu.:3.000
## Max. :4.000 Max. :5.000 Max. :75.0 Max. :89.00 Max. :3.000
## NA's :102 NA's :118 NA's :29 NA's :33 NA's :165
## V22 V23 V24 V25
## Min. : 0.10 Min. :1.000 Min. :1.000 Min. : 0
## 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 2112
## Median : 2.25 Median :1.000 Median :1.000 Median : 2674
## Mean : 3.02 Mean :1.552 Mean :1.363 Mean : 3658
## 3rd Qu.: 3.90 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 3209
## Max. :10.10 Max. :3.000 Max. :2.000 Max. :41110
## NA's :198 NA's :1
## V26 V27 V28
## Min. : 0.00 Min. : 0.000 Min. :1.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:1.00
## Median : 0.00 Median : 0.000 Median :2.00
## Mean : 90.23 Mean : 7.363 Mean :1.67
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.:2.00
## Max. :7111.00 Max. :2209.000 Max. :2.00
##
missing_values = sapply(colic_data, function(x) sum(is.na(x)))
print(missing_values)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 0 0 60 24 58 56 69 47 32 55 44 56 104 106 247 102 118 29 33
## V21 V22 V23 V24 V25 V26 V27 V28
## 165 198 1 0 0 0 0 0
Here we see the summary statistics for this dataset.
As you can see, there is some missing data.
Since this dataset is longtitudinal time series data, we will use
LOCF to fill in these missing values. I do this by using the
locf function.
colic_data_locf = data.frame(lapply(colic_data, function(x) na.locf(x, na.rm = FALSE, fromLast = FALSE)))
summary(colic_data_locf)
## V1 V2 V3 V4
## Min. :1.0 Min. :1.00 Min. : 518476 Min. :35.40
## 1st Qu.:1.0 1st Qu.:1.00 1st Qu.: 528904 1st Qu.:37.80
## Median :1.0 Median :1.00 Median : 530306 Median :38.20
## Mean :1.4 Mean :1.64 Mean :1085889 Mean :38.17
## 3rd Qu.:2.0 3rd Qu.:1.00 3rd Qu.: 534728 3rd Qu.:38.50
## Max. :2.0 Max. :9.00 Max. :5305629 Max. :40.80
##
## V5 V6 V7 V8 V9
## Min. : 30.00 Min. : 8.00 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:18.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.000
## Median : 64.00 Median :24.00 Median :3.00 Median :1.00 Median :3.000
## Mean : 71.96 Mean :30.49 Mean :2.37 Mean :2.01 Mean :2.903
## 3rd Qu.: 88.50 3rd Qu.:36.00 3rd Qu.:3.00 3rd Qu.:3.00 3rd Qu.:4.000
## Max. :184.00 Max. :96.00 Max. :4.00 Max. :4.00 Max. :6.000
## NA's :1
## V10 V11 V12 V13 V14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:1.000
## Median :1.000 Median :3.000 Median :3.000 Median :2.00 Median :2.000
## Mean :1.303 Mean :2.947 Mean :2.937 Mean :2.24 Mean :1.731
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:2.000
## Max. :3.000 Max. :5.000 Max. :4.000 Max. :4.00 Max. :3.000
## NA's :3
## V15 V16 V17 V18 V19
## Min. :1.00 Min. :1.000 Min. :1.00 Min. :1.000 Min. :23.00
## 1st Qu.:1.00 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:3.000 1st Qu.:38.00
## Median :1.00 Median :5.000 Median :3.00 Median :4.000 Median :45.00
## Mean :1.64 Mean :4.804 Mean :2.68 Mean :3.717 Mean :46.39
## 3rd Qu.:2.00 3rd Qu.:6.500 3rd Qu.:4.00 3rd Qu.:5.000 3rd Qu.:52.00
## Max. :3.00 Max. :7.500 Max. :4.00 Max. :5.000 Max. :75.00
## NA's :3 NA's :3
## V20 V21 V22 V23
## Min. : 3.30 Min. :1.00 Min. : 0.100 Min. :1.000
## 1st Qu.: 6.50 1st Qu.:1.00 1st Qu.: 2.000 1st Qu.:1.000
## Median : 7.50 Median :2.00 Median : 2.600 Median :1.000
## Mean :24.47 Mean :2.03 Mean : 3.282 Mean :1.553
## 3rd Qu.:57.00 3rd Qu.:3.00 3rd Qu.: 4.500 3rd Qu.:2.000
## Max. :89.00 Max. :3.00 Max. :10.100 Max. :3.000
## NA's :1 NA's :1
## V24 V25 V26 V27
## Min. :1.000 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 2112 1st Qu.: 0.00 1st Qu.: 0.000
## Median :1.000 Median : 2674 Median : 0.00 Median : 0.000
## Mean :1.363 Mean : 3658 Mean : 90.23 Mean : 7.363
## 3rd Qu.:2.000 3rd Qu.: 3209 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :2.000 Max. :41110 Max. :7111.00 Max. :2209.000
##
## V28
## Min. :1.00
## 1st Qu.:1.00
## Median :2.00
## Mean :1.67
## 3rd Qu.:2.00
## Max. :2.00
##
missing_values2 = sapply(colic_data_locf, function(x) sum(is.na(x)))
print(missing_values2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 0 0 0 0 0 0 0 0 1 0 0 0 0 3 3 3 0 0 0 0
## V21 V22 V23 V24 V25 V26 V27 V28
## 1 1 0 0 0 0 0 0
Now, there is no missing data anymore.
Now, lets evaluate the dataset before and after application of the LOCF…
summary(colic_data)
## V1 V2 V3 V4
## Min. :1.000 Min. :1.00 Min. : 518476 Min. :35.40
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 528904 1st Qu.:37.80
## Median :1.000 Median :1.00 Median : 530306 Median :38.20
## Mean :1.398 Mean :1.64 Mean :1085889 Mean :38.17
## 3rd Qu.:2.000 3rd Qu.:1.00 3rd Qu.: 534728 3rd Qu.:38.50
## Max. :2.000 Max. :9.00 Max. :5305629 Max. :40.80
## NA's :1 NA's :60
## V5 V6 V7 V8
## Min. : 30.00 Min. : 8.00 Min. :1.000 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:18.50 1st Qu.:1.000 1st Qu.:1.000
## Median : 64.00 Median :24.50 Median :3.000 Median :2.000
## Mean : 71.91 Mean :30.42 Mean :2.348 Mean :2.017
## 3rd Qu.: 88.00 3rd Qu.:36.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :184.00 Max. :96.00 Max. :4.000 Max. :4.000
## NA's :24 NA's :58 NA's :56 NA's :69
## V9 V10 V11 V12
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :1.000 Median :3.000 Median :3.000
## Mean :2.854 Mean :1.306 Mean :2.951 Mean :2.918
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :3.000 Max. :5.000 Max. :4.000
## NA's :47 NA's :32 NA's :55 NA's :44
## V13 V14 V15 V16
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :2.000 Median :1.000 Median :5.000
## Mean :2.266 Mean :1.755 Mean :1.582 Mean :4.708
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:6.500
## Max. :4.000 Max. :3.000 Max. :3.000 Max. :7.500
## NA's :56 NA's :104 NA's :106 NA's :247
## V17 V18 V19 V20 V21
## Min. :1.000 Min. :1.000 Min. :23.0 Min. : 3.30 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:38.0 1st Qu.: 6.50 1st Qu.:1.000
## Median :3.000 Median :4.000 Median :45.0 Median : 7.50 Median :2.000
## Mean :2.758 Mean :3.692 Mean :46.3 Mean :24.46 Mean :2.037
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:52.0 3rd Qu.:57.00 3rd Qu.:3.000
## Max. :4.000 Max. :5.000 Max. :75.0 Max. :89.00 Max. :3.000
## NA's :102 NA's :118 NA's :29 NA's :33 NA's :165
## V22 V23 V24 V25
## Min. : 0.10 Min. :1.000 Min. :1.000 Min. : 0
## 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 2112
## Median : 2.25 Median :1.000 Median :1.000 Median : 2674
## Mean : 3.02 Mean :1.552 Mean :1.363 Mean : 3658
## 3rd Qu.: 3.90 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 3209
## Max. :10.10 Max. :3.000 Max. :2.000 Max. :41110
## NA's :198 NA's :1
## V26 V27 V28
## Min. : 0.00 Min. : 0.000 Min. :1.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:1.00
## Median : 0.00 Median : 0.000 Median :2.00
## Mean : 90.23 Mean : 7.363 Mean :1.67
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.:2.00
## Max. :7111.00 Max. :2209.000 Max. :2.00
##
summary(colic_data_locf)
## V1 V2 V3 V4
## Min. :1.0 Min. :1.00 Min. : 518476 Min. :35.40
## 1st Qu.:1.0 1st Qu.:1.00 1st Qu.: 528904 1st Qu.:37.80
## Median :1.0 Median :1.00 Median : 530306 Median :38.20
## Mean :1.4 Mean :1.64 Mean :1085889 Mean :38.17
## 3rd Qu.:2.0 3rd Qu.:1.00 3rd Qu.: 534728 3rd Qu.:38.50
## Max. :2.0 Max. :9.00 Max. :5305629 Max. :40.80
##
## V5 V6 V7 V8 V9
## Min. : 30.00 Min. : 8.00 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:18.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.000
## Median : 64.00 Median :24.00 Median :3.00 Median :1.00 Median :3.000
## Mean : 71.96 Mean :30.49 Mean :2.37 Mean :2.01 Mean :2.903
## 3rd Qu.: 88.50 3rd Qu.:36.00 3rd Qu.:3.00 3rd Qu.:3.00 3rd Qu.:4.000
## Max. :184.00 Max. :96.00 Max. :4.00 Max. :4.00 Max. :6.000
## NA's :1
## V10 V11 V12 V13 V14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:1.000
## Median :1.000 Median :3.000 Median :3.000 Median :2.00 Median :2.000
## Mean :1.303 Mean :2.947 Mean :2.937 Mean :2.24 Mean :1.731
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:2.000
## Max. :3.000 Max. :5.000 Max. :4.000 Max. :4.00 Max. :3.000
## NA's :3
## V15 V16 V17 V18 V19
## Min. :1.00 Min. :1.000 Min. :1.00 Min. :1.000 Min. :23.00
## 1st Qu.:1.00 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:3.000 1st Qu.:38.00
## Median :1.00 Median :5.000 Median :3.00 Median :4.000 Median :45.00
## Mean :1.64 Mean :4.804 Mean :2.68 Mean :3.717 Mean :46.39
## 3rd Qu.:2.00 3rd Qu.:6.500 3rd Qu.:4.00 3rd Qu.:5.000 3rd Qu.:52.00
## Max. :3.00 Max. :7.500 Max. :4.00 Max. :5.000 Max. :75.00
## NA's :3 NA's :3
## V20 V21 V22 V23
## Min. : 3.30 Min. :1.00 Min. : 0.100 Min. :1.000
## 1st Qu.: 6.50 1st Qu.:1.00 1st Qu.: 2.000 1st Qu.:1.000
## Median : 7.50 Median :2.00 Median : 2.600 Median :1.000
## Mean :24.47 Mean :2.03 Mean : 3.282 Mean :1.553
## 3rd Qu.:57.00 3rd Qu.:3.00 3rd Qu.: 4.500 3rd Qu.:2.000
## Max. :89.00 Max. :3.00 Max. :10.100 Max. :3.000
## NA's :1 NA's :1
## V24 V25 V26 V27
## Min. :1.000 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 2112 1st Qu.: 0.00 1st Qu.: 0.000
## Median :1.000 Median : 2674 Median : 0.00 Median : 0.000
## Mean :1.363 Mean : 3658 Mean : 90.23 Mean : 7.363
## 3rd Qu.:2.000 3rd Qu.: 3209 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :2.000 Max. :41110 Max. :7111.00 Max. :2209.000
##
## V28
## Min. :1.00
## 1st Qu.:1.00
## Median :2.00
## Mean :1.67
## 3rd Qu.:2.00
## Max. :2.00
##
In some columns, the NA count has not changed after applying LOCF. This suggests that LOCF might not have been effective due to:
Initial NAs:
If the first values in the sequence of a column are missing (NA), LOCF cannot fill these because there is no preceding value to carry forward.
Incorrect Application:
The LOCF method may not have been applied correctly, particularly if the data doesn’t have an inherent ordering that supports the assumption behind LOCF.
Partial Effectiveness:
In some columns, the number of NAs reduces slightly or remains unchanged. This could be seen where some initial values are present, allowing for subsequent NAs to be filled but leaving leading NAs untouched.
Data Characteristics:
The effectiveness of LOCF heavily depends on the data’s characteristics. In datasets where the order of data (like time series) is not meaningful or incorrectly assumed, LOCF may not be the appropriate method.
The use of LOCF in dealing with missing data in the Horse Colic dataset provides a practical illustration of both the utility and limitations of this method. While LOCF is useful for maintaining data consistency in scenarios where subsequent data points closely follow previous ones, its effectiveness is constrained in cases where data lacks an inherent sequential order or when the first entries are missing. Furthermore, this method does not account for the possibility that later observations might fundamentally differ from earlier ones, potentially leading to biased analyses. Therefore, while LOCF is an appealing choice due to its simplicity, researchers should carefully consider the nature of their data and possibly complement LOCF with other imputation methods to address its shortcomings effectively. This ensures a more robust approach to handling missing data, crucial for deriving reliable insights from analyses.