Instead of explicitly coding the visualization of missing values and their imputation as is done in the LMR text. Here I use the dataset from the book to walk through similar visualization & modelling using the naniar & simputation libraries. The benefit of using packages like these: (1) homebaked code has more errors, libraries are more reliable. (2) readability: functionalized code is easier on the eyes and extensive documentation is provided by the developers.
Finding missing values with naniar
dependancies:
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( faraway )
library( naniar )
library( simputation )will be using the chmiss dataset. same one that LMR used to walk through missing data in chapter 13.
data( chmiss, package = 'faraway')
glimpse( chmiss )## Rows: 47
## Columns: 6
## $ race <dbl> 10.0, 22.2, 19.6, 17.3, 24.5, 54.0, 4.9, 7.1, 5.3, 21.5, 43.1…
## $ fire <dbl> 6.2, 9.5, 10.5, 7.7, 8.6, 34.1, 11.0, 6.9, 7.3, 15.1, 29.1, 2…
## $ theft <dbl> 29, 44, 36, 37, 53, 68, 75, 18, 31, NA, 34, 14, 11, 11, 22, N…
## $ age <dbl> 60.4, 76.5, NA, NA, 81.4, 52.6, 42.6, 78.5, 90.1, 89.8, 82.7,…
## $ involact <dbl> NA, 0.1, 1.2, 0.5, 0.7, 0.3, 0.0, 0.0, NA, 1.1, 1.9, 0.0, 0.0…
## $ income <dbl> 11.744, 9.323, 9.948, 10.656, 9.730, 8.231, 21.480, 11.104, 1…
tallies of missingness
naniar can be used for descriptions of missingness on dataframes or variables
#perform counts on missing values
n_miss( chmiss )## [1] 20
n_miss( chmiss$age )## [1] 5
#perform counts on complete values
n_complete( chmiss )## [1] 262
n_complete( chmiss$age )## [1] 42
#find proportions of missing or complete
prop_miss( chmiss )## [1] 0.07092199
prop_complete( chmiss$age )## [1] 0.893617
summaries of missingness
naniar can also return summaries of missingness:
#give a columnwise description of missingness
miss_var_summary( chmiss )## # A tibble: 6 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 age 5 10.6
## 2 race 4 8.51
## 3 theft 4 8.51
## 4 involact 3 6.38
## 5 fire 2 4.26
## 6 income 2 4.26
#give a rowwise description of missingness
miss_case_summary( chmiss )## # A tibble: 47 x 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 1 1 16.7
## 2 3 1 16.7
## 3 4 1 16.7
## 4 9 1 16.7
## 5 10 1 16.7
## 6 13 1 16.7
## 7 16 1 16.7
## 8 20 1 16.7
## 9 21 1 16.7
## 10 24 1 16.7
## # … with 37 more rows
tabulating missingness patterns
miss_var_table() will return a dataframe that describes the number of missing variables and how many features of present
miss_var_table( chmiss )## # A tibble: 4 x 3
## n_miss_in_var n_vars pct_vars
## * <int> <int> <dbl>
## 1 2 2 33.3
## 2 3 1 16.7
## 3 4 2 33.3
## 4 5 1 16.7
visualizing missingness
visualizing missingness across variables
p1 <- vis_miss( chmiss )
p2 <- vis_miss( chmiss, cluster = T )
grid.arrange( p1, p2, ncol = 2 )visualizing missingness for both variables and cases
p1 <- gg_miss_var( chmiss )
p2 <- gg_miss_case( chmiss )
grid.arrange( p1, p2, ncol = 2 )if you have factor feature variables, you can facet for missingness in variables or case. Also, there are functions like gg_miss_fct() to visualize missingness across levels.
chmiss is all numeric, so I don’t explore that here. However, we can visualize across a feature variable across a given span:
gg_miss_span( chmiss, age, span_every = 10 )naniar also has a heaping ton of functions that can help deal with other types of missingness such as mislabelled or incomplete entries. For example, parsing out various mislabelled NAs (‘NaN’, ‘na’, etc.)) and replacing with NA
#can scan for these common strings or create your own list
print( common_na_strings )## [1] "NA" "N A" "N/A" "NA " " NA" "N /A" "N / A" " N / A"
## [9] "N / A " "na" "n a" "n/a" "na " " na" "n /a" "n / a"
## [17] " a / a" "n / a " "NULL" "null" "" "\\?" "\\*" "\\."
geom_miss_point() creates a powerful visualization of the missing values for variables plotted in the margins. This is helpful in finding patterns in missingess.
ggplot( chmiss, aes( x = age, y = involact ) ) +
geom_miss_point()imputation with simputation
imputing to the mean can be done if a dataset is very large. However, for smaller datasets, this has the unfortunate effect of pulling model coefficients towards 0 and reducing variance (flattening the data)
chmiss_impmean <- chmiss %>%
bind_shadow( ) %>%
impute_mean_all() %>%
mutate( plottedNA = if_else( age_NA == 'NA' | involact_NA == 'NA', TRUE, FALSE ) ) %>%
add_label_shadow()
#glimpse( chmiss_impmean )
ggplot( chmiss_impmean, aes( x = age, y = involact, color = plottedNA ) ) +
geom_point() +
ggtitle( 'imputing to the mean' ) +
geom_hline( yintercept = 0.6477273, color = 'gray', linetype = 'dashed' ) +
geom_vline( xintercept = 59.96905, color = 'gray', linetype = 'dashed' ) +
theme_classic()We can see from the abobe visualization that imputation by the mean simple replaces a NA value with the mean. But, the thing is that we can use the linear relation of the variables with each other to better inform our imputation.
using impute_lm from simputation is a powerful way to impute values for a dataset.
chmiss_lmimp <- bind_shadow( chmiss ) %>%
add_label_shadow() %>%
mutate( plottedNA = if_else( age_NA == 'NA' | involact_NA == 'NA', TRUE, FALSE ) ) %>%
impute_lm( involact ~ race + fire + theft + age + income ) %>%
impute_lm( age ~ race + fire + theft + income )
ggplot( chmiss_lmimp, aes( x = age, y = involact, color = plottedNA ) ) +
geom_point() +
ggtitle( 'linear regression imputation' ) +
geom_hline( yintercept = 0.6477273, color = 'gray', linetype = 'dashed' ) +
geom_vline( xintercept = 59.96905, color = 'gray', linetype = 'dashed' ) +
theme_classic() The imputed values no longer fall on the lines representing the variable’s mean values. Rather, they distribute in a way that more naturally reflects the distribution of the data
which imputation give the best predictions with a linear model of the data?
#glimpse( chmiss_impmean )
chmiss_meanmod <- lm( data = chmiss_impmean, involact ~ race + fire + theft + age + income )
chmiss_meanmod_sum <- summary( chmiss_meanmod )
chmiss_meanmod_sum$adj.r.squared## [1] 0.6221437
chmiss_lmmod <- lm( data = chmiss_lmimp, involact ~ race + fire + theft + age + income )
chmiss_lmmod_sum <- summary( chmiss_lmmod )
chmiss_lmmod_sum$adj.r.squared## [1] 0.7067788
We get a higher \(R^2\) when we use regression to impute missing values. I.e. it gives out model higher explanatory power.