Week 2 Coding Practice R, Part 1: Missing Data


Step 1:

# Attach hflights package
library(hflights)

# Show summary of hflights data
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek        DepTime    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000   Min.   :   1  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000   1st Qu.:1021  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000   Median :1416  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948   Mean   :1396  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000   3rd Qu.:1801  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000   Max.   :2400  
##                                                                  NA's   :2905  
##     ArrTime     UniqueCarrier        FlightNum      TailNum         
##  Min.   :   1   Length:227496      Min.   :   1   Length:227496     
##  1st Qu.:1215   Class :character   1st Qu.: 855   Class :character  
##  Median :1617   Mode  :character   Median :1696   Mode  :character  
##  Mean   :1578                      Mean   :1962                     
##  3rd Qu.:1953                      3rd Qu.:2755                     
##  Max.   :2400                      Max.   :7290                     
##  NA's   :3066                                                       
##  ActualElapsedTime    AirTime         ArrDelay          DepDelay      
##  Min.   : 34.0     Min.   : 11.0   Min.   :-70.000   Min.   :-33.000  
##  1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000   1st Qu.: -3.000  
##  Median :128.0     Median :107.0   Median :  0.000   Median :  0.000  
##  Mean   :129.3     Mean   :108.1   Mean   :  7.094   Mean   :  9.445  
##  3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000   3rd Qu.:  9.000  
##  Max.   :575.0     Max.   :549.0   Max.   :978.000   Max.   :981.000  
##  NA's   :3622      NA's   :3622    NA's   :3622      NA's   :2905     
##     Origin              Dest              Distance          TaxiIn       
##  Length:227496      Length:227496      Min.   :  79.0   Min.   :  1.000  
##  Class :character   Class :character   1st Qu.: 376.0   1st Qu.:  4.000  
##  Mode  :character   Mode  :character   Median : 809.0   Median :  5.000  
##                                        Mean   : 787.8   Mean   :  6.099  
##                                        3rd Qu.:1042.0   3rd Qu.:  7.000  
##                                        Max.   :3904.0   Max.   :165.000  
##                                                         NA's   :3066     
##     TaxiOut         Cancelled       CancellationCode      Diverted       
##  Min.   :  1.00   Min.   :0.00000   Length:227496      Min.   :0.000000  
##  1st Qu.: 10.00   1st Qu.:0.00000   Class :character   1st Qu.:0.000000  
##  Median : 14.00   Median :0.00000   Mode  :character   Median :0.000000  
##  Mean   : 15.09   Mean   :0.01307                      Mean   :0.002853  
##  3rd Qu.: 18.00   3rd Qu.:0.00000                      3rd Qu.:0.000000  
##  Max.   :163.00   Max.   :1.00000                      Max.   :1.000000  
##  NA's   :2947
# Return Total number of missing values in hflights
sum(is.na(hflights)==TRUE)
## [1] 25755
# Show number of rows that have complete data and number of rows that do not (e.g., contain NA values). 
table(complete.cases(hflights))
## 
##  FALSE   TRUE 
##   3622 223874
# Shows percentage of complete rows and % of incomplete rows (e.g., contain NA values)
prop.table(table(complete.cases(hflights))) * 100
## 
##     FALSE      TRUE 
##  1.592116 98.407884
# sapply applies function sum(is.na(x)) to each column in the dataset, returning the total count of missing values in hflights columns, sorted in ascending order.
sort(sapply(hflights, function(x) sum(is.na(x))))
##              Year             Month        DayofMonth         DayOfWeek 
##                 0                 0                 0                 0 
##     UniqueCarrier         FlightNum           TailNum            Origin 
##                 0                 0                 0                 0 
##              Dest          Distance         Cancelled  CancellationCode 
##                 0                 0                 0                 0 
##          Diverted           DepTime          DepDelay           TaxiOut 
##                 0              2905              2905              2947 
##           ArrTime            TaxiIn ActualElapsedTime           AirTime 
##              3066              3066              3622              3622 
##          ArrDelay 
##              3622

Step 2:

How do we start looking at missing data?

Functions for smaller datasets:

  • summary()
  • str()
  • skimr::skim
  • dplyr::glimpse()

For larger datasets, use visdat package

This provides visualization of an entire dataFrame at once.

2 main functions in visdat: * vis_dat * vis_miss

vis_dat

Visualizes the whole dataframe at onces, providing info. Provides infor about the class of the data input into R, and whether or not the data is missing.

library(visdat)
vis_dat(airquality)

vis_miss

Provides a summary of whether the data is missing or not. Also provides the amount of missings in each column. The following example shows that Ozone and Solar.R have the most missing data. Ozone with 24.2% and Solar.R with 4.6% missing data. No other variables have any missings.

vis_miss(airquality)

Exploring Missingness Relationships

Use scatter plots to explore relationships between 2 variables.

Note: ggplot does NOT handle missings by default, it removes them.

To visualize missing data, use:

  • ggobi
  • MANET

Which replaces NA values with values 10% lower than the minimum value in the variable.

Standard default ggplot scatter plot with missing values removed.

library(ggplot2)
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_point()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Missing values replaced with values 10% lower than the minimum value in that variable.

library(naniar)
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point()

Facets are supported for ggplot:

library(naniar)
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point() + 
  facet_wrap(~Month)

Different Themes are supported for ggplot:

library(naniar)
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point() + 
  facet_wrap(~Month) +
  theme_dark()

Visualising missings in variables

  • gg_miss_var() - Another way to visualize missings in a dataset. This function shows the number of missing values by variable, has customizable themes and labels, and supports faceting.

Default gg_miss_var():

gg_miss_var(airquality)

Customized Theme

gg_miss_var(airquality) + theme_bw() 

Customized label

gg_miss_var(airquality) + labs(y = "Look at all the missing ones")

Faceted by month

# Faceted by Month. 
gg_miss_var(airquality, facet = Month)

Note: There are other visualizations availabled in naniar pkg with prefix gg_miss_…(). For each of these visualizations, there is an acompanying function that returns a dataframe of the plot (i.e.: miss_var_summary())

Replacing Existing Values with NA

In some cases, you may want to replace values with NA. For instance, you have values “N/A”, “N A”, ” -99, or -1 which represent missing values.

Functions in tidyr and naniar complement each other.

  • tidyr::replace_na: Missing values turn into value (NA -> -99)
  • naniar::replace_with_na: Value becomes a missing value (-99 -> NA)

Tidy Missing Data: The Shadow Matrix

Keep track of missing values with a shadow matrix, which represents a missing data structure in tidy format.

The shadow matrix:

  • Has the same dimension as the data.
  • Consists of binary indicators of missingness of data values, where missing is represented as “NA” and not missing is represented as “!NA”, or 1 and 0, respectively.
  • The data matrix can be augmented to include the shadow matrix.
  • Another format can display it in long form, which facilitates heatmap style visualizations. This is good for an overview of which variables contain the most missingness.
  • Methods can be applied to rearrange rows and columns to find clusters, and ID other features of the data that may have been previously hidden or unclear.

Shadow Functions:

as_shadow(): Returns a new dataframe with the same set of columns as the data with the added suffix, _NA

as_shadow(airquality)
## # A tibble: 153 × 6
##    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
##    <fct>    <fct>      <fct>   <fct>   <fct>    <fct> 
##  1 !NA      !NA        !NA     !NA     !NA      !NA   
##  2 !NA      !NA        !NA     !NA     !NA      !NA   
##  3 !NA      !NA        !NA     !NA     !NA      !NA   
##  4 !NA      !NA        !NA     !NA     !NA      !NA   
##  5 NA       NA         !NA     !NA     !NA      !NA   
##  6 !NA      NA         !NA     !NA     !NA      !NA   
##  7 !NA      !NA        !NA     !NA     !NA      !NA   
##  8 !NA      !NA        !NA     !NA     !NA      !NA   
##  9 !NA      !NA        !NA     !NA     !NA      !NA   
## 10 NA       !NA        !NA     !NA     !NA      !NA   
## # ℹ 143 more rows

bind_shadow(): Attaches a shadow to the current dataframe (called “nabular” format for “NA tabular”). Could also use function: nabular() .

aq_shadow <- bind_shadow(airquality)
aq_nab <- nabular(airquality)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(aq_shadow)
## Rows: 153
## Columns: 12
## $ Ozone      <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1…
## $ Solar.R    <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,…
## $ Wind       <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9…
## $ Temp       <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,…
## $ Month      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Ozone_NA   <fct> !NA, !NA, !NA, !NA, NA, !NA, !NA, !NA, !NA, NA, !NA, !NA, !…
## $ Solar.R_NA <fct> !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, !NA, !NA, NA, !NA, !N…
## $ Wind_NA    <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Temp_NA    <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Month_NA   <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Day_NA     <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
glimpse(aq_nab)
## Rows: 153
## Columns: 12
## $ Ozone      <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1…
## $ Solar.R    <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,…
## $ Wind       <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9…
## $ Temp       <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,…
## $ Month      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Ozone_NA   <fct> !NA, !NA, !NA, !NA, NA, !NA, !NA, !NA, !NA, NA, !NA, !NA, !…
## $ Solar.R_NA <fct> !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, !NA, !NA, NA, !NA, !N…
## $ Wind_NA    <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Temp_NA    <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Month_NA   <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
## $ Day_NA     <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA,…
all.equal(aq_shadow, aq_nab)
## [1] TRUE

Nabular data provides a useful pattern to explore missing values, grouping by the missing/complete of one variable and looking at the mean and other summary values.

Show the mean, sd, variance, and min and max values of Solar.R for when Ozone is present, and when it is missing.

airquality %>%
  bind_shadow() %>%
  group_by(Ozone_NA) %>%
  summarise_at(.vars = "Solar.R",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)
## # A tibble: 2 × 6
##   Ozone_NA  mean    sd   var   min   max
##   <fct>    <dbl> <dbl> <dbl> <int> <int>
## 1 !NA       185.  91.2 8309.     7   334
## 2 NA        190.  87.7 7690.    31   332

A plot of the distribution of Temperature, plotting for values of temperature when Ozone is missing, and when it is not missing.

ggplot(aq_shadow,
       aes(x = Temp,
           colour = Ozone_NA)) + 
  geom_density()

Exploring the value of air temp an dhumidity based on missingness of each:

  oceanbuoys %>%
    bind_shadow() %>%
    ggplot(aes(x = air_temp_c,
               fill = humidity_NA)) +
        geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 81 rows containing non-finite outside the scale range
## (`stat_bin()`).

  oceanbuoys %>%
    bind_shadow() %>%
    ggplot(aes(x = humidity,
               fill = air_temp_c_NA)) +
        geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 93 rows containing non-finite outside the scale range
## (`stat_bin()`).

Visualizing Imputed Values

Binding the shadow is very useful especially when combined with imputation. Use the simputation package for easy imputuation.

Below, we can see a scatter plot of Ozone vs Temp, with no errors regarding missing observations because they are all imputed. However, we have no information about where the imputations are–they are in a way, invisible.

library(simputation)
## 
## Attaching package: 'simputation'
## The following object is masked from 'package:naniar':
## 
##     impute_median
airquality %>%
  impute_lm(Ozone ~ Temp + Wind) %>%
  ggplot(aes(x = Temp,
             y = Ozone)) + 
  geom_point()

Use a shadow matrix to keep track of where the missings are and track the imputations by coloring which was previously missing in Ozone.

aq_shadow %>%
  as.data.frame() %>% 
  impute_lm(Ozone ~ Temp + Wind) %>%
  ggplot(aes(x = Temp,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()

Numerical summaries of missing values

The naniar package also provides numerical summaries for missing data.

2 convenient counters of complete values and missings are:

  • n_miss()
  • n_complete() These work on both dataframes and vectors, similar to dplyr::n_distinct()
dplyr::n_distinct(airquality)
## [1] 153
dplyr::n_distinct(airquality$Ozone)
## [1] 68
n_miss(airquality)
## [1] 44
n_miss(airquality$Ozone)
## [1] 37
n_complete(airquality)
## [1] 874
n_complete(airquality$Ozone)
## [1] 116


Other syntax:

  • …miss_case() - Refers to cases
  • …miss_var() - Refers to variables

prop_miss_case() - Returns numeric value describing the proportion of missing values in the dataframe.

prop_miss_case(airquality)
## [1] 0.2745098

pct_miss_case() - Returns numeric value describing the percent of missing values in the dataframe.

pct_miss_case(airquality)
## [1] 27.45098

miss_case_summary() - Returns a numeric value describing the number of missings in a given case (aka row) and the % of missings in that row.

miss_case_summary(airquality)
## # A tibble: 153 × 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1     5      2     33.3
##  2    27      2     33.3
##  3     6      1     16.7
##  4    10      1     16.7
##  5    11      1     16.7
##  6    25      1     16.7
##  7    26      1     16.7
##  8    32      1     16.7
##  9    33      1     16.7
## 10    34      1     16.7
## # ℹ 143 more rows

miss_case_table() - Tabulates the number of missing values in a case/row. In the following example, there are:

  • 111 cases with 0 missings, equal to 72% of the data.
  • 40 cases with 1 value missing, 26% of the data
  • 2 cases with 2 values missing, 1% of the data
miss_case_table(airquality)
## # A tibble: 3 × 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0     111     72.5 
## 2              1      40     26.1 
## 3              2       2      1.31

Working similarly but returning the percents and proportions of variables containing a missing value are:

  • pct_miss_var()
  • prop_miss_var()
  • miss_var_summary()
  • miss_var_table()
prop_miss_var(airquality)
## [1] 0.3333333
pct_miss_var(airquality)
## [1] 33.33333
miss_var_summary(airquality)
## # A tibble: 6 × 3
##   variable n_miss pct_miss
##   <chr>     <int>    <num>
## 1 Ozone        37    24.2 
## 2 Solar.R       7     4.58
## 3 Wind          0     0   
## 4 Temp          0     0   
## 5 Month         0     0   
## 6 Day           0     0
miss_var_table(airquality)
## # A tibble: 3 × 3
##   n_miss_in_var n_vars pct_vars
##           <int>  <int>    <dbl>
## 1             0      4     66.7
## 2             7      1     16.7
## 3            37      1     16.7

Explore Missings over a Particular Span or a Single Run:

miss_var_run(): Useful in time series data, allowing you to provide summaries for the # of missings or complete values in a single run. Returns a dataframe of the run length of missings and complete values.

miss_var_run(pedestrian, hourly_counts)
## # A tibble: 35 × 2
##    run_length is_na   
##         <int> <chr>   
##  1       6628 complete
##  2          1 missing 
##  3       5250 complete
##  4        624 missing 
##  5       3652 complete
##  6          1 missing 
##  7       1290 complete
##  8        744 missing 
##  9       7420 complete
## 10          1 missing 
## # ℹ 25 more rows

miss_var_span(): Determines the # of missings over a specified repeating span of rows in a variable of a dataframe. Specify the variable you want to explore and the size of the span.

miss_var_span(pedestrian, 
              hourly_counts,
              span_every=100)
## # A tibble: 377 × 6
##    span_counter n_miss n_complete prop_miss prop_complete n_in_span
##           <int>  <int>      <int>     <dbl>         <dbl>     <int>
##  1            1      0        100         0             1       100
##  2            2      0        100         0             1       100
##  3            3      0        100         0             1       100
##  4            4      0        100         0             1       100
##  5            5      0        100         0             1       100
##  6            6      0        100         0             1       100
##  7            7      0        100         0             1       100
##  8            8      0        100         0             1       100
##  9            9      0        100         0             1       100
## 10           10      0        100         0             1       100
## # ℹ 367 more rows


Using group_by with naniar

Every miss_*() summary function that returns a dataframe can be suesd with dplyr’s group_by() function.

Example: Look at the number of missing values for all variables of pedestrian data

In miss_var_summary, the data is in hourly_counts:

pedestrian %>% miss_var_summary()
## # A tibble: 9 × 3
##   variable      n_miss pct_miss
##   <chr>          <int>    <num>
## 1 hourly_counts   2548     6.76
## 2 date_time          0     0   
## 3 year               0     0   
## 4 month              0     0   
## 5 month_day          0     0   
## 6 week_day           0     0   
## 7 hour               0     0   
## 8 sensor_id          0     0   
## 9 sensor_name        0     0

Explore by month and filter by the variable, hourly_counts, because it’s the only one with missing values.

pedestrian %>%
 group_by(month) %>%
 miss_var_summary() %>%
 filter(variable == "hourly_counts")
## # A tibble: 12 × 4
## # Groups:   month [12]
##    month     variable      n_miss pct_miss
##    <ord>     <chr>          <int>    <num>
##  1 January   hourly_counts      0     0   
##  2 February  hourly_counts      0     0   
##  3 March     hourly_counts      0     0   
##  4 April     hourly_counts    552    19.2 
##  5 May       hourly_counts     72     2.42
##  6 June      hourly_counts      0     0   
##  7 July      hourly_counts      0     0   
##  8 August    hourly_counts    408    13.7 
##  9 September hourly_counts      0     0   
## 10 October   hourly_counts    412     7.44
## 11 November  hourly_counts    888    30.8 
## 12 December  hourly_counts    216     7.26


Modeling Missingness

One approach is to predict the proportion of missingness in a given caes, using all variables.

add_prop_miss(): A helper function to add a column with the proportion of cases, aka rows, missing.

airquality %>%
  add_prop_miss() %>%
  head()
##   Ozone Solar.R Wind Temp Month Day prop_miss_all
## 1    41     190  7.4   67     5   1     0.0000000
## 2    36     118  8.0   72     5   2     0.0000000
## 3    12     149 12.6   74     5   3     0.0000000
## 4    18     313 11.5   62     5   4     0.0000000
## 5    NA      NA 14.3   56     5   5     0.3333333
## 6    28      NA 14.9   66     5   6     0.1666667

We can use a model like decision trees to predict which variables and their values are important for predicting the proportion of missingness:

library(rpart)
library(rpart.plot)

airquality %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data = .) %>%
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ")
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Summary

Tools in the naniar package help us identify where missingness is, while maintaining a tidy workflow. These mechanisms are important because they help us understand potential mechanisms, such as equipment failures, and then identify possible solutions based on the evidence.