Dealing With Missing Data in R (DataCamp)

Ch. 1 - Why care about missing data?

Introduction to Missing Data

[Video]

What are missing values?

Missing values are values that should have been recorded but were not. NA = Not Available.

How do I check if I have missing values?

x <- c(1, NA, 3, NA, NA, 5)
any_na(x) # anyNA(x)

## [1] TRUE

are_na(x) # is.na(x)

## [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE

n_miss(x)

## [1] 3

prop_miss(x)

## [1] 0.5

Missing data gotchya’s

any_na(NaN)

## [1] TRUE

any_na(NULL)

## [1] FALSE

any_na(Inf)

## [1] FALSE

Using and finding missing values

# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")

# Use any_na() and are_na() on to explore the missings
any_na(x) # anyNA(x)

## [1] TRUE

are_na(x) # is.na(x)

## [1]  TRUE FALSE FALSE FALSE FALSE

How many missing values are there?

# Use n_miss() to count the total number of missing values in dat_hw
n_miss(dat_hw) # sum(is.na(dat_hw))

## [1] 30

# Use n_miss() on dat_hw$weight to count the total number of missing values
n_miss(dat_hw$weight) # sum(is.na(dat_hw$weight))

## [1] 15

# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw) # sum(!is.na(dat_hw))

## [1] 170

# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight) # sum(!is.na(dat_hw$weight))

## [1] 85

# Use prop_miss() and prop_complete on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw) # mean(is.na(dat_hw))

## [1] 0.15

prop_complete(dat_hw) # mean(!is.na(dat_hw))

## [1] 0.85

Working with missing values

R stores missing values as NA, which have some special behavior. Now that you can define missing data and understand how R stores missing values, can you predict what will happen when we operate with some missing values?

What is the output of the following four commands in R? Try them out in the code console to test them before you submit your answer.

1 + NA
NA + NA
NA | TRUE
NA | FALSE

NA, NA, NA, NA
1, NA, TRUE, FALSE
NA, NA, NA, FALSE
[*] NA, NA, TRUE, NA
NA, NA, TRUE, FALSE

Why care about missing values?

[Video]

Introduction to missingness summaries

Basic summaries of missingness: * n_miss * n_complete

Dataframe summaries of missingness: * miss_var_summary * miss_case_summary

These functions work with group_by

Missing data summaries: Variables

miss_var_summary(airquality)

# A tibble: 6 x 3
  variable n_miss pct_miss 
  <chr>     <int>    <dbl> 
1 Ozone        37    24.2  
2 Solar.R       7     4.58 
3 Wind          0     0    
4 Temp          0     0    
5 Month         0     0    
6 Day           0     0

Missing data summaries: Cases

miss_case_summary(airquality)

# A tibble: 153 x 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ... with 143 more rows

Missing data tabulations

miss_var_table(airquality)

# A tibble: 3 x 3
  n_miss_in_var n_vars pct_var
          <int>  <int>    <dbl>
1             0      4     66.7
2             7      1     16.7
3            37      1     16.7

miss_case_table(airquality)

# A tibble: 3 x 3
  n_miss_in_case n_cases pct_case
           <int>   <int>    <dbl>
1              0     111    72.5 
2              1      40    26.1 
3              2       2     1.31

Missing data summaries: Spans of missing data

miss_var_span(pedestrian, 
              var = hourly_counts,
              span_every = 4000)

# A tibble: 10 x 5
   span_counter n_miss n_complete prop_miss prop_complete
          <int>  <int>      <dbl>     <dbl>         <dbl>
 1            1      0       4000   0               1    
 2            2      1       3999   0.00025         1.000
 3            3    121       3879   0.0302          0.970
 4            4    503       3497   0.126           0.874
 5            5    745       3255   0.186           0.814
 6            6      0       4000   0               1    
 7            7      1       3999   0.00025         1.000
 8            8      0       4000   0               1    
 9            9    745       3255   0.186           0.814
10           10    432       3568   0.108           0.892

Missing data summaries: Runs of missing data

miss_var_run(pedestrian,
             hourly_counts)

# A tibble: 35 x 2
   run_length is_na   
        <int> <chr>   
 1       6628 complete
 2          1 missing 
 3       5250 complete
 4        624 missing 
 5       3652 complete
 6          1 missing 
 7       1290 complete
 8        744 missing 
 9       7420 complete
10          1 missing 
# ... with 25 more rows

Using summaries with group_by

airquality %>%
  group_by(Month) %>%
  miss_var_summary()

# A tibble: 25 x 4
   Month variable n_miss pct_miss 
   <int> <chr>     <int>    <dbl> 
 1     5 Ozone         5     16.1 
 2     5 Solar.R       4     12.9 
 3     5 Wind          0      0   
 4     5 Temp          0      0   
 5     5 Day           0      0   
 6     6 Ozone        21     70   
 7     6 Solar.R       0      0   
 8     6 Wind          0      0   
 9     6 Temp          0      0   
10     6 Day           0      0   
# ... with 15 more rows

Summarizing missingness

Tabulating Missingness

Other summaries of missingness

How do we visualize missing values?

Your first missing data visualizations

Visualizing missing cases and variables

Visualising missingness patterns

Ch. 2 - Wrangling and tidying up missing values

Searching for and replacing missing values

Using miss_scan_count

Using replace_with_na

Using replace_with_na scoped variants

Filling down missing values

Fix implicit missings using complete()

Fix explicit missings using fill()

Using complete() and fill() together

Missing Data dependence

Differences between MCAR and MAR

Exploring missingness dependence

Further exploring missingness dependence

Ch. 3 - Testing missing relationships

Tools to explore missing data dependence

Creating shadow matrix data

Performing grouped summaries of missingness

Further exploring more combinations of missingness

Visualizing missingness across one variable

Nabular data and filling by missingness

Nabular data and summarising by missingness

Explore variation by missingness: boxplots

Visualizing missingness across two variables

Exploring missing data with scatterplots

Faceting to explore missingness (multiple plots)

Ch. 4 - Connecting the dots (Imputation)

Filling in the blanks

Impute data below range with nabular data

Visualise imputed values in a scatterplot

Create histogram of imputed data

What makes a good imputation

Evaluating bad imputations

Evaluating imputations: The scale

Evaluating imputations: Across many variables

Performing imputations

Using simputation to impute data

Evaluating and comparing imputations

Evaluating imputations (many models & variables)

Evaluating imputations and models

Combining and comparing many imputation models

Evaluating the different parameters in the model

Final Lesson

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470