[Video]
What are missing values?
Missing values are values that should have been recorded but were not. NA = Not Available.
How do I check if I have missing values?
x <- c(1, NA, 3, NA, NA, 5)
any_na(x) # anyNA(x)
## [1] TRUE
are_na(x) # is.na(x)
## [1] FALSE TRUE FALSE TRUE TRUE FALSE
n_miss(x)
## [1] 3
prop_miss(x)
## [1] 0.5
Missing data gotchya’s
any_na(NaN)
## [1] TRUE
any_na(NULL)
## [1] FALSE
any_na(Inf)
## [1] FALSE
# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")
# Use any_na() and are_na() on to explore the missings
any_na(x) # anyNA(x)
## [1] TRUE
are_na(x) # is.na(x)
## [1] TRUE FALSE FALSE FALSE FALSE
# Use n_miss() to count the total number of missing values in dat_hw
n_miss(dat_hw) # sum(is.na(dat_hw))
## [1] 30
# Use n_miss() on dat_hw$weight to count the total number of missing values
n_miss(dat_hw$weight) # sum(is.na(dat_hw$weight))
## [1] 15
# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw) # sum(!is.na(dat_hw))
## [1] 170
# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight) # sum(!is.na(dat_hw$weight))
## [1] 85
# Use prop_miss() and prop_complete on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw) # mean(is.na(dat_hw))
## [1] 0.15
prop_complete(dat_hw) # mean(!is.na(dat_hw))
## [1] 0.85
R stores missing values as NA, which have some special behavior. Now that you can define missing data and understand how R stores missing values, can you predict what will happen when we operate with some missing values?
What is the output of the following four commands in R? Try them out in the code console to test them before you submit your answer.
1 + NA
NA + NA
NA | TRUE
NA | FALSE
NA, NA, NA, NA1, NA, TRUE, FALSENA, NA, NA, FALSENA, NA, TRUE, NANA, NA, TRUE, FALSE[Video]
Introduction to missingness summaries
Basic summaries of missingness: * n_miss * n_complete
Dataframe summaries of missingness: * miss_var_summary * miss_case_summary
These functions work with group_by
Missing data summaries: Variables
miss_var_summary(airquality)
# A tibble: 6 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 Ozone 37 24.2
2 Solar.R 7 4.58
3 Wind 0 0
4 Temp 0 0
5 Month 0 0
6 Day 0 0
Missing data summaries: Cases
miss_case_summary(airquality)
# A tibble: 153 x 3
case n_miss pct_miss
<int> <int> <dbl>
1 5 2 33.3
2 27 2 33.3
3 6 1 16.7
4 10 1 16.7
5 11 1 16.7
6 25 1 16.7
7 26 1 16.7
8 32 1 16.7
9 33 1 16.7
10 34 1 16.7
# ... with 143 more rows
Missing data tabulations
miss_var_table(airquality)
# A tibble: 3 x 3
n_miss_in_var n_vars pct_var
<int> <int> <dbl>
1 0 4 66.7
2 7 1 16.7
3 37 1 16.7
miss_case_table(airquality)
# A tibble: 3 x 3
n_miss_in_case n_cases pct_case
<int> <int> <dbl>
1 0 111 72.5
2 1 40 26.1
3 2 2 1.31
Missing data summaries: Spans of missing data
miss_var_span(pedestrian,
var = hourly_counts,
span_every = 4000)
# A tibble: 10 x 5
span_counter n_miss n_complete prop_miss prop_complete
<int> <int> <dbl> <dbl> <dbl>
1 1 0 4000 0 1
2 2 1 3999 0.00025 1.000
3 3 121 3879 0.0302 0.970
4 4 503 3497 0.126 0.874
5 5 745 3255 0.186 0.814
6 6 0 4000 0 1
7 7 1 3999 0.00025 1.000
8 8 0 4000 0 1
9 9 745 3255 0.186 0.814
10 10 432 3568 0.108 0.892
Missing data summaries: Runs of missing data
miss_var_run(pedestrian,
hourly_counts)
# A tibble: 35 x 2
run_length is_na
<int> <chr>
1 6628 complete
2 1 missing
3 5250 complete
4 624 missing
5 3652 complete
6 1 missing
7 1290 complete
8 744 missing
9 7420 complete
10 1 missing
# ... with 25 more rows
Using summaries with group_by
airquality %>%
group_by(Month) %>%
miss_var_summary()
# A tibble: 25 x 4
Month variable n_miss pct_miss
<int> <chr> <int> <dbl>
1 5 Ozone 5 16.1
2 5 Solar.R 4 12.9
3 5 Wind 0 0
4 5 Temp 0 0
5 5 Day 0 0
6 6 Ozone 21 70
7 6 Solar.R 0 0
8 6 Wind 0 0
9 6 Temp 0 0
10 6 Day 0 0
# ... with 15 more rows
Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470