This report is a summary of lesson by Data camp

1. Why care about missing data?

Introduction to missing data

1) Missing value

  • NA: Not Available, missing values are values that should have been recorded but were not
  • NaN: Not a number, R에서는 missing value로 취급

2) Not missing value

  • NULL: NOT missing value
  • Inf

3) Function to relative

(1) Basic summaries of missingness: return a single number
  • any_na: 전체 포함 여부, TRUE/FALSE 반환
  • are_na: 개별 여부
  • n_miss: 전체 NA 개수
    • n_complete: 전체 NA가 아닌 개수
  • prop_miss: 전체 NA 비율
    • prop_complete: 전체 `NA``가 아닌 비율
(2) Dataframe summaries of missingness: return a dataframe

group_by와 같이 사용하면 좋음

  • miss_var_summary : 변수 기준(종 방향)

  • miss_case_summary : 관측치 기준(행 방향)

  • miss_var_table

  • miss_case_table

  • miss_var_span(data, col = ..., span_every = ...) : 기간 별 missing value

  • miss_case_span

  • miss_var_run : complete와 missing 연속 길이를 반환하며 missing pattern을 찾는데 유용

4) Working with missing data

  • NA + anything = NA
  • NA|TRUE = TRUE
  • NA|FALSE = NA
  • NA + NaN = NA
  • NaN + NA = NaN

Summarizing missingness

airquality %>% group_by(Month) %>% miss_var_summary()
## # A tibble: 25 × 4
## # Groups:   Month [5]
##    Month variable n_miss pct_miss
##    <int> <chr>     <int>    <num>
##  1     5 Ozone         5     16.1
##  2     5 Solar.R       4     12.9
##  3     5 Wind          0      0  
##  4     5 Temp          0      0  
##  5     5 Day           0      0  
##  6     6 Ozone        21     70  
##  7     6 Solar.R       0      0  
##  8     6 Wind          0      0  
##  9     6 Temp          0      0  
## 10     6 Day           0      0  
## # ℹ 15 more rows
airquality %>% group_by(Month) %>% miss_case_summary()
## # A tibble: 153 × 4
## # Groups:   Month [5]
##    Month  case n_miss pct_miss
##    <int> <int>  <int>    <dbl>
##  1     5     5      2       40
##  2     5    27      2       40
##  3     5     6      1       20
##  4     5    10      1       20
##  5     5    11      1       20
##  6     5    25      1       20
##  7     5    26      1       20
##  8     5     1      0        0
##  9     5     2      0        0
## 10     5     3      0        0
## # ℹ 143 more rows

Tabulating Missingness

airquality %>% group_by(Month) %>% miss_var_table()
## # A tibble: 12 × 4
## # Groups:   Month [5]
##    Month n_miss_in_var n_vars pct_vars
##    <int>         <int>  <int>    <dbl>
##  1     5             0      3       60
##  2     5             4      1       20
##  3     5             5      1       20
##  4     6             0      4       80
##  5     6            21      1       20
##  6     7             0      4       80
##  7     7             5      1       20
##  8     8             0      3       60
##  9     8             3      1       20
## 10     8             5      1       20
## 11     9             0      4       80
## 12     9             1      1       20
airquality %>% group_by(Month) %>% miss_case_table()
## # A tibble: 11 × 4
## # Groups:   Month [5]
##    Month n_miss_in_case n_cases pct_cases
##    <int>          <int>   <int>     <dbl>
##  1     5              0      24     77.4 
##  2     5              1       5     16.1 
##  3     5              2       2      6.45
##  4     6              0       9     30   
##  5     6              1      21     70   
##  6     7              0      26     83.9 
##  7     7              1       5     16.1 
##  8     8              0      23     74.2 
##  9     8              1       8     25.8 
## 10     9              0      29     96.7 
## 11     9              1       1      3.33

Other summaries of missingness

pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
## # A tibble: 51 × 3
## # Groups:   month [12]
##    month    run_length is_na   
##    <ord>         <int> <chr>   
##  1 January        2976 complete
##  2 February       2784 complete
##  3 March          2976 complete
##  4 April           888 complete
##  5 April           552 missing 
##  6 April          1440 complete
##  7 May             744 complete
##  8 May              72 missing 
##  9 May            2160 complete
## 10 June           2880 complete
## # ℹ 41 more rows
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
## # A tibble: 25 × 7
## # Groups:   month [12]
##    month    span_counter n_miss n_complete prop_miss prop_complete n_in_span
##    <ord>           <int>  <int>      <int>     <dbl>         <dbl>     <int>
##  1 January             1      0       2000     0             1          2000
##  2 January             2      0        976     0             1           976
##  3 February            1      0       2000     0             1          2000
##  4 February            2      0        784     0             1           784
##  5 March               1      0       2000     0             1          2000
##  6 March               2      0        976     0             1           976
##  7 April               1    552       1448     0.276         0.724      2000
##  8 April               2      0        880     0             1           880
##  9 May                 1     72       1928     0.036         0.964      2000
## 10 May                 2      0        976     0             1           976
## # ℹ 15 more rows

How do we visualize missing values?

  • vis_miss(..., cluster = T/F, sort_miss = T/F) : heatmap of the missingness
  • gg_miss_var(..., order_cases = T/F, facet = ...): 각 열의 NA 개수 그래프
  • gg_miss_case: 각 행의 NA 개수 그래프
  • gg_miss_upset: show the number of combinations of missing values that co-occur
  • gg_miss_fct: factor 변수에 대한 NA 그래프
  • gg_miss_span
vis_miss(airquality, sort_miss = T, cluster = T)

gg_miss_case(airquality)

gg_miss_var(airquality)

gg_miss_upset(airquality)

gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)

gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)

2. Wrangling and tidying up missing values

Searching for and replacing different kinds of missing values

ex) “missing” “NA” “N A” “N/A” “#N/A” “NA” …..

  • miss_scan_count(search = list(...)
  • replace_with_na(replace = list(...)
  • replace_with_na_at(.vars = ...): A subset of selected variables
  • replace_with_na_if(.predicate = ...): A subset of variables that fulfill some condition
  • replace_with_na_all(condition = ~.x = ...): All variables
# Use `replace_with_na_at()` to replace with NA
# replace_with_na_at(pacman,
#                    .vars = c("year", "month", "day"), 
#                    ~.x %in% c("N/A", "missing", "na", " "))
# 
# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
# replace_with_na_if(pacman,
#                    .predicate = is.character, 
#                    ~.x %in% c("N/A", "missing", "na", " "))
# 
# Use `replace_with_na_all()` to replace with NA
# replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))

Filling down missing values

  • complete
  • fill

해당 함수는 이미 tidyr에서 배웠기 때문에 생략

Missing Data dependence

  • MCAR: Missing completely at random
    • 특정 데이터와 관련성이 없이 완전 무작위로 발생하는 유형으로 제거해도 bias가 발생하지 않음
  • MAR: Missing at random
    • 특정 데이터와 관련성이 있지만 자체적인 이유로 발생하지는 않는 유형
  • MNAR: Missing not at random
    • 결측 자체가 특정 이유로 발생하며 제거할 경우 심각한 bias가 발생할 수 있음

위 3가지 type을 구분하는 것은 중요하지만 쉽지 않다.

gg_miss_var(oceanbuoys, facet = year)

year 변수 별로 humidityair_temp_cclustering을 보이기에 oceanbuoysMAR 데이터이다.

3. Testing missing relationship

Shadow matrix

shadow matrix는 1과 0의 데이터프레임으로, 각각 값이 누락되었는지(1) 아닌지(0)를 보여준다.

  • as_shadow
  • bind_shadow(only_miss = T/F): 기존 데이터프레임과 shadow 데이터프레임을 조인하여 nabular 생성
  • add_label_shadow(): missing value 존재 여부 칼럼 생성
oceanbuoys %>% 
  bind_shadow(only_miss = T) %>% 
  group_by(air_temp_c_NA) %>% 
  summarise(
    wind_ew_mean = mean(wind_ew),
    wind_ew_sd = sd(wind_ew),
    n_obs = n()
  )
## # A tibble: 2 × 4
##   air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
##   <fct>                <dbl>      <dbl> <int>
## 1 !NA                  -3.91       1.85   655
## 2 NA                   -2.17       2.14    81

Visualizing missingness across one variables

  • Explore visualization:
    • densities
    • box plots
    • different methods of splitting the visualization
airquality %>% 
  bind_shadow() %>% 
  ggplot(aes(x = Temp, color = Ozone_NA)) +
  geom_density() +
  labs(title = "Using geom_density")

airquality %>% 
  bind_shadow() %>% 
  ggplot(aes(x = Ozone_NA, y = Temp)) +
  geom_boxplot() +
  labs(title = "Using geom_boxplot")

airquality %>% 
  bind_shadow() %>% 
  ggplot(aes(x = Temp, y = Wind)) +
  geom_point() +
  facet_wrap(~ Ozone_NA) +
  labs(title = "Using geom_point")

oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = air_temp_c_NA,
             y = wind_ew)) + 
  geom_boxplot() +
  facet_wrap(~humidity_NA)

Visualizing missingness across two variables

  • geom_miss_point(): geom_point와 다르게 NA도 표시
ggplot(airquality,
       aes(x = Ozone, y = Solar.R)) +
  # NA를 여백에 표시
  geom_miss_point()

ggplot(airquality, 
       aes(x = Wind, y = Ozone)) +
  geom_miss_point() +
  facet_wrap(~Month)

bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew,
             y = air_temp_c)) + 
  geom_miss_point() + 
  facet_grid(humidity_NA~year)

4. Connecting the dots(Imputation)

Filling in the blanks

  • impute_below() : 결측치들을 관측치보다 작은 값으로 filling
    • impute_below_if()
    • impute_below_at()
    • impute_below_all()
ocean_imp_track <- oceanbuoys %>% 
  bind_shadow() %>% 
  add_label_shadow() %>% 
  impute_below_all()

ocean_imp_track
## # A tibble: 736 × 17
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ℹ 726 more rows
## # ℹ 9 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
## #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## #   wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>
ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) +
  geom_histogram() +
  facet_wrap(~year)

What makes a good imputation

unlikely values… ex) pregnant males, or winter temperatures on a warm summers day

The mean

  • impute_mean
    • impute_mean_if
    • impute_mean_at
    • impute_mean_all
ocean_imp_mean <- bind_shadow(oceanbuoys) %>% 
  impute_mean_all() %>% 
  add_label_shadow()

ocean_imp_mean
## # A tibble: 736 × 17
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ℹ 726 more rows
## # ℹ 9 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
## #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## #   wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>
ggplot(ocean_imp_mean, aes(x = humidity_NA, y = humidity)) +
  geom_boxplot()

One way to evaluate the appropriateness of the scale of the imputations is to use a scatter plot to explore whether or not the values are appropriate.

ggplot(ocean_imp_mean, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) + 
  geom_point() +
  facet_wrap(~year)

Across many variables?

  • shadow_long()
ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
                                     humidity,
                                     air_temp_c)
ocean_imp_mean_gather$value <- as.double(ocean_imp_mean_gather$value)
ocean_imp_mean_gather
## # A tibble: 1,472 × 4
##    variable   value variable_NA   value_NA
##    <chr>      <dbl> <chr>         <fct>   
##  1 air_temp_c  27.1 air_temp_c_NA !NA     
##  2 humidity    79.6 humidity_NA   !NA     
##  3 air_temp_c  27.0 air_temp_c_NA !NA     
##  4 humidity    75.8 humidity_NA   !NA     
##  5 air_temp_c  27   air_temp_c_NA !NA     
##  6 humidity    76.5 humidity_NA   !NA     
##  7 air_temp_c  26.9 air_temp_c_NA !NA     
##  8 humidity    76.2 humidity_NA   !NA     
##  9 air_temp_c  26.8 air_temp_c_NA !NA     
## 10 humidity    76.4 humidity_NA   !NA     
## # ℹ 1,462 more rows
ggplot(ocean_imp_mean_gather, 
       aes(x = value, fill = value_NA)) + 
  geom_histogram() + 
  facet_wrap(~variable)

Performing imputations with different models

방금 설명한 평균을 사용한 impuation은 인위적으로 평균을 증가시키고 분산을 감소시키기 때문에 bad imputation이다.

  • impute_lm in simputation package
## mean imputation
ocean_imp_mean <- bind_shadow(oceanbuoys) %>% 
  impute_mean_all() %>% 
  add_label_shadow()
## lm imputation(2 vars)
ocean_imp_lm_wind <- bind_shadow(oceanbuoys) %>% 
  impute_lm(air_temp_c ~ wind_ew + wind_ns) %>% 
  impute_lm(humidity ~ wind_ew + wind_ns) %>% 
  add_label_shadow()
## lm imputation(3 vars)
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>% 
  impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>% 
  impute_lm(humidity ~ wind_ew + wind_ns + year) %>% 
  add_label_shadow()
## bind rows
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          lm_wind_year = ocean_imp_lm_wind_year,
                          .id = "imp_model")
## ggplot
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point() +
  facet_wrap(~imp_model)

Evaluating imputations and models

lm(Temp ~ Ozone + Solar.R + Wind + Month + day, data = airquality)

다음과 같이 regression을 실시할 때 우리는 missing values를 어떻게 처리해야 할까?
2가지 방법이 있다.

  1. Complete case analysis, where we remove all rows that contain a missing value
  2. Imputing data using the linear model imputation - more better!!
# Create the model summary for each dataset
model_summary <- bound_models %>% 
  group_by(imp_model) %>% 
  nest() %>% 
  mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
         res = map(mod, residuals),
         pred = map(mod, predict),
         tidy = map(mod, tidy))
# Explore the coefficients in the model
model_summary %>% 
  select(imp_model, tidy) %>% 
  unnest(tidy)
## # A tibble: 12 × 6
## # Groups:   imp_model [3]
##    imp_model    term          estimate std.error statistic   p.value
##    <chr>        <chr>            <dbl>     <dbl>     <dbl>     <dbl>
##  1 mean         (Intercept) -1589.      56.7        -28.0  6.84e-118
##  2 mean         air_temp_c      0.441    0.0285      15.5  5.33e- 47
##  3 mean         humidity        0.0161   0.00675      2.39 1.71e-  2
##  4 mean         year            0.803    0.0286      28.1  3.51e-118
##  5 lm_wind      (Intercept) -1742.      56.1        -31.0  1.83e-135
##  6 lm_wind      air_temp_c      0.365    0.0279      13.1  2.73e- 35
##  7 lm_wind      humidity        0.0225   0.00690      3.26 1.17e-  3
##  8 lm_wind      year            0.880    0.0283      31.1  6.79e-136
##  9 lm_wind_year (Intercept)  -614.      48.5        -12.7  2.49e- 33
## 10 lm_wind_year air_temp_c      0.927    0.0235      39.5  3.25e-183
## 11 lm_wind_year humidity        0.0221   0.00427      5.17 3.05e-  7
## 12 lm_wind_year year            0.308    0.0245      12.6  7.14e- 33

linear model을 적용한 회귀식이 가장 best하다.

추천 패키지 및 책

  • visdat package
  • mice package
  • Stefan van Buuren’s Flexible imputation of missing data