This report is a summary of lesson by Data camp
NA: Not Available, missing values are values that
should have been recorded but were notNaN: Not a number, R에서는 missing value로 취급NULL: NOT missing valueInfany_na: 전체 포함 여부,
TRUE/FALSE 반환are_na: 개별 여부n_miss: 전체 NA 개수
n_complete: 전체 NA가 아닌 개수prop_miss: 전체 NA 비율
prop_complete: 전체 `NA``가 아닌 비율group_by와 같이 사용하면 좋음
miss_var_summary : 변수 기준(종 방향)
miss_case_summary : 관측치 기준(행 방향)
miss_var_table
miss_case_table
miss_var_span(data, col = ..., span_every = ...) :
기간 별 missing value
miss_case_span
miss_var_run : complete와 missing 연속 길이를
반환하며 missing pattern을 찾는데 유용
NA + anything = NANA|TRUE = TRUENA|FALSE = NANA + NaN = NANaN + NA = NaNairquality %>% group_by(Month) %>% miss_var_summary()
## # A tibble: 25 × 4
## # Groups: Month [5]
## Month variable n_miss pct_miss
## <int> <chr> <int> <num>
## 1 5 Ozone 5 16.1
## 2 5 Solar.R 4 12.9
## 3 5 Wind 0 0
## 4 5 Temp 0 0
## 5 5 Day 0 0
## 6 6 Ozone 21 70
## 7 6 Solar.R 0 0
## 8 6 Wind 0 0
## 9 6 Temp 0 0
## 10 6 Day 0 0
## # ℹ 15 more rows
airquality %>% group_by(Month) %>% miss_case_summary()
## # A tibble: 153 × 4
## # Groups: Month [5]
## Month case n_miss pct_miss
## <int> <int> <int> <dbl>
## 1 5 5 2 40
## 2 5 27 2 40
## 3 5 6 1 20
## 4 5 10 1 20
## 5 5 11 1 20
## 6 5 25 1 20
## 7 5 26 1 20
## 8 5 1 0 0
## 9 5 2 0 0
## 10 5 3 0 0
## # ℹ 143 more rows
airquality %>% group_by(Month) %>% miss_var_table()
## # A tibble: 12 × 4
## # Groups: Month [5]
## Month n_miss_in_var n_vars pct_vars
## <int> <int> <int> <dbl>
## 1 5 0 3 60
## 2 5 4 1 20
## 3 5 5 1 20
## 4 6 0 4 80
## 5 6 21 1 20
## 6 7 0 4 80
## 7 7 5 1 20
## 8 8 0 3 60
## 9 8 3 1 20
## 10 8 5 1 20
## 11 9 0 4 80
## 12 9 1 1 20
airquality %>% group_by(Month) %>% miss_case_table()
## # A tibble: 11 × 4
## # Groups: Month [5]
## Month n_miss_in_case n_cases pct_cases
## <int> <int> <int> <dbl>
## 1 5 0 24 77.4
## 2 5 1 5 16.1
## 3 5 2 2 6.45
## 4 6 0 9 30
## 5 6 1 21 70
## 6 7 0 26 83.9
## 7 7 1 5 16.1
## 8 8 0 23 74.2
## 9 8 1 8 25.8
## 10 9 0 29 96.7
## 11 9 1 1 3.33
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
## # A tibble: 51 × 3
## # Groups: month [12]
## month run_length is_na
## <ord> <int> <chr>
## 1 January 2976 complete
## 2 February 2784 complete
## 3 March 2976 complete
## 4 April 888 complete
## 5 April 552 missing
## 6 April 1440 complete
## 7 May 744 complete
## 8 May 72 missing
## 9 May 2160 complete
## 10 June 2880 complete
## # ℹ 41 more rows
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
## # A tibble: 25 × 7
## # Groups: month [12]
## month span_counter n_miss n_complete prop_miss prop_complete n_in_span
## <ord> <int> <int> <int> <dbl> <dbl> <int>
## 1 January 1 0 2000 0 1 2000
## 2 January 2 0 976 0 1 976
## 3 February 1 0 2000 0 1 2000
## 4 February 2 0 784 0 1 784
## 5 March 1 0 2000 0 1 2000
## 6 March 2 0 976 0 1 976
## 7 April 1 552 1448 0.276 0.724 2000
## 8 April 2 0 880 0 1 880
## 9 May 1 72 1928 0.036 0.964 2000
## 10 May 2 0 976 0 1 976
## # ℹ 15 more rows
vis_miss(..., cluster = T/F, sort_miss = T/F) : heatmap
of the missingnessgg_miss_var(..., order_cases = T/F, facet = ...): 각
열의 NA 개수 그래프gg_miss_case: 각 행의 NA 개수 그래프gg_miss_upset: show the number of combinations of
missing values that co-occurgg_miss_fct: factor 변수에 대한 NA 그래프gg_miss_spanvis_miss(airquality, sort_miss = T, cluster = T)
gg_miss_case(airquality)
gg_miss_var(airquality)
gg_miss_upset(airquality)
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)
gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)
ex) “missing” “NA” “N A” “N/A” “#N/A” “NA” …..
miss_scan_count(search = list(...)replace_with_na(replace = list(...)replace_with_na_at(.vars = ...): A subset of selected
variablesreplace_with_na_if(.predicate = ...): A subset of
variables that fulfill some conditionreplace_with_na_all(condition = ~.x = ...): All
variables# Use `replace_with_na_at()` to replace with NA
# replace_with_na_at(pacman,
# .vars = c("year", "month", "day"),
# ~.x %in% c("N/A", "missing", "na", " "))
#
# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
# replace_with_na_if(pacman,
# .predicate = is.character,
# ~.x %in% c("N/A", "missing", "na", " "))
#
# Use `replace_with_na_all()` to replace with NA
# replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))
completefill해당 함수는 이미 tidyr에서 배웠기 때문에 생략
MCAR: Missing completely at random
MAR: Missing at random
MNAR: Missing not at random
위 3가지 type을 구분하는 것은 중요하지만 쉽지 않다.
gg_miss_var(oceanbuoys, facet = year)
year 변수 별로 humidity와
air_temp_c가 clustering을 보이기에
oceanbuoys는 MAR 데이터이다.
shadow matrix는 1과 0의 데이터프레임으로, 각각 값이
누락되었는지(1) 아닌지(0)를 보여준다.
as_shadowbind_shadow(only_miss = T/F): 기존 데이터프레임과
shadow 데이터프레임을 조인하여 nabular 생성add_label_shadow(): missing value 존재 여부 칼럼
생성oceanbuoys %>%
bind_shadow(only_miss = T) %>%
group_by(air_temp_c_NA) %>%
summarise(
wind_ew_mean = mean(wind_ew),
wind_ew_sd = sd(wind_ew),
n_obs = n()
)
## # A tibble: 2 × 4
## air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
## <fct> <dbl> <dbl> <int>
## 1 !NA -3.91 1.85 655
## 2 NA -2.17 2.14 81
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp, color = Ozone_NA)) +
geom_density() +
labs(title = "Using geom_density")
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Ozone_NA, y = Temp)) +
geom_boxplot() +
labs(title = "Using geom_boxplot")
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp, y = Wind)) +
geom_point() +
facet_wrap(~ Ozone_NA) +
labs(title = "Using geom_point")
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = air_temp_c_NA,
y = wind_ew)) +
geom_boxplot() +
facet_wrap(~humidity_NA)
geom_miss_point(): geom_point와 다르게
NA도 표시ggplot(airquality,
aes(x = Ozone, y = Solar.R)) +
# NA를 여백에 표시
geom_miss_point()
ggplot(airquality,
aes(x = Wind, y = Ozone)) +
geom_miss_point() +
facet_wrap(~Month)
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_grid(humidity_NA~year)
impute_below() : 결측치들을 관측치보다 작은
값으로 filling
impute_below_if()impute_below_at()impute_below_all()ocean_imp_track <- oceanbuoys %>%
bind_shadow() %>%
add_label_shadow() %>%
impute_below_all()
ocean_imp_track
## # A tibble: 736 × 17
## year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5
## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5
## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5
## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
## # ℹ 726 more rows
## # ℹ 9 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## # wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>
ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) +
geom_histogram() +
facet_wrap(~year)
unlikely values… ex) pregnant males, or winter temperatures on a warm summers day
impute_mean
impute_mean_ifimpute_mean_atimpute_mean_allocean_imp_mean <- bind_shadow(oceanbuoys) %>%
impute_mean_all() %>%
add_label_shadow()
ocean_imp_mean
## # A tibble: 736 × 17
## year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5
## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5
## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5
## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
## # ℹ 726 more rows
## # ℹ 9 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## # wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>
ggplot(ocean_imp_mean, aes(x = humidity_NA, y = humidity)) +
geom_boxplot()
One way to evaluate the appropriateness of the scale of the imputations is to use a scatter plot to explore whether or not the values are appropriate.
ggplot(ocean_imp_mean,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() +
facet_wrap(~year)
Across many variables?
shadow_long()ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
humidity,
air_temp_c)
ocean_imp_mean_gather$value <- as.double(ocean_imp_mean_gather$value)
ocean_imp_mean_gather
## # A tibble: 1,472 × 4
## variable value variable_NA value_NA
## <chr> <dbl> <chr> <fct>
## 1 air_temp_c 27.1 air_temp_c_NA !NA
## 2 humidity 79.6 humidity_NA !NA
## 3 air_temp_c 27.0 air_temp_c_NA !NA
## 4 humidity 75.8 humidity_NA !NA
## 5 air_temp_c 27 air_temp_c_NA !NA
## 6 humidity 76.5 humidity_NA !NA
## 7 air_temp_c 26.9 air_temp_c_NA !NA
## 8 humidity 76.2 humidity_NA !NA
## 9 air_temp_c 26.8 air_temp_c_NA !NA
## 10 humidity 76.4 humidity_NA !NA
## # ℹ 1,462 more rows
ggplot(ocean_imp_mean_gather,
aes(x = value, fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable)
방금 설명한 평균을 사용한 impuation은 인위적으로 평균을 증가시키고 분산을 감소시키기 때문에 bad imputation이다.
impute_lm in simputation package## mean imputation
ocean_imp_mean <- bind_shadow(oceanbuoys) %>%
impute_mean_all() %>%
add_label_shadow()
## lm imputation(2 vars)
ocean_imp_lm_wind <- bind_shadow(oceanbuoys) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns) %>%
impute_lm(humidity ~ wind_ew + wind_ns) %>%
add_label_shadow()
## lm imputation(3 vars)
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
add_label_shadow()
## bind rows
bound_models <- bind_rows(mean = ocean_imp_mean,
lm_wind = ocean_imp_lm_wind,
lm_wind_year = ocean_imp_lm_wind_year,
.id = "imp_model")
## ggplot
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() +
facet_wrap(~imp_model)
lm(Temp ~ Ozone + Solar.R + Wind + Month + day, data = airquality)
다음과 같이 regression을 실시할 때 우리는 missing values를 어떻게
처리해야 할까?
2가지 방법이 있다.
# Create the model summary for each dataset
model_summary <- bound_models %>%
group_by(imp_model) %>%
nest() %>%
mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
res = map(mod, residuals),
pred = map(mod, predict),
tidy = map(mod, tidy))
# Explore the coefficients in the model
model_summary %>%
select(imp_model, tidy) %>%
unnest(tidy)
## # A tibble: 12 × 6
## # Groups: imp_model [3]
## imp_model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mean (Intercept) -1589. 56.7 -28.0 6.84e-118
## 2 mean air_temp_c 0.441 0.0285 15.5 5.33e- 47
## 3 mean humidity 0.0161 0.00675 2.39 1.71e- 2
## 4 mean year 0.803 0.0286 28.1 3.51e-118
## 5 lm_wind (Intercept) -1742. 56.1 -31.0 1.83e-135
## 6 lm_wind air_temp_c 0.365 0.0279 13.1 2.73e- 35
## 7 lm_wind humidity 0.0225 0.00690 3.26 1.17e- 3
## 8 lm_wind year 0.880 0.0283 31.1 6.79e-136
## 9 lm_wind_year (Intercept) -614. 48.5 -12.7 2.49e- 33
## 10 lm_wind_year air_temp_c 0.927 0.0235 39.5 3.25e-183
## 11 lm_wind_year humidity 0.0221 0.00427 5.17 3.05e- 7
## 12 lm_wind_year year 0.308 0.0245 12.6 7.14e- 33
linear model을 적용한 회귀식이 가장 best하다.
추천 패키지 및 책
visdat packagemice packageFlexible imputation of missing data