探索性数据分析(EDA)是我们做数据分析的第一步,也是最重要的一步。通过EDA,可以让我们了解数据状况,发现数据问题,为我们后续的数据管理提供指导。
EDA是从原始数据入手,采用数据汇总和数据可视化的方法,研究数据的概况,变量的类型,变量的分布,变量与变量之间的关系,数据的常见问题(缺失值|无效值|异常值|数据范围|数据单位等)等内容,其目的就是最大程度地理解数据,最大程度地保证数据质量。
DataExplore包是R语言的一个EDA包,使用它,可以帮助我们更有效和快捷地对数据做EDA。
if(!require(DataExplorer)){
install.packages("DataExplorer")
require(DataExplorer)
}
## Loading required package: DataExplorer
library(pacman)
p_load(DataExplorer,tidyverse,funModeling,nycflights13)
如果要快速可视化所有结构,可以执行以下操作:
airlines %>% head()
## # A tibble: 6 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
airports %>% head()
## # A tibble: 6 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Y~
## 2 06A Moton Field Municipal Airp~ 32.5 -85.7 264 -6 A America/Chica~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chica~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Y~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Y~
## 6 0A9 Elizabethton Municipal Air~ 36.4 -82.2 1593 -5 A America/New_Y~
flights %>% head()
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
planes %>% head()
## # A tibble: 6 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 4 N104UW 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 5 N10575 2002 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 6 N105UW 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
weather %>% head()
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
data_list <- list(airlines = airlines, airports = airports, flights = flights, planes = planes,weather = weather)
plot_str(data_list)
map(data_list,colnames)
## $airlines
## [1] "carrier" "name"
##
## $airports
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
##
## $flights
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
##
## $planes
## [1] "tailnum" "year" "type" "manufacturer" "model"
## [6] "engines" "seats" "speed" "engine"
##
## $weather
## [1] "origin" "year" "month" "day" "hour"
## [6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
## [11] "wind_gust" "precip" "pressure" "visib" "time_hour"
merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE)
merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes"))
merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin"))
final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest"))
探索性数据分析是了解数据的过程,以便您可以生成和检验假设。通常应用可视化技术。
引入新创建的数据集:
DataExplorer::introduce(final_data) %>% t() %>% knitr::kable(col.names = "introduce")
| introduce | |
|---|---|
| rows | 336776 |
| columns | 42 |
| discrete_columns | 16 |
| continuous_columns | 26 |
| all_missing_columns | 0 |
| total_missing_values | 809170 |
| complete_rows | 906 |
| total_observations | 14144592 |
| memory_usage | 97254656 |
可视化上表
plot_intro(final_data) + theme(plot.title = element_text(hjust = 0.5))
您应该立即注意到一些惊喜:
现实世界的数据很乱,您可以简单地使用plot_missing函数来可视化每个特征。
DataExplorer::plot_missing(final_data)
从图表中可以看出,速度变量大多缺失,并且可能没有信息。看来我们找到了0.3%完整行的罪魁祸首。让我们删除它:
final_data <- drop_columns(final_data, "speed")
DataExplorer::plot_missing(final_data)
可视化所有离散特征的频率分布:
plot_bar(final_data)
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories
通过仔细检查制造商变量,不难发现以下重复项:
final_data[which(final_data$manufacturer == "AIRBUS INDUSTRIE"),]$manufacturer <- "AIRBUS"
final_data[which(final_data$manufacturer == "CANADAIR LTD"),]$manufacturer <- "CANADAIR"
final_data[which(final_data$manufacturer %in% c("MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")),]$manufacturer <- "MCDONNELL DOUGLAS"
plot_bar(final_data$manufacturer)
可视化所有连续特征的分布:
plot_histogram(final_data)
4 ## 多变量分析
plot_correlation(final_data,
type = "continuous")
## Warning in cor(x = structure(list(year_flights = c(2013L, 2013L, 2013L, : 标准差
## 为零
5 ## 生成一份EDA的报告
create_report(mpg)
创建一份HTML格式的探索性数据分析报告,让你全面地认识数据。
温馨提示: