探索性数据分析(EDA)是我们做数据分析的第一步,也是最重要的一步。通过EDA,可以让我们了解数据状况,发现数据问题,为我们后续的数据管理提供指导。

EDA是从原始数据入手,采用数据汇总和数据可视化的方法,研究数据的概况,变量的类型,变量的分布,变量与变量之间的关系,数据的常见问题(缺失值|无效值|异常值|数据范围|数据单位等)等内容,其目的就是最大程度地理解数据,最大程度地保证数据质量。

DataExplore包是R语言的一个EDA包,使用它,可以帮助我们更有效和快捷地对数据做EDA。


1 安装和导入DataExplore包

if(!require(DataExplorer)){
  install.packages("DataExplorer")
  require(DataExplorer)
}
## Loading required package: DataExplorer

2 加载数据集

library(pacman)
p_load(DataExplorer,tidyverse,funModeling,nycflights13)

如果要快速可视化所有结构,可以执行以下操作:

2.1 plot_str函数

airlines %>% head()
## # A tibble: 6 x 2
##   carrier name                    
##   <chr>   <chr>                   
## 1 9E      Endeavor Air Inc.       
## 2 AA      American Airlines Inc.  
## 3 AS      Alaska Airlines Inc.    
## 4 B6      JetBlue Airways         
## 5 DL      Delta Air Lines Inc.    
## 6 EV      ExpressJet Airlines Inc.
airports %>% head()
## # A tibble: 6 x 8
##   faa   name                          lat   lon   alt    tz dst   tzone         
##   <chr> <chr>                       <dbl> <dbl> <dbl> <dbl> <chr> <chr>         
## 1 04G   Lansdowne Airport            41.1 -80.6  1044    -5 A     America/New_Y~
## 2 06A   Moton Field Municipal Airp~  32.5 -85.7   264    -6 A     America/Chica~
## 3 06C   Schaumburg Regional          42.0 -88.1   801    -6 A     America/Chica~
## 4 06N   Randall Airport              41.4 -74.4   523    -5 A     America/New_Y~
## 5 09J   Jekyll Island Airport        31.1 -81.4    11    -5 A     America/New_Y~
## 6 0A9   Elizabethton Municipal Air~  36.4 -82.2  1593    -5 A     America/New_Y~
flights %>% head()
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
planes %>% head()
## # A tibble: 6 x 9
##   tailnum  year type           manufacturer   model  engines seats speed engine 
##   <chr>   <int> <chr>          <chr>          <chr>    <int> <int> <int> <chr>  
## 1 N10156   2004 Fixed wing mu~ EMBRAER        EMB-1~       2    55    NA Turbo-~
## 2 N102UW   1998 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 3 N103US   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 4 N104UW   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 5 N10575   2002 Fixed wing mu~ EMBRAER        EMB-1~       2    55    NA Turbo-~
## 6 N105UW   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
weather %>% head()
## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>
data_list <- list(airlines = airlines, airports = airports, flights = flights, planes = planes,weather = weather)
plot_str(data_list)

2.2 各个数据集的行名

map(data_list,colnames)
## $airlines
## [1] "carrier" "name"   
## 
## $airports
## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"
## 
## $flights
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"     
## 
## $planes
## [1] "tailnum"      "year"         "type"         "manufacturer" "model"       
## [6] "engines"      "seats"        "speed"        "engine"      
## 
## $weather
##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"

2.3 merge数据

merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE)
merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes"))
merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin"))
final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest"))

3 探索性数据分析

探索性数据分析是了解数据的过程,以便您可以生成和检验假设。通常应用可视化技术。

3.1 introduce函数

引入新创建的数据集:

DataExplorer::introduce(final_data) %>% t() %>% knitr::kable(col.names = "introduce")
introduce
rows 336776
columns 42
discrete_columns 16
continuous_columns 26
all_missing_columns 0
total_missing_values 809170
complete_rows 906
total_observations 14144592
memory_usage 97254656

3.2 plot_intro函数

可视化上表

plot_intro(final_data) + theme(plot.title = element_text(hjust = 0.5))

您应该立即注意到一些惊喜:

  • 0.3%的完整行:这意味着所有行中只有0.3%没有完全丢失!
  • 5.7%的缺失观察值:给定0.3%的完整行,总共只有5.7%的缺失观察值

3.3 plot_missing函数

现实世界的数据很乱,您可以简单地使用plot_missing函数来可视化每个特征。

DataExplorer::plot_missing(final_data)

从图表中可以看出,速度变量大多缺失,并且可能没有信息。看来我们找到了0.3%完整行的罪魁祸首。让我们删除它:

final_data <- drop_columns(final_data, "speed")
DataExplorer::plot_missing(final_data)

3.4 plot_bar函数

可视化所有离散特征的频率分布:

plot_bar(final_data)
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories

通过仔细检查制造商变量,不难发现以下重复项:

  • 空客和空客工业
  • 加拿大航空公司和加拿大航空公司
  • 麦克唐纳·道格拉斯,麦克唐纳·道格拉斯飞机公司和麦克唐纳·道格拉斯公司
final_data[which(final_data$manufacturer == "AIRBUS INDUSTRIE"),]$manufacturer <- "AIRBUS"
final_data[which(final_data$manufacturer == "CANADAIR LTD"),]$manufacturer <- "CANADAIR"
final_data[which(final_data$manufacturer %in% c("MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")),]$manufacturer <- "MCDONNELL DOUGLAS"

plot_bar(final_data$manufacturer)

3.5 plot_histogram函数

可视化所有连续特征的分布:

plot_histogram(final_data)

4 ## 多变量分析

plot_correlation(final_data, 
                 type = "continuous")
## Warning in cor(x = structure(list(year_flights = c(2013L, 2013L, 2013L, : 标准差
## 为零

5 ## 生成一份EDA的报告

create_report(mpg)

创建一份HTML格式的探索性数据分析报告,让你全面地认识数据。

温馨提示:

  • 只有充分地理解数据,才能更好地应用数据。
  • EDA从原始数据入手,对数据做画像工作,以保证数据可用和有价值。