skimr package

skimr

显示汇总统计信息,处理不同的数据类型并返回一个对象,且可以包含在管道中使用。

  • 提供比summary更多的统计信息。

  • 分别报告每种数据类型

  • 处理日期、逻辑和各种其他类型数据。

安装

存储库:https://github.com/ropensci/skimr/

主页:https://github.com/ropensci/skimr/ # 结果呈现 ## 采用紧凑的水平格式

library(skimr)
skim(iris)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

支持字符串、列表和其他

skim(dplyr::starwars)
Data summary
Name dplyr::starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 12 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 17 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

汇总功能

skim(iris) %>% 
  summary()
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

选择单个列

skim(iris,Sepal.Length,Petal.Length)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂

处理分组数据

skim()可以处理已使用 进行分组的数据。dplyr::group_by()

iris %>% 
  dplyr::group_by(Species) %>% 
  skim()
Data summary
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 4
________________________
Group variables Species

Variable type: numeric

skim_variable Species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length setosa 0 1 5.01 0.35 4.3 4.80 5.00 5.20 5.8 ▃▃▇▅▁
Sepal.Length versicolor 0 1 5.94 0.52 4.9 5.60 5.90 6.30 7.0 ▂▇▆▃▃
Sepal.Length virginica 0 1 6.59 0.64 4.9 6.23 6.50 6.90 7.9 ▁▃▇▃▂
Sepal.Width setosa 0 1 3.43 0.38 2.3 3.20 3.40 3.68 4.4 ▁▃▇▅▂
Sepal.Width versicolor 0 1 2.77 0.31 2.0 2.52 2.80 3.00 3.4 ▁▅▆▇▂
Sepal.Width virginica 0 1 2.97 0.32 2.2 2.80 3.00 3.18 3.8 ▂▆▇▅▁
Petal.Length setosa 0 1 1.46 0.17 1.0 1.40 1.50 1.58 1.9 ▁▃▇▃▁
Petal.Length versicolor 0 1 4.26 0.47 3.0 4.00 4.35 4.60 5.1 ▂▂▇▇▆
Petal.Length virginica 0 1 5.55 0.55 4.5 5.10 5.55 5.88 6.9 ▃▇▇▃▂
Petal.Width setosa 0 1 0.25 0.11 0.1 0.20 0.20 0.30 0.6 ▇▂▂▁▁
Petal.Width versicolor 0 1 1.33 0.20 1.0 1.20 1.30 1.50 1.8 ▅▇▃▆▁
Petal.Width virginica 0 1 2.03 0.27 1.4 1.80 2.00 2.30 2.5 ▂▇▆▅▇

管道操作中使用

iris %>% 
  skim() %>% 
  dplyr::filter(numeric.sd>1)
Data summary
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂

自定义

虽然提供了默认值,也可以定制统计信息,更改结果格式

my_skim <- skim_with(numeric=sfl(mad))
my_skim(iris,Sepal.Length)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist mad
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂ 1.04

其他对象

  • numeric(包括和doubleinteger)

  • character

  • factor

  • logical

  • complex

  • Date

  • POSIXct

  • ts

  • AsIs