1.1 导入数据

在可视化数据之前,必须先将其输入到 r 中。 这包括从外部源导入数据并将其转换为有用的格式。

R 可以从几乎任何来源导入数据,包括文本文件、 excel 电子表格、统计数据包和数据库管理系统。

我们将使用薪资数据集来说明这些技术,该数据集包含2008-2009年单个机构中大学教授9个月的学术薪资。

1.1.1 文本文件

readr 包提供了将带分隔符的文本文件导入 r 数据帧的函数。

library(readr)

# import data from a comma delimited file
Salaries <- read_csv("salaries.csv")

# import data from a tab delimited file
Salaries <- read_tsv("salaries.txt")

这些函数假设数据的第一行包含变量名,值分别用逗号或制表符分隔,缺少的数据用空格表示。 例如,逗号分隔文件的前几行如下所示。

"rank","discipline","yrs.since.phd","yrs.service","sex","salary"
"Prof","B",19,18,"Male",139750
"Prof","B",20,16,"Male",173200
"AsstProf","B",4,3,"Male",79750
"Prof","B",45,39,"Male",115000
"Prof","B",40,41,"Male",141500
"AssocProf","B",6,6,"Male",97000

1.1.2 Excel 电子表格

readxl 包可以从 Excel 工作簿中导入数据,支持 xls 和 xlsx 格式。

library(readxl)

# import data from an Excel workbook
Salaries <- read_excel("salaries.xlsx", sheet=1)

1.1.3 统计软件包

haven软件包提供的功能从各种统计软件包导入数据。

library(haven)

# import data from Stata
Salaries <- read_dta("salaries.dta")

# import data from SPSS
Salaries <- read_sav("salaries.sav")

# import data from SAS
Salaries <- read_sas("salaries.sas7bdat")

1.1.4 数据库

从数据库导入数据时需要额外的步骤,这超出了本书的范围。根据包含数据的数据库上,下面的包可以帮助:RODBCRMySQLROracleRPostgreSQLRSQLite和RMongo。在RStudio的最新版本,可以使用连接窗格快速访问存储在数据库管理系统中的数据。

1.2 清洗数据

清理你的数据的方法可以是任意数据分析中最耗时的部分。最重要的步骤被视为下方。虽然有许多方法,使用dplyrtidyr包都是一些最快捷,最容易学习的。

函数 功能
dplyr select 选择变量/列
dplyr filter 选择意见/行
dplyr mutate 转换或重新编码的变量
dplyr summarise 汇总数据
dplyr group_by 识别用于进一步处理子组
tidyr gather 转换宽数据集为长格式
tidyr spread 长格式数据集转换为宽格式

本节中的示例将使用来自 dplyr 包的 starwars 数据集。 数据集提供了13个变量的87个来自星球大战宇宙的字符的描述。 (实际上我更喜欢《星际迷航》 ,但是我们用我们现有的东西工作。)

1.2.1 选择变量

library(tidyverse)
library(DT)

starwars %>% datatable()
starwars %>% glimpse()
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown"...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light"...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blu...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57....
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female", NA, ...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan",...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "H...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "The Emp...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, ...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced ...
# keep the variables name, height, and gender
newdata <- select(starwars, name, height, gender)

# keep the variables name and all variables between mass and species inclusive
newdata <- select(starwars, name, mass:species)

# keep all variables except birth_year and gender
newdata <- select(starwars, -birth_year, -gender)

filter函数允许您将数据集限制为满足特定条件的观察(行)。 多个条件可以与 & (AND)| (OR)符号组合在一起。

library(dplyr)

# select females
newdata <- filter(starwars, 
                  gender == "female")

# select females that are from Alderaan
newdata <- filter(starwars, 
                  gender == "female" & 
                  homeworld == "Alderaan")


# select individuals that are from Alderaan, Coruscant, or Endor
newdata <- filter(starwars, 
                  homeworld == "Alderaan" | 
                  homeworld == "Coruscant" | 
                  homeworld == "Endor")

# this can be written more succinctly as
newdata <- filter(starwars, 
                  homeworld %in% c("Alderaan", "Coruscant", "Endor"))

1.2.3 创建 / 重新编码变量

# convert height in centimeters to inches, and mass in kilograms to pounds
newdata <- mutate(starwars, 
                  height = height * 0.394,
                  mass   = mass   * 2.205)

Ifelse 函数(基 r 的一部分)可用于重新编码数据。 格式是 ifelse (测试,返回 TRUE,返回 FALSE)

# if height is greater than 180 then heightcat = "tall", otherwise heightcat = "short"

newdata <- mutate(starwars, 
                  heightcat = ifelse(height > 180, 
                                     "tall", 
                                     "short"))
                  
# convert any eye color that is not black, blue or brown, to other
newdata <- mutate(starwars, 
                  eye_color = ifelse(eye_color %in% c("black", "blue", "brown"),
                                     eye_color,
                                     "other"))
                  
# set heights greater than 200 or less than 75 to missing
newdata <- mutate(starwars, 
                  height = ifelse(height < 75 | height > 200,
                                     NA,
                                     height))

1.2.4 总结数据

汇总函数summarise可用于将多个值减少到单个值(例如 mean)。 它通常与按组函数group_by一起使用,以便按组计算统计数据。 在下面的代码中,na.rm TRUE 选项用于在计算平均值之前删除缺少的值。

# calculate mean height and mass
newdata <- summarize(starwars, 
                     mean_ht = mean(height, na.rm=TRUE), 
                     mean_mass = mean(mass, na.rm=TRUE))
newdata
## # A tibble: 1 x 2
##   mean_ht mean_mass
##     <dbl>     <dbl>
## 1    174.      97.3
# calculate mean height and weight by gender
starwars %>% 
  group_by(gender) %>% 
  summarise(mean_ht = mean(height,na.rm = TRUE),
            mean_wt = mean(mass,na.rm = TRUE))
## # A tibble: 5 x 3
##   gender        mean_ht mean_wt
##   <chr>           <dbl>   <dbl>
## 1 female           165.    54.0
## 2 hermaphrodite    175   1358  
## 3 male             179.    81.0
## 4 none             200    140  
## 5 <NA>             120     46.3

1.2.5 使用管道

dplyrtidyr 这样的包允许您使用管道% 操作符以紧凑的格式编写代码。 下面是一个例子。

# calculate the mean height for women by species
newdata <- filter(starwars, 
                  gender == "female")
newdata <- group_by(newdata,species)
newdata <- summarize(newdata, 
                     mean_ht = mean(height, na.rm = TRUE))

# this can be written as
newdata <- starwars %>%
  filter(gender == "female") %>%   # 性别
  group_by(species) %>%            # 分组
  summarize(mean_ht = mean(height, na.rm = TRUE))  # 总结数据

1.2.6 重新整合数据

有些图需要宽格式的数据,而有些图需要长格式的数据。

id name sex age income
01 Bill Male 22 55000
02 Bob Male 25 75000
03 Mary Female 18 90000
library(tidyr)
wide_data <- tibble(id = c("01","02","03"),
                    name = c("Bill","Bob","Mary"),
                    sex = c("Male","Male","Female"),
                    age = c(22,25,18),
                    income = c(55000,75000,90000))
long_data <- gather(wide_data, 
                    key="variable", 
                    value="value", 
                    sex:income)
long_data
## # A tibble: 9 x 4
##   id    name  variable value 
##   <chr> <chr> <chr>    <chr> 
## 1 01    Bill  sex      Male  
## 2 02    Bob   sex      Male  
## 3 03    Mary  sex      Female
## 4 01    Bill  age      22    
## 5 02    Bob   age      25    
## 6 03    Mary  age      18    
## 7 01    Bill  income   55000 
## 8 02    Bob   income   75000 
## 9 03    Mary  income   90000

相反,您可以使用以下方法将长数据集转换为宽数据集

wide_data <- spread(long_data, variable, value)
wide_data
## # A tibble: 3 x 5
##   id    name  age   income sex   
##   <chr> <chr> <chr> <chr>  <chr> 
## 1 01    Bill  22    55000  Male  
## 2 02    Bob   25    75000  Male  
## 3 03    Mary  18    90000  Female

1.2.7 缺失数据

真实的数据可能包含缺失的值。 处理缺失数据有三种基本方法: 特征选择列表删除填充。 让我们看看它们是如何从ggplot2包应用到 msleep 数据集的。 Msleep 数据集描述了哺乳动物的睡眠习惯,并包含了几个变量的缺失值。

1.2.7.1 Feature selection

在选择 n 个特性时,可以删除包含太多缺失值的变量(列)。

data(msleep, package="ggplot2")

# what is the proportion of missing data for each variable?
pctmiss <- colSums(is.na(msleep))/nrow(msleep)
round(pctmiss, 2)
##         name        genus         vore        order conservation  sleep_total 
##         0.00         0.00         0.08         0.00         0.35         0.00 
##    sleep_rem  sleep_cycle        awake      brainwt       bodywt 
##         0.27         0.61         0.00         0.33         0.00

百分之六十一的睡眠周期值丢失。你可以决定放弃它

1.2.7.2 Listwise删除

Listwise删除:删除包含任何感兴趣的变量上的缺失值的观察值(行)。

# Create a dataset containing genus, vore, and conservation.
# Delete any rows containing missing data.
newdata <- select(msleep, genus, vore, conservation)
newdata <- na.omit(newdata)

1.2.7.3 Imputation

估算涉及用”合理”的猜测取代缺失值,即如果没有缺失值,这些值将是什么。 有几种方法,详细介绍了 VIMmiceAmeliamissForest 等软件包。 在这里,我们将使用 VIM 包中的 kNN 函数用输入值替换缺失值。

# Impute missing values using the 5 nearest neighbors
library(pacman)
p_load(VIM)
newdata <- kNN(msleep, k=5)
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA

基本上,对于每个缺失值的情况,都会选择 k 个最相似的没有缺失值的情况。 如果缺失的值是数值,则使用这些 k 值的平均值作为估算值。 如果缺失值是离散值,则使用 k 情况下最常见的值。 过程遍历案例和变量,直到结果收敛(变得稳定)。 这有点过于简单化了——有关实际细节,请参阅 r Package VIM 的 Imputation。

重要的注意事项: 缺失的数值可能影响研究结果(有时严重)。如果您有大量的缺失数据,在删除cases或填补缺失值之前咨询统计学家或数据科学家可能是一个好主意。