在可视化数据之前,必须先将其输入到 r 中。 这包括从外部源导入数据并将其转换为有用的格式。
R 可以从几乎任何来源导入数据,包括文本文件、 excel 电子表格、统计数据包和数据库管理系统。
我们将使用薪资数据集来说明这些技术,该数据集包含2008-2009年单个机构中大学教授9个月的学术薪资。
readr 包提供了将带分隔符的文本文件导入 r 数据帧的函数。
library(readr)
# import data from a comma delimited file
Salaries <- read_csv("salaries.csv")
# import data from a tab delimited file
Salaries <- read_tsv("salaries.txt")
这些函数假设数据的第一行包含变量名,值分别用逗号或制表符分隔,缺少的数据用空格表示。 例如,逗号分隔文件的前几行如下所示。
"rank","discipline","yrs.since.phd","yrs.service","sex","salary"
"Prof","B",19,18,"Male",139750
"Prof","B",20,16,"Male",173200
"AsstProf","B",4,3,"Male",79750
"Prof","B",45,39,"Male",115000
"Prof","B",40,41,"Male",141500
"AssocProf","B",6,6,"Male",97000
readxl 包可以从 Excel 工作簿中导入数据,支持 xls 和 xlsx 格式。
library(readxl)
# import data from an Excel workbook
Salaries <- read_excel("salaries.xlsx", sheet=1)
该haven软件包提供的功能从各种统计软件包导入数据。
library(haven)
# import data from Stata
Salaries <- read_dta("salaries.dta")
# import data from SPSS
Salaries <- read_sav("salaries.sav")
# import data from SAS
Salaries <- read_sas("salaries.sas7bdat")
从数据库导入数据时需要额外的步骤,这超出了本书的范围。根据包含数据的数据库上,下面的包可以帮助:RODBC,RMySQL,ROracle,RPostgreSQL,RSQLite,和RMongo。在RStudio的最新版本,可以使用连接窗格快速访问存储在数据库管理系统中的数据。
清理你的数据的方法可以是任意数据分析中最耗时的部分。最重要的步骤被视为下方。虽然有许多方法,使用dplyr和tidyr包都是一些最快捷,最容易学习的。
| 包 | 函数 | 功能 |
|---|---|---|
| dplyr | select | 选择变量/列 |
| dplyr | filter | 选择意见/行 |
| dplyr | mutate | 转换或重新编码的变量 |
| dplyr | summarise | 汇总数据 |
| dplyr | group_by | 识别用于进一步处理子组 |
| tidyr | gather | 转换宽数据集为长格式 |
| tidyr | spread | 长格式数据集转换为宽格式 |
本节中的示例将使用来自 dplyr 包的 starwars 数据集。 数据集提供了13个变量的87个来自星球大战宇宙的字符的描述。 (实际上我更喜欢《星际迷航》 ,但是我们用我们现有的东西工作。)
library(tidyverse)
library(DT)
starwars %>% datatable()
starwars %>% glimpse()
## Observations: 87
## Variables: 13
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia...
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180...
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown"...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light"...
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blu...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57....
## $ gender <chr> "male", NA, NA, "male", "female", "male", "female", NA, ...
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan",...
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "H...
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "The Emp...
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, ...
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced ...
# keep the variables name, height, and gender
newdata <- select(starwars, name, height, gender)
# keep the variables name and all variables between mass and species inclusive
newdata <- select(starwars, name, mass:species)
# keep all variables except birth_year and gender
newdata <- select(starwars, -birth_year, -gender)
filter函数允许您将数据集限制为满足特定条件的观察(行)。 多个条件可以与 & (AND)和 | (OR)符号组合在一起。
library(dplyr)
# select females
newdata <- filter(starwars,
gender == "female")
# select females that are from Alderaan
newdata <- filter(starwars,
gender == "female" &
homeworld == "Alderaan")
# select individuals that are from Alderaan, Coruscant, or Endor
newdata <- filter(starwars,
homeworld == "Alderaan" |
homeworld == "Coruscant" |
homeworld == "Endor")
# this can be written more succinctly as
newdata <- filter(starwars,
homeworld %in% c("Alderaan", "Coruscant", "Endor"))
# convert height in centimeters to inches, and mass in kilograms to pounds
newdata <- mutate(starwars,
height = height * 0.394,
mass = mass * 2.205)
Ifelse 函数(基 r 的一部分)可用于重新编码数据。 格式是 ifelse (测试,返回 TRUE,返回 FALSE)
# if height is greater than 180 then heightcat = "tall", otherwise heightcat = "short"
newdata <- mutate(starwars,
heightcat = ifelse(height > 180,
"tall",
"short"))
# convert any eye color that is not black, blue or brown, to other
newdata <- mutate(starwars,
eye_color = ifelse(eye_color %in% c("black", "blue", "brown"),
eye_color,
"other"))
# set heights greater than 200 or less than 75 to missing
newdata <- mutate(starwars,
height = ifelse(height < 75 | height > 200,
NA,
height))
汇总函数summarise可用于将多个值减少到单个值(例如 mean)。 它通常与按组函数group_by一起使用,以便按组计算统计数据。 在下面的代码中,na.rm TRUE 选项用于在计算平均值之前删除缺少的值。
# calculate mean height and mass
newdata <- summarize(starwars,
mean_ht = mean(height, na.rm=TRUE),
mean_mass = mean(mass, na.rm=TRUE))
newdata
## # A tibble: 1 x 2
## mean_ht mean_mass
## <dbl> <dbl>
## 1 174. 97.3
# calculate mean height and weight by gender
starwars %>%
group_by(gender) %>%
summarise(mean_ht = mean(height,na.rm = TRUE),
mean_wt = mean(mass,na.rm = TRUE))
## # A tibble: 5 x 3
## gender mean_ht mean_wt
## <chr> <dbl> <dbl>
## 1 female 165. 54.0
## 2 hermaphrodite 175 1358
## 3 male 179. 81.0
## 4 none 200 140
## 5 <NA> 120 46.3
像 dplyr 和 tidyr 这样的包允许您使用管道% 操作符以紧凑的格式编写代码。 下面是一个例子。
# calculate the mean height for women by species
newdata <- filter(starwars,
gender == "female")
newdata <- group_by(newdata,species)
newdata <- summarize(newdata,
mean_ht = mean(height, na.rm = TRUE))
# this can be written as
newdata <- starwars %>%
filter(gender == "female") %>% # 性别
group_by(species) %>% # 分组
summarize(mean_ht = mean(height, na.rm = TRUE)) # 总结数据
有些图需要宽格式的数据,而有些图需要长格式的数据。
| id | name | sex | age | income |
|---|---|---|---|---|
| 01 | Bill | Male | 22 | 55000 |
| 02 | Bob | Male | 25 | 75000 |
| 03 | Mary | Female | 18 | 90000 |
library(tidyr)
wide_data <- tibble(id = c("01","02","03"),
name = c("Bill","Bob","Mary"),
sex = c("Male","Male","Female"),
age = c(22,25,18),
income = c(55000,75000,90000))
long_data <- gather(wide_data,
key="variable",
value="value",
sex:income)
long_data
## # A tibble: 9 x 4
## id name variable value
## <chr> <chr> <chr> <chr>
## 1 01 Bill sex Male
## 2 02 Bob sex Male
## 3 03 Mary sex Female
## 4 01 Bill age 22
## 5 02 Bob age 25
## 6 03 Mary age 18
## 7 01 Bill income 55000
## 8 02 Bob income 75000
## 9 03 Mary income 90000
相反,您可以使用以下方法将长数据集转换为宽数据集
wide_data <- spread(long_data, variable, value)
wide_data
## # A tibble: 3 x 5
## id name age income sex
## <chr> <chr> <chr> <chr> <chr>
## 1 01 Bill 22 55000 Male
## 2 02 Bob 25 75000 Male
## 3 03 Mary 18 90000 Female
真实的数据可能包含缺失的值。 处理缺失数据有三种基本方法: 特征选择、列表删除和填充。 让我们看看它们是如何从ggplot2包应用到 msleep 数据集的。 Msleep 数据集描述了哺乳动物的睡眠习惯,并包含了几个变量的缺失值。
在选择 n 个特性时,可以删除包含太多缺失值的变量(列)。
data(msleep, package="ggplot2")
# what is the proportion of missing data for each variable?
pctmiss <- colSums(is.na(msleep))/nrow(msleep)
round(pctmiss, 2)
## name genus vore order conservation sleep_total
## 0.00 0.00 0.08 0.00 0.35 0.00
## sleep_rem sleep_cycle awake brainwt bodywt
## 0.27 0.61 0.00 0.33 0.00
百分之六十一的睡眠周期值丢失。你可以决定放弃它
Listwise删除:删除包含任何感兴趣的变量上的缺失值的观察值(行)。
# Create a dataset containing genus, vore, and conservation.
# Delete any rows containing missing data.
newdata <- select(msleep, genus, vore, conservation)
newdata <- na.omit(newdata)
估算涉及用”合理”的猜测取代缺失值,即如果没有缺失值,这些值将是什么。 有几种方法,详细介绍了 VIM、 mice、 Amelia 和 missForest 等软件包。 在这里,我们将使用 VIM 包中的 kNN 函数用输入值替换缺失值。
# Impute missing values using the 5 nearest neighbors
library(pacman)
p_load(VIM)
newdata <- kNN(msleep, k=5)
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx, numericalX, :
## 强制改变过程中产生了NA
基本上,对于每个缺失值的情况,都会选择 k 个最相似的没有缺失值的情况。 如果缺失的值是数值,则使用这些 k 值的平均值作为估算值。 如果缺失值是离散值,则使用 k 情况下最常见的值。 过程遍历案例和变量,直到结果收敛(变得稳定)。 这有点过于简单化了——有关实际细节,请参阅 r Package VIM 的 Imputation。
重要的注意事项: 缺失的数值可能影响研究结果(有时严重)。如果您有大量的缺失数据,在删除cases或填补缺失值之前咨询统计学家或数据科学家可能是一个好主意。