1. Работа с качественными данными в R

Создаем текстовую переменную из 100 случайно выбранных слов “yes”, “no”, “maybe”. Затем посмотрим её описание.

x <- sample(size = 100, c("yes", "no", "maybe"), rep = TRUE)
str(x)

##  chr [1:100] "maybe" "maybe" "yes" "maybe" "no" "yes" ...

Переведём нашу переменную из символьной в факторную:

x.factor <- factor(x)
str(x.factor)

##  Factor w/ 3 levels "maybe","no","yes": 1 1 3 1 2 3 3 3 3 2 ...

Графики

Можно построить гистограмму. Штатными средствами R:

plot(x.factor, main = "Любите ли вы сыр?", xlab = "Ответ мышки", 
    ylab = "Количество мышек")

plot of chunk unnamed-chunk-3

Та же гистограмма с помощью пакета ggplot2. Пакет ggplot2 работает с таблицами данных (data frame), поэтому предварительно создадим таблицу h со столбцом x.factor.

library(ggplot2)
h <- data.frame(x.factor)
qplot(x.factor, data = h) + labs(x = "Ответ мышки", y = "Количество мышек", 
    title = "Любите ли вы сыр?")

plot of chunk unnamed-chunk-4

Есть пакет vcd с кучей графиков для нескольких качественных переменных. Например, мозаичный график:

library(vcd)

## Loading required package: grid

g <- Titanic
mosaic(~Class + Sex + Survived, data = g, shade = TRUE)

plot of chunk unnamed-chunk-5

добавить parallel coordinates

добавить circular plot

Устанавливаем пакет alluvial

require(devtools)
install_github(repo = "alluvial", username = "mbojan")

Загружаем alluvial

require(alluvial)

## Loading required package: alluvial

tit <- as.data.frame(Titanic)
alluvial(tit[, 1:4], freq = tit$Freq, border = NA, hide = tit$Freq < quantile(tit$Freq, 
    0.5), col = ifelse(tit$Survived == "No", "red", "gray"))

plot of chunk unnamed-chunk-7

Регрессии и смена базовой категории

Теперь можно строить регрессии и R автоматом будет вводить дамми в нужном количестве.

y <- rnorm(100)
model1 <- lm(y ~ x.factor)
summary(model1)

## 
## Call:
## lm(formula = y ~ x.factor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2294 -0.6717  0.0801  0.5928  2.4052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.056      0.190    0.30     0.77
## x.factorno     0.109      0.231    0.47     0.64
## x.factoryes   -0.140      0.252   -0.56     0.58
## 
## Residual standard error: 0.909 on 97 degrees of freedom
## Multiple R-squared:  0.014,  Adjusted R-squared:  -0.00636 
## F-statistic: 0.687 on 2 and 97 DF,  p-value: 0.505

Мы легко можем указать категорию no в качестве базовой:

x.factor <- relevel(x.factor, ref = "no")
model2 <- lm(y ~ x.factor)
summary(model2)

## 
## Call:
## lm(formula = y ~ x.factor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2294 -0.6717  0.0801  0.5928  2.4052 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.165      0.133    1.24     0.22
## x.factormaybe   -0.109      0.231   -0.47     0.64
## x.factoryes     -0.249      0.212   -1.17     0.24
## 
## Residual standard error: 0.909 on 97 degrees of freedom
## Multiple R-squared:  0.014,  Adjusted R-squared:  -0.00636 
## F-statistic: 0.687 on 2 and 97 DF,  p-value: 0.505

Для некоторых целей можно перевести факторную переменную в числовую

x.num <- as.numeric(x.factor)
str(x.num)

##  num [1:100] 2 2 3 2 1 3 3 3 3 1 ...

Почиташки: