Como muchos de los packages del tidyverse, la idea es simplificar y unificar las interfases.
5/4/2018
Como muchos de los packages del tidyverse, la idea es simplificar y unificar las interfases.
Sirven para modelar variables categóricas.
x1 <- c("Dec", "Abr", "Ene", "Mar")
x1
## [1] "Dec" "Abr" "Ene" "Mar"
Orden:
sort(x1)
## [1] "Abr" "Dec" "Ene" "Mar"
Errores de tipeo:
x2 <- c("Dec", "Abr", "Ene", "Marzo")
Evitan estos problemas.
niveles <- c("Ene", "Feb", "Mar", "Abr",
"May", "Jun", "Jul", "Ag",
"Set", "Oct", "Nov", "Dec")
x1
## [1] "Dec" "Abr" "Ene" "Mar"
meses_factor <- factor(x1, levels=niveles) sort(meses_factor)
## [1] Ene Mar Abr Dec ## Levels: Ene Feb Mar Abr May Jun Jul Ag Set Oct Nov Dec
Si hay un error:
x2
## [1] "Dec" "Abr" "Ene" "Marzo"
meses_factor_2 <- factor(x2, levels=niveles) meses_factor_2
## [1] Dec Abr Ene <NA> ## Levels: Ene Feb Mar Abr May Jun Jul Ag Set Oct Nov Dec
library(dplyr) library(forcats)
head(gss_cat)
## # A tibble: 6 x 9 ## year marital age race rincome party… relig denom tvho… ## <int> <fctr> <int> <fctr> <fctr> <fctr> <fct> <fct> <int> ## 1 2000 Never married 26 White $8000 to 9999 Ind,n… Prot… Sout… 12 ## 2 2000 Divorced 48 White $8000 to 9999 Not s… Prot… Bapt… NA ## 3 2000 Widowed 67 White Not applicable Indep… Prot… No d… 2 ## 4 2000 Never married 39 White Not applicable Ind,n… Orth… Not … 4 ## 5 2000 Divorced 25 White Not applicable Not s… None Not … 1 ## 6 2000 Married 25 White $20000 - 24999 Stron… Prot… Sout… NA
Vamos a usar ggplot para graficar los datos, pero podrían ser tablas.
library(ggplot2) ggplot(gss_cat, aes(x=race)) + geom_bar()
ggplot(gss_cat, aes(x=race)) + geom_bar() +
scale_x_discrete(drop=FALSE)
Explorar la distribución de rincome:
ggplot(gss_cat, aes(x=rincome)) + geom_bar() +
theme(axis.text.x=element_text(angle=90))
relig <- gss_cat %>%
group_by(relig) %>%
summarize(age=mean(age, na.rm=TRUE),
tvhours=mean(tvhours, na.rm=TRUE),
n=n())
ggplot(relig, aes(tvhours, relig)) + geom_point()
Como son un factor, puedo usar fct_reorder, esta función lleva: un factor y un vector numérico. Devuelve el mismo factor pero con los niveles reordenados.
relig <- relig %>%
mutate(relig_factor = fct_reorder(relig, tvhours))
ggplot(relig, aes(tvhours, relig_factor)) + geom_point()
Tangente, en el libro está así:
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
rincome <- gss_cat %>%
group_by(rincome) %>%
summarize(age = mean(age, na.rm=TRUE),
tvhours = mean(tvhours, na.rm=TRUE),
n=n())
ggplot(rincome, aes(tvhours, rincome)) +
geom_point()
No tiene sentido reordenarlos por otra variable como en el caso anterior. Sin embargo, podemos querer mandar algunos para adelante para resaltarlos:
ggplot(rincome, aes(tvhours, relevel(rincome, "Not applicable"))) +
geom_point()
fct_infreq)gss_cat %>%
mutate(marital = marital %>% fct_infreq() ) %>%
ggplot(aes(marital)) + geom_bar()
fct_rev lo revierte:
gss_cat %>% mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>% ggplot(aes(marital)) + geom_bar()
gss_cat %>% count(partyid)
## # A tibble: 10 x 2 ## partyid n ## <fctr> <int> ## 1 No answer 154 ## 2 Don't know 1 ## 3 Other party 393 ## 4 Strong republican 2314 ## 5 Not str republican 3032 ## 6 Ind,near rep 1791 ## 7 Independent 4119 ## 8 Ind,near dem 2499 ## 9 Not str democrat 3690 ## 10 Strong democrat 3490
fct_recode)gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat")) %>%
count(partyid)
count(partyid)
## # A tibble: 10 x 2 ## partyid n ## <fctr> <int> ## 1 No answer 154 ## 2 Don't know 1 ## 3 Other party 393 ## 4 Republican, strong 2314 ## 5 Republican, weak 3032 ## 6 Independent, near rep 1791 ## 7 Independent 4119 ## 8 Independent, near dem 2499 ## 9 Democrat, weak 3690 ## 10 Democrat, strong 3490
fct_collapse)gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
count(partyid)
## # A tibble: 4 x 2 ## partyid n ## <fctr> <int> ## 1 other 548 ## 2 rep 5346 ## 3 ind 8409 ## 4 dem 7180
fct_lump)Colapsa los factores chicos en un "otros".
gss_cat %>% mutate(relig = fct_lump(relig, n = 10, other_level="otros")) %>% count(relig, sort = TRUE)
count(relig, sort = TRUE)
## # A tibble: 11 x 2 ## relig n ## <fctr> <int> ## 1 Protestant 10846 ## 2 Catholic 5124 ## 3 None 3523 ## 4 Christian 689 ## 5 Jewish 388 ## 6 otros 234 ## 7 Other 224 ## 8 Buddhism 147 ## 9 Inter-nondenominational 109 ## 10 Moslem/islam 104 ## 11 Orthodox-christian 95
Calcular cuántas empresas hay en cada zona franca y hacer un dot plot.
Calcular el VBP agregado por Zona, quedarse solo con las "principales". Lo mismo con empleados.