Taller Usando dplyr en Computacion estadıstica

\[\color{blue}{Usando~dplyr}\] Usando la funcion set.seed(20) y colocando sus dos ultimos numeros de cedula en los espacios, realice las siguientes actividades: :

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

set.seed(2077)

\[\color{blue}{Actividad~1}\] Genere una muestra de tamaño 120 de la distribucion normal con media 5, desviacion estandar 0.85. Use solo dos numeros decimales (rnorm()):

set.seed(2077)
biomasa <- rnorm(120, 5, 0.85)
head(biomasa)

## [1] 5.298067 4.391265 5.742773 5.525798 5.733124 4.740585

Genere una muestra de tamaño 120 de la distribucion binomial con parametros 0.8 (probabilidad) y 20 (ensayos independientes) (rbinom())

set.seed(2077)
flores.r <- rbinom(120, 20, 0.8)
flores.r

##   [1] 15 17 17 16 14 16 15 17 14 15 17 17 16 12 19 13 14 15 17 17 16 17 18 14 17
##  [26] 16 17 19 18 16 17 16 19 15 17 16 12 13 13 12 15 17 11 14 15 16 18 17 16 16
##  [51] 16 15 17 15 14 18 14 14 17 14 16 16 14 12 14 13 13 16 14 19 14 14 12 17 18
##  [76] 15 15 17 15 17 15 15 16 19 16 17 19 12 19 16 18 17 16 17 16 16 13 19 12 15
## [101] 14 18 18 12 18 14 17 14 17 19 16 13 13 15 17 16 16 16 17 17

Genere una muestra de tamaño 120 de la distribucion Poisson con parametro 10.5 (media) (rpois()):

set.seed(2077)
flores.d <- rpois(120, 10.5)
flores.d

##   [1] 11  8 10  8  9  9  3 11  9  9 14 11  5  9 17  7 11 13 12  7  8 11 16 14 10
##  [26]  4 16  7  9  9 12  9  6  8 16 17 13  6 13 13  4 15  9 10  8 18 13  8 10 11
##  [51] 11  8 10 10  9 13  4 10 13  9 12 10  7  6  8  6 10 11 15 18  7 13  8  7  8
##  [76]  4 11  4  8 12 12  6  9 10  9 10 11 14 12  9 11 10  9 13  8 10  5  6  6  7
## [101]  9 16 10 12 11 11  6 10 11 13  9 10  7  7 20 10 12 12 13 11

Genere una muestra con reemplazo de tamaño 120 de una secuencia de 300 numeros(sample.int()):

set.seed(2077)
hojas.d <- sample.int(300, 120, replace = TRUE)
hojas.d

##   [1] 280 169 277  56  73 153 265  72 155 100  44 153 116   4 112 136  79 142
##  [19]  16  24 232 235 106 185 289 280 208 225 251 111 237 145 106 200 127 134
##  [37] 243 148  11  10 286 152 160 197 158  38  33 230 156  14 289 275 116  43
##  [55] 121  90   4 298  28  47  63 141  33 243  72 154  74 217  94  79 120 124
##  [73] 188  51  58 189 174 257 230 205 142 144  30 222 155 243  93 208 282 141
##  [91] 161 101  66 202 121  58 170  87  36 284 240 155 277  37  48 155 112  47
## [109] 132 109  77  33 257  99 276 125 133 201 104  23

Usando la librería purrr genere una muestra de la distribución de Bernoulli de tamaño 120 y parámetro 0.75 (probabilidad) (rbernoulli()) y cambie el FALSE (ausente) y el TRUE por (presente):

library(purrr)
plaga <- rbernoulli(120, 0.75) 
plaga<- ifelse(plaga[] == TRUE, "Presente", "Ausente")
head(plaga)

## [1] "Presente" "Ausente"  "Presente" "Presente" "Presente" "Presente"

Genere tres niveles de un factor, cada uno con 40 datos y etiquételos con (S) para identificar la planta (sana), (PA) para las plantas parcialmente afectadas y (MA) para las muy afectadas. Use la función (gl()):

estatus <- gl(3, 40, labels = c("S","PA", "MA"))
estatus

##   [1] S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  S 
##  [26] S  S  S  S  S  S  S  S  S  S  S  S  S  S  S  PA PA PA PA PA PA PA PA PA PA
##  [51] PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA PA
##  [76] PA PA PA PA PA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA
## [101] MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA
## Levels: S PA MA

Genere dos niveles de un factor usando la distribución uniforme con parámetros 0 y 1.2, ponga para cada datos a generar de los 120 la condición de si el número generado es menor a 0.5, los etiquetamos como (FO) para asociarlo a fertilización orgánica, de lo contrario, use (FI) para asociarlo a fertilización inorgánica. Use la función (runif()). Si lo desea use la función (ifelse()) o (if else()) para condicionar:

set.seed(2077)
datos <- runif(120, 0, 1.2)
fertilizacion <- ifelse(datos[] < 0.5, "FO", "FI")
fertilizacion

##   [1] "FI" "FO" "FO" "FI" "FI" "FI" "FI" "FO" "FI" "FI" "FO" "FO" "FI" "FI" "FO"
##  [16] "FI" "FI" "FI" "FO" "FO" "FI" "FO" "FO" "FI" "FO" "FI" "FO" "FO" "FO" "FI"
##  [31] "FO" "FI" "FO" "FI" "FO" "FI" "FI" "FI" "FI" "FI" "FI" "FO" "FI" "FI" "FI"
##  [46] "FI" "FO" "FO" "FI" "FI" "FI" "FI" "FO" "FI" "FI" "FO" "FI" "FI" "FO" "FI"
##  [61] "FI" "FI" "FI" "FI" "FI" "FI" "FI" "FI" "FI" "FO" "FI" "FI" "FI" "FO" "FO"
##  [76] "FI" "FI" "FO" "FI" "FO" "FI" "FI" "FI" "FO" "FI" "FO" "FO" "FI" "FO" "FI"
##  [91] "FO" "FO" "FI" "FO" "FI" "FI" "FI" "FO" "FI" "FI" "FI" "FO" "FO" "FI" "FO"
## [106] "FI" "FO" "FI" "FO" "FO" "FI" "FI" "FI" "FI" "FO" "FI" "FI" "FI" "FO" "FO"

Con los datos generados en la actividad 1 desarrolle la siguiente actividad:

\[\color{blue}{Actividad~2}\] Construya un marco de datos (data.frame()) o una tableta (tibble()) con todas la variables antes generadas y asigne respectivamente los nombres de variable: Biomasa(gramos), Flores.r (conteo de flores en tres ramas), Flores.d (conteo de flores desprendidas), Hojas.d (conteo de hojas desprendidas), Plaga , Estatus y Fertilización:

tib.c <-data.frame(biomasa, flores.r, flores.d, hojas.d, plaga, estatus, fertilizacion)
head(tib.c)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 5.298067       15       11     280 Presente       S            FI
## 2 4.391265       17        8     169  Ausente       S            FO
## 3 5.742773       17       10     277 Presente       S            FO
## 4 5.525798       16        8      56 Presente       S            FI
## 5 5.733124       14        9      73 Presente       S            FI
## 6 4.740585       16        9     153 Presente       S            FI

Revise del objeto creado su dimensión (dim()), su estructura con (str()) o (glimpse()) , la clase (class()), los nombres en las variables (names()), la presencia de faltantes (is.na()):

dim(tib.c)

## [1] 120   7

str(tib.c)

## 'data.frame':    120 obs. of  7 variables:
##  $ biomasa      : num  5.3 4.39 5.74 5.53 5.73 ...
##  $ flores.r     : int  15 17 17 16 14 16 15 17 14 15 ...
##  $ flores.d     : int  11 8 10 8 9 9 3 11 9 9 ...
##  $ hojas.d      : int  280 169 277 56 73 153 265 72 155 100 ...
##  $ plaga        : chr  "Presente" "Ausente" "Presente" "Presente" ...
##  $ estatus      : Factor w/ 3 levels "S","PA","MA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fertilizacion: chr  "FI" "FO" "FO" "FI" ...

class(tib.c)

## [1] "data.frame"

names(tib.c)

## [1] "biomasa"       "flores.r"      "flores.d"      "hojas.d"      
## [5] "plaga"         "estatus"       "fertilizacion"

Seleccione un subconjunto (el 75 % de las filas de todas las columnas) de todo el marco de datos o tableta. Use la función (sample n()) y asigne un valor faltante NA a dos variables cuantitativas cualesquiera del conjunto de datos muestreado:

tib.i <- sample_n(tib.c, 120*0.75)
dim(tib.i)

## [1] 90  7

tib.i[60,3]<-NA
tib.i[15,4]<-NA
head(tib.i)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 4.658924       13        7     136  Ausente       S            FI
## 2 5.310972       13        5     170 Presente      MA            FI
## 3 4.538076       16        9     111  Ausente       S            FI
## 4 3.285837       14       11      79 Presente       S            FI
## 5 4.226020       17       13     202  Ausente      MA            FO
## 6 4.387129       15        7     284 Presente      MA            FI

Ya tenemos dos conjuntos de datos, los que podemos guardar como tib.c al completo y tib.i al incompleto. Ahora si podemos usar algunas de las funciones de dplyr() consideradas en clase.

\[\color{blue}{Actividad~3}\] Seleccione una variable cualquiera con (select()) para tib.c:

tib.c_fert<-tib.c %>%
  select(fertilizacion)
head(tib.c_fert)

##   fertilizacion
## 1            FI
## 2            FO
## 3            FO
## 4            FI
## 5            FI
## 6            FI

Seleccione desde la tercera a la sexta variable con (select(:)) para tib.c:

tib.c_3_6<-tib.c %>%
  select(3:6)
head(tib.c_3_6)

##   flores.d hojas.d    plaga estatus
## 1       11     280 Presente       S
## 2        8     169  Ausente       S
## 3       10     277 Presente       S
## 4        8      56 Presente       S
## 5        9      73 Presente       S
## 6        9     153 Presente       S

No seleccione desde la tercera a la sexta variable con (select(!(:))) para tib.c:

tib.c_sin_3_6<-select(tib.c,!(3:6))
head(tib.c_sin_3_6)

##    biomasa flores.r fertilizacion
## 1 5.298067       15            FI
## 2 4.391265       17            FO
## 3 5.742773       17            FO
## 4 5.525798       16            FI
## 5 5.733124       14            FI
## 6 4.740585       16            FI

Seleccione las que no finalizan con .d usando (select(!ends with())):

sin_d<-select(tib.c,!ends_with("d"))
head(sin_d)

##    biomasa flores.r    plaga estatus fertilizacion
## 1 5.298067       15 Presente       S            FI
## 2 4.391265       17  Ausente       S            FO
## 3 5.742773       17 Presente       S            FO
## 4 5.525798       16 Presente       S            FI
## 5 5.733124       14 Presente       S            FI
## 6 4.740585       16 Presente       S            FI

Seleccione las que comienzan con Fl usando (select(starts with())):

with_fl<-select(tib.c, starts_with("fl"))
head(with_fl)

##   flores.r flores.d
## 1       15       11
## 2       17        8
## 3       17       10
## 4       16        8
## 5       14        9
## 6       16        9

Seleccione las que comienzan con F y terminan con .d usando (select(starts with()& ends with())):

start.f_and_end_d<-select(tib.c, starts_with("F")& ends_with(".d"))
head(start.f_and_end_d)

##   flores.d
## 1       11
## 2        8
## 3       10
## 4        8
## 5        9
## 6        9

Seleccione una variable cualquiera con (select()) para tib.c y agrúpela por Estatus usando (group by())

Guarde el resultado previo en la variable var estatus y ordene de mayor a menor por la misma variable anterior usando (arrange(desc(),.by group=TRUE)) para tib.c

order<-tib.c %>%
  group_by(estatus) %>%
  select(flores.r)

## Adding missing grouping variables: `estatus`

head(order)

## # A tibble: 6 x 2
## # Groups:   estatus [1]
##   estatus flores.r
##   <fct>      <int>
## 1 S             15
## 2 S             17
## 3 S             17
## 4 S             16
## 5 S             14
## 6 S             16

var.estatus <- tib.c %>%
   arrange(desc(flores.r), by_group = TRUE)
head(var.estatus)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 3.760985       19       17     112 Presente       S            FO
## 2 6.020558       19        7     225  Ausente       S            FO
## 3 6.023857       19        6     106 Presente       S            FO
## 4 6.056599       19       18      79  Ausente      PA            FO
## 5 5.556938       19       10     222  Ausente      MA            FO
## 6 5.239378       19       11      93 Presente      MA            FO

Filtre los datos que comienzan con Flores para el estatus muy afectadas ( Filter(,)):

flores_MA<-tib.c %>%
  select(starts_with("flores")) %>%
  filter(estatus == "MA")
head(flores_MA)

##   flores.r flores.d
## 1       15       12
## 2       15        6
## 3       16        9
## 4       19       10
## 5       16        9
## 6       17       10

Filtre los datos para cuando la biomasa es superior a 5 gramos:

bio5<-tib.c %>%
  filter(biomasa > 5)
head(bio5)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 5.298067       15       11     280 Presente       S            FI
## 2 5.742773       17       10     277 Presente       S            FO
## 3 5.525798       16        8      56 Presente       S            FI
## 4 5.733124       14        9      73 Presente       S            FI
## 5 5.834968       14        9     155 Presente       S            FI
## 6 5.155649       17       14      44 Presente       S            FO

Filtre los datos para tener solo fertilización se tienen plantas parcialmente afectadas y fueron tratadas con fertilización orgánica:

Plants_PA_FO<-tib.c %>%
  filter(estatus == "PA", fertilizacion == "FO")
head(Plants_PA_FO)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 5.236966       17       15     152 Presente      PA            FO
## 2 5.035616       18       13      33  Ausente      PA            FO
## 3 5.002382       17        8     230 Presente      PA            FO
## 4 4.223415       17       10     116 Presente      PA            FO
## 5 4.960302       18       13      90 Presente      PA            FO
## 6 5.115705       17       13      28 Presente      PA            FO

Filtre los datos para tener solo fertilización se tienen plantas parcialmente afectadas o fueron tratadas con fertilización inorgánica:

plants_PA_FI<-tib.c %>%
  filter(estatus == "PA"|fertilizacion == "FI")
head(plants_PA_FI)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 5.298067       15       11     280 Presente       S            FI
## 2 5.525798       16        8      56 Presente       S            FI
## 3 5.733124       14        9      73 Presente       S            FI
## 4 4.740585       16        9     153 Presente       S            FI
## 5 4.844750       15        3     265 Presente       S            FI
## 6 5.834968       14        9     155 Presente       S            FI

median(flores.d)

## [1] 10

Filtre los datos de Flores desprendidas por presencia o ausencia de plaga para cuando el número de flores desprendidas es mayor a su mediana:

Pre_au_plaga<-tib.c %>%
  select(flores.d, plaga) %>%
  filter(flores.d > median(flores.d)) %>%
  arrange(plaga)
head(Pre_au_plaga)

##   flores.d   plaga
## 1       11 Ausente
## 2       11 Ausente
## 3       16 Ausente
## 4       17 Ausente
## 5       13 Ausente
## 6       13 Ausente

Filtre los datos de una de las variables con datos faltantes por presencia o ausencia de plaga para cuando el valor de la variable con faltante es mayor a su mediana. Compare los resultados en los dos conjuntos de datos (completo o muestreado). Si encuentra diferencias en la mediana use median(, na.rm = TRUE):

Pre_au_plaga2<-tib.i %>%
  select(flores.r, plaga) %>%
  filter(flores.r > median(flores.r, na.rm = TRUE))
head(Pre_au_plaga)

##   flores.d   plaga
## 1       11 Ausente
## 2       11 Ausente
## 3       16 Ausente
## 4       17 Ausente
## 5       13 Ausente
## 6       13 Ausente

m.c <- median(tib.c$flores.r)
m.i <- median(tib.i$flores.r, na.rm = TRUE) 
data.frame(m.c,m.i)

##   m.c m.i
## 1  16  16

Nota: Las medianas de los dos conjuntos de datos son iguales aplicando na.rm = TRUE.

Como ya hemos usado varias funciones, en los ejercicios posteriores se omitirán funciones antes evaluadas y solo se mencionaran aquellas que no se hayan usado.

\[\color{blue}{Actividad~4}\]

Seleccione dos variables cuantitativas y asígnelas a un vector con c(,) y nómbrelo con v1. Ponga en otro vector dos valores numéricos que sirvan de condición a cada variable, por ejemplo, el cuartil inferior para una y el cuartil superior para la otra y llámelo v2. Usando ahora pipes, llame el data.frame completo y filtre usando el operador punto para seleccionar aquellos datos que superan respectivamente cada variable el valor de la condición. Use (v1=c();v2=c(); tib.c % > %filter(.data[[v1[[1]]>v2[[1]],.data[[v1[[2]]>v2[[2]])):

v1 <- c(tib.c$flores.r, tib.c$flores.d)
c1 <- quantile(tib.c$Flores.r, 0.75)
c2 <- quantile(tib.c$Flores.d, 0.25)
v2 <- c(c1,c2)
tib.c %>%
  filter((v1[[1]]>v2[[1]]),v1[[2]]>v2[[2]])

## [1] biomasa       flores.r      flores.d      hojas.d       plaga        
## [6] estatus       fertilizacion
## <0 rows> (or 0-length row.names)

Cree un data frame o tableta con todas las variables cuantitativas continuas estandarizadas con el (score z) y las discretas con la estandarización minimax y llámelo tib.e. Use (mutate()):

mean(tib.c$biomasa)

## [1] 5.100295

sd(tib.c$biomasa)

## [1] 0.8639438

tib.c_zcore<-tib.c %>% 
  mutate(zscore = (biomasa - mean(biomasa))/sd(biomasa))
head(tib.c_zcore)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion     zscore
## 1 5.298067       15       11     280 Presente       S            FI  0.2289177
## 2 4.391265       17        8     169  Ausente       S            FO -0.8206905
## 3 5.742773       17       10     277 Presente       S            FO  0.7436569
## 4 5.525798       16        8      56 Presente       S            FI  0.4925125
## 5 5.733124       14        9      73 Presente       S            FI  0.7324888
## 6 4.740585       16        9     153 Presente       S            FI -0.4163579

min_max_norm <- function (x) {
    (x - min (x)) / (max (x) - min (x))
}
tib.c_norm<- (minmax= as.data.frame (lapply (tib.c [2:4], min_max_norm)))
head(tib.c_norm)

##   flores.r  flores.d   hojas.d
## 1    0.500 0.4705882 0.9387755
## 2    0.750 0.2941176 0.5612245
## 3    0.750 0.4117647 0.9285714
## 4    0.625 0.2941176 0.1768707
## 5    0.375 0.3529412 0.2346939
## 6    0.625 0.3529412 0.5068027

tib.e <-data.frame(tib.c_zcore,tib.c_norm)
head(tib.e)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion     zscore
## 1 5.298067       15       11     280 Presente       S            FI  0.2289177
## 2 4.391265       17        8     169  Ausente       S            FO -0.8206905
## 3 5.742773       17       10     277 Presente       S            FO  0.7436569
## 4 5.525798       16        8      56 Presente       S            FI  0.4925125
## 5 5.733124       14        9      73 Presente       S            FI  0.7324888
## 6 4.740585       16        9     153 Presente       S            FI -0.4163579
##   flores.r.1 flores.d.1 hojas.d.1
## 1      0.500  0.4705882 0.9387755
## 2      0.750  0.2941176 0.5612245
## 3      0.750  0.4117647 0.9285714
## 4      0.625  0.2941176 0.1768707
## 5      0.375  0.3529412 0.2346939
## 6      0.625  0.3529412 0.5068027

Cree una nueva variable para tib.c donde divida el número de flores en las ramas con el número de flores desprendidas.

tib.c_div<-tib.c %>% 
  mutate(div_flores = flores.r/flores.d)
head(tib.c_div)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion div_flores
## 1 5.298067       15       11     280 Presente       S            FI   1.363636
## 2 4.391265       17        8     169  Ausente       S            FO   2.125000
## 3 5.742773       17       10     277 Presente       S            FO   1.700000
## 4 5.525798       16        8      56 Presente       S            FI   2.000000
## 5 5.733124       14        9      73 Presente       S            FI   1.555556
## 6 4.740585       16        9     153 Presente       S            FI   1.777778

Seleccione solo la variable del cociente previo agrupada por plaga. Cree una nueva variable que imprima el rango mínimo en cada grupo. LLame a la variable rangomin. Use (mutate( rangomin= min rank(desc())))

tib.c_divrang<-tib.c_div %>%
group_by(plaga)%>%
  mutate(rangogim=min_rank(div_flores))
head(tib.c_divrang)

## # A tibble: 6 x 9
## # Groups:   plaga [2]
##   biomasa flores.r flores.d hojas.d plaga    estatus fertilizacion div_flores
##     <dbl>    <int>    <int>   <int> <chr>    <fct>   <chr>              <dbl>
## 1    5.30       15       11     280 Presente S       FI                  1.36
## 2    4.39       17        8     169 Ausente  S       FO                  2.12
## 3    5.74       17       10     277 Presente S       FO                  1.7 
## 4    5.53       16        8      56 Presente S       FI                  2   
## 5    5.73       14        9      73 Presente S       FI                  1.56
## 6    4.74       16        9     153 Presente S       FI                  1.78
## # ... with 1 more variable: rangogim <int>

Renombre las variables asociadas a las flores a su gusto. Use (rename(tib.e,)):

names(tib.e)

##  [1] "biomasa"       "flores.r"      "flores.d"      "hojas.d"      
##  [5] "plaga"         "estatus"       "fertilizacion" "zscore"       
##  [9] "flores.r.1"    "flores.d.1"    "hojas.d.1"

renamed.tib.e<-rename(tib.e,flores.rojas=flores.r,
                      flores.azules=flores.d,
                      flores.rojas.marchitas=flores.r.1,
                      flores.azules.suculentas=flores.d.1)
names(renamed.tib.e)

##  [1] "biomasa"                  "flores.rojas"            
##  [3] "flores.azules"            "hojas.d"                 
##  [5] "plaga"                    "estatus"                 
##  [7] "fertilizacion"            "zscore"                  
##  [9] "flores.rojas.marchitas"   "flores.azules.suculentas"
## [11] "hojas.d.1"

Pase a mayúsculas todos los nombres de las variables en cualquier tibble. Use (rename with(tib.e,toupper)):

tib.e_MAYUS<-(rename_with(tib.e,toupper))
head(tib.e_MAYUS)

##    BIOMASA FLORES.R FLORES.D HOJAS.D    PLAGA ESTATUS FERTILIZACION     ZSCORE
## 1 5.298067       15       11     280 Presente       S            FI  0.2289177
## 2 4.391265       17        8     169  Ausente       S            FO -0.8206905
## 3 5.742773       17       10     277 Presente       S            FO  0.7436569
## 4 5.525798       16        8      56 Presente       S            FI  0.4925125
## 5 5.733124       14        9      73 Presente       S            FI  0.7324888
## 6 4.740585       16        9     153 Presente       S            FI -0.4163579
##   FLORES.R.1 FLORES.D.1 HOJAS.D.1
## 1      0.500  0.4705882 0.9387755
## 2      0.750  0.2941176 0.5612245
## 3      0.750  0.4117647 0.9285714
## 4      0.625  0.2941176 0.1768707
## 5      0.375  0.3529412 0.2346939
## 6      0.625  0.3529412 0.5068027

Pase a minúsculas todos los nombres de variable y aproveche y cambie los .d por guiones al piso, es decir, d. Use (rename with(tib.e,∼tolower(gsub(”.”,” ”,.x,fixed=TRUE)))):

tib.e_raname_with<-rename_with(tib.e, ~ tolower(gsub(".", "_", .x, fixed = TRUE)))
head(tib.e_raname_with)

##    biomasa flores_r flores_d hojas_d    plaga estatus fertilizacion     zscore
## 1 5.298067       15       11     280 Presente       S            FI  0.2289177
## 2 4.391265       17        8     169  Ausente       S            FO -0.8206905
## 3 5.742773       17       10     277 Presente       S            FO  0.7436569
## 4 5.525798       16        8      56 Presente       S            FI  0.4925125
## 5 5.733124       14        9      73 Presente       S            FI  0.7324888
## 6 4.740585       16        9     153 Presente       S            FI -0.4163579
##   flores_r_1 flores_d_1 hojas_d_1
## 1      0.500  0.4705882 0.9387755
## 2      0.750  0.2941176 0.5612245
## 3      0.750  0.4117647 0.9285714
## 4      0.625  0.2941176 0.1768707
## 5      0.375  0.3529412 0.2346939
## 6      0.625  0.3529412 0.5068027

Entre las opciones interesantes para realizar estadísticas descriptivas esta la función summarise(), la cual usaremos en la siguiente actividad:

\[\color{blue}{Actividad~5}\]

Seleccione la variable biomasa de la tableta con faltantes y con summarise() obtenga la media y el número de datos de esta variable:

tib.i%>%
  summarise(mean(biomasa),N_datos=n())

##   mean(biomasa) N_datos
## 1      5.105746      90

Seleccione la variable biomasa de la tableta con faltantes y con summarise() obtenga la media y el número de datos por tipo de fertilización:

tib.i%>%
 group_by(fertilizacion) %>%
 summarise(mean(biomasa),N_datos=n())

## # A tibble: 2 x 3
##   fertilizacion `mean(biomasa)` N_datos
##   <chr>                   <dbl>   <int>
## 1 FI                       5.05      56
## 2 FO                       5.20      34

Seleccione la variable biomasa de la tableta con faltantes y con summarise() obtenga los cuantiles 0.10,0.20,0.30,0.40 y 0.50 por tipo de fertilización:

tib.i%>%
 group_by(fertilizacion) %>%
 summarise(Q.10=quantile(tib.i$biomasa,0.10),
           Q.20=quantile(tib.i$biomasa,0.20),
           Q.30=quantile(tib.i$biomasa,0.30),
           Q.40=quantile(tib.i$biomasa,0.40),
           Q.50=quantile(tib.i$biomasa,0.50))

## # A tibble: 2 x 6
##   fertilizacion  Q.10  Q.20  Q.30  Q.40  Q.50
##   <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FI             3.98  4.52  4.76  4.90  5.04
## 2 FO             3.98  4.52  4.76  4.90  5.04

Seleccione la variable biomasa de la tableta con faltantes y con summarise() obtenga la media, mediana, máximo, mínimo, desviación típica, desviación media, media truncada y varianza por tipo de fertilización y plaga:

tib.i%>%
 group_by(fertilizacion,plaga) %>%
 summarise(meanbio=mean(biomasa),
           medbio=median(biomasa),
           maxbio=max(biomasa),
           minbio=min(biomasa),
           destipica=sd(biomasa),
           desmedia=mean(abs(biomasa-mean(biomasa))),
           medtru5=mean(biomasa,trim=5/100),
           medtru10=mean(biomasa,trim=10/100),
           varbio=var(biomasa))

## `summarise()` has grouped output by 'fertilizacion'. You can override using the `.groups` argument.

## # A tibble: 4 x 11
## # Groups:   fertilizacion [2]
##   fertilizacion plaga    meanbio medbio maxbio minbio destipica desmedia medtru5
##   <chr>         <chr>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>   <dbl>
## 1 FI            Ausente     5.25   4.79   7.18   3.97     1.07     0.867    5.25
## 2 FI            Presente    5.01   5.07   6.74   2.54     0.833    0.603    5.03
## 3 FO            Ausente     5.48   5.72   7.26   3.08     1.11     0.882    5.48
## 4 FO            Presente    5.00   5.04   6.07   3.76     0.667    0.511    5.01
## # ... with 2 more variables: medtru10 <dbl>, varbio <dbl>

tib.i_filS<-tib.i%>%
   group_by(estatus) %>%
  filter(estatus=="S")%>%
    select(fertilizacion,plaga)

## Adding missing grouping variables: `estatus`

head(tib.i_filS)

## # A tibble: 6 x 3
## # Groups:   estatus [1]
##   estatus fertilizacion plaga   
##   <fct>   <chr>         <chr>   
## 1 S       FI            Ausente 
## 2 S       FI            Ausente 
## 3 S       FI            Presente
## 4 S       FO            Ausente 
## 5 S       FO            Presente
## 6 S       FI            Presente

tib.i_filS%>%
group_by(fertilizacion,plaga) %>%
 summarise(Ndatos=n(),
           meanbio=mean(biomasa),
           medbio=median(biomasa),
           maxbio=max(biomasa),
           minbio=min(biomasa),
           destipica=sd(biomasa),
           desmedia=mean(abs(biomasa-mean(biomasa))),
           medtru5=mean(biomasa,trim=5/100),
           medtru10=mean(biomasa,trim=10/100),
           varbio=var(biomasa))

## `summarise()` has grouped output by 'fertilizacion'. You can override using the `.groups` argument.

## # A tibble: 4 x 12
## # Groups:   fertilizacion [2]
##   fertilizacion plaga    Ndatos meanbio medbio maxbio minbio destipica desmedia
##   <chr>         <chr>     <int>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
## 1 FI            Ausente       4    5.10   5.03   7.26   2.54     0.864    0.671
## 2 FI            Presente     14    5.10   5.03   7.26   2.54     0.864    0.671
## 3 FO            Ausente       5    5.10   5.03   7.26   2.54     0.864    0.671
## 4 FO            Presente      8    5.10   5.03   7.26   2.54     0.864    0.671
## # ... with 3 more variables: medtru5 <dbl>, medtru10 <dbl>, varbio <dbl>

Con la tableta con faltantes use la función drop na() para sacar los faltantes y compare las estadísticas obtenidas en el item anterior con y sin faltantes:

tib.i_sinNA <-drop_na(tib.i)
head(tib.i_sinNA)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 4.658924       13        7     136  Ausente       S            FI
## 2 5.310972       13        5     170 Presente      MA            FI
## 3 4.538076       16        9     111  Ausente       S            FI
## 4 3.285837       14       11      79 Presente       S            FI
## 5 4.226020       17       13     202  Ausente      MA            FO
## 6 4.387129       15        7     284 Presente      MA            FI

tib.i_filSsinNA<-tib.i_sinNA%>%
   group_by(estatus) %>%
  filter(estatus=="S")%>%
    select(fertilizacion,plaga)

## Adding missing grouping variables: `estatus`

head(tib.i_filS)

## # A tibble: 6 x 3
## # Groups:   estatus [1]
##   estatus fertilizacion plaga   
##   <fct>   <chr>         <chr>   
## 1 S       FI            Ausente 
## 2 S       FI            Ausente 
## 3 S       FI            Presente
## 4 S       FO            Ausente 
## 5 S       FO            Presente
## 6 S       FI            Presente

tib.i_filSsinNA%>%
group_by(fertilizacion,plaga) %>%
 summarise(Ndatos=n(),
           meanbio=mean(biomasa),
           medbio=median(biomasa),
           maxbio=max(biomasa),
           minbio=min(biomasa),
           destipica=sd(biomasa),
           desmedia=mean(abs(biomasa-mean(biomasa))),
           medtru5=mean(biomasa,trim=5/100),
           medtru10=mean(biomasa,trim=10/100),
           varbio=var(biomasa))

## `summarise()` has grouped output by 'fertilizacion'. You can override using the `.groups` argument.

## # A tibble: 4 x 12
## # Groups:   fertilizacion [2]
##   fertilizacion plaga    Ndatos meanbio medbio maxbio minbio destipica desmedia
##   <chr>         <chr>     <int>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
## 1 FI            Ausente       4    5.10   5.03   7.26   2.54     0.864    0.671
## 2 FI            Presente     14    5.10   5.03   7.26   2.54     0.864    0.671
## 3 FO            Ausente       5    5.10   5.03   7.26   2.54     0.864    0.671
## 4 FO            Presente      8    5.10   5.03   7.26   2.54     0.864    0.671
## # ... with 3 more variables: medtru5 <dbl>, medtru10 <dbl>, varbio <dbl>

Nota: Al eliminar los datos faltantes se observa que los estadisticos no cambiaron, siguen siendo iguales.

Filtre los datos seleccionando solo las plantas afectadas o muy afectadas. Use el operador %in %:

platas_PA_MA <- filter(tib.i,estatus %in% c("PA", "MA"))
head(platas_PA_MA)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 5.310972       13        5     170 Presente      MA            FI
## 2 4.226020       17       13     202  Ausente      MA            FO
## 3 4.387129       15        7     284 Presente      MA            FI
## 4 7.176602       16        9      66  Ausente      MA            FI
## 5 6.056599       19       18      79  Ausente      PA            FO
## 6 3.972527       14       10      47  Ausente      MA            FI

Seleccione la tibble con faltantes y use complete.cases(.) para dejar por fuera los faltantes:

SinNA_complet.cases<-filter(tib.i[1:90,],complete.cases(tib.i[1:90,]))
head(SinNA_complet.cases)

##    biomasa flores.r flores.d hojas.d    plaga estatus fertilizacion
## 1 4.658924       13        7     136  Ausente       S            FI
## 2 5.310972       13        5     170 Presente      MA            FI
## 3 4.538076       16        9     111  Ausente       S            FI
## 4 3.285837       14       11      79 Presente       S            FI
## 5 4.226020       17       13     202  Ausente      MA            FO
## 6 4.387129       15        7     284 Presente      MA            FI

Elimine de cualquier tibble las columnas asociadas al conteo de flores:

remove_tib.i<-mutate(tib.i,flores.r=NULL,flores.d=NULL)
head(remove_tib.i)

##    biomasa hojas.d    plaga estatus fertilizacion
## 1 4.658924     136  Ausente       S            FI
## 2 5.310972     170 Presente      MA            FI
## 3 4.538076     111  Ausente       S            FI
## 4 3.285837      79 Presente       S            FI
## 5 4.226020     202  Ausente      MA            FO
## 6 4.387129     284 Presente      MA            FI

Seleccione de cualquier tibble las variables que contengan la d. Use select(contains()):

d_contain<-tib.i%>%
select(contains("d"))
head(d_contain)

##   flores.d hojas.d
## 1        7     136
## 2        5     170
## 3        9     111
## 4       11      79
## 5       13     202
## 6        7     284

Reordene una tableta usando select(,everything()) colocando primero los conteos de flores:

R_everyth<-tib.i%>%
select(flores.r,flores.d,everything())
head(R_everyth)

##   flores.r flores.d  biomasa hojas.d    plaga estatus fertilizacion
## 1       13        7 4.658924     136  Ausente       S            FI
## 2       13        5 5.310972     170 Presente      MA            FI
## 3       16        9 4.538076     111  Ausente       S            FI
## 4       14       11 3.285837      79 Presente       S            FI
## 5       17       13 4.226020     202  Ausente      MA            FO
## 6       15        7 4.387129     284 Presente      MA            FI

\[\color{blue}{Fin~del~ejercicio}\]

Taller Usando dplyr en Computacion estadıstica

Leider Andrés Tombé Morales

1/11/2021