Descriptive Statistics

Now we are ready to describe some properties of our data using descriptive statistics, and presenting them in tables.

Calculating statistics by group

We are interested in calculating some statistics on our Melocactus data set: number of values, maximum and minimum values, mean, and standard deviation for measurements on the size of the individuals, and frequencies of the status, but we want the calculations for each year.

First, descriptive statistics for values of plant height, but we want also to have the values of the vegetative part of the plant, substracting the longitude of the inflorescence from the total height of the plant:

library(tidyverse)
# descriptive statistics plant_height
descrip.stats.plht <- melo.select %>% 
  group_by(year) %>%
  summarise(n = length(plant_height),
                         Max = max(plant_height),
                         min = min(plant_height),
                         median = median(plant_height),
                         mean = mean(plant_height),
                         st.err=sd(plant_height)/sqrt(length(plant_height)))
descrip.stats.plht
## # A tibble: 2 × 7
##    year     n   Max   min median  mean st.err
##   <int> <int> <int> <int>  <int> <dbl>  <dbl>
## 1  2019   125    66     1      9 14.2   1.24 
## 2  2020   153    61     1      4  9.41  0.962
# new column with the height of the vegetative part - veg_height
melo.select.veg <- melo.select %>% 
  mutate(veg_height = plant_height - long.inflo)
# descriptive statistics veg_height
descrip.stats.vght <- melo.select.veg %>% 
  group_by(year) %>%
  summarise(n = length(veg_height),
                         Max = max(veg_height, na.rm = T),
                         min = min(veg_height, na.rm = T),
                         median = median(veg_height, na.rm = T),
                         mean = mean(veg_height, na.rm = T),
                         st.err=sd(veg_height, na.rm = T)/sqrt(n))
descrip.stats.vght
## # A tibble: 2 × 7
##    year     n   Max   min median  mean st.err
##   <int> <int> <int> <int>  <dbl> <dbl>  <dbl>
## 1  2019   125    43     0      7 10.8   0.856
## 2  2020   153    45     1      4  8.09  0.730

And now, descriptive statistics for the longitude of the inflorescence, eliminating zeroes (when the inflorescence is not present):

# filtering non-zero values for long.inflo
melo.nonzero <- melo.select %>% 
  filter(long.inflo > 0)
# descriptive statistics long.inflo - without zero values
descrip.stats.infl <- melo.nonzero %>% 
  group_by(year) %>%
  summarise(n = length(long.inflo),
                         Max = max(long.inflo, na.rm = T),
                         min = min(long.inflo, na.rm = T),
                         median = median(long.inflo, na.rm = T),
                         mean = mean(long.inflo, na.rm = T),
                         st.err=sd(long.inflo, na.rm = T)/sqrt(n))
descrip.stats.infl
## # A tibble: 2 × 7
##    year     n   Max   min median  mean st.err
##   <int> <int> <int> <int>  <dbl> <dbl>  <dbl>
## 1  2019    32    32     1     11  13.4   1.49
## 2  2020    17    26     1     11  10.6   1.62

Creating tables for publications

Now we are going to use the above results to build more aesthetic tables for reports and publications. We will use the package gt. First a simple table, just using the previous results and the command gt:

library(gt)
gt_basic <- descrip.stats.vght %>% 
  gt() 
gtsave(gt_basic, "gt_basic.png")

We can make the table a bit more elaborated:

table1 <- descrip.stats.vght %>% 
  gt() %>%
# title for the table
  tab_header(title = md("Table 1.  Descriptive statistics for vegetative part height (cm) of individuals of _M. intortus_, in a population of the Guánica Dry Forest, Puerto Rico")) %>%
# subtitle for the columns
  tab_spanner(
    label = "Descriptive Statistics",
    columns = c(2:7)) %>% 
# renaming the columns
  cols_label(
    year = md("__Year__"),
    n = md("__n__"),
    Max = md("__Maximum__"),
    min = md("__Minimum__"),
    median = md("__Median__"),
    mean = md("__Mean__"),
    st.err = md("__Standard Error__")) %>% 
# formatting the columns
  cols_align(
    align = "center") %>% 
# formatting the data values
  fmt_number(
    columns = c(6:7),
    decimals = 2
  )
gtsave(table1,"table_1.png")

Creating a frequency table for categorical variables

The variable status is a categorical variable and we want to create a table with the frequencies for each category. The package janitor is one alternative to do this.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
# removing missing values
melo.select.narm <- filter(melo.select, status == "E" | status == "S" | status == "X")
# frequency and % table 
tabl_freq <- 
  tabyl(melo.select.narm, year, status) %>% 
  adorn_percentages("row") %>% 
  adorn_pct_formatting(affix_sign = FALSE) %>% 
  adorn_ns()
tabl_freq
##  year         E         S         X
##  2019 25.6 (32) 52.0 (65) 22.4 (28)
##  2020 13.8 (21) 48.0 (73) 38.2 (58)
# a more polished table
table2 <- tabl_freq %>% 
  gt() %>% 
  # title for the table
  tab_header(title = md("Table 2.  Percentage (count) of individuals in each category of status and sampling year")) %>%
# subtitle for the columns
  tab_spanner(
    label = "Status",
    columns = c(2:4)) %>% 
# renaming the columns
  cols_label(
    year = md("__Year__"),
    E = md("__Sick__"),
    S = md("__Healthy__"),
    X = md("__Dead__")) %>% 
# formatting the columns
  cols_align(
    align = "center")
gtsave(table2, "table_2.png")