Summary Statistics

In R, you can obtain summary statistics for numerical variables using several functions, depending on the level of detail and customization you need. These functions can help you calculate common descriptive statistics such as the mean, median, standard deviation, minimum, maximum, range, and quantiles.

# Set working directory
setwd("E:/Desktop/ProyectoR")

# Load libraries
library(readxl)
library(moments)
library(psych)
library(skimr)
library(fBasics)
library(Hmisc)

# Load data
datos <- read_excel("Atriplex.xlsx")

Ensure the dataset is correctly understood by applying the functions covered in previous sessions in R.

class(datos)

## [1] "tbl_df"     "tbl"        "data.frame"

names(datos)

## [1] "Tamaño"    "Episperma" "PG"        "PN"        "PS"        "Bloque"

str(datos)

## tibble [27 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Tamaño   : chr [1:27] "chicas" "chicas" "chicas" "chicas" ...
##  $ Episperma: chr [1:27] "claro" "claro" "claro" "rojizo" ...
##  $ PG       : num [1:27] 60 73 73 93 66 60 20 26 20 87 ...
##  $ PN       : num [1:27] 47 33 60 7 33 20 0 13 7 54 ...
##  $ PS       : num [1:27] 0.003 0.003 0.0031 0.003 0.0026 0.003 0.003 0.0027 0.003 0.0033 ...
##  $ Bloque   : num [1:27] 1 2 3 1 2 3 1 2 3 1 ...

datos$Bloque=factor(datos$Bloque, 
                    levels = c("1","2","3"), 
                    ordered = TRUE)
head(datos)

## # A tibble: 6 × 6
##   Tamaño Episperma    PG    PN     PS Bloque
##   <chr>  <chr>     <dbl> <dbl>  <dbl> <ord> 
## 1 chicas claro        60    47 0.003  1     
## 2 chicas claro        73    33 0.003  2     
## 3 chicas claro        73    60 0.0031 3     
## 4 chicas rojizo       93     7 0.003  1     
## 5 chicas rojizo       66    33 0.0026 2     
## 6 chicas rojizo       60    20 0.003  3

tail(datos)

## # A tibble: 6 × 6
##   Tamaño  Episperma    PG    PN     PS Bloque
##   <chr>   <chr>     <dbl> <dbl>  <dbl> <ord> 
## 1 grandes rojizo       87    47 0.004  1     
## 2 grandes rojizo       87    27 0.0038 2     
## 3 grandes rojizo       80    67 0.0036 3     
## 4 grandes oscuro       53    47 0.0033 1     
## 5 grandes oscuro       40    27 0.003  2     
## 6 grandes oscuro       53    47 0.0042 3

attach(datos)

Quick overview. Gives: Min, 1st Quartile, Median, Mean, 3rd Quartile, Max. Good for quick summaries of numeric variables (and factors too).

summary(datos)

##     Tamaño           Episperma               PG               PN       
##  Length:27          Length:27          Min.   : 13.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 46.50   1st Qu.:23.50  
##  Mode  :character   Mode  :character   Median : 73.00   Median :47.00  
##                                        Mean   : 65.56   Mean   :42.15  
##                                        3rd Qu.: 87.00   3rd Qu.:57.00  
##                                        Max.   :100.00   Max.   :87.00  
##        PS           Bloque
##  Min.   :0.002600   1:9   
##  1st Qu.:0.003000   2:9   
##  Median :0.003200   3:9   
##  Mean   :0.003407         
##  3rd Qu.:0.003800         
##  Max.   :0.004900

Individual measures for PG variable.

min(PG,na.rm = TRUE)

## [1] 13

max(PG,na.rm = TRUE)

## [1] 100

range(PG,na.rm = TRUE)

## [1]  13 100

mean(PG,na.rm = TRUE)

## [1] 65.55556

sd(PG,na.rm = TRUE)

## [1] 26.93487

var(PG,na.rm = TRUE)

## [1] 725.4872

cv_PG = (sd(PG)/mean(PG))*100
cv_PG

## [1] 41.08709

cor(PG,PS,method = "spearman")

## [1] 0.5356887

cor(datos[ ,3:5], method = "spearman")

##           PG        PN        PS
## PG 1.0000000 0.7084620 0.5356887
## PN 0.7084620 1.0000000 0.6351845
## PS 0.5356887 0.6351845 1.0000000

Tukeys five-number summary: Returns: min, lower-hinge (Q1), median, upper-hinge (Q3), max.

fivenum(PG)

## [1]  13.0  46.5  73.0  87.0 100.0

Percentiles and quantile

cuantiles=quantile(PG, 
                   probs = c(0.25, 0.5, 0.75))
cuantiles

##  25%  50%  75% 
## 46.5 73.0 87.0

percentiles=quantile(PG, 
                     probs = c(0.05, 0.10, 0.95))
percentiles

##   5%  10%  95% 
## 20.0 23.6 93.0

IQR(PG)

## [1] 40.5

Skew and Kurtosis, from the library moments.

skewness(PG)

## [1] -0.5750775
## attr(,"method")
## [1] "moment"

kurtosis(PG)

## [1] -1.150249
## attr(,"method")
## [1] "excess"

skewness(datos[ ,3:5])

##         PG         PN         PS 
## -0.5750775  0.1476275  0.8761652 
## attr(,"method")
## [1] "moment"

kurtosis(datos[ ,3:5])

##         PG         PN         PS 
## -1.1502494 -1.1972202 -0.1624687 
## attr(,"method")
## [1] "excess"

Great if you want many stats in one go: mean, sd, median, min, max, range, skewness, kurtosis, etc.

The notation psych::describe(PG) is used to explicitly call the describe() function from the psych package. This is important when multiple packages contain a function with the same name, helping to avoid conflicts or ambiguity. This notation prevents function masking and ensures consistent results.

psych::describe(PG)

##    vars  n  mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 27 65.56 26.93     73   67.13 29.65  13 100    87 -0.58    -1.15 5.18

psych::describe(datos[ ,3:5])

##    vars  n  mean    sd median trimmed   mad min max range  skew kurtosis   se
## PG    1 27 65.56 26.93     73   67.13 29.65  13 100    87 -0.58    -1.15 5.18
## PN    2 27 42.15 25.90     47   41.91 29.65   0  87    87  0.15    -1.20 4.99
## PS    3 27  0.00  0.00      0    0.00  0.00   0   0     0  0.88    -0.16 0.00

Gives a nicely formatted table with counts, missing values, mean, sd, min, max, histograms, etc. From skimr library

skim(PG)

Data summary
Name	PG
Number of rows	27
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	65.56	26.93	13	46.5	73	87	100	▃▂▃▅▇

skim(datos)

Data summary
Name	datos
Number of rows	27
Number of columns	6
_______________________
Column type frequency:
character	2
factor	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Tamaño	0	1	6	8	0	3	0
Episperma	0	1	5	6	0	3	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Bloque	0	1	TRUE	3	1: 9, 2: 9, 3: 9

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PG	1	65.56	26.93	13	46.5	73	87	100	▃▂▃▅▇
PN	1	42.15	25.90	0	23.5	47	57	87	▅▇▃▅▅
PS	1	0.00	0.00	0	0.0	0	0	0	▇▆▂▃▁

Statistics from fBasics library

basicStats(PG)

##                      PG
## nobs          27.000000
## NAs            0.000000
## Minimum       13.000000
## Maximum      100.000000
## 1. Quartile   46.500000
## 3. Quartile   87.000000
## Mean          65.555556
## Median        73.000000
## Sum         1770.000000
## SE Mean        5.183618
## LCL Mean      54.900476
## UCL Mean      76.210635
## Variance     725.487179
## Stdev         26.934869
## Skewness      -0.575078
## Kurtosis      -1.150249

basicStats(datos[ ,3:5])

##                      PG          PN        PS
## nobs          27.000000   27.000000 27.000000
## NAs            0.000000    0.000000  0.000000
## Minimum       13.000000    0.000000  0.002600
## Maximum      100.000000   87.000000  0.004900
## 1. Quartile   46.500000   23.500000  0.003000
## 3. Quartile   87.000000   57.000000  0.003800
## Mean          65.555556   42.148148  0.003407
## Median        73.000000   47.000000  0.003200
## Sum         1770.000000 1138.000000  0.092000
## SE Mean        5.183618    4.985364  0.000109
## LCL Mean      54.900476   31.900585  0.003184
## UCL Mean      76.210635   52.395711  0.003631
## Variance     725.487179  671.054131  0.000000
## Stdev         26.934869   25.904713  0.000566
## Skewness      -0.575078    0.147628  0.876165
## Kurtosis      -1.150249   -1.197220 -0.162469

Hmisc::describe(PG)

## PG 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##       27        0       13    0.988    65.56     66.5    30.53     20.0 
##      .10      .25      .50      .75      .90      .95 
##     23.6     46.5     73.0     87.0     93.0     93.0 
##                                                                             
## Value         13    20    26    33    40    53    60    66    73    80    87
## Frequency      1     2     1     2     1     2     2     1     2     3     5
## Proportion 0.037 0.074 0.037 0.074 0.037 0.074 0.074 0.037 0.074 0.111 0.185
##                       
## Value         93   100
## Frequency      4     1
## Proportion 0.148 0.037
## 
## For the frequency table, variable is rounded to the nearest 0

Hmisc::describe(datos)

## datos 
## 
##  6  Variables      27  Observations
## --------------------------------------------------------------------------------
## Tamaño 
##        n  missing distinct 
##       27        0        3 
##                                      
## Value        chicas  grandes medianas
## Frequency         9        9        9
## Proportion    0.333    0.333    0.333
## --------------------------------------------------------------------------------
## Episperma 
##        n  missing distinct 
##       27        0        3 
##                                
## Value       claro oscuro rojizo
## Frequency       9      9      9
## Proportion  0.333  0.333  0.333
## --------------------------------------------------------------------------------
## PG 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##       27        0       13    0.988    65.56     66.5    30.53     20.0 
##      .10      .25      .50      .75      .90      .95 
##     23.6     46.5     73.0     87.0     93.0     93.0 
##                                                                             
## Value         13    20    26    33    40    53    60    66    73    80    87
## Frequency      1     2     1     2     1     2     2     1     2     3     5
## Proportion 0.037 0.074 0.037 0.074 0.037 0.074 0.074 0.037 0.074 0.111 0.185
##                       
## Value         93   100
## Frequency      4     1
## Proportion 0.148 0.037
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## PN 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##       27        0       12    0.989    42.15       42    30.13      7.0 
##      .10      .25      .50      .75      .90      .95 
##      7.0     23.5     47.0     57.0     80.0     80.0 
##                                                                             
## Value          0     7    13    20    27    33    47    54    60    67    80
## Frequency      1     3     1     2     3     3     4     3     1     1     4
## Proportion 0.037 0.111 0.037 0.074 0.111 0.111 0.148 0.111 0.037 0.037 0.148
##                 
## Value         87
## Frequency      1
## Proportion 0.037
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## PS 
##         n   missing  distinct      Info      Mean   pMedian       Gmd       .05 
##        27         0        13     0.972  0.003407   0.00335 0.0006177   0.00279 
##       .10       .25       .50       .75       .90       .95 
##   0.00300   0.00300   0.00320   0.00380   0.00424   0.00430 
##                                                                          
## Value      0.0026 0.0027 0.0030 0.0031 0.0032 0.0033 0.0034 0.0036 0.0038
## Frequency       1      1      8      2      3      2      1      1      2
## Proportion  0.037  0.037  0.296  0.074  0.111  0.074  0.037  0.037  0.074
##                                       
## Value      0.0040 0.0042 0.0043 0.0049
## Frequency       2      1      2      1
## Proportion  0.074  0.037  0.074  0.037
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## Bloque 
##        n  missing distinct 
##       27        0        3 
##                             
## Value          1     2     3
## Frequency      9     9     9
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------

Statistics by category (grouped summaries)

tapply(datos$PG,datos$Tamaño,mean)

##   chicas  grandes medianas 
## 54.55556 73.33333 68.77778

tapply(datos$PG,datos$Tamaño,var)

##    chicas   grandes  medianas 
##  694.0278  371.7500 1076.1944

In the following sections, I will incorporate grouped summary statistics using the aggregate() function and the dplyr package. These methods will allow for a more detailed comparison of numeric variables across the levels of categorical factors.

Summary Statistics

Julie Ramirez

2025-04-17