In R, you can obtain summary statistics for numerical variables using several functions, depending on the level of detail and customization you need. These functions can help you calculate common descriptive statistics such as the mean, median, standard deviation, minimum, maximum, range, and quantiles.
# Set working directory
setwd("E:/Desktop/ProyectoR")
# Load libraries
library(readxl)
library(moments)
library(psych)
library(skimr)
library(fBasics)
library(Hmisc)
# Load data
datos <- read_excel("Atriplex.xlsx")
Ensure the dataset is correctly understood by applying the functions covered in previous sessions in R.
class(datos)
## [1] "tbl_df" "tbl" "data.frame"
names(datos)
## [1] "Tamaño" "Episperma" "PG" "PN" "PS" "Bloque"
str(datos)
## tibble [27 × 6] (S3: tbl_df/tbl/data.frame)
## $ Tamaño : chr [1:27] "chicas" "chicas" "chicas" "chicas" ...
## $ Episperma: chr [1:27] "claro" "claro" "claro" "rojizo" ...
## $ PG : num [1:27] 60 73 73 93 66 60 20 26 20 87 ...
## $ PN : num [1:27] 47 33 60 7 33 20 0 13 7 54 ...
## $ PS : num [1:27] 0.003 0.003 0.0031 0.003 0.0026 0.003 0.003 0.0027 0.003 0.0033 ...
## $ Bloque : num [1:27] 1 2 3 1 2 3 1 2 3 1 ...
datos$Bloque=factor(datos$Bloque,
levels = c("1","2","3"),
ordered = TRUE)
head(datos)
## # A tibble: 6 × 6
## Tamaño Episperma PG PN PS Bloque
## <chr> <chr> <dbl> <dbl> <dbl> <ord>
## 1 chicas claro 60 47 0.003 1
## 2 chicas claro 73 33 0.003 2
## 3 chicas claro 73 60 0.0031 3
## 4 chicas rojizo 93 7 0.003 1
## 5 chicas rojizo 66 33 0.0026 2
## 6 chicas rojizo 60 20 0.003 3
tail(datos)
## # A tibble: 6 × 6
## Tamaño Episperma PG PN PS Bloque
## <chr> <chr> <dbl> <dbl> <dbl> <ord>
## 1 grandes rojizo 87 47 0.004 1
## 2 grandes rojizo 87 27 0.0038 2
## 3 grandes rojizo 80 67 0.0036 3
## 4 grandes oscuro 53 47 0.0033 1
## 5 grandes oscuro 40 27 0.003 2
## 6 grandes oscuro 53 47 0.0042 3
attach(datos)
Quick overview. Gives: Min, 1st Quartile, Median, Mean, 3rd Quartile, Max. Good for quick summaries of numeric variables (and factors too).
summary(datos)
## Tamaño Episperma PG PN
## Length:27 Length:27 Min. : 13.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 46.50 1st Qu.:23.50
## Mode :character Mode :character Median : 73.00 Median :47.00
## Mean : 65.56 Mean :42.15
## 3rd Qu.: 87.00 3rd Qu.:57.00
## Max. :100.00 Max. :87.00
## PS Bloque
## Min. :0.002600 1:9
## 1st Qu.:0.003000 2:9
## Median :0.003200 3:9
## Mean :0.003407
## 3rd Qu.:0.003800
## Max. :0.004900
Individual measures for PG variable.
min(PG,na.rm = TRUE)
## [1] 13
max(PG,na.rm = TRUE)
## [1] 100
range(PG,na.rm = TRUE)
## [1] 13 100
mean(PG,na.rm = TRUE)
## [1] 65.55556
sd(PG,na.rm = TRUE)
## [1] 26.93487
var(PG,na.rm = TRUE)
## [1] 725.4872
cv_PG = (sd(PG)/mean(PG))*100
cv_PG
## [1] 41.08709
cor(PG,PS,method = "spearman")
## [1] 0.5356887
cor(datos[ ,3:5], method = "spearman")
## PG PN PS
## PG 1.0000000 0.7084620 0.5356887
## PN 0.7084620 1.0000000 0.6351845
## PS 0.5356887 0.6351845 1.0000000
Tukeys five-number summary: Returns: min, lower-hinge (Q1), median, upper-hinge (Q3), max.
fivenum(PG)
## [1] 13.0 46.5 73.0 87.0 100.0
Percentiles and quantile
cuantiles=quantile(PG,
probs = c(0.25, 0.5, 0.75))
cuantiles
## 25% 50% 75%
## 46.5 73.0 87.0
percentiles=quantile(PG,
probs = c(0.05, 0.10, 0.95))
percentiles
## 5% 10% 95%
## 20.0 23.6 93.0
IQR(PG)
## [1] 40.5
Skew and Kurtosis, from the library moments.
skewness(PG)
## [1] -0.5750775
## attr(,"method")
## [1] "moment"
kurtosis(PG)
## [1] -1.150249
## attr(,"method")
## [1] "excess"
skewness(datos[ ,3:5])
## PG PN PS
## -0.5750775 0.1476275 0.8761652
## attr(,"method")
## [1] "moment"
kurtosis(datos[ ,3:5])
## PG PN PS
## -1.1502494 -1.1972202 -0.1624687
## attr(,"method")
## [1] "excess"
Great if you want many stats in one go: mean, sd, median, min, max, range, skewness, kurtosis, etc.
The notation psych::describe(PG) is used to explicitly
call the describe() function from the psych package. This
is important when multiple packages contain a function with the same
name, helping to avoid conflicts or ambiguity. This notation prevents
function masking and ensures consistent results.
psych::describe(PG)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 27 65.56 26.93 73 67.13 29.65 13 100 87 -0.58 -1.15 5.18
psych::describe(datos[ ,3:5])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PG 1 27 65.56 26.93 73 67.13 29.65 13 100 87 -0.58 -1.15 5.18
## PN 2 27 42.15 25.90 47 41.91 29.65 0 87 87 0.15 -1.20 4.99
## PS 3 27 0.00 0.00 0 0.00 0.00 0 0 0 0.88 -0.16 0.00
Gives a nicely formatted table with counts, missing values, mean, sd,
min, max, histograms, etc. From skimr library
skim(PG)
| Name | PG |
| Number of rows | 27 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 65.56 | 26.93 | 13 | 46.5 | 73 | 87 | 100 | ▃▂▃▅▇ |
skim(datos)
| Name | datos |
| Number of rows | 27 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| factor | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Tamaño | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| Episperma | 0 | 1 | 5 | 6 | 0 | 3 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Bloque | 0 | 1 | TRUE | 3 | 1: 9, 2: 9, 3: 9 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PG | 0 | 1 | 65.56 | 26.93 | 13 | 46.5 | 73 | 87 | 100 | ▃▂▃▅▇ |
| PN | 0 | 1 | 42.15 | 25.90 | 0 | 23.5 | 47 | 57 | 87 | ▅▇▃▅▅ |
| PS | 0 | 1 | 0.00 | 0.00 | 0 | 0.0 | 0 | 0 | 0 | ▇▆▂▃▁ |
Statistics from fBasics library
basicStats(PG)
## PG
## nobs 27.000000
## NAs 0.000000
## Minimum 13.000000
## Maximum 100.000000
## 1. Quartile 46.500000
## 3. Quartile 87.000000
## Mean 65.555556
## Median 73.000000
## Sum 1770.000000
## SE Mean 5.183618
## LCL Mean 54.900476
## UCL Mean 76.210635
## Variance 725.487179
## Stdev 26.934869
## Skewness -0.575078
## Kurtosis -1.150249
basicStats(datos[ ,3:5])
## PG PN PS
## nobs 27.000000 27.000000 27.000000
## NAs 0.000000 0.000000 0.000000
## Minimum 13.000000 0.000000 0.002600
## Maximum 100.000000 87.000000 0.004900
## 1. Quartile 46.500000 23.500000 0.003000
## 3. Quartile 87.000000 57.000000 0.003800
## Mean 65.555556 42.148148 0.003407
## Median 73.000000 47.000000 0.003200
## Sum 1770.000000 1138.000000 0.092000
## SE Mean 5.183618 4.985364 0.000109
## LCL Mean 54.900476 31.900585 0.003184
## UCL Mean 76.210635 52.395711 0.003631
## Variance 725.487179 671.054131 0.000000
## Stdev 26.934869 25.904713 0.000566
## Skewness -0.575078 0.147628 0.876165
## Kurtosis -1.150249 -1.197220 -0.162469
Hmisc::describe(PG)
## PG
## n missing distinct Info Mean pMedian Gmd .05
## 27 0 13 0.988 65.56 66.5 30.53 20.0
## .10 .25 .50 .75 .90 .95
## 23.6 46.5 73.0 87.0 93.0 93.0
##
## Value 13 20 26 33 40 53 60 66 73 80 87
## Frequency 1 2 1 2 1 2 2 1 2 3 5
## Proportion 0.037 0.074 0.037 0.074 0.037 0.074 0.074 0.037 0.074 0.111 0.185
##
## Value 93 100
## Frequency 4 1
## Proportion 0.148 0.037
##
## For the frequency table, variable is rounded to the nearest 0
Hmisc::describe(datos)
## datos
##
## 6 Variables 27 Observations
## --------------------------------------------------------------------------------
## Tamaño
## n missing distinct
## 27 0 3
##
## Value chicas grandes medianas
## Frequency 9 9 9
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
## Episperma
## n missing distinct
## 27 0 3
##
## Value claro oscuro rojizo
## Frequency 9 9 9
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
## PG
## n missing distinct Info Mean pMedian Gmd .05
## 27 0 13 0.988 65.56 66.5 30.53 20.0
## .10 .25 .50 .75 .90 .95
## 23.6 46.5 73.0 87.0 93.0 93.0
##
## Value 13 20 26 33 40 53 60 66 73 80 87
## Frequency 1 2 1 2 1 2 2 1 2 3 5
## Proportion 0.037 0.074 0.037 0.074 0.037 0.074 0.074 0.037 0.074 0.111 0.185
##
## Value 93 100
## Frequency 4 1
## Proportion 0.148 0.037
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## PN
## n missing distinct Info Mean pMedian Gmd .05
## 27 0 12 0.989 42.15 42 30.13 7.0
## .10 .25 .50 .75 .90 .95
## 7.0 23.5 47.0 57.0 80.0 80.0
##
## Value 0 7 13 20 27 33 47 54 60 67 80
## Frequency 1 3 1 2 3 3 4 3 1 1 4
## Proportion 0.037 0.111 0.037 0.074 0.111 0.111 0.148 0.111 0.037 0.037 0.148
##
## Value 87
## Frequency 1
## Proportion 0.037
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## PS
## n missing distinct Info Mean pMedian Gmd .05
## 27 0 13 0.972 0.003407 0.00335 0.0006177 0.00279
## .10 .25 .50 .75 .90 .95
## 0.00300 0.00300 0.00320 0.00380 0.00424 0.00430
##
## Value 0.0026 0.0027 0.0030 0.0031 0.0032 0.0033 0.0034 0.0036 0.0038
## Frequency 1 1 8 2 3 2 1 1 2
## Proportion 0.037 0.037 0.296 0.074 0.111 0.074 0.037 0.037 0.074
##
## Value 0.0040 0.0042 0.0043 0.0049
## Frequency 2 1 2 1
## Proportion 0.074 0.037 0.074 0.037
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## Bloque
## n missing distinct
## 27 0 3
##
## Value 1 2 3
## Frequency 9 9 9
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
Statistics by category (grouped summaries)
tapply(datos$PG,datos$Tamaño,mean)
## chicas grandes medianas
## 54.55556 73.33333 68.77778
tapply(datos$PG,datos$Tamaño,var)
## chicas grandes medianas
## 694.0278 371.7500 1076.1944
In the following sections, I will incorporate grouped summary
statistics using the aggregate() function and the
dplyr package. These methods will allow for a more detailed
comparison of numeric variables across the levels of categorical
factors.