❶. Check file: R Markdown codes to learn about some of the basic the codes used to produce this document, specially the use of inline R codes.
❷. As a general recommendation, use code round() ONLY when you are ready to present the data, never when you are performing your calculations or creating objects.
❸. In this document, some values are presented in gray highlight, e.g., 32.456. These values were all produced by using inline R codes with double backsticks. An inline R code will look like this in the original Rmd file used to produce this HTML document: ``r code``.
❹. All R chunks used to create this document, and their corresponding codes, are presented. In some cases, the R chunk {r} settings are included using a #; for example, the library and data sets R chunk contains the additional settings: {r librariesData, message=FALSE, warning=FALSE}. The information included on R chunk headings is explained on the R Markdown file.
# {r librariesData, message=FALSE, warning=FALSE}
# Libraries used in this document
library(tidyverse)
library(gridExtra) # For grid.arrange()
library(grid) # For grid tables
library(DT) # For data tables
library(knitr) # For kable() tables
library(modeest)
library(pander)
library(psych)
# Data sets used in this document
data("faithful")
data("mpg")
data("iris")
data("mtcars")
set1 = c(11,15,16,16,18,18,20,20,22,24,25,27,30,30,30,31,31,32,33,33,34,35,35,36,36,37,37,40,40,
40,41,41,42,43,44,44,44,44,44,44,45,46,46,46,46,46,47,52,52,53,54,54,54,56,56,57,57,59,
59,59,60,60,62,62,63,64,65,66,66,67,68,69,69,70,70,71,72,72,73,75,76,76,77,77,78,78,78,
79,79,79,79,81,81,82,83,85,91,92,92,94,96,97,98,98,99,101,101,102,103,103,104,106,106,
108,109,109,109,110,110,111,111,111,112,113,114,115,116,117,117,119,119,120,120,120,121,
122,123,123,124,125)| Basic descriptive statistics |
The values of basic descriptive statistics that we will explore in this document are the following:
| The N value |
✻
The N value is the indication of how many observations your data has. N is used for the population parameter, and n is used for the sample statistics.
One way to compute the N value in R, from numerical vectors, is by using code length(set1). Try it in your console using vector set1; the answer you obtain should be = 140.
From a data set, such as faithful, which contains columns and rows, the strategy is to check the number of rows. In R, use code nrow(dataset name), it will count the rows without including the headers (the names of the variables). In this case, the code will be: nrow(faithful).
nrow(faithful): Data set faithful contains 272 observations.
nrow(mpg): Data set mpg contains 234 observations.
nrow(iris): Data set iris contains 150 observations.
If you use length() on a data set, the outcome will be the same as in ncol(), the number of columns/variables.
nrow(mtcars): Data set mtcars contains 32 observations.
ncol(mtcars): Data set mtcars contains 11 columns/variables.
length(mtcars): Data set mtcars contains 11 columns/variables.
| Measures of central tendency |
✻
Three of the most common measures of central tendency are the mean, median and mode. Check the R chunk below, three objects were created to save these values.
The codes are quite simple:
For mean = mean()
For median = median()
For mode = modeest::mfv()
Observe: Notice the code modeest::mfv(). The syntax is: library::code. Sometimes a code can have conflicts in R if the same code is present in different libraries, if that is the case, it is important to tell R which library you want to use. Other times, even if there are no conflicts, you want to remind yourself, or you want to tell others, to which library that code belongs to.
set1_mean = mean(set1)
set1_median = median(set1)
set1_mode = modeest::mfv(set1) # mfv stands for most frequent value
Present the results using inline R codes.
The mean of set1 is: 69.64
The median of set1 is: 67.5
The mode of set1 is: 44
From the data set
Let’s use inline R codes to obtain these values.
Important: It is critical to remember the basic way to subtract information from a data set. Observe how the data set mtcars looks like:
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
`
If you are interested in the efficiency of these cars, measured in miles per gallon (column mpg), the way you communicate R that you want to subtract that information is the following:
First mention the name of the data set, add the dollar symbol $, then add the name of the variable, in this case: mtcars$mpg. Easy, right?
Let’s check these values using inline R codes.
The mean efficiency of all cars is: ``r round(mean(mtcars$mpg),2)`` = 20.09 mpg.
The median efficiency of all cars is: ``r median(mtcars$mpg)`` = 19.2 mpg.
The mean gross horsepower of all cars is: ``r round(mean(mtcars$hp),2)`` = 146.69 hp.
| Measures of Dispersion |
✻ Some of the commonly used measures of dispersion include the variance, standard deviation, and range.
From the vector set1
The variance of set1.
``r round(var(set1),2)`` = 1023.41.
The standard deviation of set1.
``r round(sd(set1),2)`` = 31.99.
Given the variance, calculate the standard deviation.
``r round(sqrt(var(set1)),2)`` = 31.99.
Given the standard deviation, calculate the variance.
``r round(sd(set1)^2),2)`` = 1023.41.
The range of set1.
``r (max(set1)-min(set1))`` = 114.
For range = max() - min()
From the data set Faithful
Take a quick look at the data set using codes glimpse() and grid.table().
## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95~
## $ waiting <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, ~
Using an R chunk, let’s prepare a table to display the standard deviation, variance and range, including both variables waiting time and eruption time.
er_sd = sd(faithful$eruptions)
er_var = var(faithful$eruptions)
er_rg = (max(faithful$eruptions)-min(faithful$eruptions))
wg_sd = sd(faithful$waiting)
wg_var = var(faithful$waiting)
wg_rg = (max(faithful$waiting)-min(faithful$waiting))
rownames = c("Standard Deviation", "Variance", "Range")
er_values = round(c(er_sd, er_var, er_rg),2)
wg_values = round(c(wg_sd, wg_var, wg_rg),2)
faith_values = tibble(rownames, wg_values, er_values)
colnames(faith_values) = c("", "Waiting", "Eruption")
faith_values %>% knitr::kable()| Waiting | Eruption | |
|---|---|---|
| Standard Deviation | 13.59 | 1.14 |
| Variance | 184.82 | 1.30 |
| Range | 53.00 | 3.50 |
| Measures of Position |
For the minimum value: min(set1) = 11
Another way to obtain the minimum vale, or 0% quantile: quantile(set1, 0) = 11
For the 25% quantile: quantile(set1, 0.25) = 44
For the 50% quantile: quantile(set1, 0.5) = 67.5
For the 75% quantile: quantile(set1, 0.75) 99.5
You can get any quantile value, e.g., the 80% quantile: quantile(set1, 0.8) = 106
For the maximum value: max(set1) = 125
Quartiles is basically the data divided in four equal portions, 1/4 each.
In R, we still use the quantile() code to obtain all quartiles at once, the only rule is: do not define any value as you did above.
For better presentation, let’s transform quantiles data into a data frame, then present it as a kable.
| . | |
|---|---|
| 0% | 11.0 |
| 25% | 44.0 |
| 50% | 67.5 |
| 75% | 99.5 |
| 100% | 125.0 |
From the data set
The minimum value of mpg: ``r min(mtcars$mpg)`` = 10.4.
The maximum value of mpg: ``r max(mtcars$mpg)`` = 33.9.
The 65% quantile of mpg: ``r quantile(mtcars$mpg, 0.65)`` = 21.4.
Another measure of position are the Z score. Also known as standard score, it describes the position of the data based on values of standard deviations (Bluman, 2017; Glen, n.d.).
For example, a data set with mean 50 and standard deviation 5, the value 60 is located 2 standard deviations above the mean, its z score is +2; the value 35 is located 3 standard deviations below the mean, its z score is then -3.
To obtain the z scores in R, we need to calculate the mean, the standard deviation, and then apply the following formula:
Using the data set mtcars:
- Let’s obtain and store the mean and standard deviation of variable mpg.
- Let’s obtain some x values from the same variable.
- Let’s calculate the Z scores of those values.
- And let’s use inline R codes to present the results.
mpg_mean = mean(mtcars$mpg)
mpg_sd = sd(mtcars$mpg)
value1 = quantile(mtcars$mpg, 0.20)
value2 = min(mtcars$mpg)
value3 = quantile(mtcars$mpg, 0.85)
value4 = max(mtcars$mpg)
z_value1 = ((value1 - mpg_mean)/mpg_sd)
z_value2 = ((value2 - mpg_mean)/mpg_sd)
z_value3 = ((value3 - mpg_mean)/mpg_sd)
z_value4 = ((value4 - mpg_mean)/mpg_sd)The mean of mtcars$mpg is: 20.09
The sd of mtcars$mpg is: 6.03
Value 1 = 15.2 — Z score = -0.81
Value 2 = 10.4 — Z score = -1.61
Value 3 = 26.45 — Z score = 1.06
Value 4 = 33.9 — Z score = 2.29
| A shortcut to obtain basic descriptive statistics |
An easy way to obtain a lot of descriptive statistic values from a data set is by using the code: psych::describe().
First let’s take a glimpse() of the data set Iris.
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~
As we can see,, there are 4 numerical variables, and one categorical (fct) variable (Species).
In the R chunk below, observe below the use of “pipes” %>% to organize the data analysis process.
Since we need descriptive statistics from the numerical variables, use code dplyr::select() to isolate the numerical variables only.
Use code psych::describe() to obtain multiple descriptive statistic values.
Reduce decimals to two values using code round(2).
Flip the columns to rows positions by using the transpose code t().
Then use code pander() to present the table.
In the final table, observe all the values of descriptive statistics we obtained by following this strategy.
iris %>%
dplyr::select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)%>%
psych::describe() %>%
round(2) %>%
t() %>%
pander()| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| vars | 1 | 2 | 3 | 4 |
| n | 150 | 150 | 150 | 150 |
| mean | 5.84 | 3.06 | 3.76 | 1.2 |
| sd | 0.83 | 0.44 | 1.77 | 0.76 |
| median | 5.8 | 3 | 4.35 | 1.3 |
| trimmed | 5.81 | 3.04 | 3.76 | 1.18 |
| mad | 1.04 | 0.44 | 1.85 | 1.04 |
| min | 4.3 | 2 | 1 | 0.1 |
| max | 7.9 | 4.4 | 6.9 | 2.5 |
| range | 3.6 | 2.4 | 5.9 | 2.4 |
| skew | 0.31 | 0.31 | -0.27 | -0.1 |
| kurtosis | -0.61 | 0.14 | -1.42 | -1.36 |
| se | 0.07 | 0.04 | 0.14 | 0.06 |
References and additional recommended reading materials:
Disclaimer: This short series manual project is a work in progress. Until clearly mentioned, these files are considered draft versions.