Basic-Descriptive-Statistics.utf8

Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown
Short manual series: Basic descriptive statistic

Important Notes

❶. Check file: R Markdown codes to learn about some of the basic the codes used to produce this document, specially the use of inline R codes.
❷. As a general recommendation, use code round() ONLY when you are ready to present the data, never when you are performing your calculations or creating objects.
❸. In this document, some values are presented in gray highlight, e.g., 32.456. These values were all produced by using inline R codes with double backsticks. An inline R code will look like this in the original Rmd file used to produce this HTML document: ``r code``.
❹. All R chunks used to create this document, and their corresponding codes, are presented. In some cases, the R chunk {r} settings are included using a #; for example, the library and data sets R chunk contains the additional settings: {r librariesData, message=FALSE, warning=FALSE}. The information included on R chunk headings is explained on the R Markdown file.

Table of contents
	• Basic descriptive statistics • The N value • Measures of central tendency • Measures of Dispersion • Measures of Position • Quartiles • Z scores • A shortcut to obtain basic descriptive statistics • References and additional recommended reading materials

As always, I start my R Markdown files by listing all libraries and data sets in the first R chunk.

# {r librariesData, message=FALSE, warning=FALSE}

# Libraries used in this document
library(tidyverse) 
library(gridExtra)   # For grid.arrange()
library(grid)        # For grid tables
library(DT)          # For data tables
library(knitr)       # For kable() tables
library(modeest)
library(pander)
library(psych)

# Data sets used in this document
data("faithful")
data("mpg")
data("iris")
data("mtcars")
set1 = c(11,15,16,16,18,18,20,20,22,24,25,27,30,30,30,31,31,32,33,33,34,35,35,36,36,37,37,40,40,
         40,41,41,42,43,44,44,44,44,44,44,45,46,46,46,46,46,47,52,52,53,54,54,54,56,56,57,57,59,
         59,59,60,60,62,62,63,64,65,66,66,67,68,69,69,70,70,71,72,72,73,75,76,76,77,77,78,78,78,
         79,79,79,79,81,81,82,83,85,91,92,92,94,96,97,98,98,99,101,101,102,103,103,104,106,106,
         108,109,109,109,110,110,111,111,111,112,113,114,115,116,117,117,119,119,120,120,120,121,
         122,123,123,124,125)

Basic descriptive statistics

Return to top

The values of basic descriptive statistics that we will explore in this document are the following:

The N value

Return to top

✻
The N value is the indication of how many observations your data has. N is used for the population parameter, and n is used for the sample statistics.
One way to compute the N value in R, from numerical vectors, is by using code length(set1). Try it in your console using vector set1; the answer you obtain should be = 140.

From a data set, such as faithful, which contains columns and rows, the strategy is to check the number of rows. In R, use code nrow(dataset name), it will count the rows without including the headers (the names of the variables). In this case, the code will be: nrow(faithful).

nrow(faithful): Data set faithful contains 272 observations.

nrow(mpg): Data set mpg contains 234 observations.

nrow(iris): Data set iris contains 150 observations.

If you use length() on a data set, the outcome will be the same as in ncol(), the number of columns/variables.

nrow(mtcars): Data set mtcars contains 32 observations.

ncol(mtcars): Data set mtcars contains 11 columns/variables.

length(mtcars): Data set mtcars contains 11 columns/variables.

Measures of central tendency

Return to top

✻
Three of the most common measures of central tendency are the mean, median and mode. Check the R chunk below, three objects were created to save these values.

The codes are quite simple:
For mean = mean()
For median = median()
For mode = modeest::mfv()

Observe: Notice the code modeest::mfv(). The syntax is: library::code. Sometimes a code can have conflicts in R if the same code is present in different libraries, if that is the case, it is important to tell R which library you want to use. Other times, even if there are no conflicts, you want to remind yourself, or you want to tell others, to which library that code belongs to.

set1_mean = mean(set1)
set1_median = median(set1)
set1_mode = modeest::mfv(set1) # mfv stands for most frequent value

Present the results using inline R codes.
The mean of set1 is: 69.64
The median of set1 is: 67.5
The mode of set1 is: 44

From the data set
Let’s use inline R codes to obtain these values.

Important: It is critical to remember the basic way to subtract information from a data set. Observe how the data set mtcars looks like:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

If you are interested in the efficiency of these cars, measured in miles per gallon (column mpg), the way you communicate R that you want to subtract that information is the following:

First mention the name of the data set, add the dollar symbol $, then add the name of the variable, in this case: mtcars$mpg. Easy, right?
Let’s check these values using inline R codes.

The mean efficiency of all cars is: ``r round(mean(mtcars$mpg),2)`` = 20.09 mpg.

The median efficiency of all cars is: ``r median(mtcars$mpg)`` = 19.2 mpg.

The mean gross horsepower of all cars is: ``r round(mean(mtcars$hp),2)`` = 146.69 hp.

Measures of Dispersion

Return to top

✻ Some of the commonly used measures of dispersion include the variance, standard deviation, and range.

From the vector set1

The variance of set1.
``r round(var(set1),2)`` = 1023.41.

The standard deviation of set1.
``r round(sd(set1),2)`` = 31.99.

Given the variance, calculate the standard deviation.
``r round(sqrt(var(set1)),2)`` = 31.99.

Given the standard deviation, calculate the variance.
``r round(sd(set1)^2),2)`` = 1023.41.

The range of set1.
``r (max(set1)-min(set1))`` = 114.

For range = max() - min()

From the data set Faithful

Take a quick look at the data set using codes glimpse() and grid.table().

glimpse(faithful)

## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95~
## $ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, ~

faithful %>% 
  rename(Eruption=eruptions, Waiting = waiting) %>%
  head(10) %>%
  gridExtra::grid.table()

Using an R chunk, let’s prepare a table to display the standard deviation, variance and range, including both variables waiting time and eruption time.

er_sd = sd(faithful$eruptions)
er_var = var(faithful$eruptions)
er_rg = (max(faithful$eruptions)-min(faithful$eruptions))
wg_sd = sd(faithful$waiting)
wg_var = var(faithful$waiting)
wg_rg = (max(faithful$waiting)-min(faithful$waiting))

rownames = c("Standard Deviation", "Variance", "Range")
er_values = round(c(er_sd, er_var, er_rg),2)
wg_values = round(c(wg_sd, wg_var, wg_rg),2)

faith_values = tibble(rownames, wg_values, er_values)
colnames(faith_values) = c("", "Waiting", "Eruption")

faith_values %>% knitr::kable()

	Waiting	Eruption
Standard Deviation	13.59	1.14
Variance	184.82	1.30
Range	53.00	3.50

Measures of Position

Return to top

For the minimum value: min(set1) = 11

Another way to obtain the minimum vale, or 0% quantile: quantile(set1, 0) = 11

For the 25% quantile: quantile(set1, 0.25) = 44

For the 50% quantile: quantile(set1, 0.5) = 67.5

For the 75% quantile: quantile(set1, 0.75) 99.5

You can get any quantile value, e.g., the 80% quantile: quantile(set1, 0.8) = 106

For the maximum value: max(set1) = 125

Quartiles

Quartiles is basically the data divided in four equal portions, 1/4 each.

In R, we still use the quantile() code to obtain all quartiles at once, the only rule is: do not define any value as you did above.
For better presentation, let’s transform quantiles data into a data frame, then present it as a kable.

quantile(set1) %>%
  as.data.frame() %>%
  knitr::kable()

	.
0%	11.0
25%	44.0
50%	67.5
75%	99.5
100%	125.0

From the data set

The minimum value of mpg: ``r min(mtcars$mpg)`` = 10.4.

The maximum value of mpg: ``r max(mtcars$mpg)`` = 33.9.

The 65% quantile of mpg: ``r quantile(mtcars$mpg, 0.65)`` = 21.4.

Z scores

Another measure of position are the Z score. Also known as standard score, it describes the position of the data based on values of standard deviations (Bluman, 2017; Glen, n.d.).
For example, a data set with mean 50 and standard deviation 5, the value 60 is located 2 standard deviations above the mean, its z score is +2; the value 35 is located 3 standard deviations below the mean, its z score is then -3.
To obtain the z scores in R, we need to calculate the mean, the standard deviation, and then apply the following formula:

Using the data set mtcars:
- Let’s obtain and store the mean and standard deviation of variable mpg.
- Let’s obtain some x values from the same variable.
- Let’s calculate the Z scores of those values.
- And let’s use inline R codes to present the results.

mpg_mean = mean(mtcars$mpg)
mpg_sd = sd(mtcars$mpg)
value1 = quantile(mtcars$mpg, 0.20)
value2 = min(mtcars$mpg)
value3 = quantile(mtcars$mpg, 0.85)
value4 = max(mtcars$mpg)

z_value1 = ((value1 - mpg_mean)/mpg_sd)
z_value2 = ((value2 - mpg_mean)/mpg_sd)
z_value3 = ((value3 - mpg_mean)/mpg_sd)
z_value4 = ((value4 - mpg_mean)/mpg_sd)

The mean of mtcars$mpg is: 20.09
The sd of mtcars$mpg is: 6.03

Value 1 = 15.2 — Z score = -0.81
Value 2 = 10.4 — Z score = -1.61
Value 3 = 26.45 — Z score = 1.06
Value 4 = 33.9 — Z score = 2.29

A shortcut to obtain basic descriptive statistics

Return to top

An easy way to obtain a lot of descriptive statistic values from a data set is by using the code: psych::describe().

First let’s take a glimpse() of the data set Iris.

glimpse(iris)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

As we can see,, there are 4 numerical variables, and one categorical (fct) variable (Species).

In the R chunk below, observe below the use of “pipes” %>% to organize the data analysis process.
Since we need descriptive statistics from the numerical variables, use code dplyr::select() to isolate the numerical variables only.
Use code psych::describe() to obtain multiple descriptive statistic values.
Reduce decimals to two values using code round(2).
Flip the columns to rows positions by using the transpose code t().
Then use code pander() to present the table.
In the final table, observe all the values of descriptive statistics we obtained by following this strategy.

iris %>%
  dplyr::select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)%>%
  psych::describe() %>%
  round(2) %>% 
  t() %>%
  pander()

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
vars	1	2	3	4
n	150	150	150	150
mean	5.84	3.06	3.76	1.2
sd	0.83	0.44	1.77	0.76
median	5.8	3	4.35	1.3
trimmed	5.81	3.04	3.76	1.18
mad	1.04	0.44	1.85	1.04
min	4.3	2	1	0.1
max	7.9	4.4	6.9	2.5
range	3.6	2.4	5.9	2.4
skew	0.31	0.31	-0.27	-0.1
kurtosis	-0.61	0.14	-1.42	-1.36
se	0.07	0.04	0.14	0.06

References and additional recommended reading materials:

Return to top

Glen, S. n.d. “Z-Score: Definition, Formula and Calculation” From StatisticsHowTo.com: Elementary Statistics for the rest of us!
https://www.statisticshowto.com/probability-and-statistics/z-score/

Disclaimer: This short series manual project is a work in progress. Until clearly mentioned, these files are considered draft versions.

Dee Chiluiza, PhD
06 August, 2021
Boston, Massachusetts, USA
Bruno Dog