Return to Home Page


Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown
Short manual series: Basic descriptive statistic


Important Notes

. Check file: R Markdown codes to learn about some of the basic the codes used to produce this document, specially the use of inline R codes.
. As a general recommendation, use code round() ONLY when you are ready to present the data, never when you are performing your calculations or creating objects.
. In this document, some values are presented in gray highlight, e.g., 32.456. These values were all produced by using inline R codes with double backsticks. An inline R code will look like this in the original Rmd file used to produce this HTML document: ``r code``.
. All R chunks used to create this document, and their corresponding codes, are presented. In some cases, the R chunk {r} settings are included using a #; for example, the library and data sets R chunk contains the additional settings: {r librariesData, message=FALSE, warning=FALSE}. The information included on R chunk headings is explained on the R Markdown file.

 


Table of contents

Basic descriptive statistics
The N value
Measures of central tendency
Measures of Dispersion
Measures of Position
Quartiles
Z scores
A shortcut to obtain basic descriptive statistics
References and additional recommended reading materials


As always, I start my R Markdown files by listing all libraries and data sets in the first R chunk.

# {r librariesData, message=FALSE, warning=FALSE}

# Libraries used in this document
library(tidyverse) 
library(gridExtra)   # For grid.arrange()
library(grid)        # For grid tables
library(DT)          # For data tables
library(knitr)       # For kable() tables
library(modeest)
library(pander)
library(psych)

# Data sets used in this document
data("faithful")
data("mpg")
data("iris")
data("mtcars")
set1 = c(11,15,16,16,18,18,20,20,22,24,25,27,30,30,30,31,31,32,33,33,34,35,35,36,36,37,37,40,40,
         40,41,41,42,43,44,44,44,44,44,44,45,46,46,46,46,46,47,52,52,53,54,54,54,56,56,57,57,59,
         59,59,60,60,62,62,63,64,65,66,66,67,68,69,69,70,70,71,72,72,73,75,76,76,77,77,78,78,78,
         79,79,79,79,81,81,82,83,85,91,92,92,94,96,97,98,98,99,101,101,102,103,103,104,106,106,
         108,109,109,109,110,110,111,111,111,112,113,114,115,116,117,117,119,119,120,120,120,121,
         122,123,123,124,125)


 

Basic descriptive statistics


The values of basic descriptive statistics that we will explore in this document are the following:


 

The N value
Return to top


The N value is the indication of how many observations your data has. N is used for the population parameter, and n is used for the sample statistics.
One way to compute the N value in R, from numerical vectors, is by using code length(set1). Try it in your console using vector set1; the answer you obtain should be = 140.

From a data set, such as faithful, which contains columns and rows, the strategy is to check the number of rows. In R, use code nrow(dataset name), it will count the rows without including the headers (the names of the variables). In this case, the code will be: nrow(faithful).


nrow(faithful): Data set faithful contains 272 observations.

nrow(mpg): Data set mpg contains 234 observations.

nrow(iris): Data set iris contains 150 observations.


If you use length() on a data set, the outcome will be the same as in ncol(), the number of columns/variables.


nrow(mtcars): Data set mtcars contains 32 observations.

ncol(mtcars): Data set mtcars contains 11 columns/variables.

length(mtcars): Data set mtcars contains 11 columns/variables.


 

Measures of central tendency
Return to top


Three of the most common measures of central tendency are the mean, median and mode. Check the R chunk below, three objects were created to save these values.


The codes are quite simple:
For mean = mean()
For median = median()
For mode = modeest::mfv()


Observe: Notice the code modeest::mfv(). The syntax is: library::code. Sometimes a code can have conflicts in R if the same code is present in different libraries, if that is the case, it is important to tell R which library you want to use. Other times, even if there are no conflicts, you want to remind yourself, or you want to tell others, to which library that code belongs to.

set1_mean = mean(set1)
set1_median = median(set1)
set1_mode = modeest::mfv(set1) # mfv stands for most frequent value


Present the results using inline R codes.
The mean of set1 is: 69.64
The median of set1 is: 67.5
The mode of set1 is: 44

From the data set
Let’s use inline R codes to obtain these values.

Important: It is critical to remember the basic way to subtract information from a data set. Observe how the data set mtcars looks like:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

`

If you are interested in the efficiency of these cars, measured in miles per gallon (column mpg), the way you communicate R that you want to subtract that information is the following:

First mention the name of the data set, add the dollar symbol $, then add the name of the variable, in this case: mtcars$mpg. Easy, right?
Let’s check these values using inline R codes.

The mean efficiency of all cars is: ``r round(mean(mtcars$mpg),2)`` = 20.09 mpg.

The median efficiency of all cars is: ``r median(mtcars$mpg)`` = 19.2 mpg.

The mean gross horsepower of all cars is: ``r round(mean(mtcars$hp),2)`` = 146.69 hp.


 

Measures of Dispersion
Return to top


Some of the commonly used measures of dispersion include the variance, standard deviation, and range.

From the vector set1

The variance of set1.
``r round(var(set1),2)`` = 1023.41.

The standard deviation of set1.
``r round(sd(set1),2)`` = 31.99.

Given the variance, calculate the standard deviation.
``r round(sqrt(var(set1)),2)`` = 31.99.

Given the standard deviation, calculate the variance.
``r round(sd(set1)^2),2)`` = 1023.41.

The range of set1.
``r (max(set1)-min(set1))`` = 114.

For range = max() - min()


From the data set Faithful

Take a quick look at the data set using codes glimpse() and grid.table().

glimpse(faithful)
## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95~
## $ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, ~
faithful %>% 
  rename(Eruption=eruptions, Waiting = waiting) %>%
  head(10) %>%
  gridExtra::grid.table()


Using an R chunk, let’s prepare a table to display the standard deviation, variance and range, including both variables waiting time and eruption time.

er_sd = sd(faithful$eruptions)
er_var = var(faithful$eruptions)
er_rg = (max(faithful$eruptions)-min(faithful$eruptions))
wg_sd = sd(faithful$waiting)
wg_var = var(faithful$waiting)
wg_rg = (max(faithful$waiting)-min(faithful$waiting))

rownames = c("Standard Deviation", "Variance", "Range")
er_values = round(c(er_sd, er_var, er_rg),2)
wg_values = round(c(wg_sd, wg_var, wg_rg),2)

faith_values = tibble(rownames, wg_values, er_values)
colnames(faith_values) = c("", "Waiting", "Eruption")

faith_values %>% knitr::kable()
Waiting Eruption
Standard Deviation 13.59 1.14
Variance 184.82 1.30
Range 53.00 3.50


 

Measures of Position
Return to top

For the minimum value: min(set1) = 11

Another way to obtain the minimum vale, or 0% quantile: quantile(set1, 0) = 11

For the 25% quantile: quantile(set1, 0.25) = 44

For the 50% quantile: quantile(set1, 0.5) = 67.5

For the 75% quantile: quantile(set1, 0.75) 99.5

You can get any quantile value, e.g., the 80% quantile: quantile(set1, 0.8) = 106

For the maximum value: max(set1) = 125

  Quartiles

Quartiles is basically the data divided in four equal portions, 1/4 each.



In R, we still use the quantile() code to obtain all quartiles at once, the only rule is: do not define any value as you did above.
For better presentation, let’s transform quantiles data into a data frame, then present it as a kable.

quantile(set1) %>%
  as.data.frame() %>%
  knitr::kable()
.
0% 11.0
25% 44.0
50% 67.5
75% 99.5
100% 125.0

From the data set

The minimum value of mpg: ``r min(mtcars$mpg)`` = 10.4.

The maximum value of mpg: ``r max(mtcars$mpg)`` = 33.9.

The 65% quantile of mpg: ``r quantile(mtcars$mpg, 0.65)`` = 21.4.

  Z scores

Another measure of position are the Z score. Also known as standard score, it describes the position of the data based on values of standard deviations (Bluman, 2017; Glen, n.d.).
For example, a data set with mean 50 and standard deviation 5, the value 60 is located 2 standard deviations above the mean, its z score is +2; the value 35 is located 3 standard deviations below the mean, its z score is then -3.
To obtain the z scores in R, we need to calculate the mean, the standard deviation, and then apply the following formula:

Using the data set mtcars:
- Let’s obtain and store the mean and standard deviation of variable mpg.
- Let’s obtain some x values from the same variable.
- Let’s calculate the Z scores of those values.
- And let’s use inline R codes to present the results.

mpg_mean = mean(mtcars$mpg)
mpg_sd = sd(mtcars$mpg)
value1 = quantile(mtcars$mpg, 0.20)
value2 = min(mtcars$mpg)
value3 = quantile(mtcars$mpg, 0.85)
value4 = max(mtcars$mpg)

z_value1 = ((value1 - mpg_mean)/mpg_sd)
z_value2 = ((value2 - mpg_mean)/mpg_sd)
z_value3 = ((value3 - mpg_mean)/mpg_sd)
z_value4 = ((value4 - mpg_mean)/mpg_sd)

The mean of mtcars$mpg is: 20.09
The sd of mtcars$mpg is: 6.03

Value 1 = 15.2       — Z score = -0.81
Value 2 = 10.4       — Z score = -1.61
Value 3 = 26.45     — Z score = 1.06
Value 4 = 33.9       — Z score = 2.29


 

A shortcut to obtain basic descriptive statistics
Return to top



An easy way to obtain a lot of descriptive statistic values from a data set is by using the code: psych::describe().

First let’s take a glimpse() of the data set Iris.

glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

As we can see,, there are 4 numerical variables, and one categorical (fct) variable (Species).

iris %>%
  dplyr::select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)%>%
  psych::describe() %>%
  round(2) %>% 
  t() %>%
  pander()
  Sepal.Length Sepal.Width Petal.Length Petal.Width
vars 1 2 3 4
n 150 150 150 150
mean 5.84 3.06 3.76 1.2
sd 0.83 0.44 1.77 0.76
median 5.8 3 4.35 1.3
trimmed 5.81 3.04 3.76 1.18
mad 1.04 0.44 1.85 1.04
min 4.3 2 1 0.1
max 7.9 4.4 6.9 2.5
range 3.6 2.4 5.9 2.4
skew 0.31 0.31 -0.27 -0.1
kurtosis -0.61 0.14 -1.42 -1.36
se 0.07 0.04 0.14 0.06


  References and additional recommended reading materials:

Return to top

Disclaimer: This short series manual project is a work in progress. Until clearly mentioned, these files are considered draft versions.


Dee Chiluiza, PhD
06 August, 2021
Boston, Massachusetts, USA

Bruno Dog