Basic Module Training on R Programming

Learning Objectives:

At the end of the session, the participants are expected to:

understand the basic concepts and use of descriptive statistics
learn how to use the basic functions in R for generating descriptive measures for central tendency, variability, and frequency tables.

Introduction

“Statistical Thinking will one day be as necessary for efficient citizenship as the ability to read and write”.
— H.G. Wells

With the presence of big data and the increasing interest in Artificial intelligence, we believe that the day Wells was referring to has already arrived.
We are surrounded by data and we can take advantage of this by learning the concepts of Statistics and Computing.

Descriptive statistics for a single group

Here, we’ll use the built-in R data set named iris.

# Store the data in the variable my_data
my_data <- iris

You can inspect your data using the functions head() and tails(), which will display the first and the last part of the data, respectively.

# Print the first 6 rows
head(my_data, 6)

Measure of central tendency

Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:

the mean: the average value. It’s sensitive to outliers.
the median: the middle value. It’s a robust alternative to mean.
and the mode: the most frequent value

The mathematical formula for the mean is \[\bar{x}=\frac{1}{n}\sum_{i=1}^{n} X_i \]

while the median is the value at position \[(\frac{n+1}{2})\] when the \(n\) is odd, while when \(n\) is even, we

Find the value at position \[(\frac{n}{2})\]
Find the value at position \[(\frac{n+1}{2})\]
Find the average of the two values in those positions to locate the median.

# Compute the mean value
mean(my_data$Sepal.Length)

## [1] 5.843333

# Compute the median value
median(my_data$Sepal.Length)

## [1] 5.8

# Compute the modal value (mode)
# Create the function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Use it in the calculation
getmode(my_data$Sepal.Length)

## [1] 5

Measure of variablity

Measures of variability gives how “spread out” the data are.

The Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.

The formula for the range is \[ Range= maximum - minimum \]

# Compute the minimum value
min(my_data$Sepal.Length)

## [1] 4.3

# Compute the maximum value
max(my_data$Sepal.Length)

## [1] 7.9

# Range
range(my_data$Sepal.Length)

## [1] 4.3 7.9

The Interquartile Range corresponds to the difference between the first and third quartiles, when the data set is divided into four equal parts(quarters).It is sometimes used as a robust alternative to the standard deviation.

\[ IQR = Q_3 - Q_1 \]

where

\(Q_3\) is the positional value at \(\frac{3(n+1)}{4}\)
\(Q_1\) is the positional value at \(\frac{(n+1)}{4}\)

when the observations are arrange in ascending order.

quantile(x, probs = seq(0, 1, 0.25))

x: numeric vector whose sample quantiles are wanted.
probs: numeric vector of probabilities with values in [0,1].

quantile(my_data$Sepal.Length)

##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles).

To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:

quantile(my_data$Sepal.Length, seq(0, 1, 0.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 4.30 4.80 5.00 5.27 5.60 5.80 6.10 6.30 6.52 6.90 7.90

To compute the interquartile range, type this:

IQR(my_data$Sepal.Length)

## [1] 1.3

Variance and standard deviation

The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.

Sample Variance

\[ S^2 =\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\]

\(S^2\) is the sample variance
\(X_i\) is the value of the ith observation
\(\bar{X}\) is the mean of all observations
\(n\) is the number of observations

# Compute the variance
var(my_data$Sepal.Length)

## [1] 0.6856935

# Compute the standard deviation
sd(my_data$Sepal.Length)

## [1] 0.8280661

Median absolute deviation

The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.

\[ MAD = median\{|X_i - median(X)|\} \]

# Compute the median
median(my_data$Sepal.Length)

## [1] 5.8

# Compute the median absolute deviation
mad(my_data$Sepal.Length)

## [1] 1.03782

Which measure to use?

Range. It’s not often used because it’s very sensitive to outliers.
Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination with the median.
Variance. It’s completely uninterpretable because it doesn’t use the same unit as the data. It’s almost never used except as a mathematical tool.
Standard deviation. This is the square root of the variance. It’s expressed in the same unit as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency.
Median absolute deviation. It’s a robust way to estimate the standard deviation, for data with outliers.

Computing an overall summary of all variables

In this case, the function summary() can automatically be applied to each column. The format of the result depends on the type of the data contained in the column. For example:

If the column is a numeric variable, mean, median, min, max and quartiles are returned.
If the column is a factor variable, the number of observations in each group is returned.

summary(my_data, digits = 1)

##   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width        Species  
##  Min.   :4     Min.   :2    Min.   :1     Min.   :0.1   setosa    :50  
##  1st Qu.:5     1st Qu.:3    1st Qu.:2     1st Qu.:0.3   versicolor:50  
##  Median :6     Median :3    Median :4     Median :1.3   virginica :50  
##  Mean   :6     Mean   :3    Mean   :4     Mean   :1.2                  
##  3rd Qu.:6     3rd Qu.:3    3rd Qu.:5     3rd Qu.:1.8                  
##  Max.   :8     Max.   :4    Max.   :7     Max.   :2.5

sapply() function

It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …

# Compute the mean of each column
sapply(my_data[, -5], mean)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

# Compute quartiles
sapply(my_data[, -5], quantile)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            4.3         2.0         1.00         0.1
## 25%           5.1         2.8         1.60         0.3
## 50%           5.8         3.0         4.35         1.3
## 75%           6.4         3.3         5.10         1.8
## 100%          7.9         4.4         6.90         2.5

Case of missing values

Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.

For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells the function to remove any NAs before calculations. An example using the mean function is as follow:

mean(my_data$Sepal.Length, na.rm = TRUE)

## [1] 5.843333

Descriptive statistics by groups

To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used.

We want to group the data by Species and then:

compute the number of element in each group. R function: n()
compute the mean. R function mean()
and the standard deviation. R function sd()

Install ddplyr as follow:

install.packages("dplyr")

We use %>% to chain operations.

library(dplyr)
group_by(my_data, Species) %>% 
summarise(
  count = n(), 
  mean = mean(Sepal.Length, na.rm = TRUE),
  sd = sd(Sepal.Length, na.rm = TRUE)
  )

Frequency Tables

A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.

R function to generate tables: table()

We use the data of 110 corn farmers. Note that this is not a real data.

# Load the data
library(readxl)
Socio <- read_excel("Socio.xlsx")
head(Socio)

Sex <- Socio$Sex
Organic_Pref<-Socio$Organic

Simple frequency dist’n: one variable

# Frequency distribution of Sex of Farmers
table(Sex)

## Sex
## Female   Male 
##     61     49

# Frequency distribution of Organic farming preference
table(Organic_Pref)

## Organic_Pref
##  No Yes 
##  65  45

Two-way contingency table:

tbl2 <- table(Sex , Organic_Pref)
tbl2

##         Organic_Pref
## Sex      No Yes
##   Female 38  23
##   Male   27  22

It’s also possible to use the function xtabs(), which will create cross tabulation of data frames with a formula interface.

xtabs(~ Organic_Pref + Sex, data = Socio)

##             Sex
## Organic_Pref Female Male
##          No      38   27
##          Yes     23   22

Multiway tables: More than two categorical variables

# Marital Status and Sex distributions by Organic preference using xtabs()
xtabs(~Status + Sex + Organic_Pref, data = Socio)

## , , Organic_Pref = No
## 
##          Sex
## Status    Female Male
##   Married     18   12
##   Single       6    8
##   Widow       14    7
## 
## , , Organic_Pref = Yes
## 
##          Sex
## Status    Female Male
##   Married      9   11
##   Single       9    0
##   Widow        5   11

You can also use the function ftable() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:

ftable(Sex + Status ~ Organic_Pref, data = Socio)

##              Sex     Female                 Male             
##              Status Married Single Widow Married Single Widow
## Organic_Pref                                                 
## No                       18      6    14      12      8     7
## Yes                       9      9     5      11      0    11

Compute table margins and relative frequency

Table margins correspond to the sums of counts along rows or columns of the table. Relative frequencies express table entries as proportions of table margins (i.e., row or column totals).

The function margin.table() and prop.table() can be used to compute table margins and relative frequencies, respectively

margin.table(x, margin = NULL)
prop.table(x, margin = NULL)

x: table
margin: index number (1 for rows and 2 for columns)

Status<-Socio$Status
he.tbl <- table(Sex, Status)
he.tbl

##         Status
## Sex      Married Single Widow
##   Female      27     15    19
##   Male        23      8    18

# Margin of rows
margin.table(he.tbl, 1)

## Sex
## Female   Male 
##     61     49

# Margin of columns
margin.table(he.tbl, 2)

## Status
## Married  Single   Widow 
##      50      23      37

# Frequencies relative to row total
prop.table(he.tbl, 1)

##         Status
## Sex        Married    Single     Widow
##   Female 0.4426230 0.2459016 0.3114754
##   Male   0.4693878 0.1632653 0.3673469

# Table of percentages
round(prop.table(he.tbl, 1), 2)*100

##         Status
## Sex      Married Single Widow
##   Female      44     25    31
##   Male        47     16    37

# Frequencies relative to column total
prop.table(he.tbl, 2)

##         Status
## Sex        Married    Single     Widow
##   Female 0.5400000 0.6521739 0.5135135
##   Male   0.4600000 0.3478261 0.4864865

# Table of percentages
round(prop.table(he.tbl, 2), 2)*100

##         Status
## Sex      Married Single Widow
##   Female      54     65    51
##   Male        46     35    49

To express the frequencies relative to the grand total, use this:’

he.tbl/sum(he.tbl)

##         Status
## Sex         Married     Single      Widow
##   Female 0.24545455 0.13636364 0.17272727
##   Male   0.20909091 0.07272727 0.16363636

# Table of percentages
round(he.tbl/sum(he.tbl), 2)*100

##         Status
## Sex      Married Single Widow
##   Female      25     14    17
##   Male        21      7    16