At the end of the session, the participants are expected to:
“Statistical Thinking will one day be as necessary for efficient citizenship as the ability to read and write”.
With the presence of big data and the increasing interest in Artificial intelligence, we believe that the day Wells was referring to has already arrived.
We are surrounded by data and we can take advantage of this by learning the concepts of Statistics and Computing.
Here, we’ll use the built-in R data set named iris.
You can inspect your data using the functions head() and tails(), which will display the first and the last part of the data, respectively.
Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:
The mathematical formula for the mean is \[\bar{x}=\frac{1}{n}\sum_{i=1}^{n} X_i \]
while the median is the value at position \[(\frac{n+1}{2})\] when the \(n\) is odd, while when \(n\) is even, we
## [1] 5.843333
## [1] 5.8
# Compute the modal value (mode)
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Use it in the calculation
getmode(my_data$Sepal.Length)## [1] 5
Measures of variability gives how “spread out” the data are.
The Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.
The formula for the range is \[ Range= maximum - minimum \]
## [1] 4.3
## [1] 7.9
## [1] 4.3 7.9
The Interquartile Range corresponds to the difference between the first and third quartiles, when the data set is divided into four equal parts(quarters).It is sometimes used as a robust alternative to the standard deviation.
\[ IQR = Q_3 - Q_1 \]
where
when the observations are arrange in ascending order.
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles).
To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 4.30 4.80 5.00 5.27 5.60 5.80 6.10 6.30 6.52 6.90 7.90
To compute the interquartile range, type this:
## [1] 1.3
Variance and standard deviation
The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.
\[ S^2 =\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\]
## [1] 0.6856935
## [1] 0.8280661
Median absolute deviation
The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.
\[ MAD = median\{|X_i - median(X)|\} \]
## [1] 5.8
## [1] 1.03782
Which measure to use?
In this case, the function summary() can automatically be applied to each column. The format of the result depends on the type of the data contained in the column. For example:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
## 1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
## Median :6 Median :3 Median :4 Median :1.3 virginica :50
## Mean :6 Mean :3 Mean :4 Mean :1.2
## 3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
## Max. :8 Max. :4 Max. :7 Max. :2.5
sapply() function
It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
Case of missing values
Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.
For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells the function to remove any NAs before calculations. An example using the mean function is as follow:
## [1] 5.843333
To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used.
We want to group the data by Species and then:
Install ddplyr as follow:
library(dplyr)
group_by(my_data, Species) %>%
summarise(
count = n(),
mean = mean(Sepal.Length, na.rm = TRUE),
sd = sd(Sepal.Length, na.rm = TRUE)
)A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.
R function to generate tables: table()
We use the data of 110 corn farmers. Note that this is not a real data.
# Load the data
library(readxl)
Socio <- read_excel("C:/Users/acer/Dropbox/ROEL/R Training/Socio.xlsx")
head(Socio)## Sex
## Female Male
## 61 49
## Organic_Pref
## No Yes
## 65 45
## Organic_Pref
## Sex No Yes
## Female 38 23
## Male 27 22
It’s also possible to use the function xtabs(), which will create cross tabulation of data frames with a formula interface.
## Sex
## Organic_Pref Female Male
## No 38 27
## Yes 23 22
# Marital Status and Sex distributions by Organic preference using xtabs()
xtabs(~Status + Sex + Organic_Pref, data = Socio)## , , Organic_Pref = No
##
## Sex
## Status Female Male
## Married 18 12
## Single 6 8
## Widow 14 7
##
## , , Organic_Pref = Yes
##
## Sex
## Status Female Male
## Married 9 11
## Single 9 0
## Widow 5 11
You can also use the function ftable() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:
## Sex Female Male
## Status Married Single Widow Married Single Widow
## Organic_Pref
## No 18 6 14 12 8 7
## Yes 9 9 5 11 0 11
Table margins correspond to the sums of counts along rows or columns of the table. Relative frequencies express table entries as proportions of table margins (i.e., row or column totals).
The function margin.table() and prop.table() can be used to compute table margins and relative frequencies, respectively
## Status
## Sex Married Single Widow
## Female 27 15 19
## Male 23 8 18
## Sex
## Female Male
## 61 49
## Status
## Married Single Widow
## 50 23 37
## Status
## Sex Married Single Widow
## Female 0.4426230 0.2459016 0.3114754
## Male 0.4693878 0.1632653 0.3673469
## Status
## Sex Married Single Widow
## Female 44 25 31
## Male 47 16 37
## Status
## Sex Married Single Widow
## Female 0.5400000 0.6521739 0.5135135
## Male 0.4600000 0.3478261 0.4864865
## Status
## Sex Married Single Widow
## Female 54 65 51
## Male 46 35 49
To express the frequencies relative to the grand total, use this:’
## Status
## Sex Married Single Widow
## Female 0.24545455 0.13636364 0.17272727
## Male 0.20909091 0.07272727 0.16363636
## Status
## Sex Married Single Widow
## Female 25 14 17
## Male 21 7 16