This is a tutorial on Descriptive statistics. Descriptive statistics summarize and organize characteristics of a dataset, providing insights into its distribution, central tendency, and variability. R offers various built-in functions and packages to perform descriptive statistics efficiently. In this tutorial, we will explore different methods to compute descriptive statistics in R.

1. Getting Started with R

Before we begin, ensure that R is installed on your system. You can also use RStudio for an enhanced coding experience.

To start, load the required packages: If not installed, install the following packages

install.packages(“dplyr”)

install.packages(“psych”)

Load the necessary packages

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## Warning: package 'psych' was built under R version 4.4.2

2. Creating a Sample Dataset

You can create a sample dataset using R’s built-in data.frame() function: Create a sample dataset View the dataset

data <- data.frame(
  ID = 1:10,
  Age = c(25, 30, 35, 40, 29, 33, 31, 27, 26, 32),
  Salary = c(40000, 50000, 60000, 70000, 45000, 55000, 52000, 47000, 43000, 51000)
)

print(data)
##    ID Age Salary
## 1   1  25  40000
## 2   2  30  50000
## 3   3  35  60000
## 4   4  40  70000
## 5   5  29  45000
## 6   6  33  55000
## 7   7  31  52000
## 8   8  27  47000
## 9   9  26  43000
## 10 10  32  51000

3. Measures of Central Tendency

Central tendency describes the center point of a dataset. Common measures include:

Mean (Average)

mean_age <- mean(data$Age)

mean_salary <- mean(data$Salary)

Median (Middle Value)

median_age <- median(data$Age)

median_salary <- median(data$Salary)

Mode (Most Frequent Value)

mode_function <- function(x) {
  uniq_x <- unique(x)
  uniq_x[which.max(tabulate(match(x, uniq_x)))]
}

mode_age <- mode_function(data$Age)

Display results

  • Mean age, Median age, Mode_age

  • Mean salary, Median salary

mean_age; median_age; mode_age
## [1] 30.8
## [1] 30.5
## [1] 25
mean_salary; median_salary
## [1] 51300
## [1] 50500

4. Measures of Dispersion

Dispersion indicates how data values spread around the central value.

Standard Deviation

sd_age <- sd(data$Age)

sd_salary <- sd(data$Salary)

Variance

var_age <- var(data$Age)

var_salary <- var(data$Salary)

Range

range_age <- range(data$Age)

range_salary <- range(data$Salary)

Interquartile Range (IQR)

iqr_age <- IQR(data$Age)

iqr_salary <- IQR(data$Salary)

Display results

  • sd_age, var_age, range_age, iqr_age

  • sd_salary, var_salary, range_salary, iqr_salary

sd_age; var_age; range_age; iqr_age
## [1] 4.516636
## [1] 20.4
## [1] 25 40
## [1] 5.25
sd_salary; var_salary; range_salary; iqr_salary
## [1] 8794.569
## [1] 77344444
## [1] 40000 70000
## [1] 8750

5. Summary Statistics Using Built-in Functions

R provides built-in functions for quick summaries: Summary function

summary(data)
##        ID             Age            Salary     
##  Min.   : 1.00   Min.   :25.00   Min.   :40000  
##  1st Qu.: 3.25   1st Qu.:27.50   1st Qu.:45500  
##  Median : 5.50   Median :30.50   Median :50500  
##  Mean   : 5.50   Mean   :30.80   Mean   :51300  
##  3rd Qu.: 7.75   3rd Qu.:32.75   3rd Qu.:54250  
##  Max.   :10.00   Max.   :40.00   Max.   :70000

6. Using psych Package for Enhanced Descriptive Statistics

The describe() function from the psych package provides detailed statistics:

describe(data)
##        vars  n    mean      sd  median  trimmed     mad   min   max range skew
## ID        1 10     5.5    3.03     5.5     5.50    3.71     1    10     9 0.00
## Age       2 10    30.8    4.52    30.5    30.38    4.45    25    40    15 0.54
## Salary    3 10 51300.0 8794.57 50500.0 50375.00 7413.00 40000 70000 30000 0.72
##        kurtosis      se
## ID        -1.56    0.96
## Age       -0.75    1.43
## Salary    -0.47 2781.09

7. Visualizing Data

Visualizations help understand data distribution.

Histogram

hist(data$Age, main="Age Distribution", xlab="Age", col="blue", breaks=5)

Boxplot

boxplot(data$Salary, main="Salary Distribution", ylab="Salary", col="red")

Descriptive statistics provide essential insights into datasets before conducting further analysis. R offers various functions and packages to compute these statistics efficiently. By understanding central tendency, dispersion, and visualization techniques, you can effectively summarize and interpret data.

8.Descriptive statistics on the mtcars dataset

Now, let us look at the mtcars dataset to explore the descriptive statistics

Load built-in dataset Assign it to a variable

data("mtcars")

df <- mtcars

9. Overview of the Dataset

To understand the dataset structure.

View first few rows

head(df)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Check the structure

str(df)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Summary statistics

summary(df)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

10. Measures of Central Tendency

These include the mean, median, and mode.

Mean Mean of miles per gallon

mean(df$mpg)   
## [1] 20.09062

Median Median of miles per gallon

median(df$mpg) 
## [1] 19.2

Mode (Custom function, since R has no built-in mode function)

mode_func <- function(x) {
  uniq_vals <- unique(x)
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}

mode_func(df$mpg)
## [1] 21

11. Measures of Dispersion

These include the range, variance, standard deviation, and interquartile range (IQR).

Range

range(df$mpg)
## [1] 10.4 33.9

Variance

var(df$mpg)
## [1] 36.3241

Standard deviation

sd(df$mpg)
## [1] 6.026948

Interquartile Range (IQR)

IQR(df$mpg)
## [1] 7.375

12. Skewness and Kurtosis

Skewness measures symmetry, and kurtosis measures the “tailedness” of the distribution.

Load necessary library

library(moments)

Skewness

skewness(df$mpg)
## [1] 0.6404399

Kurtosis

kurtosis(df$mpg)
## [1] 2.799467

13. Frequency Distribution (Categorical Data)

For categorical data, we use tables and proportions.

Convert a numerical variable into a categorical one

df$cyl_factor <- as.factor(df$cyl)

Frequency table

table(df$cyl_factor)
## 
##  4  6  8 
## 11  7 14

Proportion table

prop.table(table(df$cyl_factor))
## 
##       4       6       8 
## 0.34375 0.21875 0.43750

14. Visualizing Data

Histograms

hist(df$mpg, main="Histogram of MPG", xlab="Miles per Gallon", col="blue")

Boxplot

boxplot(df$mpg, main="Boxplot of MPG", ylab="Miles per Gallon", col="red")

Bar Chart (for categorical data)

barplot(table(df$cyl_factor), main="Cylinder Count", col="green")

15. Summary Statistics for Multiple Columns

Summary statistics for all numerical variables

summary(df)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb       cyl_factor
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000   4:11      
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000   6: 7      
##  Median :0.0000   Median :4.000   Median :2.000   8:14      
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812             
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000             
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Apply functions across multiple columns Mean for all numeric columns

sapply(df[, sapply(df, is.numeric)], mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Conclusion

This tutorial introduced key descriptive statistics in R, including measures of central tendency, dispersion, skewness, and visualizations. These techniques help summarize and understand data effectively.