Descriptive Statistics in R

Author

Affiliation

Renato A. Folledo, Jr.

Isabela State University

Introduction

Descriptive statistics helps researchers describe the central tendency (mean, median, mode), spread/dispersion (range, variance, and standard deviation), and shape of the probability distribution of data. It also involves making graphs and summary tables to help in visualization and understanding the data. The main objective of descriptive statistics is to effectively summarize and describe the characteristics of a data, providing an overview of the data and helping to identify patterns and relationships between variables. It provides a useful starting point in analyzing data, as it can help to identify outliers, and helps in identifying the appropriate statistical technique for inferential data processing.

Objectives

At the end of this exercise, the students

Use R in descriptive statistics computations.
Check for any irregularity on the data from the desriptive statistics outputs, such as typo-errors during the data entry, the presence of outliers, etc.
Generate summary tables and graphs to better understand the data.
Describe the usefulness of the different descriptive statistics outputs.

Expected outputs

The sample R codes in this exercise examines the relationship between Gender and Height of participants. You then need to change the Height into Weight using the same codes below to generate descriptive summary statistics, box plot, and summary tables. Aside from deriving the descriptive statistics on your data, you also need to answer the guide questions and incorporate your answer on your report.

A. Measures of central tendency and dispersion

1. Central tendency

Measures of central tendency include mean, median, and mode. The mean is applicable only to quantitative data. Median, on the other hand, is applicable only to ordinal and quantitative data, but not on nominal data. Finally, the mode is applicable only to qualitative data.

This activity is an exercise on descriptive statistical analysis of the heights of male and female participants, using the Gender and Height variables from the student data. Download your “Stat 2024/QUIZ STAT A2.xlsx”

# open the excel file using the read_xlsx function from readxl library
dt <- readxl::read_xlsx("C:/jun/FirstSem24-25/Stat 2024/QUIZ STAT A2.xlsx")

# convert to data.table
dt <- data.table::data.table(dt)
class(dt)

[1] "data.table" "data.frame"

Copy the variables Gender and Height to a new table named HtGender

# We copy all the rows, denoted by `dt[,` 
# which has no (blank) parameter before the comma `,`
# whereas if we use `dt[Gender == "Male"`, 
# only the male participants are copied.

# The parameter `list(Gender, Height)` says that only column 
# variables Height and Gender are copied to `HtGender` table
HtGender <- dt[, list(Gender, Height)]

## display a summary of all the variables in HtGender
summary(HtGender)

    Gender              Height     
 Length:39          Min.   :143.0  
 Class :character   1st Qu.:154.4  
 Mode  :character   Median :160.4  
                    Mean   :162.2  
                    3rd Qu.:169.6  
                    Max.   :186.0

a. Mean

The mean function in R computes the mean or average of a vector. It can run independently or within a data.table object as illustrated in the codes below.

# display mean right away
HtGender$Height           # display the Height as vector

 [1] 150.40 161.02 158.48 166.48 152.56 154.40 160.40 173.10 175.64 180.34
[11] 178.26 168.02 143.40 162.10 157.10 169.10 172.10 157.64 174.80 155.48
[21] 157.48 169.18 169.02 151.94 156.00 159.00 166.00 170.00 150.48 168.00
[31] 173.00 146.40 154.02 154.48 148.48 186.00 160.00 143.00 174.00

                          # the dollar symbol uses the `data.frame`
                          # functionality to extract Height variable
                          
mean(HtGender$Height)     # compute the mean of a data.frame vector

[1] 162.2385

mean(HtGender[,Height] )  # equivalent to HtGender$Height but this time,

[1] 162.2385

                          # it uses the `data.table` to extract Height vector

# use data.table functionality to compute the mean
HtGender[, mean(Height)]  # mean height of all participants

[1] 162.2385

# It is easier to extract selected rows in a data.table
HtGender[Gender=="Male", mean(Height)]  # mean height of Male

[1] 163.4911

HtGender[Gender=="Female", mean(Height)]  # mean height of Male

[1] 161.1648

# use data.table functionality to 
# create a summary table of mean height of participants
# grouped by male and female
HtGender[,list(mean=mean(Height)), by = "Gender"]

   Gender     mean
   <char>    <num>
1: Female 161.1648
2:   Male 163.4911

b. Median

# compute median
# display the median right away
median(HtGender[,Height] )

[1] 160.4

# use data.table functionality
HtGender[,median(Height)]                  # median height of all participants

[1] 160.4

HtGender[Gender=="Male",  median(Height)]  # median height of Male

[1] 161.56

HtGender[Gender=="Female",median(Height)]  # median height of Female

[1] 158.48

# summary table of median height grouped by male and female
HtGender[,list(median = median(Height)), by = "Gender"]

   Gender median
   <char>  <num>
1: Female 158.48
2:   Male 161.56

# summary table of mean and median heights by gender
HtGender[,list(mean=mean(Height), median = median(Height)), by = "Gender"]

   Gender     mean median
   <char>    <num>  <num>
1: Female 161.1648 158.48
2:   Male 163.4911 161.56

# create a summary table of mean and median heights, 
# including the number of observations per gender
HtGender[,list(mean=mean(Height), 
         median = median(Height), 
         n = .N), by = "Gender"]

   Gender     mean median     n
   <char>    <num>  <num> <int>
1: Female 161.1648 158.48    21
2:   Male 163.4911 161.56    18

Guide Question # 1

What is the mean and median height of females, males, and the whole sample data (combined males and females).

c. Mode

Mode works only for qualitative data. But there is no built-in mode function in R, so we use the user defined function Modes. I downloaded user-defined functions Mode and Modes. Mode gets the first mode of the data. However, if there are multiple modes, then we need to use the user-defined Modes function

# define function Mode
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# define the 2nd function Modes that can get multiple modes if any
Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

# display mode of a qualitative data
Mode(HtGender[,Gender])

[1] "Female"

Modes(HtGender[,Gender])

[1] "Female"

# attempt to compute the mode of quantitative data
Modes(HtGender[,Height])

 [1] 150.40 161.02 158.48 166.48 152.56 154.40 160.40 173.10 175.64 180.34
[11] 178.26 168.02 143.40 162.10 157.10 169.10 172.10 157.64 174.80 155.48
[21] 157.48 169.18 169.02 151.94 156.00 159.00 166.00 170.00 150.48 168.00
[31] 173.00 146.40 154.02 154.48 148.48 186.00 160.00 143.00 174.00

Guide Question # 2

What is the mode of the Gender variable? Why is it that the mode of the Height variable is not meaningful?

B. Measure of spread/dispersion

1. minimum, maximum, and range

# use built-in functions and display results immediately
min(HtGender[, Height]) # minimum

[1] 143

max(HtGender[, Height]) # maximum

[1] 186

max(HtGender[, Height]) - min(HtGender[, Height]) # range

[1] 43

# create summary table using data.table capability
HtGender[, list(minHt=min(Height),
                maxHT=max(Height),
                rangeHt = max(Height) - min(Height))]

   minHt maxHT rangeHt
   <num> <num>   <num>
1:   143   186      43

# separate by gender
HtGender[, list(minHt=min(Height),
                maxHT=max(Height),
                rangeHt = max(Height) - min(Height)),
         by="Gender"]

   Gender minHt maxHT rangeHt
   <char> <num> <num>   <num>
1: Female 143.4   174    30.6
2:   Male 143.0   186    43.0

2. Quartiles

Compute Quartile 1, median (Quartile 2), Quartile 3, Interquartile range (IQR).

Quartile 1, 25% of the observations are below it, while 75% are above it
Quartile 2 on the other hand is also the median of the observations
Quartile 3, 75% of the observations are below it, while 25% are above it.
IQR = Q3 - Q1
IQR minimum = Q1 - 1.5(IQR)
IQR maximum = Q3 + 1.5(IQR)

IQR minimum and maximum are useful in determining any outliers. Any observation with a value above the IQR maximum or below the IQR minimum are called outliers, which are showed as asterisks (*) in Figure 1.

Figure 1. Location of quartile values and IQR in a box plot

Use the code below to compute the different quartile values.

# display output right away
quantile(HtGender[, Height], 0.25)  # quartile 1

   25% 
154.44

quantile(HtGender[, Height], 0.5)   # median is also quartile 2

  50% 
160.4

quantile(HtGender[, Height], 0.75)  # quartile 3

   75% 
169.59

Instead of computing quartiles individually, we can generate a summary table to create a combined summary statistics table.

# create a summary table for all the observations
HtGender[, list(minHt=min(Height),
                maxHT=max(Height),
                rangeHt = max(Height) - min(Height),
                Q1=quantile(Height, 0.25),
                median=median(Height),
                Q3 = quantile(Height, 0.75))]

   minHt maxHT rangeHt     Q1 median     Q3
   <num> <num>   <num>  <num>  <num>  <num>
1:   143   186      43 154.44  160.4 169.59

# create a summary table by gender
# note that we added `by="Gender"` parameter
HtGender[, list(minHt=min(Height),
                maxHT=max(Height),
                rangeHt = max(Height) - min(Height),
                Q1=quantile(Height, 0.25),
                median=median(Height),
                Q3 = quantile(Height, 0.75)),
         by="Gender"]

   Gender minHt maxHT rangeHt     Q1 median     Q3
   <char> <num> <num>   <num>  <num>  <num>  <num>
1: Female 143.4   174    30.6 154.48 158.48 168.02
2:   Male 143.0   186    43.0 153.23 161.56 173.60

Compute the interquartile range (IQR)

# interquartile range to the summary table
IQR(HtGender[,Height])

[1] 15.15

# include IQR on the summary table
HtGender[, list(minHt=min(Height),
                maxHT=max(Height),
                rangeHt = max(Height) - min(Height),
                Q1=quantile(Height, 0.25),
                median=median(Height),
                Q3 = quantile(Height, 0.75),
                IQR = IQR(Height),
                minIQR = quantile(Height, 0.25) - 1.5*IQR(Height),
                maxIQR = quantile(Height, 0.75) + 1.5*IQR(Height)),
         by="Gender"]

   Gender minHt maxHT rangeHt     Q1 median     Q3   IQR  minIQR  maxIQR
   <char> <num> <num>   <num>  <num>  <num>  <num> <num>   <num>   <num>
1: Female 143.4   174    30.6 154.48 158.48 168.02 13.54 134.170 188.330
2:   Male 143.0   186    43.0 153.23 161.56 173.60 20.37 122.675 204.155

Guide Question # 3

Did you see any observation whose Height is less than minIQR or greater than maxIQR?

What do we call the observations that are less than minimum IQR or greater than maximum IQR?

3. Variance and standard deviation

Variance is a numerical measure of how the data values are dispersed around the mean. Standard deviation, on the other hand, is simply the square root of variance. Standard deviation nonetheless has the same unit of measure as the variable being studied, e.g. if the height variable is in cm, the unit of standard deviation is also in cm.

# display height variance
var(HtGender[, Height])

[1] 112.5726

# standard deviation of height variable
sd(HtGender[, Height])

[1] 10.61002

# standard deviation of male participants
sd(HtGender[Gender=="Male", Height])

[1] 12.62961

# standard deviation of female participants
sd(HtGender[Gender=="Female", Height])

[1] 8.699673

# add standard deviation on the summary table and
# save the summary table to sumTable object
sumTable <- HtGender[, list(meanHt=mean(Height), 
                            medianHt = median(Height),
                            minHt=min(Height),
                            maxHT=max(Height),
                            rangeHt = max(Height) - min(Height),
                            Q1=quantile(Height, 0.25),
                            Q2=median(Height),
                            Q3 = quantile(Height, 0.75),
                            IQR = IQR(Height),
                            stdev = sd(Height),
                            obs = .N),
                     by="Gender"]

# display the summary table
sumTable

   Gender   meanHt medianHt minHt maxHT rangeHt     Q1     Q2     Q3   IQR
   <char>    <num>    <num> <num> <num>   <num>  <num>  <num>  <num> <num>
1: Female 161.1648   158.48 143.4   174    30.6 154.48 158.48 168.02 13.54
2:   Male 163.4911   161.56 143.0   186    43.0 153.23 161.56 173.60 20.37
       stdev   obs
       <num> <int>
1:  8.699673    21
2: 12.629607    18

4. Additional outputs

a. Generate a box plot

The box plot generated displays the spread of the height data along the y-axis, and grouped by Gender along the x-axis.

boxplot(HtGender[,Height] ~ dt[,Gender], xlab="Gender", ylab = "Height (cm)")

Figure 2. Box plot comparing the heights of male and female participants

b. Export the summary table to a comma-delemeted format

We can export our summary table to a format that is compatible to most programs. In the example below, we export the summary table to a comma-delimeted format for use in MS Excel or Word.

write.csv(sumTable,"C:/stat/SumTable.csv", row.names = F)

Laboratory report requirement

For your exercise, use the Weight variable to generate the same descriptive statistics as in the R codes above:

Generate a summary table of the measures of central tendency and dispersion. Copy and paste it as a table in MS Word format.
Describe the resulting box plot.
Summary/reflection/lesson learned from this exercise