Rmarkdown: Descriptive Statistics

This RMarkdown document aims to provide a comprehensive understanding of descriptive statistics. We will cover:

Measures of Frequency (Count, Percent, Cumulative Percent)
Measures of Central Tendency (Mean, Median, Mode)
Measures of Dispersion

First of all, let’s load the libraries that will be in use.

library(tidyverse)
library(dplyr)
library(knitr) # kable()
library(kableExtra) # kable_styling()

Next, let’s create a dataset representing scores of 50 students across three different subjects (Math, History, English) and two terms (Term1, Term2).

set.seed(123)  # Set the seed for reproducibility of random numbers

# Create 'student_data' data frame with Student IDs and random scores for Math, History, and English for two terms
student_data <- data.frame(
  StudentID = seq(1, 50),  # Generate a sequence of Student IDs from 1 to 50
  Math_Term1 = round(runif(50, 50, 100)),  # Generate 50 random scores between 50 and 100 (for various subjects and terms)
  Math_Term2 = round(runif(50, 50, 100)),  
  History_Term1 = round(runif(50, 50, 100)), 
  History_Term2 = round(runif(50, 50, 100)),
  English_Term1 = round(runif(50, 50, 100)), 
  English_Term2 = round(runif(50, 50, 100))
) %>%
  # Initialize 'math_term1_bands' by adding a new column to 'student_data' to classify Math Term1 scores into letter bands
 mutate(Math_Term1_Band = as.character(cut(Math_Term1, 
                              breaks = c(0, 59, 69, 79, 89, 100),  # Define numeric intervals for categorization (0-59, 60-69, etc.)
                              labels = c("F", "D", "C", "B", "A"))), # Assign letter grades to each interval
        Gender = sample(c("Female", "Male"), 
                        size = nrow(.), 
                        replace = TRUE)) # It randomly assigns a gender ("Female" or "Male") to each row, with the number of rows determined by nrow(.). Sampling is done with replacement, meaning the same gender can be assigned to multiple rows.

head(student_data, n = 10) # Display the first 10 rows of the 'student_data' data frame, which allows us to examine what the data looks like

##    StudentID Math_Term1 Math_Term2 History_Term1 History_Term2 English_Term1
## 1          1         64         52            80            92            62
## 2          2         89         72            67            75            98
## 3          3         70         90            74            69            80
## 4          4         94         56            98            62            76
## 5          5         97         78            74            56            70
## 6          6         52         60            95            69            94
## 7          7         76         56            96            79            68
## 8          8         95         88            80            61            64
## 9          9         78         95            71            72            59
## 10        10         73         69            57            61            59
##    English_Term2 Math_Term1_Band Gender
## 1             63               D   Male
## 2             61               B   Male
## 3             80               C Female
## 4             63               A   Male
## 5             77               A Female
## 6             89               F Female
## 7             58               C   Male
## 8             70               A Female
## 9             74               C Female
## 10            93               C Female

Measures of Frequency

Count

The Count refers to the number of occurrences of each category in your data. It provides a straightforward way to understand the distribution of your dataset.

# Count the number of students by Gender
gender_count <- student_data %>% 
  group_by(Gender) %>%  # 'group_by' function groups the data frame by the 'Gender' column
  summarise(Count = n())  # 'summarise' calculates summary statistics for each group. 'n()' counts the number of rows in each group

kable(gender_count, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # 'kable' from the 'knitr' library and 'kable_styling' from 'kableExtra' library are used to format and display the dataset neatly. No need to worry about these; they're just for presentation.

Gender	Count
Female	29
Male	21

Important Notes:

Difference between summarise() and mutate(): summarise() is used for collapsing multiple rows into a single summary row; mutate(), on the other hand, is used for modifying or adding new columns without changing the number of rows. If you used mutate() here, you would add a new column to each row of your existing data frame, which is not what you want when you are trying to count the number of males and females. summarise() will reduce the data into a new, smaller data frame that contains only the counts of each gender.
Importance of group_by(): group_by() splits the data frame into subsets based on a categorical variable like Gender, allowing for targeted analysis. After grouping, calculating statistics such as counts or averages for each group becomes straightforward, providing deeper insights into the data.

Percent

Moving on to the Percent, it tells you what fraction of the total each category represents. This is calculated by dividing the count of each category by the total number of observations, and then multiplying by 100.

# Calculate the percentage of students by Gender
gender_percent <- gender_count %>% 
  mutate(Percent = (Count / sum(Count)) * 100)  # Calculate percentage: For each row, it divides the value in the Count column by the sum of all values in the Count column. Essentially, this computes the proportion of each gender in the data.

kable(gender_percent, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling()

Gender	Count	Percent
Female	29	58
Male	21	42

Cumulative Percent

The Cumulative Percent goes a step further by adding up the percentages as you move down the categories. This helps in understanding what portion of the data falls below a certain level.

# Calculate the cumulative percentage of students by Gender
gender_cumulative_percent <- gender_percent %>% 
  mutate(Cumulative_Percent = cumsum(Percent))  # Calculate cumulative percentage: The cumsum() function calculates the cumulative sum of the Percent column. Starting from the first row and moving downwards, it keeps adding the value in the Percent column to the sum of all the previous rows' Percent values.

kable(gender_cumulative_percent, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling()

Gender	Count	Percent	Cumulative_Percent
Female	29	58	58
Male	21	42	100

This table will show the cumulative percentage of students, aiding in understanding the distribution of genders in a more comprehensive way.

Complete Code for Measures of Frequency

To summarize, you can create a single table with Count, Percent, and Cumulative Percent to get a comprehensive view.

# Count, Percent, and Cumulative Percent by Gender
measures_of_frequency <- student_data %>% 
  group_by(Gender) %>% 
  summarise(Count = n()) %>%  # Count of each gender
  mutate(Percent = (Count / sum(Count)) * 100,  # Calculate percent
         Cumulative_Percent = cumsum(Percent))  # Calculate cumulative percent

kable(measures_of_frequency, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling()

Gender	Count	Percent	Cumulative_Percent
Female	29	58	58
Male	21	42	100

This final table combines all three measures, giving you a complete picture of how genders are distributed in your dataset.

Measures of Central Tendency

Mean

The mean is the average of all numbers and is computed as the sum of all the numbers divided by the total number of items.

Mean for Single-variable

We start by calculating the mean for just one variable: Math_Term1.

# Calculate mean for Math_Term1
mean_math_term1 <- student_data %>% 
  group_by(Math_Term1_Band) %>%
  summarise(mean_Math_Term1 = mean(Math_Term1)) %>%  # Using 'summarise' to calculate the mean of Math_Term1
  arrange(Math_Term1_Band)

kable(mean_math_term1, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result

Math_Term1_Band	mean_Math_Term1
A	95.58
B	84.56
C	74.09
D	63.80
F	54.88

Means for Multiple-variables

To extend our understanding, let’s calculate the mean across multiple columns.

# Calculate means for Math_Term1, Math_Term2, History_Term1, and History_Term2
mean_multiple_columns <- student_data %>% 
  group_by(Gender) %>%
  summarise(across(c(Math_Term1, Math_Term2, History_Term1, History_Term2), mean))  # The 'across()' function is used to apply the same function across multiple columns. Here, the columns are Math_Term1, Math_Term2, History_Term1, and History_Term2. The 'mean()' function is applied to each of these columns to calculate their average

kable(mean_multiple_columns, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result

Gender	Math_Term1	Math_Term2	History_Term1	History_Term2
Female	75.24	73.69	74.62	75.34
Male	77.05	74.19	77.38	76.00

Median

The median is the value that separates the higher half from the lower half of a data sample. For a data set, it may be thought of as the “middle” value.

Median for Single-variable

Let’s start by finding the median for the Math_Term1 scores.

# Calculate median for Math_Term1
median_math_term1 <- student_data %>% 
  group_by(Math_Term1_Band) %>%
  summarise(median_Math_Term1 = median(Math_Term1))  # Using 'summarise' to calculate the median of Math_Term1

kable(median_math_term1, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result

Math_Term1_Band	median_Math_Term1
A	95.0
B	85.0
C	73.0
D	63.5
F	56.0

Medians for Multiple-variables

Now, let’s find the median for multiple columns.

# Calculate medians for Math_Term1, Math_Term2, History_Term1, and History_Term2
median_multiple_columns <- student_data %>% 
  group_by(Gender) %>%
  summarise(across(c(Math_Term1, Math_Term2, History_Term1, History_Term2), median))  # Apply 'median()' to multiple columns

kable(median_multiple_columns, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result

Gender	Math_Term1	Math_Term2	History_Term1	History_Term2
Female	73	72	74	73
Male	80	73	80	76

Mode

The mode is the most commonly occurring value in a dataset. Unlike mean and median, a dataset can have more than one mode.

Mode for Single-variable

We start by finding the mode for Math_Term1.

# Calculate mode for Math_Term1
# install.packages("modeest") If you have not installed the package, please do so
library(modeest)

mode_math_term1 <- student_data %>% 
  group_by(Math_Term1_Band) %>%
  summarise(mode_math_term1_by_band = mfv(Math_Term1))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'Math_Term1_Band'. You can override using
## the `.groups` argument.

kable(mode_math_term1, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result # Display the result

Math_Term1_Band	mode_math_term1_by_band
A	95
A	98
B	85
C	73
D	62
F	57

This set of operations first counts the frequency of each unique value in Math_Term1 using count(Math_Term1). It then arranges these frequencies in descending order and picks the most frequent value as the mode.

Mode for Multiple-variables

# Calculate modes for Math_Term1, Math_Term2, and, History_Term1
mode_multiple_columns <- student_data %>% 
  group_by(Gender) %>%
  summarise(across(c(Math_Term1, Math_Term2, History_Term1), mfv))  # Apply 'mfv' to multiple columns

## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

kable(mode_multiple_columns, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling() # Display the result # Display the result

Gender	Math_Term1	Math_Term2	History_Term1
Female	73	55	57
Female	73	55	62
Female	73	55	65
Female	73	55	74
Female	73	55	87
Female	73	55	99
Male	71	56	98
Male	95	83	98
Male	98	91	98

Measures of Dispersion

Measures of dispersion help you understand the spread of your data points. This is useful in determining how much individual data points deviate from the mean.

Range (Max, Min)

The range gives you an idea of how spread out the values in a data set are. It’s calculated as the difference between the maximum and minimum values.

# Calculate range for Math_Term1
range_math_term1 <- student_data %>% 
  group_by(Gender) %>%
  summarise(Max_Math_Term1 = max(Math_Term1),
            Min_Math_Term1 = min(Math_Term1),
            Range_Math_Term1 = Max_Math_Term1 - Min_Math_Term1)  # Using 'summarise' to calculate the min, the max, and the range

kable(range_math_term1, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling()  # Display the result

Gender	Max_Math_Term1	Min_Math_Term1	Range_Math_Term1
Female	100	51	49
Male	98	52	46

Variance and SD

The variance measures how far each number in the set is from the mean and is calculated by taking the average of the squared differences from the Mean. The standard deviation is a measure of the dispersion or spread of a set of values and is the square root of the variance.

# Calculate variance for Math_Term1
variance_sd_math_term1 <- student_data %>% 
  group_by(Gender) %>%
  summarise(Variance_Math_Term1 = var(Math_Term1), # Calculate variance
            Std_Dev_Math_Term1 =  sd(Math_Term1))  # Calculate sd

kable(variance_sd_math_term1, format = "html", booktabs = TRUE, 
      digits = 2, escape = F, row.names = FALSE) %>%
  kable_styling()  # Display the result

Gender	Variance_Math_Term1	Std_Dev_Math_Term1
Female	210.05	14.49
Male	238.75	15.45