This RMarkdown document aims to provide a comprehensive understanding of descriptive statistics. We will cover:
First of all, let’s load the libraries that will be in use.
library(tidyverse)
library(dplyr)
library(knitr) # kable()
library(kableExtra) # kable_styling()
Next, let’s create a dataset representing scores of 50 students across three different subjects (Math, History, English) and two terms (Term1, Term2).
set.seed(123) # Set the seed for reproducibility of random numbers
# Create 'student_data' data frame with Student IDs and random scores for Math, History, and English for two terms
student_data <- data.frame(
StudentID = seq(1, 50), # Generate a sequence of Student IDs from 1 to 50
Math_Term1 = round(runif(50, 50, 100)), # Generate 50 random scores between 50 and 100 (for various subjects and terms)
Math_Term2 = round(runif(50, 50, 100)),
History_Term1 = round(runif(50, 50, 100)),
History_Term2 = round(runif(50, 50, 100)),
English_Term1 = round(runif(50, 50, 100)),
English_Term2 = round(runif(50, 50, 100))
) %>%
# Initialize 'math_term1_bands' by adding a new column to 'student_data' to classify Math Term1 scores into letter bands
mutate(Math_Term1_Band = as.character(cut(Math_Term1,
breaks = c(0, 59, 69, 79, 89, 100), # Define numeric intervals for categorization (0-59, 60-69, etc.)
labels = c("F", "D", "C", "B", "A"))), # Assign letter grades to each interval
Gender = sample(c("Female", "Male"),
size = nrow(.),
replace = TRUE)) # It randomly assigns a gender ("Female" or "Male") to each row, with the number of rows determined by nrow(.). Sampling is done with replacement, meaning the same gender can be assigned to multiple rows.
head(student_data, n = 10) # Display the first 10 rows of the 'student_data' data frame, which allows us to examine what the data looks like
## StudentID Math_Term1 Math_Term2 History_Term1 History_Term2 English_Term1
## 1 1 64 52 80 92 62
## 2 2 89 72 67 75 98
## 3 3 70 90 74 69 80
## 4 4 94 56 98 62 76
## 5 5 97 78 74 56 70
## 6 6 52 60 95 69 94
## 7 7 76 56 96 79 68
## 8 8 95 88 80 61 64
## 9 9 78 95 71 72 59
## 10 10 73 69 57 61 59
## English_Term2 Math_Term1_Band Gender
## 1 63 D Male
## 2 61 B Male
## 3 80 C Female
## 4 63 A Male
## 5 77 A Female
## 6 89 F Female
## 7 58 C Male
## 8 70 A Female
## 9 74 C Female
## 10 93 C Female
The Count refers to the number of occurrences of each category in your data. It provides a straightforward way to understand the distribution of your dataset.
# Count the number of students by Gender
gender_count <- student_data %>%
group_by(Gender) %>% # 'group_by' function groups the data frame by the 'Gender' column
summarise(Count = n()) # 'summarise' calculates summary statistics for each group. 'n()' counts the number of rows in each group
kable(gender_count, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # 'kable' from the 'knitr' library and 'kable_styling' from 'kableExtra' library are used to format and display the dataset neatly. No need to worry about these; they're just for presentation.
Gender | Count |
---|---|
Female | 29 |
Male | 21 |
Important Notes:
Difference between summarise()
and
mutate()
: summarise()
is used for collapsing
multiple rows into a single summary row; mutate()
, on the
other hand, is used for modifying or adding new columns without changing
the number of rows. If you used mutate()
here, you would
add a new column to each row of your existing data frame, which is not
what you want when you are trying to count the number of males and
females. summarise()
will reduce the data into a new,
smaller data frame that contains only the counts of each
gender.
Importance of group_by()
: group_by()
splits the data frame into subsets based on a categorical variable like
Gender
, allowing for targeted analysis. After grouping,
calculating statistics such as counts
or
averages
for each group becomes straightforward, providing
deeper insights into the data.
Moving on to the Percent, it tells you what fraction of the total each category represents. This is calculated by dividing the count of each category by the total number of observations, and then multiplying by 100.
# Calculate the percentage of students by Gender
gender_percent <- gender_count %>%
mutate(Percent = (Count / sum(Count)) * 100) # Calculate percentage: For each row, it divides the value in the Count column by the sum of all values in the Count column. Essentially, this computes the proportion of each gender in the data.
kable(gender_percent, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling()
Gender | Count | Percent |
---|---|---|
Female | 29 | 58 |
Male | 21 | 42 |
The Cumulative Percent goes a step further by adding up the percentages as you move down the categories. This helps in understanding what portion of the data falls below a certain level.
# Calculate the cumulative percentage of students by Gender
gender_cumulative_percent <- gender_percent %>%
mutate(Cumulative_Percent = cumsum(Percent)) # Calculate cumulative percentage: The cumsum() function calculates the cumulative sum of the Percent column. Starting from the first row and moving downwards, it keeps adding the value in the Percent column to the sum of all the previous rows' Percent values.
kable(gender_cumulative_percent, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling()
Gender | Count | Percent | Cumulative_Percent |
---|---|---|---|
Female | 29 | 58 | 58 |
Male | 21 | 42 | 100 |
This table will show the cumulative percentage of students, aiding in understanding the distribution of genders in a more comprehensive way.
To summarize, you can create a single table with Count, Percent, and Cumulative Percent to get a comprehensive view.
# Count, Percent, and Cumulative Percent by Gender
measures_of_frequency <- student_data %>%
group_by(Gender) %>%
summarise(Count = n()) %>% # Count of each gender
mutate(Percent = (Count / sum(Count)) * 100, # Calculate percent
Cumulative_Percent = cumsum(Percent)) # Calculate cumulative percent
kable(measures_of_frequency, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling()
Gender | Count | Percent | Cumulative_Percent |
---|---|---|---|
Female | 29 | 58 | 58 |
Male | 21 | 42 | 100 |
This final table combines all three measures, giving you a complete picture of how genders are distributed in your dataset.
The mean is the average of all numbers and is computed as the sum of all the numbers divided by the total number of items.
We start by calculating the mean for just one variable:
Math_Term1
.
# Calculate mean for Math_Term1
mean_math_term1 <- student_data %>%
group_by(Math_Term1_Band) %>%
summarise(mean_Math_Term1 = mean(Math_Term1)) %>% # Using 'summarise' to calculate the mean of Math_Term1
arrange(Math_Term1_Band)
kable(mean_math_term1, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Math_Term1_Band | mean_Math_Term1 |
---|---|
A | 95.58 |
B | 84.56 |
C | 74.09 |
D | 63.80 |
F | 54.88 |
To extend our understanding, let’s calculate the mean across multiple columns.
# Calculate means for Math_Term1, Math_Term2, History_Term1, and History_Term2
mean_multiple_columns <- student_data %>%
group_by(Gender) %>%
summarise(across(c(Math_Term1, Math_Term2, History_Term1, History_Term2), mean)) # The 'across()' function is used to apply the same function across multiple columns. Here, the columns are Math_Term1, Math_Term2, History_Term1, and History_Term2. The 'mean()' function is applied to each of these columns to calculate their average
kable(mean_multiple_columns, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Gender | Math_Term1 | Math_Term2 | History_Term1 | History_Term2 |
---|---|---|---|---|
Female | 75.24 | 73.69 | 74.62 | 75.34 |
Male | 77.05 | 74.19 | 77.38 | 76.00 |
The median is the value that separates the higher half from the lower half of a data sample. For a data set, it may be thought of as the “middle” value.
Let’s start by finding the median for the Math_Term1 scores.
# Calculate median for Math_Term1
median_math_term1 <- student_data %>%
group_by(Math_Term1_Band) %>%
summarise(median_Math_Term1 = median(Math_Term1)) # Using 'summarise' to calculate the median of Math_Term1
kable(median_math_term1, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Math_Term1_Band | median_Math_Term1 |
---|---|
A | 95.0 |
B | 85.0 |
C | 73.0 |
D | 63.5 |
F | 56.0 |
Now, let’s find the median for multiple columns.
# Calculate medians for Math_Term1, Math_Term2, History_Term1, and History_Term2
median_multiple_columns <- student_data %>%
group_by(Gender) %>%
summarise(across(c(Math_Term1, Math_Term2, History_Term1, History_Term2), median)) # Apply 'median()' to multiple columns
kable(median_multiple_columns, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Gender | Math_Term1 | Math_Term2 | History_Term1 | History_Term2 |
---|---|---|---|---|
Female | 73 | 72 | 74 | 73 |
Male | 80 | 73 | 80 | 76 |
The mode is the most commonly occurring value in a dataset. Unlike mean and median, a dataset can have more than one mode.
We start by finding the mode for Math_Term1.
# Calculate mode for Math_Term1
# install.packages("modeest") If you have not installed the package, please do so
library(modeest)
mode_math_term1 <- student_data %>%
group_by(Math_Term1_Band) %>%
summarise(mode_math_term1_by_band = mfv(Math_Term1))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Math_Term1_Band'. You can override using
## the `.groups` argument.
kable(mode_math_term1, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result # Display the result
Math_Term1_Band | mode_math_term1_by_band |
---|---|
A | 95 |
A | 98 |
B | 85 |
C | 73 |
D | 62 |
F | 57 |
This set of operations first counts the frequency of each unique value in Math_Term1 using count(Math_Term1). It then arranges these frequencies in descending order and picks the most frequent value as the mode.
# Calculate modes for Math_Term1, Math_Term2, and, History_Term1
mode_multiple_columns <- student_data %>%
group_by(Gender) %>%
summarise(across(c(Math_Term1, Math_Term2, History_Term1), mfv)) # Apply 'mfv' to multiple columns
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
kable(mode_multiple_columns, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result # Display the result
Gender | Math_Term1 | Math_Term2 | History_Term1 |
---|---|---|---|
Female | 73 | 55 | 57 |
Female | 73 | 55 | 62 |
Female | 73 | 55 | 65 |
Female | 73 | 55 | 74 |
Female | 73 | 55 | 87 |
Female | 73 | 55 | 99 |
Male | 71 | 56 | 98 |
Male | 95 | 83 | 98 |
Male | 98 | 91 | 98 |
Measures of dispersion help you understand the spread of your data points. This is useful in determining how much individual data points deviate from the mean.
The range gives you an idea of how spread out the values in a data set are. It’s calculated as the difference between the maximum and minimum values.
# Calculate range for Math_Term1
range_math_term1 <- student_data %>%
group_by(Gender) %>%
summarise(Max_Math_Term1 = max(Math_Term1),
Min_Math_Term1 = min(Math_Term1),
Range_Math_Term1 = Max_Math_Term1 - Min_Math_Term1) # Using 'summarise' to calculate the min, the max, and the range
kable(range_math_term1, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Gender | Max_Math_Term1 | Min_Math_Term1 | Range_Math_Term1 |
---|---|---|---|
Female | 100 | 51 | 49 |
Male | 98 | 52 | 46 |
The variance measures how far each number in the set is from the mean and is calculated by taking the average of the squared differences from the Mean. The standard deviation is a measure of the dispersion or spread of a set of values and is the square root of the variance.
# Calculate variance for Math_Term1
variance_sd_math_term1 <- student_data %>%
group_by(Gender) %>%
summarise(Variance_Math_Term1 = var(Math_Term1), # Calculate variance
Std_Dev_Math_Term1 = sd(Math_Term1)) # Calculate sd
kable(variance_sd_math_term1, format = "html", booktabs = TRUE,
digits = 2, escape = F, row.names = FALSE) %>%
kable_styling() # Display the result
Gender | Variance_Math_Term1 | Std_Dev_Math_Term1 |
---|---|---|
Female | 210.05 | 14.49 |
Male | 238.75 | 15.45 |