DATA LOADING AND FAMILIARIZATION
Cohort Data (cohort_data.csv) – This dataset tracks cohort-based learning programs, including cohort sizes, timelines, and linked opportunities, allowing for participation and completion analysis
The original dataset contained five columns: cohort_id
,
cohort_code
, start_date
,
end_date
, and size
. The cohort_id
was labeled as “cohort#” throughout the dataset, serving as a general
identifier for each cohort. The cohort_code
consisted of
unique seven-digit codes assigned to each cohort. The
start_date
and end_date
columns indicated when
each cohort began and ended, respectively, while size
represented the number of participants in each cohort. However, the
start_date
and end_date
values were not in a
human-readable or standard date format. To address this, the dates were
transformed into properly formatted and readable versions labeled as
sdate
and edate
. This transformation was
carried out in Excel before importing the data into R for analysis.
#data loading and inspection
library(readr)
data<-read.csv("C:\\Users\\PC\\Downloads\\CohortRaw(in).csv")
head(data)
## cohort_id cohort_code sdate start_date edate end_date size
## 1 Cohort# B456514 2023-04-03 1.68e+12 2023-05-01 1.68e+12 1500
## 2 Cohort# B328821 2023-01-16 1.67e+12 2023-02-20 1.68e+12 1000
## 3 Cohort# B289256 2022-11-14 1.67e+12 2022-12-15 1.67e+12 100000
## 4 Cohort# B0VCB0F 2022-09-22 1.66e+12 2022-09-22 1.66e+12 40
## 5 Cohort# B908347 2023-01-09 1.67e+12 2023-02-13 1.68e+12 100000
## 6 Cohort# B306047 2023-01-09 1.67e+12 2023-02-13 1.68e+12 10000
tail(data)
## cohort_id cohort_code sdate start_date edate end_date size
## 634 Cohort# B1Z43A9 2025-04-13 1.74e+12 2025-06-07 1.75e+12 1500
## 635 Cohort# BX9N6FW 2025-04-01 1.74e+12 2025-04-23 1.75e+12 1500
## 636 Cohort# BM81DEC 2025-01-06 1.74e+12 2025-09-22 1.76e+12 1600
## 637 Cohort# BK4VONV 2025-03-17 1.74e+12 2025-04-15 1.74e+12 800
## 638 Cohort# BDFSXJ0 2025-08-07 1.75e+12 2025-10-07 1.76e+12 1500
## 639 Cohort# BLD0LQ9 2025-05-12 1.75e+12 2025-06-10 1.75e+12 800
To better understand the types of variables in the dataset, the
structure of the data was examined. The dataset contains 639
observations across 7 variables. The cohort_id
is of
character data type, and cohort_code
is also stored as a
character (chr
). Both sdate
and
edate
, which represent the readable start and end dates,
are in character format. On the other hand, the original
start_date
and end_date
are stored as numeric
values. Finally, the size
variable, representing the number
of participants in each cohort, is of integer type.
#structure of the data set
str(data)
## 'data.frame': 639 obs. of 7 variables:
## $ cohort_id : chr "Cohort#" "Cohort#" "Cohort#" "Cohort#" ...
## $ cohort_code: chr "B456514" "B328821" "B289256" "B0VCB0F" ...
## $ sdate : chr "2023-04-03" "2023-01-16" "2022-11-14" "2022-09-22" ...
## $ start_date : num 1.68e+12 1.67e+12 1.67e+12 1.66e+12 1.67e+12 ...
## $ edate : chr "2023-05-01" "2023-02-20" "2022-12-15" "2022-09-22" ...
## $ end_date : num 1.68e+12 1.68e+12 1.67e+12 1.66e+12 1.68e+12 ...
## $ size : int 1500 1000 100000 40 100000 10000 1500 100000 100000 10000 ...
DATA CLEANING
Since sdate
and edate
were initially
recognized as character data types, they were converted to proper date
formats to enable accurate analysis. Summary statistics revealed that
the minimum start date is 2022-06-09 and the maximum is 2025-08-07. The
size of the cohorts ranges from a minimum of 3 to a maximum of 100,000,
with an average (mean) size of 5,741. The first quartile (Q1) is 500 and
the third quartile (Q3) is 1,500, suggesting that cohorts with sizes
below 500 or above 1,500 may be considered outliers.
#changing datatype for date column to date
data$sdate <- as.Date(data$sdate, format = "%Y-%m-%d")
data$edate <- as.Date(data$edate, format = "%Y-%m-%d")
#confirmation
str(data)
## 'data.frame': 639 obs. of 7 variables:
## $ cohort_id : chr "Cohort#" "Cohort#" "Cohort#" "Cohort#" ...
## $ cohort_code: chr "B456514" "B328821" "B289256" "B0VCB0F" ...
## $ sdate : Date, format: "2023-04-03" "2023-01-16" ...
## $ start_date : num 1.68e+12 1.67e+12 1.67e+12 1.66e+12 1.67e+12 ...
## $ edate : Date, format: "2023-05-01" "2023-02-20" ...
## $ end_date : num 1.68e+12 1.68e+12 1.67e+12 1.66e+12 1.68e+12 ...
## $ size : int 1500 1000 100000 40 100000 10000 1500 100000 100000 10000 ...
summary(data)
## cohort_id cohort_code sdate start_date
## Length:639 Length:639 Min. :2022-06-09 Min. :1.650e+12
## Class :character Class :character 1st Qu.:2023-05-03 1st Qu.:1.680e+12
## Mode :character Mode :character Median :2024-03-03 Median :1.710e+12
## Mean :2024-01-26 Mean :1.706e+12
## 3rd Qu.:2024-10-21 3rd Qu.:1.730e+12
## Max. :2025-08-07 Max. :1.750e+12
## edate end_date size
## Min. :2022-06-10 Min. :1.650e+12 Min. : 3
## 1st Qu.:2023-06-09 1st Qu.:1.690e+12 1st Qu.: 500
## Median :2024-04-16 Median :1.710e+12 Median : 800
## Mean :2024-03-23 Mean :1.712e+12 Mean : 5741
## 3rd Qu.:2024-12-15 3rd Qu.:1.730e+12 3rd Qu.: 1500
## Max. :2026-03-06 Max. :1.770e+12 Max. :100000
The data set contained no missing values
#checking for missing values
sum(is.na(data))
## [1] 0
Checking for outliers using the IQR rule.
#checking for outliers
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Apply IQR rule to all numeric columns
numeric_cols <- select_if(data, is.numeric)
# Function to detect outliers
find_outliers <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
which(x < (q1 - 1.5 * iqr) | x > (q3 + 1.5 * iqr))
}
# Check for outliers per column
lapply(numeric_cols, find_outliers)
## $start_date
## integer(0)
##
## $end_date
## integer(0)
##
## $size
## [1] 3 5 6 8 9 10 12 17 22 28 33 35 39 43 47 49 50 57 58
## [20] 61 66 70 71 73 74 77 78 83 84 85 88 89 94 95 96 104 105 114
## [39] 172 265 290 313 381 419 504 523 530
The box plot below indicates the presence of outliers in the data, further confirming that cohort sizes exceeding 1,500 can be considered outliers.
library(ggplot2)
ggplot(data, aes(y = size)) +
geom_boxplot(fill = "orange") +
theme_minimal() +
ggtitle("Boxplot of Size (Outlier Check)")
Since the cause of the outliers is unknown, a log transformation was
applied to the size
column to reduce their impact on the
analysis and improve the interpretability of the data.
#data transformation to reduce the impact of outliers
data$log_size <- log(data$size)
head(data)
## cohort_id cohort_code sdate start_date edate end_date size
## 1 Cohort# B456514 2023-04-03 1.68e+12 2023-05-01 1.68e+12 1500
## 2 Cohort# B328821 2023-01-16 1.67e+12 2023-02-20 1.68e+12 1000
## 3 Cohort# B289256 2022-11-14 1.67e+12 2022-12-15 1.67e+12 100000
## 4 Cohort# B0VCB0F 2022-09-22 1.66e+12 2022-09-22 1.66e+12 40
## 5 Cohort# B908347 2023-01-09 1.67e+12 2023-02-13 1.68e+12 100000
## 6 Cohort# B306047 2023-01-09 1.67e+12 2023-02-13 1.68e+12 10000
## log_size
## 1 7.313220
## 2 6.907755
## 3 11.512925
## 4 3.688879
## 5 11.512925
## 6 9.210340
The data set have no duplicates.
#checking for duplicates
sum(duplicated(data$cohort_code))
## [1] 0
sum(duplicated(data))
## [1] 0
EXPLORATORY DATA ANALYSES(EDA)
DISTRIBUTION OF COHORT SIZES
The log transformation was helpful in making the distribution more interpretable but the data remains multimodal, suggesting distinct groups or types of cohorts.
This is still further proven by the histogram
# Histogram for size
ggplot(data, aes(x = log_size)) +
geom_histogram(binwidth = 3, fill = "skyblue", color = "black") +
theme_minimal() +
ggtitle("Histogram of Size")
# Density plot for size
ggplot(data, aes(x = log_size)) +
geom_density(fill = "lightblue") +
theme_minimal() +
ggtitle("Density Plot of Size")
DISTRIBUTION OF COHORTS START DATES OVER THE YEARS
The histogram below displays the distribution of cohort start dates over time, revealing that the majority of cohorts began between 2023 and 2025. This suggests a period of growth or expansion during these years, possibly reflecting an increase in demand or a deliberate scale-up of the program. In contrast, there were significantly fewer cohorts in 2022, which may indicate the early stages of the program or a pilot phase. The sharp drop in 2026, with very few cohorts starting, could either point to future scheduling with incomplete data or potential data entry errors. Overall, the trend shows a clear rise in cohort activity after 2022, peaking in the subsequent years.
# Histogram of start date
ggplot(data, aes(x = sdate)) +
geom_histogram(binwidth = 300, fill = "lightgreen", color = "black") +
theme_minimal() +
ggtitle("Distribution of Start Dates")
PAIR PLOT FOR VISUALIZING MULTIPLE RELATIONSHIP
The pair plot shows the relationships among three variables:
size
, sdate_num
(numerical start date), and
edate_num
(numerical end date). From the correlation
values, there is a strong positive correlation (r =
0.941*) between sdate_num
and
edate_num
, indicating that start and end dates move closely
together—cohorts that start later also tend to end later, which is
expected. There are moderate negative correlations
between size
and both sdate_num
(r =
-0.313***) and edate_num
(r = -0.309***), suggesting that
cohort sizes have generally decreased over time. The scatter plots also
show a concentration of cohort sizes at lower values, with only a few
outliers reaching very large sizes. Overall, the analysis reveals a
temporal pattern where more recent cohorts are smaller in size, while
start and end dates remain closely aligned.
data$sdate <- as.Date(data$sdate)
data$edate <- as.Date(data$edate)
# Optional: convert dates to numeric days since min date
min_date <- min(data$sdate, na.rm = TRUE)
data$sdate_num <- as.numeric(data$sdate - min_date)
data$edate_num <- as.numeric(data$edate - min_date)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(data[, c("size", "sdate_num", "edate_num")])
data$month <- format(data$sdate, "%B")
data$month_num <- as.numeric(format(data$sdate, "%m"))
library(dplyr)
monthly_sizes <- data %>%
group_by(month, month_num) %>%
summarise(avg_size = mean(log_size, na.rm = TRUE),
total_size = sum(log_size, na.rm = TRUE),
.groups = 'drop') %>%
arrange(month_num)
DISTRIBUTION OF COHORT SIZES BY MONTH
The bar chart titled “Average Cohort Size by Month” illustrates the variations in cohort sizes throughout the year. February has the lowest average cohort size, indicating a dip in enrollment or participation during this month. In contrast, October boasts the highest average cohort size, suggesting a peak in interest or enrollment during this time. The overall trend shows fluctuations in cohort sizes, with a general increase towards the end of the year. This pattern could be attributed to seasonal enrollment cycles, heightened marketing and recruitment efforts, or the popularity of certain programs during specific months.
library(ggplot2)
ggplot(monthly_sizes, aes(x = reorder(month, month_num), y = avg_size)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_minimal() +
labs(title = "Average Cohort Size by Month",
x = "Month",
y = "Average Cohort Size")