DATA LOADING AND FAMILIARIZATION

Cohort Data (cohort_data.csv) – This dataset tracks cohort-based learning programs, including cohort sizes, timelines, and linked opportunities, allowing for participation and completion analysis

The original dataset contained five columns: cohort_id, cohort_code, start_date, end_date, and size. The cohort_id was labeled as “cohort#” throughout the dataset, serving as a general identifier for each cohort. The cohort_code consisted of unique seven-digit codes assigned to each cohort. The start_date and end_date columns indicated when each cohort began and ended, respectively, while size represented the number of participants in each cohort. However, the start_date and end_date values were not in a human-readable or standard date format. To address this, the dates were transformed into properly formatted and readable versions labeled as sdate and edate. This transformation was carried out in Excel before importing the data into R for analysis.

#data loading and inspection
library(readr)
data<-read.csv("C:\\Users\\PC\\Downloads\\CohortRaw(in).csv")
head(data)
##   cohort_id cohort_code      sdate start_date      edate end_date   size
## 1   Cohort#     B456514 2023-04-03   1.68e+12 2023-05-01 1.68e+12   1500
## 2   Cohort#     B328821 2023-01-16   1.67e+12 2023-02-20 1.68e+12   1000
## 3   Cohort#     B289256 2022-11-14   1.67e+12 2022-12-15 1.67e+12 100000
## 4   Cohort#     B0VCB0F 2022-09-22   1.66e+12 2022-09-22 1.66e+12     40
## 5   Cohort#     B908347 2023-01-09   1.67e+12 2023-02-13 1.68e+12 100000
## 6   Cohort#     B306047 2023-01-09   1.67e+12 2023-02-13 1.68e+12  10000
tail(data)
##     cohort_id cohort_code      sdate start_date      edate end_date size
## 634   Cohort#     B1Z43A9 2025-04-13   1.74e+12 2025-06-07 1.75e+12 1500
## 635   Cohort#     BX9N6FW 2025-04-01   1.74e+12 2025-04-23 1.75e+12 1500
## 636   Cohort#     BM81DEC 2025-01-06   1.74e+12 2025-09-22 1.76e+12 1600
## 637   Cohort#     BK4VONV 2025-03-17   1.74e+12 2025-04-15 1.74e+12  800
## 638   Cohort#     BDFSXJ0 2025-08-07   1.75e+12 2025-10-07 1.76e+12 1500
## 639   Cohort#     BLD0LQ9 2025-05-12   1.75e+12 2025-06-10 1.75e+12  800

To better understand the types of variables in the dataset, the structure of the data was examined. The dataset contains 639 observations across 7 variables. The cohort_id is of character data type, and cohort_code is also stored as a character (chr). Both sdate and edate, which represent the readable start and end dates, are in character format. On the other hand, the original start_date and end_date are stored as numeric values. Finally, the size variable, representing the number of participants in each cohort, is of integer type.

#structure of the data set
str(data)
## 'data.frame':    639 obs. of  7 variables:
##  $ cohort_id  : chr  "Cohort#" "Cohort#" "Cohort#" "Cohort#" ...
##  $ cohort_code: chr  "B456514" "B328821" "B289256" "B0VCB0F" ...
##  $ sdate      : chr  "2023-04-03" "2023-01-16" "2022-11-14" "2022-09-22" ...
##  $ start_date : num  1.68e+12 1.67e+12 1.67e+12 1.66e+12 1.67e+12 ...
##  $ edate      : chr  "2023-05-01" "2023-02-20" "2022-12-15" "2022-09-22" ...
##  $ end_date   : num  1.68e+12 1.68e+12 1.67e+12 1.66e+12 1.68e+12 ...
##  $ size       : int  1500 1000 100000 40 100000 10000 1500 100000 100000 10000 ...

DATA CLEANING

Since sdate and edate were initially recognized as character data types, they were converted to proper date formats to enable accurate analysis. Summary statistics revealed that the minimum start date is 2022-06-09 and the maximum is 2025-08-07. The size of the cohorts ranges from a minimum of 3 to a maximum of 100,000, with an average (mean) size of 5,741. The first quartile (Q1) is 500 and the third quartile (Q3) is 1,500, suggesting that cohorts with sizes below 500 or above 1,500 may be considered outliers.

#changing datatype for date column to date 
data$sdate <- as.Date(data$sdate, format = "%Y-%m-%d")
data$edate <- as.Date(data$edate, format = "%Y-%m-%d")
#confirmation
str(data)
## 'data.frame':    639 obs. of  7 variables:
##  $ cohort_id  : chr  "Cohort#" "Cohort#" "Cohort#" "Cohort#" ...
##  $ cohort_code: chr  "B456514" "B328821" "B289256" "B0VCB0F" ...
##  $ sdate      : Date, format: "2023-04-03" "2023-01-16" ...
##  $ start_date : num  1.68e+12 1.67e+12 1.67e+12 1.66e+12 1.67e+12 ...
##  $ edate      : Date, format: "2023-05-01" "2023-02-20" ...
##  $ end_date   : num  1.68e+12 1.68e+12 1.67e+12 1.66e+12 1.68e+12 ...
##  $ size       : int  1500 1000 100000 40 100000 10000 1500 100000 100000 10000 ...
summary(data)
##   cohort_id         cohort_code            sdate              start_date       
##  Length:639         Length:639         Min.   :2022-06-09   Min.   :1.650e+12  
##  Class :character   Class :character   1st Qu.:2023-05-03   1st Qu.:1.680e+12  
##  Mode  :character   Mode  :character   Median :2024-03-03   Median :1.710e+12  
##                                        Mean   :2024-01-26   Mean   :1.706e+12  
##                                        3rd Qu.:2024-10-21   3rd Qu.:1.730e+12  
##                                        Max.   :2025-08-07   Max.   :1.750e+12  
##      edate               end_date              size       
##  Min.   :2022-06-10   Min.   :1.650e+12   Min.   :     3  
##  1st Qu.:2023-06-09   1st Qu.:1.690e+12   1st Qu.:   500  
##  Median :2024-04-16   Median :1.710e+12   Median :   800  
##  Mean   :2024-03-23   Mean   :1.712e+12   Mean   :  5741  
##  3rd Qu.:2024-12-15   3rd Qu.:1.730e+12   3rd Qu.:  1500  
##  Max.   :2026-03-06   Max.   :1.770e+12   Max.   :100000

The data set contained no missing values

#checking for missing values
sum(is.na(data))
## [1] 0

Checking for outliers using the IQR rule.

#checking for outliers
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Apply IQR rule to all numeric columns
numeric_cols <- select_if(data, is.numeric)

# Function to detect outliers
find_outliers <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  which(x < (q1 - 1.5 * iqr) | x > (q3 + 1.5 * iqr))
}

# Check for outliers per column
lapply(numeric_cols, find_outliers)
## $start_date
## integer(0)
## 
## $end_date
## integer(0)
## 
## $size
##  [1]   3   5   6   8   9  10  12  17  22  28  33  35  39  43  47  49  50  57  58
## [20]  61  66  70  71  73  74  77  78  83  84  85  88  89  94  95  96 104 105 114
## [39] 172 265 290 313 381 419 504 523 530

The box plot below indicates the presence of outliers in the data, further confirming that cohort sizes exceeding 1,500 can be considered outliers.

library(ggplot2)

ggplot(data, aes(y = size)) +
  geom_boxplot(fill = "orange") +
  theme_minimal() +
  ggtitle("Boxplot of Size (Outlier Check)")

Since the cause of the outliers is unknown, a log transformation was applied to the size column to reduce their impact on the analysis and improve the interpretability of the data.

#data transformation to reduce the impact of outliers 
data$log_size <- log(data$size)
head(data)
##   cohort_id cohort_code      sdate start_date      edate end_date   size
## 1   Cohort#     B456514 2023-04-03   1.68e+12 2023-05-01 1.68e+12   1500
## 2   Cohort#     B328821 2023-01-16   1.67e+12 2023-02-20 1.68e+12   1000
## 3   Cohort#     B289256 2022-11-14   1.67e+12 2022-12-15 1.67e+12 100000
## 4   Cohort#     B0VCB0F 2022-09-22   1.66e+12 2022-09-22 1.66e+12     40
## 5   Cohort#     B908347 2023-01-09   1.67e+12 2023-02-13 1.68e+12 100000
## 6   Cohort#     B306047 2023-01-09   1.67e+12 2023-02-13 1.68e+12  10000
##    log_size
## 1  7.313220
## 2  6.907755
## 3 11.512925
## 4  3.688879
## 5 11.512925
## 6  9.210340

The data set have no duplicates.

#checking for duplicates 
sum(duplicated(data$cohort_code))
## [1] 0
sum(duplicated(data))
## [1] 0

EXPLORATORY DATA ANALYSES(EDA)

DISTRIBUTION OF COHORT SIZES

# Histogram for size
ggplot(data, aes(x = log_size)) + 
  geom_histogram(binwidth = 3, fill = "skyblue", color = "black") +
  theme_minimal() +
  ggtitle("Histogram of Size")

# Density plot for size
ggplot(data, aes(x = log_size)) + 
  geom_density(fill = "lightblue") +
  theme_minimal() +
  ggtitle("Density Plot of Size")

DISTRIBUTION OF COHORTS START DATES OVER THE YEARS

The histogram below displays the distribution of cohort start dates over time, revealing that the majority of cohorts began between 2023 and 2025. This suggests a period of growth or expansion during these years, possibly reflecting an increase in demand or a deliberate scale-up of the program. In contrast, there were significantly fewer cohorts in 2022, which may indicate the early stages of the program or a pilot phase. The sharp drop in 2026, with very few cohorts starting, could either point to future scheduling with incomplete data or potential data entry errors. Overall, the trend shows a clear rise in cohort activity after 2022, peaking in the subsequent years.

# Histogram of start date
ggplot(data, aes(x = sdate)) + 
  geom_histogram(binwidth = 300, fill = "lightgreen", color = "black") +
  theme_minimal() +
  ggtitle("Distribution of Start Dates")

PAIR PLOT FOR VISUALIZING MULTIPLE RELATIONSHIP

The pair plot shows the relationships among three variables: size, sdate_num (numerical start date), and edate_num (numerical end date). From the correlation values, there is a strong positive correlation (r = 0.941*) between sdate_num and edate_num, indicating that start and end dates move closely together—cohorts that start later also tend to end later, which is expected. There are moderate negative correlations between size and both sdate_num (r = -0.313***) and edate_num (r = -0.309***), suggesting that cohort sizes have generally decreased over time. The scatter plots also show a concentration of cohort sizes at lower values, with only a few outliers reaching very large sizes. Overall, the analysis reveals a temporal pattern where more recent cohorts are smaller in size, while start and end dates remain closely aligned.

data$sdate <- as.Date(data$sdate)
data$edate <- as.Date(data$edate)
# Optional: convert dates to numeric days since min date
min_date <- min(data$sdate, na.rm = TRUE)

data$sdate_num <- as.numeric(data$sdate - min_date)
data$edate_num <- as.numeric(data$edate - min_date)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data[, c("size", "sdate_num", "edate_num")])

data$month <- format(data$sdate, "%B")  
data$month_num <- as.numeric(format(data$sdate, "%m"))
library(dplyr)

monthly_sizes <- data %>%
  group_by(month, month_num) %>%
  summarise(avg_size = mean(log_size, na.rm = TRUE),
            total_size = sum(log_size, na.rm = TRUE),
            .groups = 'drop') %>%
  arrange(month_num)

DISTRIBUTION OF COHORT SIZES BY MONTH

The bar chart titled “Average Cohort Size by Month” illustrates the variations in cohort sizes throughout the year. February has the lowest average cohort size, indicating a dip in enrollment or participation during this month. In contrast, October boasts the highest average cohort size, suggesting a peak in interest or enrollment during this time. The overall trend shows fluctuations in cohort sizes, with a general increase towards the end of the year. This pattern could be attributed to seasonal enrollment cycles, heightened marketing and recruitment efforts, or the popularity of certain programs during specific months.

library(ggplot2)

ggplot(monthly_sizes, aes(x = reorder(month, month_num), y = avg_size)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Average Cohort Size by Month",
       x = "Month",
       y = "Average Cohort Size")