library(tidyverse)
library(magrittr)
library(cluster)
library(factoextra)
library(pscl)
library(boot)
library(dbscan)
library(dendextend)
library(fpc)
course_signal <- googledrive::drive_read_raw("Course_Signals.csv") %>%
readr::read_csv(., col_names = TRUE)
# There are random dates scattered along, dropping these.
cleaned_df <- course_signal[as.factor(substr(course_signal$COURSE, 4, 4)) != "-", ]
mod_df <- cleaned_df %>%
reshape2::melt() %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
rename(full_course_name = COURSE,
semester = variable,
n = value,
course_type = course_type,
course_level = course_level
)
# 60% of the values are 0
round(100*sum(mod_df$n == 0)/nrow(mod_df),2)
## [1] 60.35
# 0s are more common in upper-level courses
mod_df %>%
filter(n == 0) %>%
ggplot(., aes(x = semester)) +
geom_bar() +
facet_wrap(~course_level) +
theme_light() +
ggtitle("Distribution of 0s Across Course Levels")
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Median number of students over 4 years is 18 years, mean is ~63 (highly skewed)
median_enrollment <- course_signal %>%
mutate(total_n_per_across_all_semesters = rowSums(course_signal[,-1])) %>%
select(total_n_per_across_all_semesters) %>%
unlist() %>%
as.vector() %>%
median()
course_signal %>%
mutate(total_n_per_across_all_semesters = rowSums(course_signal[,-1])) %>%
ggplot(., aes(total_n_per_across_all_semesters)) +
geom_density() +
geom_vline(xintercept = median_enrollment) +
xlim(0,20)
# Least number of 0 classes: ACG
# Highest number of 0 classes: MGF
class_type_freq <- mod_df %>%
filter(n == 0) %>%
select(course_type) %>%
mode() %>%
rownames_to_column() %>%
rename(count = rowname) %>%
mutate(across(count, as.numeric))
head(class_type_freq)
## # A tibble: 6 × 2
## count course_type
## <dbl> <fct>
## 1 1 ACG
## 2 13 ADV
## 3 19 AFH
## 4 20 AFR
## 5 31 AMH
## 6 41 AML
tail(class_type_freq)
## # A tibble: 6 × 2
## count course_type
## <dbl> <fct>
## 1 3039 EIN
## 2 3099 ESC
## 3 3168 GEA
## 4 3664 PEL
## 5 9010 EDF
## 6 12175 MGF
# Highly skewed data (ZIP?)
ggplot(mod_df, aes(n)) +
geom_histogram() +
theme_light()
# Highly skewed ECDF
ggplot(mod_df, aes(n)) +
stat_ecdf(geom = "point") +
theme_light()
Upper level courses are indicated by higher enrollment values in later semesters. Conversely, lower level courses are indicated by higher enrollment values in earlier semesters. Using row_sum = 12 as the cutoff allows all students an opportunity to have taken the course. However, in some upper level electives, this cutoff point is likely to result in inappropriate exclusions, as such, I will consider a cutoff value of row_sum = 4. The foregoing cutoff would better be able to include upper-level nieche electives that would otherwise the excluded. Also the pdf of rowSums seems to be centered around row_sum = 4.
Alternatively, I will also consider row_sum = 16, as that is the median of the distribution of rowSums.
# *row_sum = 4*
c_s_course_signals_4 <-
cleaned_df[rowSums(cleaned_df[,-1]) >= 4,] %>%
mutate(round(across(where(is.numeric))/rowSums(across(where(is.numeric))),5)) %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
mutate(across(where(is.numeric), scale)) %>%
select(-COURSE)
# *row_sum = 16*
c_s_course_signals_16 <-
cleaned_df[rowSums(cleaned_df[,-1]) >= 16,] %>%
mutate(round(across(where(is.numeric))/rowSums(across(where(is.numeric))),5)) %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
mutate(across(where(is.numeric), scale)) %>%
select(-COURSE)
Because we would like to incorporate features that are categorical, k-means clustering may not be appropriate. Will consider hierarchical clustering.
Method of single linkage or nearest neighbour. Proximity between two clusters is the proximity between their two closest objects. This value is one of values of the input matrix. The conceptual metaphor of this build of cluster, its archetype, is spectrum or chain. Chains could be straight or curvilinear, or could be like “snowflake” or “amoeba” view. Two most dissimilar cluster members can happen to be very much dissimilar in comparison to two most similar. Single linkage method controls only nearest neighbours similarity.
Method of complete linkage or farthest neighbour. Proximity between two clusters is the proximity between their two most distant objects. This value is one of values of the input matrix. The metaphor of this build of cluster is circle (in the sense, by hobby or plot) where two most distant from each other members cannot be much more dissimilar than other quite dissimilar pairs (as in circle). Such clusters are “compact” contours by their borders, but they are not necessarily compact inside.
Simple average, Proximity between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side; while the subclusters of which each of these two clusters were merged recently have equalized influence on that proximity – even if the subclusters differed in the number of objects.
Ward’s method, or minimal increase of sum-of-squares (MISSQ), sometimes incorrectly called “minimum variance” method. Proximity between two clusters is the magnitude by which the summed square in their joint cluster will be greater than the combined summed square in these two clusters: (Between two singleton objects this quantity = squared euclidean distance / 2.) The metaphor of this build of cluster is type. Intuitively, a type is a cloud more dense and more concentric towards its middle, whereas marginal points are few and could be scattered relatively freely.
# methods to assess
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(c_s_course_signals_4, method = x)$ac
}
# get agglomerative coefficient for each linkage method
# The agglomerative coefficient measures the dissimilarity of an object to the first cluster it joins, divided by the dissimilarity of the final merger in the cluster analysis, averaged across all samples
# Low values reflect tight clustering of objects, larger values indicate less well-formed clusters.
purrr::map_dbl(m, ac)
## average single complete ward
## 0.9784912 0.7911173 0.9881440 0.9991327
# Ward gave highest AC, will proceed with single.
hc_4 <- agnes(c_s_course_signals_4, method = "single")
hc_4$ac
## [1] 0.7911173
pltree(hc_4, cex = 0.6, hang = -1, main = "Dendrogram of agnes")
rect.hclust(hc_4, k = 4)
sub_grp_4<- cutree(hc_4, k = 4)
table(sub_grp_4)
## sub_grp_4
## 1 2 3 4
## 1413 1 1 1
# methods to assess
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(c_s_course_signals_16, method = x)$ac
}
# get agglomerative coefficient for each linkage method
purrr::map_dbl(m, ac)
## average single complete ward
## 0.9757546 0.7830484 0.9870114 0.9988339
# Ward gave highest AC, will proceed with single.
hc_16 <- agnes(c_s_course_signals_16, method = "single")
hc_16$ac
## [1] 0.7830484
pltree(hc_16, cex = 0.6, hang = -1, main = "Dendrogram of agnes")
rect.hclust(hc_16, k = 4)
sub_grp_16 <- cutree(hc_16, k = 4)
table(sub_grp_16)
## sub_grp_16
## 1 2 3 4
## 917 1 1 1
Letting the cutoff be \(\geq 16\) may have been too aggressive, clustering behavior and relative size seems similar to \(\geq 4\). As such, will proceed with \(\geq 4\) as the cutoff point.
sub_grp_4 <- cutree(hc_4, k = 4)
table(sub_grp_4)
## sub_grp_4
## 1 2 3 4
## 1413 1 1 1
labeled_data <-
cleaned_df[rowSums(cleaned_df[,-1]) >= 4,] %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
mutate(cluster = sub_grp_4)
labeled_data %>%
dplyr::select(-'cluster', -'COURSE', -'course_type', -'course_level') %>%
aggregate(., by=list(cluster=labeled_data$cluster), mean)
## # A tibble: 4 × 13
## cluster `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 12.7 11.8 2.33 9.51 8.69 2.54 7.68 7.16 2.83 6.15 4.95
## 2 2 0 2 3 0 0 0 0 0 0 0 0
## 3 3 0 0 7 0 0 6 0 0 3 2 0
## 4 4 6 0 0 0 0 0 0 0 0 0 0
## # … with 1 more variable: `12` <dbl>
labeled_data %>%
ggplot(., aes(x = course_level)) +
geom_bar() +
facet_wrap(~cluster) +
ggtitle("Course level distribution among clusters")
labeled_data %>%
select(2:13, 16) %>%
mutate(cluster = as.factor(cluster)) %>%
reshape2::melt(value.name = "enrollment") %>%
select(-"variable") %>%
ggplot(., aes(x = enrollment)) +
geom_histogram(bins = 10, color = "white") +
xlim(0,10) +
ylim(0,300) +
facet_wrap(~cluster) +
ggtitle("Course enrollment distribution among clusters")
temp_df <- labeled_data %>%
mutate(total_enrollment = rowSums(labeled_data[2:13])) %>%
group_split(cluster)
labeled_data %>%
group_by(cluster) %>%
summarise(most_freq_course_type = mode(course_type),
most_freq_course_level = mode(course_level)) %>%
cbind(., mean_enrollment_4_years =
sapply(temp_df, function(x) mean(x$total_enrollment)),
median_enrollment_4_years =
sapply(temp_df, function(x) median(x$total_enrollment)))
## # A tibble: 4 × 5
## cluster most_freq_course_type most_freq_course_level mean_enrollment…¹ media…²
## <int> <fct> <fct> <dbl> <dbl>
## 1 1 ANT 4 77.9 26
## 2 2 GER 1 5 5
## 3 3 MUG 2 19 19
## 4 4 PEM 2 6 6
## # … with abbreviated variable names ¹mean_enrollment_4_years,
## # ²median_enrollment_4_years
analysis_script <- function(.x) {
course_signal <- googledrive::drive_read_raw(.x) %>%
readr::read_csv(., col_names = TRUE, show_col_types = FALSE)
cleaned_df <- course_signal[as.factor(substr(course_signal$COURSE, 4, 4)) != "-", ]
c_s_course_signals_4 <-
cleaned_df[rowSums(cleaned_df[,-1]) >= 4,] %>%
mutate(round(across(where(is.numeric))/rowSums(across(where(is.numeric))),5)) %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
mutate(across(where(is.numeric), scale)) %>%
select(-COURSE)
hc_4 <- agnes(c_s_course_signals_4, method = "single")
sub_grp_4 <- cutree(hc_4, k = 4)
labeled_data <-
cleaned_df[rowSums(cleaned_df[,-1]) >= 4,] %>%
mutate(course_type = as.factor(substr(COURSE, 1, 3)),
course_level = as.factor(substr(COURSE, 4, 4))) %>%
mutate(cluster = sub_grp_4)
temp_df <- labeled_data %>%
mutate(total_enrollment = rowSums(labeled_data[2:13])) %>%
group_split(cluster)
# Stability Analysis of Clusters
cboot.hclust <- clusterboot(c_s_course_signals_4[1:12],
clustermethod = hclustCBI,
method = "single",
k = 4,
showplots=FALSE)
out_df <- labeled_data %>%
group_by(cluster) %>%
summarise(most_freq_course_type = mode(course_type),
most_freq_course_level = mode(course_level)) %>%
cbind(., mean_enrollment_4_years =
sapply(temp_df, function(x) mean(x$total_enrollment)),
median_enrollment_4_years =
sapply(temp_df, function(x) median(x$total_enrollment))) %>%
mutate(college_name = .x) %>%
mutate(stability = cboot.hclust$bootmean )
pltree(hc_4, cex = 0.6, hang = -1, main = paste("Dendrogram of agnes", .x))
rect.hclust(hc_4, k = 4)
return(out_df)
}
# Names of data files
names <- c("Course_Signals_UKCOH.csv",
"Course_Signals_CEPS.csv",
"Course_Signals_COB.csv",
"Course_Signals_HMCSE.csv",
"Course_Signals_CASSH.csv"
)
analysis_across_schools <- purrr::map(names, analysis_script) %>%
lapply(.,as_tibble)
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
print(analysis_across_schools)
## [[1]]
## # A tibble: 4 × 7
## cluster most_freq_course_type most_freq_cour…¹ mean_…² media…³ colle…⁴ stabi…⁵
## <int> <fct> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 HSC 4 40.5 13 Course… 0.990
## 2 2 ATR 4 18 18 Course… 0.391
## 3 3 ECO 4 4 4 Course… 0.600
## 4 4 PEM 2 4 4 Course… 0.544
## # … with abbreviated variable names ¹most_freq_course_level,
## # ²mean_enrollment_4_years, ³median_enrollment_4_years, ⁴college_name,
## # ⁵stability
##
## [[2]]
## # A tibble: 4 × 7
## cluster most_freq_course_type most_freq_cour…¹ mean_…² media…³ colle…⁴ stabi…⁵
## <int> <fct> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 PLA 3 23.2 11 Course… 0.993
## 2 2 FIN 3 4 4 Course… 0.781
## 3 3 MSL 4 6 6 Course… 0.440
## 4 4 SYP 3 4 4 Course… 0.65
## # … with abbreviated variable names ¹most_freq_course_level,
## # ²mean_enrollment_4_years, ³median_enrollment_4_years, ⁴college_name,
## # ⁵stability
##
## [[3]]
## # A tibble: 4 × 7
## cluster most_freq_course_type most_freq_cour…¹ mean_…² media…³ colle…⁴ stabi…⁵
## <int> <fct> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 HFT 4 33.6 10 Course… 0.984
## 2 2 IDH 1 13 13 Course… 0.157
## 3 3 ISM 3 8 8 Course… 0.531
## 4 4 PEM 1 4 4 Course… 0.64
## # … with abbreviated variable names ¹most_freq_course_level,
## # ²mean_enrollment_4_years, ³median_enrollment_4_years, ⁴college_name,
## # ⁵stability
##
## [[4]]
## # A tibble: 4 × 7
## cluster most_freq_course_type most_freq_cour…¹ mean_…² media…³ colle…⁴ stabi…⁵
## <int> <fct> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 PCB 4 48.8 14 Course… 0.995
## 2 2 AMH 4 4 4 Course… 0.101
## 3 3 JPN 3 6 6 Course… 0.69
## 4 4 MSL 1 10 10 Course… 0.69
## # … with abbreviated variable names ¹most_freq_course_level,
## # ²mean_enrollment_4_years, ³median_enrollment_4_years, ⁴college_name,
## # ⁵stability
##
## [[5]]
## # A tibble: 4 × 7
## cluster most_freq_course_type most_freq_cour…¹ mean_…² media…³ colle…⁴ stabi…⁵
## <int> <fct> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 ANT 4 19.1 10 Course… 0.996
## 2 2 GER 1 4.5 4.5 Course… 0.600
## 3 3 MCB 1 4 4 Course… 0.605
## 4 4 RTV 3 6 6 Course… 0.495
## # … with abbreviated variable names ¹most_freq_course_level,
## # ²mean_enrollment_4_years, ³median_enrollment_4_years, ⁴college_name,
## # ⁵stability