We will use the oulad data-set. The static data-set is the following
student_Info <- read.csv("studentInfo.csv")
names(student_Info)
## [1] "code_module" "code_presentation" "id_student"
## [4] "gender" "region" "highest_education"
## [7] "imd_band" "age_band" "num_of_prev_attempts"
## [10] "studied_credits" "disability" "final_result"
summary(student_Info)
## code_module code_presentation id_student gender
## AAA: 748 2013B: 4684 Min. : 3733 F:14718
## BBB:7909 2013J: 8845 1st Qu.: 508573 M:17875
## CCC:4434 2014B: 7804 Median : 590310
## DDD:6272 2014J:11260 Mean : 706688
## EEE:2934 3rd Qu.: 644453
## FFF:7762 Max. :2716795
## GGG:2534
## region highest_education
## Scotland : 3446 A Level or Equivalent :14045
## East Anglian Region : 3340 HE Qualification : 4730
## London Region : 3216 Lower Than A Level :13158
## South Region : 3092 No Formal quals : 347
## North Western Region: 2906 Post Graduate Qualification: 313
## West Midlands Region: 2582
## (Other) :14011
## imd_band age_band num_of_prev_attempts studied_credits
## 20-30% : 3654 0-35 :22944 Min. :0.0000 Min. : 30.00
## 30-40% : 3539 35-55: 9433 1st Qu.:0.0000 1st Qu.: 60.00
## 10-20 : 3516 55<= : 216 Median :0.0000 Median : 60.00
## 0-10% : 3311 Mean :0.1632 Mean : 79.76
## 40-50% : 3256 3rd Qu.:0.0000 3rd Qu.:120.00
## 50-60% : 3124 Max. :6.0000 Max. :655.00
## (Other):12193
## disability final_result
## N:29429 Distinction: 3024
## Y: 3164 Fail : 7052
## Pass :12361
## Withdrawn :10156
##
##
##
The dynamical dataset is the following
student_vle <- read.csv("studentVle.csv")
summary(student_vle)
## code_module code_presentation id_student id_site
## AAA: 350298 2013B:1886868 Min. : 6516 Min. : 526721
## BBB:1567564 2013J:2988784 1st Qu.: 507743 1st Qu.: 673519
## CCC:1207827 2014B:2160176 Median : 588236 Median : 730069
## DDD:2166486 2014J:3619452 Mean : 733334 Mean : 738323
## EEE: 961433 3rd Qu.: 646484 3rd Qu.: 877030
## FFF:4014499 Max. :2698588 Max. :1049562
## GGG: 387173
## date sum_click
## Min. :-25.00 Min. : 1.000
## 1st Qu.: 25.00 1st Qu.: 1.000
## Median : 86.00 Median : 2.000
## Mean : 95.17 Mean : 3.717
## 3rd Qu.:156.00 3rd Qu.: 3.000
## Max. :269.00 Max. :6977.000
##
The dynamical data set reflects the dynamical involvement of a student with the online content of the vle during two semesters, for two academic years.
In this part we will investigate differences between groups of students. We can address questions like do male or female interact more often with the vle and is there a difference in their performance? Which demographic segment is best performing? Is the vle more appealing to a particular group of Education? The answers will help us to understand the vle better and to target students effectively, and to evaluate the effectiveness of the vle and the grading process.
We are not focused only to differences between people, similar kind of questions can be targeted on many other groups of data. We might group data by the geographical region and ask questions about these groups. We may also consider time periods and involvement. Did the performance increase in relation to the interaction of the student with the vle? In such cases we compare groups of data to another to identify effects and get valuable insight.
For our student in the vle data, we are interested in how measures such as level of education and gender vary for different segments of the data set. With this insight, the educational “firm” can develop tailored educational material for the segment or engage in different ways to reach and activate the student.
An ad hoc way to do this is with data frame indexing: find the rows that match some criterion, and then take the mean (or some other statistic) for the matching observations on a variable of interest. For example, to examine the final result for the “AAA” and “BBB” course modules:
AAA_st_info <- student_Info[student_Info$code_module == "AAA", ]
summary(AAA_st_info$final_result)
## Distinction Fail Pass Withdrawn
## 44 91 487 126
BBB_st_info <- student_Info[student_Info$code_module == "BBB", ]
summary(BBB_st_info$final_result)
## Distinction Fail Pass Withdrawn
## 677 1767 3077 2388
This says ‘from the student_info observations take only the rows from the column ’code_module’ that have values ‘AAA’ and ‘BBB’ for the two cases respectively.
We can perform the same analysis for all the groups in the data-set by using the by command
by(student_Info$final_result, student_Info$code_module, summary)
## student_Info$code_module: AAA
## Distinction Fail Pass Withdrawn
## 44 91 487 126
## --------------------------------------------------------
## student_Info$code_module: BBB
## Distinction Fail Pass Withdrawn
## 677 1767 3077 2388
## --------------------------------------------------------
## student_Info$code_module: CCC
## Distinction Fail Pass Withdrawn
## 498 781 1180 1975
## --------------------------------------------------------
## student_Info$code_module: DDD
## Distinction Fail Pass Withdrawn
## 383 1412 2227 2250
## --------------------------------------------------------
## student_Info$code_module: EEE
## Distinction Fail Pass Withdrawn
## 356 562 1294 722
## --------------------------------------------------------
## student_Info$code_module: FFF
## Distinction Fail Pass Withdrawn
## 670 1711 2978 2403
## --------------------------------------------------------
## student_Info$code_module: GGG
## Distinction Fail Pass Withdrawn
## 396 728 1118 292
We can brake up fearther the data by using combinatios of groups. For example we might check the final results for each given gender for all the modules
SA <- by(student_Info$final_result, list(student_Info$code_module, student_Info$gender), summary)
SA
## : AAA
## : F
## Distinction Fail Pass Withdrawn
## 13 39 195 67
## --------------------------------------------------------
## : BBB
## : F
## Distinction Fail Pass Withdrawn
## 610 1530 2728 2123
## --------------------------------------------------------
## : CCC
## : F
## Distinction Fail Pass Withdrawn
## 143 145 311 502
## --------------------------------------------------------
## : DDD
## : F
## Distinction Fail Pass Withdrawn
## 122 508 893 995
## --------------------------------------------------------
## : EEE
## : F
## Distinction Fail Pass Withdrawn
## 38 48 152 99
## --------------------------------------------------------
## : FFF
## : F
## Distinction Fail Pass Withdrawn
## 148 252 540 474
## --------------------------------------------------------
## : GGG
## : F
## Distinction Fail Pass Withdrawn
## 320 581 916 226
## --------------------------------------------------------
## : AAA
## : M
## Distinction Fail Pass Withdrawn
## 31 52 292 59
## --------------------------------------------------------
## : BBB
## : M
## Distinction Fail Pass Withdrawn
## 67 237 349 265
## --------------------------------------------------------
## : CCC
## : M
## Distinction Fail Pass Withdrawn
## 355 636 869 1473
## --------------------------------------------------------
## : DDD
## : M
## Distinction Fail Pass Withdrawn
## 261 904 1334 1255
## --------------------------------------------------------
## : EEE
## : M
## Distinction Fail Pass Withdrawn
## 318 514 1142 623
## --------------------------------------------------------
## : FFF
## : M
## Distinction Fail Pass Withdrawn
## 522 1459 2438 1929
## --------------------------------------------------------
## : GGG
## : M
## Distinction Fail Pass Withdrawn
## 76 147 202 66
Similar operations we can perform by using the aggregate command For example we can explore the mean credits earned from each course module
aggregate(studied_credits ∼ code_module, data=student_Info, mean)
## code_module studied_credits
## 1 AAA 84.43850
## 2 BBB 84.07258
## 3 CCC 77.48760
## 4 DDD 86.64780
## 5 EEE 65.43626
## 6 FFF 90.84514
## 7 GGG 34.45935
An other useful technich is cross-tabulating, separating students into groups according to two or more factors.
aggregate(studied_credits ∼ code_presentation + gender, data=student_Info, mean)
## code_presentation gender studied_credits
## 1 2013B F 89.35329
## 2 2013J F 76.24286
## 3 2014B F 75.32809
## 4 2014J F 78.64209
## 5 2013B M 88.91285
## 6 2013J M 80.56620
## 7 2014B M 76.07191
## 8 2014J M 80.32467
We now have a separate group for each combination of Semester and Gender and can begin to see how studied Credits are related to both variables. A formula can be extended to include as many grouping variables as needed. The aggregate command allows us to compute functions of continuous variables, for any combination of factors.
We might also want to explore the frequency with which different combinations of variables occur. The table command allows us to obtain one-way or multiway counts
with(student_Info, table(code_presentation, gender))
## gender
## code_presentation F M
## 2013B 2389 2295
## 2013J 4200 4645
## 2014B 3368 4436
## 2014J 4761 6499
In the above table why see the distributions of gender in the four given semesters. Suppose we want to explore the educational level compared to the region of the students.
with(student_Info, table( highest_education, region))
## region
## highest_education East Anglian Region East Midlands Region
## A Level or Equivalent 1455 1068
## HE Qualification 356 198
## Lower Than A Level 1466 1081
## No Formal quals 53 11
## Post Graduate Qualification 10 7
## region
## highest_education Ireland London Region North Region
## A Level or Equivalent 481 1387 716
## HE Qualification 183 490 282
## Lower Than A Level 481 1215 711
## No Formal quals 0 94 26
## Post Graduate Qualification 39 30 88
## region
## highest_education North Western Region Scotland
## A Level or Equivalent 1230 1202
## HE Qualification 323 1233
## Lower Than A Level 1314 910
## No Formal quals 39 5
## Post Graduate Qualification 0 96
## region
## highest_education South East Region South Region
## A Level or Equivalent 957 1417
## HE Qualification 245 387
## Lower Than A Level 893 1245
## No Formal quals 11 11
## Post Graduate Qualification 5 32
## region
## highest_education South West Region Wales West Midlands Region
## A Level or Equivalent 1179 811 1224
## HE Qualification 226 364 230
## Lower Than A Level 1012 883 1099
## No Formal quals 19 22 29
## Post Graduate Qualification 0 6 0
## region
## highest_education Yorkshire Region
## A Level or Equivalent 918
## HE Qualification 213
## Lower Than A Level 848
## No Formal quals 27
## Post Graduate Qualification 0
and we can inspect the counts for the highest education for each region sererately.
Also, we can further explore count variables. For example we can retrieve the total number of studied_credits reported in each region
xtabs(studied_credits ~ region , data = student_Info)
## region
## East Anglian Region East Midlands Region Ireland
## 260380 191370 92390
## London Region North Region North Western Region
## 266575 146685 234410
## Scotland South East Region South Region
## 265090 167005 240380
## South West Region Wales West Midlands Region
## 195385 170220 209120
## Yorkshire Region
## 160565
We can take the same result also by using the apply, colsums or aggregate commands. For example if we want to store the output as a data.frame it is better to use the aggregate command
aggregate(studied_credits ~ region , data = student_Info, sum)
## region studied_credits
## 1 East Anglian Region 260380
## 2 East Midlands Region 191370
## 3 Ireland 92390
## 4 London Region 266575
## 5 North Region 146685
## 6 North Western Region 234410
## 7 Scotland 265090
## 8 South East Region 167005
## 9 South Region 240380
## 10 South West Region 195385
## 11 Wales 170220
## 12 West Midlands Region 209120
## 13 Yorkshire Region 160565
We can generate the wanted data (for example by using the table command) and plot histograms, barplots etc. However, the lattice package understands simple formulas and generates quality plots.
library(lattice)
histogram(∼gender | highest_education, data=student_Info)
In the above figure, we see that men students have a hightest level of eduation than female studemts. While this data doesn’t tell us why that might be, it does suggest that the educational company might investigate and perhaps improve the vle experience to make it more appealing to women with higher education.
The default in histogram() is to plot proportions within each group so that the values are relative to the group size. If we want actual counts
histogram(∼gender | highest_education, data=student_Info,type="count",layout=c(5,1), col=c("burlywood", "darkolivegreen") )
We add more than one factors. We can plot educational level for each gender among all regions
histogram(∼gender | highest_education + code_module, data=student_Info, )
In the previous section we saw how to plot counts and proportions. What about continuous data? How would we plot sum of clicks by course in our data? A simple way is to use aggregate() to find the mean sum_click, and then use barchart() from the lattice package to plot the computed values:
sum_click_mean <- aggregate(sum_click ~ code_module, data = student_vle, mean)
barchart(sum_click ~ code_module, data = sum_click_mean , col = "grey")
This bar chart depictes the mean interaction of a student with the vle environment for each course.
How do we split this out further by semester? First we have to aggregate the data to include both factors in the formula. Then we tell barchart() to use semester as a grouping variable by adding the argument groups=factor. Doing that, and also adding a simpleTheme option to improve the chart colors, we have:
sum_click_mean_sem <- aggregate(sum_click ~ code_module + code_presentation, data = student_vle, mean)
barchart(sum_click ~ code_module, data = sum_click_mean_sem, groups=code_presentation, auto.key=TRUE, par.settings = simpleTheme(col=terrain.colors(4)))
A more informative plot for comparing values of continuous data, like income for different groups is a box-and-whiskers plot or boxplot. A boxplot is better than a barchart because it shows more about the distributions of values.
bwplot(log(sum_click) ~ code_module, data=student_vle, horizontal=FALSE)
We can break out code_presentation as a conditioning variable using “ | code_presentation” in the formula:
bwplot(log(sum_click) ~ code_module | code_presentation, data=student_vle, horizontal=FALSE)
From the above plot it is evident that in the data set the semesters of 2014 include all the courses and the interaction with the vle increased. Due to the more complete picture of the 2014 academic year in the following we might consider the two years seperately.
In this section, we tacle a canonical educational research problem: finding, assesing, and predicting student segments. From the analysis of our dataset we can assess relationships in the data, compare groups, and assess complex multivariable models. In a real segmentation project, one would use use those methods to ensure that data has appropriate multivariate structure, and then begin segmentation analysis. Segmentation is not a well-defined process and analysts vary in their definitions of segmentation as well as their approaches and philosophy.
The general goal of segmentation in vle is to find groups of students that differ in important ways assiciate with interest in educational material, participation in the environment, or response to the vle. By understanding the differences among groups, the educator can make better strategic choices about planning, material definition, and course evolution and can engage in more effective educational procedure.
It is not difficult to find groups within student data. What is difficult is to find meaningful segmentattions for particular educational needs. There are many techniques in segmentation and many of them bilong to the field of statistical learning. In this project we will focus on clustering analysis and we will demonstrate techniques that can be used to retrieve valuable insight for optimizing the vle.
Clustering belongs to the field of unsupervised learning. In usupervised learning we do not know the outcome groupings but we attempt to discover them from structure in the data. For instance, we might explore a particulare corse module and ask, “Are there groups that differ in how and when they respond to educational material? If so, what are the characteristics of those groups?” We use the term clustering for this approach.
Clustering is useful in segmentation projects. Educators can use segmentation for discovering groups in the data in order to derive new insight about students. This obniously suggests clustering approaches because the possible student groups are unknown. Still, classification approaches are also useful in such projects and are always complementary to clustering but we will consider them in a future work.
The dynamical dataset has a huge amount of information, namely 10.655.280 rows. This restricts the use of clustering algorithms due to the high computational complexity that most of them have. For our purposes we will retrieve the total number of clicks for a given course module for each student. This is costly and it must be done with an efficient algorithm. Computational complexity plays a huge roll for this type of analyses since big data are involved in many vle environments.
student_vle_sum <- aggregate(sum_click ∼ code_module + code_presentation + id_student, data=student_vle, sum)
We will merge this relult retrieved from the dynamical data set with the static data set. We will use an inner join of the data sets and this will give us only the students that actualy participated in the vle.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
student_inj <- inner_join(student_Info, student_vle_sum, by = c("id_student", "code_module", "code_presentation"))
The students that where in the student and finally did not interacted at all with the vle where
(dim(student_Info)[1]-dim(student_vle_sum)[1])/dim(student_Info)[1]
## [1] 0.103243
about 10% of the initial population. Keep in mind that we can still evaluate the characteristics of these potential students and consider strategies to make the vle more appealing to them. Let’s see the resulting data set from the inner join
head(student_inj)
## code_module code_presentation id_student gender region
## 1 AAA 2013J 11391 M East Anglian Region
## 2 AAA 2013J 28400 F Scotland
## 3 AAA 2013J 30268 F North Western Region
## 4 AAA 2013J 31604 F South East Region
## 5 AAA 2013J 32885 F West Midlands Region
## 6 AAA 2013J 38053 M Wales
## highest_education imd_band age_band num_of_prev_attempts
## 1 HE Qualification 90-100% 55<= 0
## 2 HE Qualification 20-30% 35-55 0
## 3 A Level or Equivalent 30-40% 35-55 0
## 4 A Level or Equivalent 50-60% 35-55 0
## 5 Lower Than A Level 50-60% 0-35 0
## 6 A Level or Equivalent 80-90% 35-55 0
## studied_credits disability final_result sum_click
## 1 240 N Pass 934
## 2 60 N Pass 1435
## 3 60 Y Withdrawn 281
## 4 60 N Pass 2158
## 5 60 N Pass 1034
## 6 60 N Pass 2445
The data set has categorical variables as well as continuous. We can convert some variables to numerical without losing information since there is a hierarchy within this variable.
levels(student_inj$highest_education) <- c(3,4,2,1,5)
“No Formal quals” < “Lower Than A Level” < “A Level or Equivalent” < “HE Qualification” (HE is Higher Education) <“Post Graduate Qualification”
levels(student_inj$age_band) <- c(1,2,3)
0-35 < 35-55 < 55+
levels(student_inj$final_result) <- c(3,1,2,0)
“Withdrawn” < “Fail” < “Pass” < “Distinction”
The Index of Multiple Deprivation is a UK government qualitative study of deprived areas in English local councils. Lets see its levels and quantify the missing values
levels(student_inj$imd_band)
## [1] "?" "0-10%" "10-20" "20-30%" "30-40%" "40-50%" "50-60%"
## [8] "60-70%" "70-80%" "80-90%" "90-100%"
sum(student_inj$imd_band == "?")/length(student_inj$imd_band)
## [1] 0.03606131
Only 3.5% of the data are missing values of imd_band. We omit these observations
miis_emd <-which(student_inj$imd_band == "?")
student_inj <- student_inj[-miis_emd,]
levels(student_inj$imd_band) <- 0:10
For male or female M=1 Female=2. For disability N=1, Y=2
levels(student_inj$gender) <- c(2,1)
levels(student_inj$disability) <- 1:2
student_inj$highest_education <- as.numeric(student_inj$highest_education)
student_inj$age_band <- as.numeric(student_inj$age_band)
student_inj$final_result <- as.numeric(student_inj$final_result)
student_inj$imd_band <- as.numeric(student_inj$imd_band)
student_inj$gender <- as.numeric(student_inj$gender)
student_inj$disability <- as.numeric(student_inj$disability)
With this merged data set we will proceed to demonstarte somecluster analysis. Finaly we “paste” the two first columns and they will repsresent a particular course in a particular semester
There may be a time in which we would like to combine the values of two variables. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.
library(tidyr)
stud_unit <- unite(student_inj,course, code_module, code_presentation, sep = "_", remove = TRUE)
head(stud_unit)
## course id_student gender region highest_education
## 1 AAA_2013J 11391 2 East Anglian Region 2
## 2 AAA_2013J 28400 1 Scotland 2
## 3 AAA_2013J 30268 1 North Western Region 1
## 4 AAA_2013J 31604 1 South East Region 1
## 5 AAA_2013J 32885 1 West Midlands Region 3
## 6 AAA_2013J 38053 2 Wales 1
## imd_band age_band num_of_prev_attempts studied_credits disability
## 1 11 3 0 240 1
## 2 4 2 0 60 1
## 3 5 2 0 60 2
## 4 7 2 0 60 1
## 5 7 1 0 60 1
## 6 10 2 0 60 1
## final_result sum_click
## 1 3 934
## 2 3 1435
## 3 4 281
## 4 3 2158
## 5 3 1034
## 6 3 2445
Clustering analysis requires two stages: finding a proposed cluster solution and evaluating that solution for the educational needs. For each method we go through the following steps:
• Transform the data if needed for a particular clustering method; for instance, some methods require all numeric data (e.g., kmeans(), mclust()) or all categorical data (e.g., poLCA()).
• Compute a distance matrix if needed; some methods require a precomputed matrix of similarity in order to group observations (e.g., hclust()).
• Apply the clustering method and save its result to an object. For some methods this requires specifying the number (K) of groups desired (e.g., kmeans(), poLCA()).
• For some methods, further parse the object to obtain a solution with K groups (e.g., hclust()).
• Examine the solution in the model object with regard to the underlying data, and consider whether it answers an educational question.
As we’ve already argued, the most difficult part of that process is the last step: establishing whether a proposed statistical solution answers an educational need. Ultimately, a cluster solution is largely just a vector of purported group assignments for each observation, such as “1, 1, 4, 3, 2, 3, 2, 2, 4, 1, 4 . . ..” It is up to the educator to figure out whether that tells a meaningful story for the vle and the students involved.
We recommend that you think hard about how you would know whether the solution—assignments of observations to groups—that is proposed by a clustering method is useful for the educational problem. Just because some grouping is proposed by an algorithm does not mean that it will help the improvement of the learning experience. One way we often approach this is to write a simple function that summarizes the data and allows quick inspection of the high-level differences between groups
A segment inspection function may be complex depending on the educational need and might even include plotting as well as data summarization. For purposes here we use a simple function that reports the mean by group. We use mean here instead of a more robust metric such as median because we have several binary variables and mean() easily shows the mixture proportion for them (i.e., 1.5 means a 50% mix of 1 and 2). A very simple function is:
seg_summa <- function(data, groups) {aggregate(data, list(groups), function(x) mean(as.numeric(x)))}
seg_summa(stud_unit[, -c(1,2,4,8,9,10)], stud_unit$course)
## Group.1 gender highest_education imd_band age_band final_result
## 1 AAA_2013J 1.606061 1.743802 7.454545 1.589532 2.925620
## 2 AAA_2014J 1.553009 1.656160 7.234957 1.584527 2.905444
## 3 BBB_2013B 1.115942 2.054677 5.775362 1.326746 2.739789
## 4 BBB_2013J 1.106085 2.054389 5.984383 1.346257 2.735595
## 5 BBB_2014B 1.111371 2.014019 5.924455 1.333333 2.622274
## 6 BBB_2014J 1.122117 2.047694 5.905136 1.355870 2.831761
## 7 CCC_2014B 1.769912 1.900126 6.515803 1.277497 2.984829
## 8 CCC_2014J 1.743756 1.846438 6.548566 1.317299 2.970860
## 9 DDD_2013B 1.599827 1.971478 6.386344 1.295592 2.929127
## 10 DDD_2013J 1.591017 1.908983 6.507092 1.250000 2.956856
## 11 DDD_2014B 1.610377 1.925472 6.417925 1.256604 2.923585
## 12 DDD_2014J 1.592214 1.860881 6.498405 1.259094 2.957881
## 13 EEE_2013J 1.904918 1.956284 6.684153 1.248087 2.709290
## 14 EEE_2014B 1.887789 1.973597 6.658416 1.252475 2.702970
## 15 EEE_2014J 1.861217 1.896388 6.555133 1.235741 2.770913
## 16 FFF_2013B 1.802113 2.006338 6.338732 1.293662 2.771831
## 17 FFF_2013J 1.832168 2.005994 6.212787 1.244256 2.827173
## 18 FFF_2014B 1.830757 2.023184 6.154560 1.243431 2.821484
## 19 FFF_2014J 1.815491 2.018747 6.205723 1.261470 2.880118
## 20 GGG_2013J 1.177130 2.289238 6.048206 1.376682 2.446188
## 21 GGG_2014B 1.202597 2.293506 5.893506 1.385714 2.459740
## 22 GGG_2014J 1.193687 2.364419 5.804878 1.416069 2.540890
## sum_click
## 1 1701.9642
## 2 1694.1891
## 3 863.0922
## 4 736.4620
## 5 642.6472
## 6 868.8459
## 7 1095.4349
## 8 1176.7627
## 9 1112.3371
## 10 981.5012
## 11 820.3425
## 12 853.0019
## 13 1565.1213
## 14 1302.3878
## 15 1450.5808
## 16 2748.5345
## 17 2408.4076
## 18 2155.0000
## 19 2457.5284
## 20 569.0213
## 21 550.6610
## 22 571.7676
This simple function will help us to inspect cluster solutions efficiently. It is not intended to be a substitute for detailed analysis—and it takes shortcuts such as treating categorical variables as numbers, which is inadvisable except for analysts who understand what they’re doing—yet it provides a quick first check of whether there is something interesting (or uninteresting) occurring in a solution.
With a summary function of this kind we are easily able to answer the following questions related to the business value of a proposed solution:
• Are there obvious differences in group means?
• Does the differentiation point to some underlying story to tell?
• Do we see immediately odd results such as a mean equal to the value of one data level?
Hierarchical clustering is a popular method that groups observations according to their similarity. The hclust() method is one way to perform this analysis in R. hclust() is a distance-based algorithm that operates on a dissimilarity matrix, an N-by-N matrix that reports a metric for the distance between each pair of observations.
The hierarchical clustering method begins with each observation in its own cluster. It then successively joins neighboring observations or clusters one at a time according to their distances from one another, and continues this until all observations are linked. This process of repeatedly joining observations and groups is known as an agglomerative method. However, the method hasa high computational cost.
The hierarchical clustering is efficient for numerical as well as categorical data. In R we can compute the distance table with the daisy() function from the cluster package. For example, in case of the BBB course at the 2014J semester, the clustering can be done as follows:
library(cluster)
stud_BBB_2014J <- student_Info[student_Info$code_module == "BBB"
& student_Info$code_presentation == "2014J"
& student_Info$final_result == "Fail" , -3]
stud_inf_dis <- daisy(stud_BBB_2014J)
as.matrix(stud_inf_dis)[1:5,1:5]
## 6367 6375 6392 6399 6402
## 6367 0.0000000 0.5386364 0.3750000 0.4659091 0.3750000
## 6375 0.5386364 0.0000000 0.2772727 0.2545455 0.5272727
## 6392 0.3750000 0.2772727 0.0000000 0.2954545 0.3863636
## 6399 0.4659091 0.2545455 0.2954545 0.0000000 0.3636364
## 6402 0.3750000 0.5272727 0.3863636 0.3636364 0.0000000
seg.hc <- hclust(stud_inf_dis, method="average")
plot(seg.hc)
The anove figure is difficult to read, so it is helpful to zoom in on one section of the chart. We can cut it at a specified location and plot just one branch as follows. We coerce it to a dendrogram object (as.dendrogram(. . . )), cut it at a certain height (h=. . . ), and select the resulting branch that we want (. . . $lower[[1]]).
plot(cut(as.dendrogram(seg.hc), h=0.3)$lower[[5]])
We cann see how students are grouped together progressively for the particular course that we chose. We can check the similarity of observations by selecting a few rows listed in the above figure. Observations 6711 and 7544 looks quite similar because they are linked at a low height as are observations 6556 and 7906. On the other hand, observations 8495 and 7729 are only linked at a highest level and sould be relatively dissimilar. Let’s check those directly
stud_BBB_2014J[c("6711", "7544"),] ##similar
## code_module code_presentation gender region highest_education
## 6711 BBB 2014J F Scotland HE Qualification
## 7544 BBB 2014J F Scotland HE Qualification
## imd_band age_band num_of_prev_attempts studied_credits disability
## 6711 0-10% 0-35 1 120 N
## 7544 60-70% 0-35 0 120 N
## final_result
## 6711 Fail
## 7544 Fail
stud_BBB_2014J[c("6556", "7906"),] ##similar
## code_module code_presentation gender region highest_education
## 6556 BBB 2014J F Scotland HE Qualification
## 7906 BBB 2014J F Scotland HE Qualification
## imd_band age_band num_of_prev_attempts studied_credits disability
## 6556 30-40% 0-35 2 120 Y
## 7906 50-60% 0-35 0 120 Y
## final_result
## 6556 Fail
## 7906 Fail
stud_BBB_2014J[c("8495", "7729"), ] ##less similar
## code_module code_presentation gender region highest_education
## 8495 BBB 2014J F Scotland HE Qualification
## 7729 BBB 2014J F Scotland HE Qualification
## imd_band age_band num_of_prev_attempts studied_credits disability
## 8495 0-10% 35-55 1 150 Y
## 7729 20-30% 0-35 0 60 N
## final_result
## 8495 Fail
## 7729 Fail
Finally, we might check one of the goodness-of-fit metrics for a hierarchical cluster solution. One method is the cophenetic correlation coefficient (CPCC), which assesses how well a dendrogram (in this case seg.hc) matches the true distance metric (stud_inf_dis). We use cophenetic() to get the distances from the dendrogram, and compare it to the dist() metrics with cor():
cor(cophenetic(seg.hc), stud_inf_dis)
## [1] 0.7026734
In this case, CPCC > 0.7 indicates a relatively strong fit, meaning that the hierarchical tree represents the distances between students that failed in the exams well.
How do we get specific segment assignments? A dendrogram can be cut into clusters at any height desired, resulting in different numbers of groups. Because a dendrogram can be cut at any point, the analyst must specify the number of groups desired. We can see where the dendrogram would be cut by overlaying its plot() with rect.hclust(), specifying the number of groups we want (k=. . . ):
plot(seg.hc)
rect.hclust(seg.hc, k=4, border="red")
We obtain the assignment vector for observations using cutree():
seg.hc.segment <- cutree(seg.hc, k=4) # membership vector for 4 groups
table(seg.hc.segment)
## seg.hc.segment
## 1 2 3 4
## 39 339 8 5
We see that group 2 dominate the assignment. Indicating that there is a large group of students that failed the exams and they share some common features. Note that the class labels (1, 2, 3, 4) are in arbitrary order and are not meaningful in themselves. seg.hc.segment is the vector of group assignments.
We use our custom summary function seg.summ(), defined above, to inspect the variables in seg.df with reference to the four clusters:
seg_summa(stud_BBB_2014J, seg.hc.segment)
## Group.1 code_module code_presentation gender region
## 1 1 2 4 2.000000 6.564103
## 2 2 2 4 1.014749 6.946903
## 3 3 2 4 1.125000 5.000000
## 4 4 2 4 2.000000 6.400000
## highest_education imd_band age_band num_of_prev_attempts studied_credits
## 1 1.948718 5.666667 1.461538 0.2051282 73.84615
## 2 2.315634 5.324484 1.289086 0.2536873 75.35398
## 3 1.000000 6.000000 1.000000 0.1250000 67.50000
## 4 3.000000 4.800000 1.600000 0.0000000 96.00000
## disability final_result
## 1 1.051282 2
## 2 1.070796 2
## 3 2.000000 2
## 4 2.000000 2
From the above table we see that gender is the most ‘homogeneous’ variable of the 2nd cluster. Also, most of the students that failed lie between the range 30-50% of the imd band and had the highest mean value of previous attempts. We can further study the statistical properties of other variables and further analyze the characteristics of the second group. Obniously the second group has many common characteristics and this can help the educator to organize a strategy to approach the students.
K-means clustering attempts to find groups that are most compact, in terms of the mean sum-of-squares deviation of each observation from the multivariate center (centroid) of its assigned group. Because it explicitly computes a mean deviation, k-means clustering relies on Euclidean distance. Thus it is only appropriate for numeric data or data that can be reasonably coerced to numeric. We will use stud_unit data.frame that we constructed in the previous. We will again focus our attention to the BBB course of 2014J semester for the students that failed.
stud_unit_numeric <- stud_unit[stud_unit$course == "BBB_2014J" & stud_unit$final_result == 1, ]
stud_unit_numeric <- stud_unit_numeric[, -c(1,2,4,11)] ##removing non-numerical variables
and now se can use the k-means clustering for 4 centroids
set.seed(1234)
seg_stud_km <- kmeans(stud_unit_numeric, centers=4)
We use our custom function seg.summ() to do a quick check of the data by proposed group, where cluster assignments are found in the $cluster vector inside the seg_stud_km model:
seg_summa(stud_unit_numeric, seg_stud_km$cluster)
## Group.1 gender highest_education imd_band age_band
## 1 1 1.136670 2.097209 5.665063 1.271415
## 2 2 1.192308 2.153846 5.769231 1.769231
## 3 3 1.057143 1.933333 6.500000 1.561905
## 4 4 1.116904 2.000000 6.107425 1.409163
## num_of_prev_attempts studied_credits disability final_result sum_click
## 1 0.19249278 80.71704 1.101059 3.024062 262.693
## 2 0.03846154 64.61538 1.153846 2.423077 5061.654
## 3 0.05238095 72.78571 1.061905 2.538095 2421.205
## 4 0.14533965 75.85308 1.064771 2.630332 1176.562
We now see some interesting differences; the groups appear to vary by gender, highest education, imd and age band, sum clicks and final results.
For example, we can visually check the distribution of income according to segment (which kmeans() stored in seg.k$cluster) using boxplot():
bwplot(stud_unit_numeric$sum_click ∼ seg_stud_km$cluster, ylab="Sum of Clicks", xlab="Cluster", horizontal =F)
The resulting Figure shows substantial differences in number of clicks by segment. Note that in clustering models, the group labels are in arbitrary order, so don’t worry if your solution shows the same pattern with different labels.
We visualize the clusters by plotting them against a dimensional plot. clusplot() will perform dimensional reduction with principal components or multidimensional scaling as the data warrant, and then plot the observations with cluster membership identified.
We use clusplot from the cluster package with arguments to color the groups, shade the ellipses for group membership, label only the groups (not the individual points) with labels=4, and omit distance lines between groups (lines=0):
library(cluster)
clusplot(stud_unit_numeric, seg_stud_km$cluster, color=TRUE, shade=TRUE, labels=4, lines=0, main="K-means cluster plot")
Overall, this is a far more interesting cluster solution for our segmentation data than the hclust() proposal. The groups here are clearly differentiated on key variables such as age and income.With this information, an analyst might cross-reference the group membership with key variables and then look at the relative differentiation of the groups as it is depected in the previous plot.
This may suggest an educational strategy to help students pass. In the present case, for instance, we see that group 3 is modestly well differentiated, and has the lowest education. That may make it a good target for potential extra material of basic knowledge. Many other strategies are possible, too; the key point is that the analysis provides interesting options to consider.
We’ve covered two statistical methods to identify potential groups of observations in a data set. There are two points that are crucial for success in segmentation projects:
• Different methods are likely to yield different solutions, and in general there is no absolute “right” answer. We recommend to try multiple clustering methods with different potential numbers of clusters.
• The results of segmentation are primarily about the value of the vle, and solutions should be evaluated in terms of both model fit and educational utility. Although model fit is an important criterion and should not be overlooked, it is ultimately necessary that an answer can be communicated to and used by educators.