We will use the oulad data-set. The static data-set is the following

student_Info <- read.csv("studentInfo.csv")
names(student_Info)
##  [1] "code_module"          "code_presentation"    "id_student"          
##  [4] "gender"               "region"               "highest_education"   
##  [7] "imd_band"             "age_band"             "num_of_prev_attempts"
## [10] "studied_credits"      "disability"           "final_result"
summary(student_Info)
##  code_module code_presentation   id_student      gender   
##  AAA: 748    2013B: 4684       Min.   :   3733   F:14718  
##  BBB:7909    2013J: 8845       1st Qu.: 508573   M:17875  
##  CCC:4434    2014B: 7804       Median : 590310            
##  DDD:6272    2014J:11260       Mean   : 706688            
##  EEE:2934                      3rd Qu.: 644453            
##  FFF:7762                      Max.   :2716795            
##  GGG:2534                                                 
##                   region                        highest_education
##  Scotland            : 3446   A Level or Equivalent      :14045  
##  East Anglian Region : 3340   HE Qualification           : 4730  
##  London Region       : 3216   Lower Than A Level         :13158  
##  South Region        : 3092   No Formal quals            :  347  
##  North Western Region: 2906   Post Graduate Qualification:  313  
##  West Midlands Region: 2582                                      
##  (Other)             :14011                                      
##     imd_band      age_band     num_of_prev_attempts studied_credits 
##  20-30% : 3654   0-35 :22944   Min.   :0.0000       Min.   : 30.00  
##  30-40% : 3539   35-55: 9433   1st Qu.:0.0000       1st Qu.: 60.00  
##  10-20  : 3516   55<= :  216   Median :0.0000       Median : 60.00  
##  0-10%  : 3311                 Mean   :0.1632       Mean   : 79.76  
##  40-50% : 3256                 3rd Qu.:0.0000       3rd Qu.:120.00  
##  50-60% : 3124                 Max.   :6.0000       Max.   :655.00  
##  (Other):12193                                                      
##  disability      final_result  
##  N:29429    Distinction: 3024  
##  Y: 3164    Fail       : 7052  
##             Pass       :12361  
##             Withdrawn  :10156  
##                                
##                                
## 

The dynamical dataset is the following

student_vle <- read.csv("studentVle.csv")
summary(student_vle)
##  code_module   code_presentation   id_student         id_site       
##  AAA: 350298   2013B:1886868     Min.   :   6516   Min.   : 526721  
##  BBB:1567564   2013J:2988784     1st Qu.: 507743   1st Qu.: 673519  
##  CCC:1207827   2014B:2160176     Median : 588236   Median : 730069  
##  DDD:2166486   2014J:3619452     Mean   : 733334   Mean   : 738323  
##  EEE: 961433                     3rd Qu.: 646484   3rd Qu.: 877030  
##  FFF:4014499                     Max.   :2698588   Max.   :1049562  
##  GGG: 387173                                                        
##       date          sum_click       
##  Min.   :-25.00   Min.   :   1.000  
##  1st Qu.: 25.00   1st Qu.:   1.000  
##  Median : 86.00   Median :   2.000  
##  Mean   : 95.17   Mean   :   3.717  
##  3rd Qu.:156.00   3rd Qu.:   3.000  
##  Max.   :269.00   Max.   :6977.000  
## 

The dynamical data set reflects the dynamical involvement of a student with the online content of the vle during two semesters, for two academic years.

Comparing Groups

In this part we will investigate differences between groups of students. We can address questions like do male or female interact more often with the vle and is there a difference in their performance? Which demographic segment is best performing? Is the vle more appealing to a particular group of Education? The answers will help us to understand the vle better and to target students effectively, and to evaluate the effectiveness of the vle and the grading process.

We are not focused only to differences between people, similar kind of questions can be targeted on many other groups of data. We might group data by the geographical region and ask questions about these groups. We may also consider time periods and involvement. Did the performance increase in relation to the interaction of the student with the vle? In such cases we compare groups of data to another to identify effects and get valuable insight.

Finding descriptives by group

For our student in the vle data, we are interested in how measures such as level of education and gender vary for different segments of the data set. With this insight, the educational “firm” can develop tailored educational material for the segment or engage in different ways to reach and activate the student.

An ad hoc way to do this is with data frame indexing: find the rows that match some criterion, and then take the mean (or some other statistic) for the matching observations on a variable of interest. For example, to examine the final result for the “AAA” and “BBB” course modules:

AAA_st_info <- student_Info[student_Info$code_module == "AAA", ]
summary(AAA_st_info$final_result)
## Distinction        Fail        Pass   Withdrawn 
##          44          91         487         126
BBB_st_info <- student_Info[student_Info$code_module == "BBB", ]
summary(BBB_st_info$final_result)
## Distinction        Fail        Pass   Withdrawn 
##         677        1767        3077        2388

This says ‘from the student_info observations take only the rows from the column ’code_module’ that have values ‘AAA’ and ‘BBB’ for the two cases respectively.

We can perform the same analysis for all the groups in the data-set by using the by command

by(student_Info$final_result, student_Info$code_module, summary)
## student_Info$code_module: AAA
## Distinction        Fail        Pass   Withdrawn 
##          44          91         487         126 
## -------------------------------------------------------- 
## student_Info$code_module: BBB
## Distinction        Fail        Pass   Withdrawn 
##         677        1767        3077        2388 
## -------------------------------------------------------- 
## student_Info$code_module: CCC
## Distinction        Fail        Pass   Withdrawn 
##         498         781        1180        1975 
## -------------------------------------------------------- 
## student_Info$code_module: DDD
## Distinction        Fail        Pass   Withdrawn 
##         383        1412        2227        2250 
## -------------------------------------------------------- 
## student_Info$code_module: EEE
## Distinction        Fail        Pass   Withdrawn 
##         356         562        1294         722 
## -------------------------------------------------------- 
## student_Info$code_module: FFF
## Distinction        Fail        Pass   Withdrawn 
##         670        1711        2978        2403 
## -------------------------------------------------------- 
## student_Info$code_module: GGG
## Distinction        Fail        Pass   Withdrawn 
##         396         728        1118         292

We can brake up fearther the data by using combinatios of groups. For example we might check the final results for each given gender for all the modules

SA <- by(student_Info$final_result, list(student_Info$code_module, student_Info$gender), summary)
SA
## : AAA
## : F
## Distinction        Fail        Pass   Withdrawn 
##          13          39         195          67 
## -------------------------------------------------------- 
## : BBB
## : F
## Distinction        Fail        Pass   Withdrawn 
##         610        1530        2728        2123 
## -------------------------------------------------------- 
## : CCC
## : F
## Distinction        Fail        Pass   Withdrawn 
##         143         145         311         502 
## -------------------------------------------------------- 
## : DDD
## : F
## Distinction        Fail        Pass   Withdrawn 
##         122         508         893         995 
## -------------------------------------------------------- 
## : EEE
## : F
## Distinction        Fail        Pass   Withdrawn 
##          38          48         152          99 
## -------------------------------------------------------- 
## : FFF
## : F
## Distinction        Fail        Pass   Withdrawn 
##         148         252         540         474 
## -------------------------------------------------------- 
## : GGG
## : F
## Distinction        Fail        Pass   Withdrawn 
##         320         581         916         226 
## -------------------------------------------------------- 
## : AAA
## : M
## Distinction        Fail        Pass   Withdrawn 
##          31          52         292          59 
## -------------------------------------------------------- 
## : BBB
## : M
## Distinction        Fail        Pass   Withdrawn 
##          67         237         349         265 
## -------------------------------------------------------- 
## : CCC
## : M
## Distinction        Fail        Pass   Withdrawn 
##         355         636         869        1473 
## -------------------------------------------------------- 
## : DDD
## : M
## Distinction        Fail        Pass   Withdrawn 
##         261         904        1334        1255 
## -------------------------------------------------------- 
## : EEE
## : M
## Distinction        Fail        Pass   Withdrawn 
##         318         514        1142         623 
## -------------------------------------------------------- 
## : FFF
## : M
## Distinction        Fail        Pass   Withdrawn 
##         522        1459        2438        1929 
## -------------------------------------------------------- 
## : GGG
## : M
## Distinction        Fail        Pass   Withdrawn 
##          76         147         202          66

Similar operations we can perform by using the aggregate command For example we can explore the mean credits earned from each course module

aggregate(studied_credits ∼ code_module, data=student_Info, mean)
##   code_module studied_credits
## 1         AAA        84.43850
## 2         BBB        84.07258
## 3         CCC        77.48760
## 4         DDD        86.64780
## 5         EEE        65.43626
## 6         FFF        90.84514
## 7         GGG        34.45935

An other useful technich is cross-tabulating, separating students into groups according to two or more factors.

aggregate(studied_credits ∼ code_presentation + gender, data=student_Info, mean)
##   code_presentation gender studied_credits
## 1             2013B      F        89.35329
## 2             2013J      F        76.24286
## 3             2014B      F        75.32809
## 4             2014J      F        78.64209
## 5             2013B      M        88.91285
## 6             2013J      M        80.56620
## 7             2014B      M        76.07191
## 8             2014J      M        80.32467

We now have a separate group for each combination of Semester and Gender and can begin to see how studied Credits are related to both variables. A formula can be extended to include as many grouping variables as needed. The aggregate command allows us to compute functions of continuous variables, for any combination of factors.

We might also want to explore the frequency with which different combinations of variables occur. The table command allows us to obtain one-way or multiway counts

with(student_Info, table(code_presentation, gender))
##                  gender
## code_presentation    F    M
##             2013B 2389 2295
##             2013J 4200 4645
##             2014B 3368 4436
##             2014J 4761 6499

In the above table why see the distributions of gender in the four given semesters. Suppose we want to explore the educational level compared to the region of the students.

with(student_Info, table( highest_education, region))
##                              region
## highest_education             East Anglian Region East Midlands Region
##   A Level or Equivalent                      1455                 1068
##   HE Qualification                            356                  198
##   Lower Than A Level                         1466                 1081
##   No Formal quals                              53                   11
##   Post Graduate Qualification                  10                    7
##                              region
## highest_education             Ireland London Region North Region
##   A Level or Equivalent           481          1387          716
##   HE Qualification                183           490          282
##   Lower Than A Level              481          1215          711
##   No Formal quals                   0            94           26
##   Post Graduate Qualification      39            30           88
##                              region
## highest_education             North Western Region Scotland
##   A Level or Equivalent                       1230     1202
##   HE Qualification                             323     1233
##   Lower Than A Level                          1314      910
##   No Formal quals                               39        5
##   Post Graduate Qualification                    0       96
##                              region
## highest_education             South East Region South Region
##   A Level or Equivalent                     957         1417
##   HE Qualification                          245          387
##   Lower Than A Level                        893         1245
##   No Formal quals                            11           11
##   Post Graduate Qualification                 5           32
##                              region
## highest_education             South West Region Wales West Midlands Region
##   A Level or Equivalent                    1179   811                 1224
##   HE Qualification                          226   364                  230
##   Lower Than A Level                       1012   883                 1099
##   No Formal quals                            19    22                   29
##   Post Graduate Qualification                 0     6                    0
##                              region
## highest_education             Yorkshire Region
##   A Level or Equivalent                    918
##   HE Qualification                         213
##   Lower Than A Level                       848
##   No Formal quals                           27
##   Post Graduate Qualification                0

and we can inspect the counts for the highest education for each region sererately.

Also, we can further explore count variables. For example we can retrieve the total number of studied_credits reported in each region

xtabs(studied_credits ~ region , data = student_Info)
## region
##  East Anglian Region East Midlands Region              Ireland 
##               260380               191370                92390 
##        London Region         North Region North Western Region 
##               266575               146685               234410 
##             Scotland    South East Region         South Region 
##               265090               167005               240380 
##    South West Region                Wales West Midlands Region 
##               195385               170220               209120 
##     Yorkshire Region 
##               160565

We can take the same result also by using the apply, colsums or aggregate commands. For example if we want to store the output as a data.frame it is better to use the aggregate command

aggregate(studied_credits ~ region , data = student_Info, sum)
##                  region studied_credits
## 1   East Anglian Region          260380
## 2  East Midlands Region          191370
## 3               Ireland           92390
## 4         London Region          266575
## 5          North Region          146685
## 6  North Western Region          234410
## 7              Scotland          265090
## 8     South East Region          167005
## 9          South Region          240380
## 10    South West Region          195385
## 11                Wales          170220
## 12 West Midlands Region          209120
## 13     Yorkshire Region          160565

Visualization of frequencies and proportions

We can generate the wanted data (for example by using the table command) and plot histograms, barplots etc. However, the lattice package understands simple formulas and generates quality plots.

library(lattice)
histogram(∼gender | highest_education, data=student_Info)

In the above figure, we see that men students have a hightest level of eduation than female studemts. While this data doesn’t tell us why that might be, it does suggest that the educational company might investigate and perhaps improve the vle experience to make it more appealing to women with higher education.

The default in histogram() is to plot proportions within each group so that the values are relative to the group size. If we want actual counts

histogram(∼gender | highest_education, data=student_Info,type="count",layout=c(5,1), col=c("burlywood", "darkolivegreen") )

We add more than one factors. We can plot educational level for each gender among all regions

histogram(∼gender | highest_education + code_module,  data=student_Info, )

Visualization by Group: Continuous Data

In the previous section we saw how to plot counts and proportions. What about continuous data? How would we plot sum of clicks by course in our data? A simple way is to use aggregate() to find the mean sum_click, and then use barchart() from the lattice package to plot the computed values:

sum_click_mean <- aggregate(sum_click ~ code_module, data = student_vle, mean)
barchart(sum_click ~ code_module, data = sum_click_mean , col = "grey")

This bar chart depictes the mean interaction of a student with the vle environment for each course.

How do we split this out further by semester? First we have to aggregate the data to include both factors in the formula. Then we tell barchart() to use semester as a grouping variable by adding the argument groups=factor. Doing that, and also adding a simpleTheme option to improve the chart colors, we have:

sum_click_mean_sem <- aggregate(sum_click ~ code_module + code_presentation, data = student_vle, mean)
barchart(sum_click ~ code_module, data = sum_click_mean_sem, groups=code_presentation, auto.key=TRUE, par.settings = simpleTheme(col=terrain.colors(4)))

A more informative plot for comparing values of continuous data, like income for different groups is a box-and-whiskers plot or boxplot. A boxplot is better than a barchart because it shows more about the distributions of values.

bwplot(log(sum_click) ~ code_module, data=student_vle, horizontal=FALSE)

We can break out code_presentation as a conditioning variable using “ | code_presentation” in the formula:

bwplot(log(sum_click) ~ code_module | code_presentation, data=student_vle, horizontal=FALSE)

From the above plot it is evident that in the data set the semesters of 2014 include all the courses and the interaction with the vle increased. Due to the more complete picture of the 2014 academic year in the following we might consider the two years seperately.

Segmentaion: Clustering of Students

In this section, we tacle a canonical educational research problem: finding, assesing, and predicting student segments. From the analysis of our dataset we can assess relationships in the data, compare groups, and assess complex multivariable models. In a real segmentation project, one would use use those methods to ensure that data has appropriate multivariate structure, and then begin segmentation analysis. Segmentation is not a well-defined process and analysts vary in their definitions of segmentation as well as their approaches and philosophy.

The general goal of segmentation in vle is to find groups of students that differ in important ways assiciate with interest in educational material, participation in the environment, or response to the vle. By understanding the differences among groups, the educator can make better strategic choices about planning, material definition, and course evolution and can engage in more effective educational procedure.

It is not difficult to find groups within student data. What is difficult is to find meaningful segmentattions for particular educational needs. There are many techniques in segmentation and many of them bilong to the field of statistical learning. In this project we will focus on clustering analysis and we will demonstrate techniques that can be used to retrieve valuable insight for optimizing the vle.

Clustering belongs to the field of unsupervised learning. In usupervised learning we do not know the outcome groupings but we attempt to discover them from structure in the data. For instance, we might explore a particulare corse module and ask, “Are there groups that differ in how and when they respond to educational material? If so, what are the characteristics of those groups?” We use the term clustering for this approach.

Clustering is useful in segmentation projects. Educators can use segmentation for discovering groups in the data in order to derive new insight about students. This obniously suggests clustering approaches because the possible student groups are unknown. Still, classification approaches are also useful in such projects and are always complementary to clustering but we will consider them in a future work.

The data set

The dynamical dataset has a huge amount of information, namely 10.655.280 rows. This restricts the use of clustering algorithms due to the high computational complexity that most of them have. For our purposes we will retrieve the total number of clicks for a given course module for each student. This is costly and it must be done with an efficient algorithm. Computational complexity plays a huge roll for this type of analyses since big data are involved in many vle environments.

student_vle_sum <- aggregate(sum_click ∼ code_module + code_presentation + id_student, data=student_vle, sum)

We will merge this relult retrieved from the dynamical data set with the static data set. We will use an inner join of the data sets and this will give us only the students that actualy participated in the vle.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
student_inj <- inner_join(student_Info, student_vle_sum, by = c("id_student", "code_module", "code_presentation"))

The students that where in the student and finally did not interacted at all with the vle where

(dim(student_Info)[1]-dim(student_vle_sum)[1])/dim(student_Info)[1]
## [1] 0.103243

about 10% of the initial population. Keep in mind that we can still evaluate the characteristics of these potential students and consider strategies to make the vle more appealing to them. Let’s see the resulting data set from the inner join

head(student_inj)
##   code_module code_presentation id_student gender               region
## 1         AAA             2013J      11391      M  East Anglian Region
## 2         AAA             2013J      28400      F             Scotland
## 3         AAA             2013J      30268      F North Western Region
## 4         AAA             2013J      31604      F    South East Region
## 5         AAA             2013J      32885      F West Midlands Region
## 6         AAA             2013J      38053      M                Wales
##       highest_education imd_band age_band num_of_prev_attempts
## 1      HE Qualification  90-100%     55<=                    0
## 2      HE Qualification   20-30%    35-55                    0
## 3 A Level or Equivalent   30-40%    35-55                    0
## 4 A Level or Equivalent   50-60%    35-55                    0
## 5    Lower Than A Level   50-60%     0-35                    0
## 6 A Level or Equivalent   80-90%    35-55                    0
##   studied_credits disability final_result sum_click
## 1             240          N         Pass       934
## 2              60          N         Pass      1435
## 3              60          Y    Withdrawn       281
## 4              60          N         Pass      2158
## 5              60          N         Pass      1034
## 6              60          N         Pass      2445

The data set has categorical variables as well as continuous. We can convert some variables to numerical without losing information since there is a hierarchy within this variable.

levels(student_inj$highest_education) <- c(3,4,2,1,5)

“No Formal quals” < “Lower Than A Level” < “A Level or Equivalent” < “HE Qualification” (HE is Higher Education) <“Post Graduate Qualification”

levels(student_inj$age_band) <- c(1,2,3)

0-35 < 35-55 < 55+

levels(student_inj$final_result) <- c(3,1,2,0)

“Withdrawn” < “Fail” < “Pass” < “Distinction”

The Index of Multiple Deprivation is a UK government qualitative study of deprived areas in English local councils. Lets see its levels and quantify the missing values

levels(student_inj$imd_band)
##  [1] "?"       "0-10%"   "10-20"   "20-30%"  "30-40%"  "40-50%"  "50-60%" 
##  [8] "60-70%"  "70-80%"  "80-90%"  "90-100%"
sum(student_inj$imd_band == "?")/length(student_inj$imd_band)
## [1] 0.03606131

Only 3.5% of the data are missing values of imd_band. We omit these observations

miis_emd <-which(student_inj$imd_band == "?")
student_inj <- student_inj[-miis_emd,]
levels(student_inj$imd_band) <- 0:10

For male or female M=1 Female=2. For disability N=1, Y=2

levels(student_inj$gender) <- c(2,1)
levels(student_inj$disability) <- 1:2
student_inj$highest_education <- as.numeric(student_inj$highest_education)
student_inj$age_band <- as.numeric(student_inj$age_band)
student_inj$final_result <- as.numeric(student_inj$final_result)
student_inj$imd_band <- as.numeric(student_inj$imd_band)
student_inj$gender <- as.numeric(student_inj$gender)
student_inj$disability <- as.numeric(student_inj$disability)

With this merged data set we will proceed to demonstarte somecluster analysis. Finaly we “paste” the two first columns and they will repsresent a particular course in a particular semester

There may be a time in which we would like to combine the values of two variables. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.

library(tidyr)
stud_unit <- unite(student_inj,course, code_module, code_presentation, sep = "_", remove = TRUE)
head(stud_unit)
##      course id_student gender               region highest_education
## 1 AAA_2013J      11391      2  East Anglian Region                 2
## 2 AAA_2013J      28400      1             Scotland                 2
## 3 AAA_2013J      30268      1 North Western Region                 1
## 4 AAA_2013J      31604      1    South East Region                 1
## 5 AAA_2013J      32885      1 West Midlands Region                 3
## 6 AAA_2013J      38053      2                Wales                 1
##   imd_band age_band num_of_prev_attempts studied_credits disability
## 1       11        3                    0             240          1
## 2        4        2                    0              60          1
## 3        5        2                    0              60          2
## 4        7        2                    0              60          1
## 5        7        1                    0              60          1
## 6       10        2                    0              60          1
##   final_result sum_click
## 1            3       934
## 2            3      1435
## 3            4       281
## 4            3      2158
## 5            3      1034
## 6            3      2445

Steps of Clustering

Clustering analysis requires two stages: finding a proposed cluster solution and evaluating that solution for the educational needs. For each method we go through the following steps:

• Transform the data if needed for a particular clustering method; for instance, some methods require all numeric data (e.g., kmeans(), mclust()) or all categorical data (e.g., poLCA()).

• Compute a distance matrix if needed; some methods require a precomputed matrix of similarity in order to group observations (e.g., hclust()).

• Apply the clustering method and save its result to an object. For some methods this requires specifying the number (K) of groups desired (e.g., kmeans(), poLCA()).

• For some methods, further parse the object to obtain a solution with K groups (e.g., hclust()).

• Examine the solution in the model object with regard to the underlying data, and consider whether it answers an educational question.

As we’ve already argued, the most difficult part of that process is the last step: establishing whether a proposed statistical solution answers an educational need. Ultimately, a cluster solution is largely just a vector of purported group assignments for each observation, such as “1, 1, 4, 3, 2, 3, 2, 2, 4, 1, 4 . . ..” It is up to the educator to figure out whether that tells a meaningful story for the vle and the students involved.

A Simple Check function

We recommend that you think hard about how you would know whether the solution—assignments of observations to groups—that is proposed by a clustering method is useful for the educational problem. Just because some grouping is proposed by an algorithm does not mean that it will help the improvement of the learning experience. One way we often approach this is to write a simple function that summarizes the data and allows quick inspection of the high-level differences between groups

A segment inspection function may be complex depending on the educational need and might even include plotting as well as data summarization. For purposes here we use a simple function that reports the mean by group. We use mean here instead of a more robust metric such as median because we have several binary variables and mean() easily shows the mixture proportion for them (i.e., 1.5 means a 50% mix of 1 and 2). A very simple function is:

seg_summa <- function(data, groups) {aggregate(data, list(groups), function(x) mean(as.numeric(x)))}
  
seg_summa(stud_unit[, -c(1,2,4,8,9,10)], stud_unit$course)
##      Group.1   gender highest_education imd_band age_band final_result
## 1  AAA_2013J 1.606061          1.743802 7.454545 1.589532     2.925620
## 2  AAA_2014J 1.553009          1.656160 7.234957 1.584527     2.905444
## 3  BBB_2013B 1.115942          2.054677 5.775362 1.326746     2.739789
## 4  BBB_2013J 1.106085          2.054389 5.984383 1.346257     2.735595
## 5  BBB_2014B 1.111371          2.014019 5.924455 1.333333     2.622274
## 6  BBB_2014J 1.122117          2.047694 5.905136 1.355870     2.831761
## 7  CCC_2014B 1.769912          1.900126 6.515803 1.277497     2.984829
## 8  CCC_2014J 1.743756          1.846438 6.548566 1.317299     2.970860
## 9  DDD_2013B 1.599827          1.971478 6.386344 1.295592     2.929127
## 10 DDD_2013J 1.591017          1.908983 6.507092 1.250000     2.956856
## 11 DDD_2014B 1.610377          1.925472 6.417925 1.256604     2.923585
## 12 DDD_2014J 1.592214          1.860881 6.498405 1.259094     2.957881
## 13 EEE_2013J 1.904918          1.956284 6.684153 1.248087     2.709290
## 14 EEE_2014B 1.887789          1.973597 6.658416 1.252475     2.702970
## 15 EEE_2014J 1.861217          1.896388 6.555133 1.235741     2.770913
## 16 FFF_2013B 1.802113          2.006338 6.338732 1.293662     2.771831
## 17 FFF_2013J 1.832168          2.005994 6.212787 1.244256     2.827173
## 18 FFF_2014B 1.830757          2.023184 6.154560 1.243431     2.821484
## 19 FFF_2014J 1.815491          2.018747 6.205723 1.261470     2.880118
## 20 GGG_2013J 1.177130          2.289238 6.048206 1.376682     2.446188
## 21 GGG_2014B 1.202597          2.293506 5.893506 1.385714     2.459740
## 22 GGG_2014J 1.193687          2.364419 5.804878 1.416069     2.540890
##    sum_click
## 1  1701.9642
## 2  1694.1891
## 3   863.0922
## 4   736.4620
## 5   642.6472
## 6   868.8459
## 7  1095.4349
## 8  1176.7627
## 9  1112.3371
## 10  981.5012
## 11  820.3425
## 12  853.0019
## 13 1565.1213
## 14 1302.3878
## 15 1450.5808
## 16 2748.5345
## 17 2408.4076
## 18 2155.0000
## 19 2457.5284
## 20  569.0213
## 21  550.6610
## 22  571.7676

This simple function will help us to inspect cluster solutions efficiently. It is not intended to be a substitute for detailed analysis—and it takes shortcuts such as treating categorical variables as numbers, which is inadvisable except for analysts who understand what they’re doing—yet it provides a quick first check of whether there is something interesting (or uninteresting) occurring in a solution.

With a summary function of this kind we are easily able to answer the following questions related to the business value of a proposed solution:

• Are there obvious differences in group means?

• Does the differentiation point to some underlying story to tell?

• Do we see immediately odd results such as a mean equal to the value of one data level?

Hierarchical Clustering

Hierarchical clustering is a popular method that groups observations according to their similarity. The hclust() method is one way to perform this analysis in R. hclust() is a distance-based algorithm that operates on a dissimilarity matrix, an N-by-N matrix that reports a metric for the distance between each pair of observations.

The hierarchical clustering method begins with each observation in its own cluster. It then successively joins neighboring observations or clusters one at a time according to their distances from one another, and continues this until all observations are linked. This process of repeatedly joining observations and groups is known as an agglomerative method. However, the method hasa high computational cost.

The hierarchical clustering is efficient for numerical as well as categorical data. In R we can compute the distance table with the daisy() function from the cluster package. For example, in case of the BBB course at the 2014J semester, the clustering can be done as follows:

library(cluster)

stud_BBB_2014J <- student_Info[student_Info$code_module == "BBB" 
                                   & student_Info$code_presentation == "2014J" 
                                      & student_Info$final_result == "Fail" , -3]
stud_inf_dis <- daisy(stud_BBB_2014J)
as.matrix(stud_inf_dis)[1:5,1:5]
##           6367      6375      6392      6399      6402
## 6367 0.0000000 0.5386364 0.3750000 0.4659091 0.3750000
## 6375 0.5386364 0.0000000 0.2772727 0.2545455 0.5272727
## 6392 0.3750000 0.2772727 0.0000000 0.2954545 0.3863636
## 6399 0.4659091 0.2545455 0.2954545 0.0000000 0.3636364
## 6402 0.3750000 0.5272727 0.3863636 0.3636364 0.0000000
seg.hc <- hclust(stud_inf_dis, method="average")
plot(seg.hc)

The anove figure is difficult to read, so it is helpful to zoom in on one section of the chart. We can cut it at a specified location and plot just one branch as follows. We coerce it to a dendrogram object (as.dendrogram(. . . )), cut it at a certain height (h=. . . ), and select the resulting branch that we want (. . . $lower[[1]]).

plot(cut(as.dendrogram(seg.hc), h=0.3)$lower[[5]])

We cann see how students are grouped together progressively for the particular course that we chose. We can check the similarity of observations by selecting a few rows listed in the above figure. Observations 6711 and 7544 looks quite similar because they are linked at a low height as are observations 6556 and 7906. On the other hand, observations 8495 and 7729 are only linked at a highest level and sould be relatively dissimilar. Let’s check those directly

stud_BBB_2014J[c("6711", "7544"),] ##similar
##      code_module code_presentation gender   region highest_education
## 6711         BBB             2014J      F Scotland  HE Qualification
## 7544         BBB             2014J      F Scotland  HE Qualification
##      imd_band age_band num_of_prev_attempts studied_credits disability
## 6711    0-10%     0-35                    1             120          N
## 7544   60-70%     0-35                    0             120          N
##      final_result
## 6711         Fail
## 7544         Fail
stud_BBB_2014J[c("6556", "7906"),] ##similar
##      code_module code_presentation gender   region highest_education
## 6556         BBB             2014J      F Scotland  HE Qualification
## 7906         BBB             2014J      F Scotland  HE Qualification
##      imd_band age_band num_of_prev_attempts studied_credits disability
## 6556   30-40%     0-35                    2             120          Y
## 7906   50-60%     0-35                    0             120          Y
##      final_result
## 6556         Fail
## 7906         Fail
stud_BBB_2014J[c("8495", "7729"), ] ##less similar
##      code_module code_presentation gender   region highest_education
## 8495         BBB             2014J      F Scotland  HE Qualification
## 7729         BBB             2014J      F Scotland  HE Qualification
##      imd_band age_band num_of_prev_attempts studied_credits disability
## 8495    0-10%    35-55                    1             150          Y
## 7729   20-30%     0-35                    0              60          N
##      final_result
## 8495         Fail
## 7729         Fail

Finally, we might check one of the goodness-of-fit metrics for a hierarchical cluster solution. One method is the cophenetic correlation coefficient (CPCC), which assesses how well a dendrogram (in this case seg.hc) matches the true distance metric (stud_inf_dis). We use cophenetic() to get the distances from the dendrogram, and compare it to the dist() metrics with cor():

cor(cophenetic(seg.hc), stud_inf_dis)
## [1] 0.7026734

In this case, CPCC > 0.7 indicates a relatively strong fit, meaning that the hierarchical tree represents the distances between students that failed in the exams well.

How do we get specific segment assignments? A dendrogram can be cut into clusters at any height desired, resulting in different numbers of groups. Because a dendrogram can be cut at any point, the analyst must specify the number of groups desired. We can see where the dendrogram would be cut by overlaying its plot() with rect.hclust(), specifying the number of groups we want (k=. . . ):

plot(seg.hc)
rect.hclust(seg.hc, k=4, border="red")

We obtain the assignment vector for observations using cutree():

seg.hc.segment <- cutree(seg.hc, k=4) # membership vector for 4 groups
table(seg.hc.segment)
## seg.hc.segment
##   1   2   3   4 
##  39 339   8   5

We see that group 2 dominate the assignment. Indicating that there is a large group of students that failed the exams and they share some common features. Note that the class labels (1, 2, 3, 4) are in arbitrary order and are not meaningful in themselves. seg.hc.segment is the vector of group assignments.

We use our custom summary function seg.summ(), defined above, to inspect the variables in seg.df with reference to the four clusters:

seg_summa(stud_BBB_2014J, seg.hc.segment)
##   Group.1 code_module code_presentation   gender   region
## 1       1           2                 4 2.000000 6.564103
## 2       2           2                 4 1.014749 6.946903
## 3       3           2                 4 1.125000 5.000000
## 4       4           2                 4 2.000000 6.400000
##   highest_education imd_band age_band num_of_prev_attempts studied_credits
## 1          1.948718 5.666667 1.461538            0.2051282        73.84615
## 2          2.315634 5.324484 1.289086            0.2536873        75.35398
## 3          1.000000 6.000000 1.000000            0.1250000        67.50000
## 4          3.000000 4.800000 1.600000            0.0000000        96.00000
##   disability final_result
## 1   1.051282            2
## 2   1.070796            2
## 3   2.000000            2
## 4   2.000000            2

From the above table we see that gender is the most ‘homogeneous’ variable of the 2nd cluster. Also, most of the students that failed lie between the range 30-50% of the imd band and had the highest mean value of previous attempts. We can further study the statistical properties of other variables and further analyze the characteristics of the second group. Obniously the second group has many common characteristics and this can help the educator to organize a strategy to approach the students.

Mean-Based Clustering: kmeans()

K-means clustering attempts to find groups that are most compact, in terms of the mean sum-of-squares deviation of each observation from the multivariate center (centroid) of its assigned group. Because it explicitly computes a mean deviation, k-means clustering relies on Euclidean distance. Thus it is only appropriate for numeric data or data that can be reasonably coerced to numeric. We will use stud_unit data.frame that we constructed in the previous. We will again focus our attention to the BBB course of 2014J semester for the students that failed.

stud_unit_numeric <- stud_unit[stud_unit$course == "BBB_2014J" & stud_unit$final_result == 1, ]
stud_unit_numeric <- stud_unit_numeric[, -c(1,2,4,11)] ##removing non-numerical variables

and now se can use the k-means clustering for 4 centroids

set.seed(1234)
seg_stud_km <- kmeans(stud_unit_numeric, centers=4)

We use our custom function seg.summ() to do a quick check of the data by proposed group, where cluster assignments are found in the $cluster vector inside the seg_stud_km model:

seg_summa(stud_unit_numeric, seg_stud_km$cluster)
##   Group.1   gender highest_education imd_band age_band
## 1       1 1.136670          2.097209 5.665063 1.271415
## 2       2 1.192308          2.153846 5.769231 1.769231
## 3       3 1.057143          1.933333 6.500000 1.561905
## 4       4 1.116904          2.000000 6.107425 1.409163
##   num_of_prev_attempts studied_credits disability final_result sum_click
## 1           0.19249278        80.71704   1.101059     3.024062   262.693
## 2           0.03846154        64.61538   1.153846     2.423077  5061.654
## 3           0.05238095        72.78571   1.061905     2.538095  2421.205
## 4           0.14533965        75.85308   1.064771     2.630332  1176.562

We now see some interesting differences; the groups appear to vary by gender, highest education, imd and age band, sum clicks and final results.

For example, we can visually check the distribution of income according to segment (which kmeans() stored in seg.k$cluster) using boxplot():

bwplot(stud_unit_numeric$sum_click ∼ seg_stud_km$cluster, ylab="Sum of Clicks", xlab="Cluster", horizontal =F)

The resulting Figure shows substantial differences in number of clicks by segment. Note that in clustering models, the group labels are in arbitrary order, so don’t worry if your solution shows the same pattern with different labels.

We visualize the clusters by plotting them against a dimensional plot. clusplot() will perform dimensional reduction with principal components or multidimensional scaling as the data warrant, and then plot the observations with cluster membership identified.

We use clusplot from the cluster package with arguments to color the groups, shade the ellipses for group membership, label only the groups (not the individual points) with labels=4, and omit distance lines between groups (lines=0):

library(cluster)
clusplot(stud_unit_numeric, seg_stud_km$cluster, color=TRUE, shade=TRUE, labels=4, lines=0, main="K-means cluster plot")

Overall, this is a far more interesting cluster solution for our segmentation data than the hclust() proposal. The groups here are clearly differentiated on key variables such as age and income.With this information, an analyst might cross-reference the group membership with key variables and then look at the relative differentiation of the groups as it is depected in the previous plot.

This may suggest an educational strategy to help students pass. In the present case, for instance, we see that group 3 is modestly well differentiated, and has the lowest education. That may make it a good target for potential extra material of basic knowledge. Many other strategies are possible, too; the key point is that the analysis provides interesting options to consider.

Recap of Clustering

We’ve covered two statistical methods to identify potential groups of observations in a data set. There are two points that are crucial for success in segmentation projects:

• Different methods are likely to yield different solutions, and in general there is no absolute “right” answer. We recommend to try multiple clustering methods with different potential numbers of clusters.

• The results of segmentation are primarily about the value of the vle, and solutions should be evaluated in terms of both model fit and educational utility. Although model fit is an important criterion and should not be overlooked, it is ultimately necessary that an answer can be communicated to and used by educators.