Introduction

Before the Microeconomics class of professor Tomasz Kopczewski at University of Warsaw, every student has to complete a survey about their self-evaluated qualities on IT knowledge, economics knowledge and attitude about economics. In addition, they also declare their preference of the teaching methods and the desired group size. For this assignment about clustering, I am using his data in 4 years: 2018, 2019, 2020 and 2021. I will cluster the students quality with different methods. After that, I will choose the most stable one and analyze the clusters in order to propose the indication for teaching methods in his class in the future.

#inspect the dataset
str(df21)

## 'data.frame':    69 obs. of  20 variables:
##  $ TreatmentID : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PeriodType  : chr  "payment" "payment" "payment" "payment" ...
##  $ PeriodNo    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Id          : int  20 41 2 26 39 44 35 42 14 29 ...
##  $ Group       : chr  "ALL" "ALL" "ALL" "ALL" ...
##  $ Role        : logi  NA NA NA NA NA NA ...
##  $ Profit      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProfitTotal : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Payment     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaymentTotal: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nickname    : chr  "Nomin" "Ania_39" "mateuszc" "LeBron" ...
##  $ sex         : chr  "female" "female" "male" "female" ...
##  $ BA          : chr  "No" "Yes" "Yes" "No" ...
##  $ Attitude    : int  100 71 72 63 100 64 72 74 67 75 ...
##  $ Varian      : int  43 50 72 49 49 85 81 28 50 89 ...
##  $ IT_lit      : int  50 42 64 56 100 81 78 78 8 63 ...
##  $ theory      : int  30 20 20 20 40 5 28 20 30 45 ...
##  $ exper       : int  40 65 60 50 40 45 34 35 30 20 ...
##  $ quan        : int  30 15 20 30 20 40 38 45 50 35 ...
##  $ team        : chr  "3" "3" "2" "3" ...

str(df20)

## 'data.frame':    97 obs. of  21 variables:
##  $ Year        : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ TreatmentID : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PeriodType  : chr  "payment" "payment" "payment" "payment" ...
##  $ PeriodNo    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Id          : int  108 30 49 12 7 77 15 26 75 43 ...
##  $ Group       : chr  "ALL" "ALL" "ALL" "ALL" ...
##  $ Role        : logi  NA NA NA NA NA NA ...
##  $ Profit      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProfitTotal : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Payment     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaymentTotal: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nickname    : chr  "bubu" "jguy" "X_DE_X" "shah" ...
##  $ sex         : chr  "male" "male" "female" "male" ...
##  $ BA          : chr  "No" "Yes" "Yes" "No" ...
##  $ Attitude    : int  76 73 32 100 32 84 67 58 88 64 ...
##  $ Varian      : int  17 58 42 53 37 56 51 51 51 53 ...
##  $ IT_lit      : int  49 62 15 67 24 22 39 35 71 38 ...
##  $ theory      : int  20 20 30 10 30 40 20 20 35 20 ...
##  $ exper       : int  50 40 40 50 35 30 50 30 35 20 ...
##  $ quan        : int  30 40 30 40 35 30 30 40 30 40 ...
##  $ team        : chr  "1" "3" "2" "4 to 5" ...

str(df19)

## 'data.frame':    55 obs. of  21 variables:
##  $ Year        : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ TreatmentID : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PeriodType  : chr  "payment" "payment" "payment" "payment" ...
##  $ PeriodNo    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Id          : int  57 52 53 54 55 56 1 2 3 4 ...
##  $ Group       : chr  "ALL" "ALL" "ALL" "ALL" ...
##  $ Role        : logi  NA NA NA NA NA NA ...
##  $ Profit      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProfitTotal : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Payment     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaymentTotal: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nickname    : chr  "Boodie" "Venri" "Olia_Ni" "Vxy" ...
##  $ sex         : chr  "male" "male" "female" "male" ...
##  $ BA          : chr  "Yes" "No" "No" "Yes" ...
##  $ Attitude    : int  58 52 51 61 69 48 70 84 30 78 ...
##  $ Varian      : int  46 32 50 49 46 30 94 50 61 50 ...
##  $ IT_lit      : int  73 69 10 83 4 41 90 80 49 31 ...
##  $ theory      : int  20 20 50 40 60 20 20 33 10 30 ...
##  $ exper       : int  40 40 50 20 30 40 50 33 70 30 ...
##  $ quan        : int  40 60 50 40 10 40 30 33 20 40 ...
##  $ team        : chr  "2" "3" "2" "2" ...

str(df18)

## 'data.frame':    51 obs. of  21 variables:
##  $ Year        : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ TreatmentID : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PeriodType  : chr  "payment" "payment" "payment" "payment" ...
##  $ PeriodNo    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Id          : int  41 5 6 7 8 9 10 11 12 13 ...
##  $ Group       : chr  "ALL" "ALL" "ALL" "ALL" ...
##  $ Role        : logi  NA NA NA NA NA NA ...
##  $ Profit      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProfitTotal : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Payment     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaymentTotal: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nickname    : chr  "no1" "no2" "no3" "no4" ...
##  $ sex         : chr  "female" "female" "male" "male" ...
##  $ BA          : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Attitude    : int  63 59 78 73 46 100 14 66 48 74 ...
##  $ Varian      : int  48 27 57 79 61 87 50 44 31 24 ...
##  $ IT_lit      : int  62 27 85 69 64 71 78 43 47 95 ...
##  $ theory      : int  20 25 20 30 35 40 15 20 70 20 ...
##  $ exper       : int  40 45 60 30 15 30 35 50 20 40 ...
##  $ quan        : int  40 30 20 40 50 30 50 30 10 40 ...
##  $ team        : chr  "3" "4 to 5" "1" "2" ...

As there is a slight difference betweet 2021 dataset and the earlier years, I am going to adjust 2021 dataset and merge all datasets together.

#change format of df21 to match the other 3 years
df21$Year = 2021
df21 <- df21[,c(21,1:20)]

#merge 4 datasets
df <- rbind(df21, df20, df19, df18)
main <- df[,c("Year", "sex", "BA", "Attitude", "Varian", "IT_lit", "theory", "exper", "quan", "team")]

Below are the explanations for the meaning of each variable in the dataset:

Year: The year of the class in which the student data is recorded
sex: Gender of the students
BA: ‘Yes’ if the students used to be a BA student at WNE, ‘No’ if the students did not used to be
Varian: Economics quality of a student. It is based on a question how easy they understand Varian’s textbook, the higher the score, the better the quality
IT_lit: IT quality of a student. The highest point indicates that the students are proficient in progamming languages, the lowest score indicates that the students only know office package
Attitude: The positiveness of attidude toward economics. The higher the score, the more the students think economics is meaningful
quan: The percentage of quantitative method that the students want to have in the course
exper: The percentage of experiment that the students want to have in the course
theory: The percentage of pure theory that the students want to have in the course
team: The preferred team size for homework

Then I clean the data and convert team variable from categorical to numeric. And I picked out all numerical variables out to do clustering.

#clean data
main = na.omit(main)
#at the variable expressing the number of an ideal group, I convert character value "4 to 5" to numerical value 4.5
main[main$team == "4 to 5",]$team = 4.5
#convert data of variable "team" from string to numeric
main[10] <- as.data.frame(apply(main[10],2,as.numeric))
#select the continuous variables
focus <- main[c("Varian","IT_lit","Attitude","quan","exper","theory", "team")]

After that, I check the correlations among variables.

#check relations between variables
focus_matrix <- data.matrix(focus, rownames.force = NA)
M <- cor(focus_matrix)
corrplot(M, method = "number", number.cex = 0.75)

It seems that all variables are independent. ‘Experiment’ and ‘Quant method’ have slight connection as one can replace the other slightly, yet, it is not significant.

After that I standardize the data and pick out 3 variables for clustering which are: Attitude, Varian and IT_lit. Because those 3 variables decide the background knowledge of students which is the critical factor in learning (Fisher, D., & Frey, N. (2009). Background knowledge. The missing piece of the comprehension puzzle).

# z-score standardization below
focus_z <- as.data.frame(lapply(focus, scale))
#create variables for clustering
focus_c <- focus_z[,c("Varian", "IT_lit","Attitude")]
dim(focus_c)

## [1] 272   3

plot(focus_c)

Assessing clustering tendency

Firstly, I will use Hopkins statistics to assess if the data is worth clustering. As we have 272 observations, I pick up random 10% ~ 30 observations to calculate the Hopkins stat.

#Hopkins stat
get_clust_tendency(focus_c, 30 , graph=TRUE, gradient=list(low="blue",  high="white"))

## $hopkins_stat
## [1] 0.6236326
## 
## $plot

The socre is 0.62 which is higher than 0.5, therefore, the dataset is clusterable.

K-means clustering

The first method I am trying for the dataset is kmeans.

# Silhouette method
fviz_nbclust(focus_c, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

From the graph, we can see the the silhouette score is stable from 2 clusters to 10 clusters. Therefore, to avoid overfitting, I will choose 2 as the ideal number for kmeans clustering.

# Run KMeans
focus_c_kmeans <- eclust(focus_c, k=2 , FUNcluster="kmeans")

# Check silhouette score
fviz_silhouette(focus_c_kmeans)

##   cluster size ave.sil.width
## 1       1  106          0.19
## 2       2  166          0.25

# Check centers
focus_c_kmeans$centers

##       Varian      IT_lit   Attitude
## 1 -0.5024998 -0.13752539 -0.8836715
## 2  0.3208734  0.08781742  0.5642721

PAM clustering

Secondly, I will try using PAM to check how it performs.

#Trying with PAM
# Silhouette method
fviz_nbclust(focus_c, cluster::pam, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Due to the silhouette score, it is recommended to have 2 clusters, I will follow the suggestion.

# Run PAM
focus_c_pam <- eclust(focus_c, k=2 , FUNcluster="pam")

# Check silhouette score
fviz_silhouette(focus_c_pam)

##   cluster size ave.sil.width
## 1       1  174          0.28
## 2       2   98          0.18

# Check medoids
focus_c_pam$medoids

##           Varian     IT_lit   Attitude
## [1,]  0.09339652  0.4296595  0.1483769
## [2,] -0.44815852 -0.6956111 -0.5034431

According to the silhouette score, the quality of PAM is more or less similar to the quality of kmeans. Yet, when looking at the graph, when seeing in 2 dimensions, there are more overlapped area in PAM than in kmeans.

Hierarchical clustering

Thirdly, I will check the performance of hierarchical clustering on this dataset. There are two approaches in hierarchical clustering: agglomerative and divisive. In my assigment, I will choose agglomerative approach.

Because there is no clear theory on how to conduct the linkage, I will go with Ward’s method

#Do Hierarchical method
focus_c_hierarchy <- eclust(focus_c, k = 2, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

plot(focus_c_hierarchy, cex=0.5, hang=-1)
rect.hclust(focus_c_hierarchy, k=2, border='blue')

Stability comparison

After trying with 3 options, I am going to check which one should be used for further analysis. To do this, I use stability comparison.

There are a few metrics that are used to calculate the stability, which are:

The average proportion of non-overlap (APN)
The average distance (AD)
The average distance between means (ADM)
The figure of merit (FOM)

#Check the most suitable method
clmethods <- c("hierarchical","kmeans","pam")
st <- clValid(focus_c, nClust=2:6, clMethods=clmethods, validation="stability", method="ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

## Warning in clValid(focus_c, nClust = 2:6, clMethods = clmethods, validation =
## "stability", : rownames for data not specified, using 1:nrow(data)

optimalScores(st)

##         Score Method Clusters
## APN 0.2930690 kmeans        2
## AD  1.7653648    pam        6
## ADM 0.7056026 kmeans        2
## FOM 0.9945678 kmeans        3

So, it is suggested here that kmeans with 2 clusters is the ideal choice. Therefore, I am going to continue with this option.

Result analysis

Firstly, I am going to merge the result of 2 clusters kmeans to the main data, then I create 2 sets of data including only cluster 1 and cluster 2.

#Merging the cluster variable to the main data and split the main to clustered parts
main$cluster <- focus_c_kmeans$cluster
main1 <- main[main$cluster == 1,]
main2 <- main[main$cluster == 2,]

Firstly, I create the boxplots for total, cluster1, cluster 2 in order to see how each variable behaves.

Total box plots

#Total
total_Attidude <- boxplot(main$Attitude, main = "Total - Attitude")

total_Varian <- boxplot(main$Varian, main = "Total - Economics quality")

total_IT <- boxplot(main$IT_lit, main = "Total - IT quality")

total_quan <- boxplot(main$quan, main = "Total - Quant method")

total_exper <- boxplot(main$exper, main = "Total - Experiment")

total_theory <- boxplot(main$exper, main = "Total - Theory")

total_team <- boxplot(main$exper, main = "Total - Team size")

Cluster 1 box plots

#Cluster1
Cluster1_Attidude <- boxplot(main1$Attitude, main1 = "Cluster1 - Attitude")

Cluster1_Varian <- boxplot(main1$Varian, main1 = "Cluster1 - Economics quality")

Cluster1_IT <- boxplot(main1$IT_lit, main1 = "Cluster1 - IT quality")

Cluster1_quan <- boxplot(main1$quan, main1 = "Cluster1 - Quant method")

Cluster1_exper <- boxplot(main1$exper, main1 = "Cluster1 - Experiment")
Cluster1_theory <- boxplot(main1$exper, main1 = "Cluster1 - Theory")

Cluster1_team <- boxplot(main1$exper, main1 = "Cluster1 - Team size")

Cluster 2 box plots

#Cluster2
Cluster2_Attidude <- boxplot(main2$Attitude, main2 = "Cluster2 - Attitude")

Cluster2_Varian <- boxplot(main2$Varian, main2 = "Cluster2 - Economics quality")

Cluster2_IT <- boxplot(main2$IT_lit, main2 = "Cluster2 - IT quality")

Cluster2_quan <- boxplot(main2$quan, main2 = "Cluster2 - Quant method")

Cluster2_exper <- boxplot(main2$exper, main2 = "Cluster2 - Experiment")
Cluster2_theory <- boxplot(main2$exper, main2 = "Cluster2 - Theory")

Cluster2_team <- boxplot(main2$exper, main2 = "Cluster2 - Team size")

After that, in the main data, there is one interesting variable that I want to dive deeper, which is “BA”, I assume whether to be a student in economics faculty before has great impact on the student quality and the preferred method.

Below, I create a summary tables of mean metrics for total dataset, for cluster 1 and cluster 2 respectively.

#Prepare the data in splitting BA and Non BA for analysis
#Total
Total <- data.frame(c(mean(main$Attitude),
                      mean(main$Varian),
                      mean(main$IT_lit),
                      mean(main$quan),
                      mean(main$exper),
                      mean(main$theory),
                      mean(main$team)),
                        c(mean(main[main$BA == "Yes",]$Attitude),
                          mean(main[main$BA == "Yes",]$Varian),
                          mean(main[main$BA == "Yes",]$IT_lit),
                          mean(main[main$BA == "Yes",]$quan),
                          mean(main[main$BA == "Yes",]$exper),
                          mean(main[main$BA == "Yes",]$theory),
                          mean(main[main$BA == "Yes",]$team)),
                        c(mean(main[main$BA == "No",]$Attitude),
                          mean(main[main$BA == "No",]$Varian),
                          mean(main[main$BA == "No",]$IT_lit),
                          mean(main[main$BA == "No",]$quan),
                          mean(main[main$BA == "No",]$exper),
                          mean(main[main$BA == "No",]$theory),
                          mean(main[main$BA == "No",]$team)))
col_name <- c("Total", "BA student", "Non BA student")
row_name <- c("Attitude", "Economics quality", "IT quality", "Quant Method","Experiment","Theory", "Size")
colnames(Total) <- col_name
rownames(Total) <- row_name
#Cluster 1
Cluster1 <- data.frame(c(mean(main1$Attitude),
                         mean(main1$Varian),
                         mean(main1$IT_lit),
                         mean(main1$quan),
                         mean(main1$exper),
                         mean(main1$theory),
                         mean(main1$team)),
                       c(mean(main1[main1$BA == "Yes",]$Attitude),
                         mean(main1[main1$BA == "Yes",]$Varian),
                         mean(main1[main1$BA == "Yes",]$IT_lit),
                         mean(main1[main1$BA == "Yes",]$quan),
                         mean(main1[main1$BA == "Yes",]$exper),
                         mean(main1[main1$BA == "Yes",]$theory),
                         mean(main1[main1$BA == "Yes",]$team)),
                       c(mean(main1[main1$BA == "No",]$Attitude),
                         mean(main1[main1$BA == "No",]$Varian),
                         mean(main1[main1$BA == "No",]$IT_lit),
                         mean(main1[main1$BA == "No",]$quan),
                         mean(main1[main1$BA == "No",]$exper),
                         mean(main1[main1$BA == "No",]$theory),
                         mean(main1[main1$BA == "No",]$team)))
colnames(Cluster1) <- col_name
rownames(Cluster1) <- row_name
#Cluster 2
Cluster2 <- data.frame(c(mean(main2$Attitude),
                         mean(main2$Varian),
                         mean(main2$IT_lit),
                         mean(main2$quan),
                         mean(main2$exper),
                         mean(main2$theory),
                         mean(main2$team)),
                        c(mean(main2[main2$BA == "Yes",]$Attitude),
                         mean(main2[main2$BA == "Yes",]$Varian),
                         mean(main2[main2$BA == "Yes",]$IT_lit),
                         mean(main2[main2$BA == "Yes",]$quan),
                         mean(main2[main2$BA == "Yes",]$exper),
                         mean(main2[main2$BA == "Yes",]$theory),
                         mean(main2[main2$BA == "Yes",]$team)),
                       c(mean(main2[main2$BA == "No",]$Attitude),
                         mean(main2[main2$BA == "No",]$Varian),
                         mean(main2[main2$BA == "No",]$IT_lit),
                         mean(main2[main2$BA == "No",]$quan),
                         mean(main2[main2$BA == "No",]$exper),
                         mean(main2[main2$BA == "No",]$theory),
                         mean(main2[main2$BA == "No",]$team)))
colnames(Cluster2) <- col_name
rownames(Cluster2) <- row_name
Total

##                       Total BA student Non BA student
## Attitude          72.268382  69.252033      74.758389
## Economics quality 50.102941  57.471545      44.020134
## IT quality        56.308824  56.317073      56.302013
## Quant Method      35.433824  34.780488      35.973154
## Experiment        41.911765  42.333333      41.563758
## Theory            25.294118  23.699187      26.610738
## Size               2.382353   2.276423       2.469799

Cluster1

##                       Total BA student Non BA student
## Attitude          56.000000  50.068182      60.209677
## Economics quality 39.896226  48.500000      33.790323
## IT quality        52.886792  50.840909      54.338710
## Quant Method      34.594340  34.500000      34.661290
## Experiment        42.641509  41.318182      43.580645
## Theory            24.084906  23.954545      24.177419
## Size               2.367925   2.113636       2.548387

Cluster2

##                       Total BA student Non BA student
## Attitude          82.656627  79.936709      85.126437
## Economics quality 56.620482  62.468354      51.310345
## IT quality        58.493976  59.367089      57.701149
## Quant Method      35.969880  34.936709      36.908046
## Experiment        41.445783  42.898734      40.126437
## Theory            26.066265  23.556962      28.344828
## Size               2.391566   2.367089       2.413793

Attitude, Economics quality and IT quality are 3 variables used for clustering, so I am going to demonstrate it again in bar chart, with the break down between BA student and non BA student.

#Chart for Attitude, Economics Quality and IT Quality
#Total
AEI_total <- t(Total[1:3,])
AEI_total_chart <- barplot(as.matrix(AEI_total), main="Total", ylab= "Total", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));

#Cluster 1
AEI_Cluster1 <- t(Cluster1[1:3,])
AEI_Cluster1_chart <- barplot(as.matrix(AEI_Cluster1), main="Cluster1", ylab= "Cluster1", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty= "n", fill=rainbow(3));

#Cluster 2
AEI_Cluster2 <- t(Cluster2[1:3,])
AEI_Cluster2_chart <- barplot(as.matrix(AEI_Cluster2), main="Cluster2", ylab= "Cluster2", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));

At first glance, in total, the IT quality is not different between BA student and non BA student. The people who used to study bachelor at UW has higher economics quality but their attitude for economics is a little bit lower than who were not students.

This trend applies in both cluster 1 and cluster 2 even though in cluster 2, people have higher qualities and better attitude than people in cluster 1.

Secondly, I am going to investigate how people prefer to study in microeconomics class.

#Chart for preferred learning methods
#Total
lm_total <- Total[4:6,]
lm_total_total <- pie(lm_total[,"Total"], labels = round(lm_total[,"Total"]), main = "Total - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_total_BA <- pie(lm_total[,"BA student"], labels = round(lm_total[,"BA student"]), main = "Total - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_total_NoBA <- pie(lm_total[,"Non BA student"], labels = round(lm_total[,"Non BA student"]), main = "Total - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

#Cluster1
lm_Cluster1 <- Cluster1[4:6,]
lm_Cluster1_total <- pie(lm_Cluster1[,"Total"], labels = round(lm_Cluster1[,"Total"]), main = "Cluster1 - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_Cluster1_BA <- pie(lm_Cluster1[,"BA student"], labels = round(lm_Cluster1[,"BA student"]), main = "Cluster1 - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_Cluster1_NoBA <- pie(lm_Cluster1[,"Non BA student"], labels = round(lm_Cluster1[,"Non BA student"]), main = "Cluster1 - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

#Cluster2
lm_Cluster2 <- Cluster2[4:6,]
lm_Cluster2_total <- pie(lm_Cluster2[,"Total"], labels = round(lm_Cluster2[,"Total"]), main = "Cluster2 - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_Cluster2_BA <- pie(lm_Cluster2[,"BA student"], labels = round(lm_Cluster2[,"BA student"]), main = "Cluster2 - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

lm_Cluster2_NoBA <- pie(lm_Cluster2[,"Non BA student"], labels = round(lm_Cluster2[,"Non BA student"]), main = "Cluster2 - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
       fill =  c("white", "lightblue", "mistyrose"))

Looking at total sample, people like to have experiment the most, then experiment and theory is least favorite. For BA students, they even prefer theory less than non BA students and they like to have Quant Method more.

In general, there are not so much different between total of cluster 1, total of cluster 2 and the total sample. The preference is consistent. There is one noticeable difference in Non BA student of cluster 2, they prefer to have more theory, quant method and less experiment than Non BA student of cluster 1.

And finally, I will have a look at the preferred group size among clusters and among BA, non BA students within each cluster.

#Chart for ideal group size
#Total
gs_Total <- t(Total[7,])
gs_Total_chart <- barplot(as.matrix(gs_Total), main="Total", ylab= "Total", beside=TRUE, col=rainbow(3), ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));

#Cluster1
gs_Cluster1 <- t(Cluster1[7,])
gs_Cluster1_chart <- barplot(as.matrix(gs_Cluster1), main="Cluster1", ylab= "Cluster1", beside=TRUE, col=rainbow(3),ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));

#Cluster2
gs_Cluster2 <- t(Cluster2[7,])
gs_Cluster2_chart <- barplot(as.matrix(gs_Cluster2),  main="Cluster2", ylab= "Cluster2",beside=TRUE, col=rainbow(3), ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));

Look at the group size, people prefer to have from 2 to 3 members per group, the same trend in all groups. It is less than 2, therefore, the minimal size for a group should be 2 if only 1 option must be picked.

Conclusion

In my opinion, the data seems to be homogeneous across profiles and across clusters. Usually, the trend happening in total sample is the same in each cluster. There is one biggest difference between 2 clusters is that cluster 2 has higher quality than cluster 1.

Thanks to the homogeneous characteristic, I believe the way professor Tomasz Kopczewski choose to teach the class as indicated in the pie charts will not cause difficulty for any groups of students.

Assignment 1 - Cluster - Khon Nguyen 444135

Khon Nguyen

11/11/2021