Before the Microeconomics class of professor Tomasz Kopczewski at University of Warsaw, every student has to complete a survey about their self-evaluated qualities on IT knowledge, economics knowledge and attitude about economics. In addition, they also declare their preference of the teaching methods and the desired group size. For this assignment about clustering, I am using his data in 4 years: 2018, 2019, 2020 and 2021. I will cluster the students quality with different methods. After that, I will choose the most stable one and analyze the clusters in order to propose the indication for teaching methods in his class in the future.
#inspect the dataset
str(df21)
## 'data.frame': 69 obs. of 20 variables:
## $ TreatmentID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PeriodType : chr "payment" "payment" "payment" "payment" ...
## $ PeriodNo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Id : int 20 41 2 26 39 44 35 42 14 29 ...
## $ Group : chr "ALL" "ALL" "ALL" "ALL" ...
## $ Role : logi NA NA NA NA NA NA ...
## $ Profit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ProfitTotal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Payment : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaymentTotal: int 0 0 0 0 0 0 0 0 0 0 ...
## $ nickname : chr "Nomin" "Ania_39" "mateuszc" "LeBron" ...
## $ sex : chr "female" "female" "male" "female" ...
## $ BA : chr "No" "Yes" "Yes" "No" ...
## $ Attitude : int 100 71 72 63 100 64 72 74 67 75 ...
## $ Varian : int 43 50 72 49 49 85 81 28 50 89 ...
## $ IT_lit : int 50 42 64 56 100 81 78 78 8 63 ...
## $ theory : int 30 20 20 20 40 5 28 20 30 45 ...
## $ exper : int 40 65 60 50 40 45 34 35 30 20 ...
## $ quan : int 30 15 20 30 20 40 38 45 50 35 ...
## $ team : chr "3" "3" "2" "3" ...
str(df20)
## 'data.frame': 97 obs. of 21 variables:
## $ Year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ TreatmentID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PeriodType : chr "payment" "payment" "payment" "payment" ...
## $ PeriodNo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Id : int 108 30 49 12 7 77 15 26 75 43 ...
## $ Group : chr "ALL" "ALL" "ALL" "ALL" ...
## $ Role : logi NA NA NA NA NA NA ...
## $ Profit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ProfitTotal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Payment : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaymentTotal: int 0 0 0 0 0 0 0 0 0 0 ...
## $ nickname : chr "bubu" "jguy" "X_DE_X" "shah" ...
## $ sex : chr "male" "male" "female" "male" ...
## $ BA : chr "No" "Yes" "Yes" "No" ...
## $ Attitude : int 76 73 32 100 32 84 67 58 88 64 ...
## $ Varian : int 17 58 42 53 37 56 51 51 51 53 ...
## $ IT_lit : int 49 62 15 67 24 22 39 35 71 38 ...
## $ theory : int 20 20 30 10 30 40 20 20 35 20 ...
## $ exper : int 50 40 40 50 35 30 50 30 35 20 ...
## $ quan : int 30 40 30 40 35 30 30 40 30 40 ...
## $ team : chr "1" "3" "2" "4 to 5" ...
str(df19)
## 'data.frame': 55 obs. of 21 variables:
## $ Year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
## $ TreatmentID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PeriodType : chr "payment" "payment" "payment" "payment" ...
## $ PeriodNo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Id : int 57 52 53 54 55 56 1 2 3 4 ...
## $ Group : chr "ALL" "ALL" "ALL" "ALL" ...
## $ Role : logi NA NA NA NA NA NA ...
## $ Profit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ProfitTotal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Payment : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaymentTotal: int 0 0 0 0 0 0 0 0 0 0 ...
## $ nickname : chr "Boodie" "Venri" "Olia_Ni" "Vxy" ...
## $ sex : chr "male" "male" "female" "male" ...
## $ BA : chr "Yes" "No" "No" "Yes" ...
## $ Attitude : int 58 52 51 61 69 48 70 84 30 78 ...
## $ Varian : int 46 32 50 49 46 30 94 50 61 50 ...
## $ IT_lit : int 73 69 10 83 4 41 90 80 49 31 ...
## $ theory : int 20 20 50 40 60 20 20 33 10 30 ...
## $ exper : int 40 40 50 20 30 40 50 33 70 30 ...
## $ quan : int 40 60 50 40 10 40 30 33 20 40 ...
## $ team : chr "2" "3" "2" "2" ...
str(df18)
## 'data.frame': 51 obs. of 21 variables:
## $ Year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ TreatmentID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PeriodType : chr "payment" "payment" "payment" "payment" ...
## $ PeriodNo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Id : int 41 5 6 7 8 9 10 11 12 13 ...
## $ Group : chr "ALL" "ALL" "ALL" "ALL" ...
## $ Role : logi NA NA NA NA NA NA ...
## $ Profit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ProfitTotal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Payment : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaymentTotal: int 0 0 0 0 0 0 0 0 0 0 ...
## $ nickname : chr "no1" "no2" "no3" "no4" ...
## $ sex : chr "female" "female" "male" "male" ...
## $ BA : chr "Yes" "No" "Yes" "Yes" ...
## $ Attitude : int 63 59 78 73 46 100 14 66 48 74 ...
## $ Varian : int 48 27 57 79 61 87 50 44 31 24 ...
## $ IT_lit : int 62 27 85 69 64 71 78 43 47 95 ...
## $ theory : int 20 25 20 30 35 40 15 20 70 20 ...
## $ exper : int 40 45 60 30 15 30 35 50 20 40 ...
## $ quan : int 40 30 20 40 50 30 50 30 10 40 ...
## $ team : chr "3" "4 to 5" "1" "2" ...
As there is a slight difference betweet 2021 dataset and the earlier years, I am going to adjust 2021 dataset and merge all datasets together.
#change format of df21 to match the other 3 years
df21$Year = 2021
df21 <- df21[,c(21,1:20)]
#merge 4 datasets
df <- rbind(df21, df20, df19, df18)
main <- df[,c("Year", "sex", "BA", "Attitude", "Varian", "IT_lit", "theory", "exper", "quan", "team")]
Below are the explanations for the meaning of each variable in the dataset:
Year: The year of the class in which the student data is recorded
sex: Gender of the students
BA: ‘Yes’ if the students used to be a BA student at WNE, ‘No’ if the students did not used to be
Varian: Economics quality of a student. It is based on a question how easy they understand Varian’s textbook, the higher the score, the better the quality
IT_lit: IT quality of a student. The highest point indicates that the students are proficient in progamming languages, the lowest score indicates that the students only know office package
Attitude: The positiveness of attidude toward economics. The higher the score, the more the students think economics is meaningful
quan: The percentage of quantitative method that the students want to have in the course
exper: The percentage of experiment that the students want to have in the course
theory: The percentage of pure theory that the students want to have in the course
team: The preferred team size for homework
Then I clean the data and convert team variable from categorical to numeric. And I picked out all numerical variables out to do clustering.
#clean data
main = na.omit(main)
#at the variable expressing the number of an ideal group, I convert character value "4 to 5" to numerical value 4.5
main[main$team == "4 to 5",]$team = 4.5
#convert data of variable "team" from string to numeric
main[10] <- as.data.frame(apply(main[10],2,as.numeric))
#select the continuous variables
focus <- main[c("Varian","IT_lit","Attitude","quan","exper","theory", "team")]
After that, I check the correlations among variables.
#check relations between variables
focus_matrix <- data.matrix(focus, rownames.force = NA)
M <- cor(focus_matrix)
corrplot(M, method = "number", number.cex = 0.75)
It seems that all variables are independent. ‘Experiment’ and ‘Quant method’ have slight connection as one can replace the other slightly, yet, it is not significant.
After that I standardize the data and pick out 3 variables for clustering which are: Attitude, Varian and IT_lit. Because those 3 variables decide the background knowledge of students which is the critical factor in learning (Fisher, D., & Frey, N. (2009). Background knowledge. The missing piece of the comprehension puzzle).
# z-score standardization below
focus_z <- as.data.frame(lapply(focus, scale))
#create variables for clustering
focus_c <- focus_z[,c("Varian", "IT_lit","Attitude")]
dim(focus_c)
## [1] 272 3
plot(focus_c)
Firstly, I will use Hopkins statistics to assess if the data is worth clustering. As we have 272 observations, I pick up random 10% ~ 30 observations to calculate the Hopkins stat.
#Hopkins stat
get_clust_tendency(focus_c, 30 , graph=TRUE, gradient=list(low="blue", high="white"))
## $hopkins_stat
## [1] 0.6236326
##
## $plot
The socre is 0.62 which is higher than 0.5, therefore, the dataset is clusterable.
K-means clustering
The first method I am trying for the dataset is kmeans.
# Silhouette method
fviz_nbclust(focus_c, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
From the graph, we can see the the silhouette score is stable from 2 clusters to 10 clusters. Therefore, to avoid overfitting, I will choose 2 as the ideal number for kmeans clustering.
# Run KMeans
focus_c_kmeans <- eclust(focus_c, k=2 , FUNcluster="kmeans")
# Check silhouette score
fviz_silhouette(focus_c_kmeans)
## cluster size ave.sil.width
## 1 1 106 0.19
## 2 2 166 0.25
# Check centers
focus_c_kmeans$centers
## Varian IT_lit Attitude
## 1 -0.5024998 -0.13752539 -0.8836715
## 2 0.3208734 0.08781742 0.5642721
Secondly, I will try using PAM to check how it performs.
#Trying with PAM
# Silhouette method
fviz_nbclust(focus_c, cluster::pam, method = "silhouette")+
labs(subtitle = "Silhouette method")
Due to the silhouette score, it is recommended to have 2 clusters, I will follow the suggestion.
# Run PAM
focus_c_pam <- eclust(focus_c, k=2 , FUNcluster="pam")
# Check silhouette score
fviz_silhouette(focus_c_pam)
## cluster size ave.sil.width
## 1 1 174 0.28
## 2 2 98 0.18
# Check medoids
focus_c_pam$medoids
## Varian IT_lit Attitude
## [1,] 0.09339652 0.4296595 0.1483769
## [2,] -0.44815852 -0.6956111 -0.5034431
According to the silhouette score, the quality of PAM is more or less similar to the quality of kmeans. Yet, when looking at the graph, when seeing in 2 dimensions, there are more overlapped area in PAM than in kmeans.
Thirdly, I will check the performance of hierarchical clustering on this dataset. There are two approaches in hierarchical clustering: agglomerative and divisive. In my assigment, I will choose agglomerative approach.
Because there is no clear theory on how to conduct the linkage, I will go with Ward’s method
#Do Hierarchical method
focus_c_hierarchy <- eclust(focus_c, k = 2, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(focus_c_hierarchy, cex=0.5, hang=-1)
rect.hclust(focus_c_hierarchy, k=2, border='blue')
After trying with 3 options, I am going to check which one should be used for further analysis. To do this, I use stability comparison.
There are a few metrics that are used to calculate the stability, which are:
The average proportion of non-overlap (APN)
The average distance (AD)
The average distance between means (ADM)
The figure of merit (FOM)
#Check the most suitable method
clmethods <- c("hierarchical","kmeans","pam")
st <- clValid(focus_c, nClust=2:6, clMethods=clmethods, validation="stability", method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## Warning in clValid(focus_c, nClust = 2:6, clMethods = clmethods, validation =
## "stability", : rownames for data not specified, using 1:nrow(data)
optimalScores(st)
## Score Method Clusters
## APN 0.2930690 kmeans 2
## AD 1.7653648 pam 6
## ADM 0.7056026 kmeans 2
## FOM 0.9945678 kmeans 3
So, it is suggested here that kmeans with 2 clusters is the ideal choice. Therefore, I am going to continue with this option.
Firstly, I am going to merge the result of 2 clusters kmeans to the main data, then I create 2 sets of data including only cluster 1 and cluster 2.
#Merging the cluster variable to the main data and split the main to clustered parts
main$cluster <- focus_c_kmeans$cluster
main1 <- main[main$cluster == 1,]
main2 <- main[main$cluster == 2,]
Firstly, I create the boxplots for total, cluster1, cluster 2 in order to see how each variable behaves.
Total box plots
#Total
total_Attidude <- boxplot(main$Attitude, main = "Total - Attitude")
total_Varian <- boxplot(main$Varian, main = "Total - Economics quality")
total_IT <- boxplot(main$IT_lit, main = "Total - IT quality")
total_quan <- boxplot(main$quan, main = "Total - Quant method")
total_exper <- boxplot(main$exper, main = "Total - Experiment")
total_theory <- boxplot(main$exper, main = "Total - Theory")
total_team <- boxplot(main$exper, main = "Total - Team size")
Cluster 1 box plots
#Cluster1
Cluster1_Attidude <- boxplot(main1$Attitude, main1 = "Cluster1 - Attitude")
Cluster1_Varian <- boxplot(main1$Varian, main1 = "Cluster1 - Economics quality")
Cluster1_IT <- boxplot(main1$IT_lit, main1 = "Cluster1 - IT quality")
Cluster1_quan <- boxplot(main1$quan, main1 = "Cluster1 - Quant method")
Cluster1_exper <- boxplot(main1$exper, main1 = "Cluster1 - Experiment")
Cluster1_theory <- boxplot(main1$exper, main1 = "Cluster1 - Theory")
Cluster1_team <- boxplot(main1$exper, main1 = "Cluster1 - Team size")
Cluster 2 box plots
#Cluster2
Cluster2_Attidude <- boxplot(main2$Attitude, main2 = "Cluster2 - Attitude")
Cluster2_Varian <- boxplot(main2$Varian, main2 = "Cluster2 - Economics quality")
Cluster2_IT <- boxplot(main2$IT_lit, main2 = "Cluster2 - IT quality")
Cluster2_quan <- boxplot(main2$quan, main2 = "Cluster2 - Quant method")
Cluster2_exper <- boxplot(main2$exper, main2 = "Cluster2 - Experiment")
Cluster2_theory <- boxplot(main2$exper, main2 = "Cluster2 - Theory")
Cluster2_team <- boxplot(main2$exper, main2 = "Cluster2 - Team size")
After that, in the main data, there is one interesting variable that I want to dive deeper, which is “BA”, I assume whether to be a student in economics faculty before has great impact on the student quality and the preferred method.
Below, I create a summary tables of mean metrics for total dataset, for cluster 1 and cluster 2 respectively.
#Prepare the data in splitting BA and Non BA for analysis
#Total
Total <- data.frame(c(mean(main$Attitude),
mean(main$Varian),
mean(main$IT_lit),
mean(main$quan),
mean(main$exper),
mean(main$theory),
mean(main$team)),
c(mean(main[main$BA == "Yes",]$Attitude),
mean(main[main$BA == "Yes",]$Varian),
mean(main[main$BA == "Yes",]$IT_lit),
mean(main[main$BA == "Yes",]$quan),
mean(main[main$BA == "Yes",]$exper),
mean(main[main$BA == "Yes",]$theory),
mean(main[main$BA == "Yes",]$team)),
c(mean(main[main$BA == "No",]$Attitude),
mean(main[main$BA == "No",]$Varian),
mean(main[main$BA == "No",]$IT_lit),
mean(main[main$BA == "No",]$quan),
mean(main[main$BA == "No",]$exper),
mean(main[main$BA == "No",]$theory),
mean(main[main$BA == "No",]$team)))
col_name <- c("Total", "BA student", "Non BA student")
row_name <- c("Attitude", "Economics quality", "IT quality", "Quant Method","Experiment","Theory", "Size")
colnames(Total) <- col_name
rownames(Total) <- row_name
#Cluster 1
Cluster1 <- data.frame(c(mean(main1$Attitude),
mean(main1$Varian),
mean(main1$IT_lit),
mean(main1$quan),
mean(main1$exper),
mean(main1$theory),
mean(main1$team)),
c(mean(main1[main1$BA == "Yes",]$Attitude),
mean(main1[main1$BA == "Yes",]$Varian),
mean(main1[main1$BA == "Yes",]$IT_lit),
mean(main1[main1$BA == "Yes",]$quan),
mean(main1[main1$BA == "Yes",]$exper),
mean(main1[main1$BA == "Yes",]$theory),
mean(main1[main1$BA == "Yes",]$team)),
c(mean(main1[main1$BA == "No",]$Attitude),
mean(main1[main1$BA == "No",]$Varian),
mean(main1[main1$BA == "No",]$IT_lit),
mean(main1[main1$BA == "No",]$quan),
mean(main1[main1$BA == "No",]$exper),
mean(main1[main1$BA == "No",]$theory),
mean(main1[main1$BA == "No",]$team)))
colnames(Cluster1) <- col_name
rownames(Cluster1) <- row_name
#Cluster 2
Cluster2 <- data.frame(c(mean(main2$Attitude),
mean(main2$Varian),
mean(main2$IT_lit),
mean(main2$quan),
mean(main2$exper),
mean(main2$theory),
mean(main2$team)),
c(mean(main2[main2$BA == "Yes",]$Attitude),
mean(main2[main2$BA == "Yes",]$Varian),
mean(main2[main2$BA == "Yes",]$IT_lit),
mean(main2[main2$BA == "Yes",]$quan),
mean(main2[main2$BA == "Yes",]$exper),
mean(main2[main2$BA == "Yes",]$theory),
mean(main2[main2$BA == "Yes",]$team)),
c(mean(main2[main2$BA == "No",]$Attitude),
mean(main2[main2$BA == "No",]$Varian),
mean(main2[main2$BA == "No",]$IT_lit),
mean(main2[main2$BA == "No",]$quan),
mean(main2[main2$BA == "No",]$exper),
mean(main2[main2$BA == "No",]$theory),
mean(main2[main2$BA == "No",]$team)))
colnames(Cluster2) <- col_name
rownames(Cluster2) <- row_name
Total
## Total BA student Non BA student
## Attitude 72.268382 69.252033 74.758389
## Economics quality 50.102941 57.471545 44.020134
## IT quality 56.308824 56.317073 56.302013
## Quant Method 35.433824 34.780488 35.973154
## Experiment 41.911765 42.333333 41.563758
## Theory 25.294118 23.699187 26.610738
## Size 2.382353 2.276423 2.469799
Cluster1
## Total BA student Non BA student
## Attitude 56.000000 50.068182 60.209677
## Economics quality 39.896226 48.500000 33.790323
## IT quality 52.886792 50.840909 54.338710
## Quant Method 34.594340 34.500000 34.661290
## Experiment 42.641509 41.318182 43.580645
## Theory 24.084906 23.954545 24.177419
## Size 2.367925 2.113636 2.548387
Cluster2
## Total BA student Non BA student
## Attitude 82.656627 79.936709 85.126437
## Economics quality 56.620482 62.468354 51.310345
## IT quality 58.493976 59.367089 57.701149
## Quant Method 35.969880 34.936709 36.908046
## Experiment 41.445783 42.898734 40.126437
## Theory 26.066265 23.556962 28.344828
## Size 2.391566 2.367089 2.413793
Attitude, Economics quality and IT quality are 3 variables used for clustering, so I am going to demonstrate it again in bar chart, with the break down between BA student and non BA student.
#Chart for Attitude, Economics Quality and IT Quality
#Total
AEI_total <- t(Total[1:3,])
AEI_total_chart <- barplot(as.matrix(AEI_total), main="Total", ylab= "Total", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));
#Cluster 1
AEI_Cluster1 <- t(Cluster1[1:3,])
AEI_Cluster1_chart <- barplot(as.matrix(AEI_Cluster1), main="Cluster1", ylab= "Cluster1", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty= "n", fill=rainbow(3));
#Cluster 2
AEI_Cluster2 <- t(Cluster2[1:3,])
AEI_Cluster2_chart <- barplot(as.matrix(AEI_Cluster2), main="Cluster2", ylab= "Cluster2", beside=TRUE, col=rainbow(3), ylim = c(0, 100))
legend(6,100, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));
At first glance, in total, the IT quality is not different between BA student and non BA student. The people who used to study bachelor at UW has higher economics quality but their attitude for economics is a little bit lower than who were not students.
This trend applies in both cluster 1 and cluster 2 even though in cluster 2, people have higher qualities and better attitude than people in cluster 1.
Secondly, I am going to investigate how people prefer to study in microeconomics class.
#Chart for preferred learning methods
#Total
lm_total <- Total[4:6,]
lm_total_total <- pie(lm_total[,"Total"], labels = round(lm_total[,"Total"]), main = "Total - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_total_BA <- pie(lm_total[,"BA student"], labels = round(lm_total[,"BA student"]), main = "Total - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_total_NoBA <- pie(lm_total[,"Non BA student"], labels = round(lm_total[,"Non BA student"]), main = "Total - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
#Cluster1
lm_Cluster1 <- Cluster1[4:6,]
lm_Cluster1_total <- pie(lm_Cluster1[,"Total"], labels = round(lm_Cluster1[,"Total"]), main = "Cluster1 - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_Cluster1_BA <- pie(lm_Cluster1[,"BA student"], labels = round(lm_Cluster1[,"BA student"]), main = "Cluster1 - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_Cluster1_NoBA <- pie(lm_Cluster1[,"Non BA student"], labels = round(lm_Cluster1[,"Non BA student"]), main = "Cluster1 - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
#Cluster2
lm_Cluster2 <- Cluster2[4:6,]
lm_Cluster2_total <- pie(lm_Cluster2[,"Total"], labels = round(lm_Cluster2[,"Total"]), main = "Cluster2 - Total student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_Cluster2_BA <- pie(lm_Cluster2[,"BA student"], labels = round(lm_Cluster2[,"BA student"]), main = "Cluster2 - BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
lm_Cluster2_NoBA <- pie(lm_Cluster2[,"Non BA student"], labels = round(lm_Cluster2[,"Non BA student"]), main = "Cluster2 - Non BA student")
legend(1,1, c("Quant Method","Experiment","Theory"),
fill = c("white", "lightblue", "mistyrose"))
Looking at total sample, people like to have experiment the most, then experiment and theory is least favorite. For BA students, they even prefer theory less than non BA students and they like to have Quant Method more.
In general, there are not so much different between total of cluster 1, total of cluster 2 and the total sample. The preference is consistent. There is one noticeable difference in Non BA student of cluster 2, they prefer to have more theory, quant method and less experiment than Non BA student of cluster 1.
And finally, I will have a look at the preferred group size among clusters and among BA, non BA students within each cluster.
#Chart for ideal group size
#Total
gs_Total <- t(Total[7,])
gs_Total_chart <- barplot(as.matrix(gs_Total), main="Total", ylab= "Total", beside=TRUE, col=rainbow(3), ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));
#Cluster1
gs_Cluster1 <- t(Cluster1[7,])
gs_Cluster1_chart <- barplot(as.matrix(gs_Cluster1), main="Cluster1", ylab= "Cluster1", beside=TRUE, col=rainbow(3),ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));
#Cluster2
gs_Cluster2 <- t(Cluster2[7,])
gs_Cluster2_chart <- barplot(as.matrix(gs_Cluster2), main="Cluster2", ylab= "Cluster2",beside=TRUE, col=rainbow(3), ylim = c(0, 5))
legend(1,5, c("Total", "BA Student", "Non BA Student"), cex=0.6, bty="n", fill=rainbow(3));
Look at the group size, people prefer to have from 2 to 3 members per group, the same trend in all groups. It is less than 2, therefore, the minimal size for a group should be 2 if only 1 option must be picked.
In my opinion, the data seems to be homogeneous across profiles and across clusters. Usually, the trend happening in total sample is the same in each cluster. There is one biggest difference between 2 clusters is that cluster 2 has higher quality than cluster 1.
Thanks to the homogeneous characteristic, I believe the way professor Tomasz Kopczewski choose to teach the class as indicated in the pie charts will not cause difficulty for any groups of students.