This analysis uses hierarchical clustering and PCA to segment respondents based on their perceptions of Pearson vs McGraw Hill. Because the survey responses are mostly categorical, the data are converted into dummy variables before clustering.
The dataset used is: Group 2 Survey Cleaned.csv
mydata <- read.csv("Group 2 Survey Cleaned.csv", stringsAsFactors = FALSE)
str(mydata)
## 'data.frame': 28 obs. of 7 variables:
## $ Q1: chr "standing" "Senior" "Senior" "Senior" ...
## $ Q2: chr "electrontic_paper" "electronic" "electronic" "electronic" ...
## $ Q3: chr "platform" "McGraw Hill" "Pearson" "McGraw Hill" ...
## $ Q4: chr "preference" "McGraw Hill" "I don't have a preference" "Pearson" ...
## $ Q5: chr "other_platform" "N/A" "N/A" "n/a" ...
## $ Q6: chr "spending" "$50-$75" "less than $50" "less than $50" ...
## $ Q7: chr "expensive" "They are about the same" "I don't pay attention to textbook prices" "Pearson is more expensive" ...
summary(mydata)
## Q1 Q2 Q3 Q4
## Length:28 Length:28 Length:28 Length:28
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Q5 Q6 Q7
## Length:28 Length:28 Length:28
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
head(mydata)
## Q1 Q2 Q3 Q4
## 1 standing electrontic_paper platform preference
## 2 Senior electronic McGraw Hill McGraw Hill
## 3 Senior electronic Pearson I don't have a preference
## 4 Senior electronic McGraw Hill Pearson
## 5 Senior electronic Pearson I don't have a preference
## 6 Senior electronic McGraw Hill McGraw Hill
## Q5 Q6 Q7
## 1 other_platform spending expensive
## 2 N/A $50-$75 They are about the same
## 3 N/A less than $50 I don't pay attention to textbook prices
## 4 n/a less than $50 Pearson is more expensive
## 5 $75-$100 They are about the same
## 6 N/A $50-$75 They are about the same
The first row contains labels such as “standing” and is not an actual respondent, so it is removed.
# Remove first row because it contains variable labels, not survey data
mydata <- mydata[-1, ]
# Trim whitespace in every column
mydata[] <- lapply(mydata, function(x) trimws(as.character(x)))
# Keep Q5 in the original data, but exclude it from clustering because it is open-ended text
cluster_vars <- mydata[, c("Q1", "Q2", "Q3", "Q4", "Q6", "Q7")]
# Convert clustering variables to factors
cluster_vars[] <- lapply(cluster_vars, as.factor)
# Convert original data to character/factor for easier summaries
mydata[] <- lapply(mydata, as.factor)
str(cluster_vars)
## 'data.frame': 27 obs. of 6 variables:
## $ Q1: Factor w/ 4 levels "Grad student",..: 3 3 3 3 3 2 3 2 3 2 ...
## $ Q2: Factor w/ 3 levels "electronic","I have no preference",..: 1 1 1 1 1 1 1 3 3 1 ...
## $ Q3: Factor w/ 2 levels "McGraw Hill",..: 1 2 1 2 1 2 2 1 2 1 ...
## $ Q4: Factor w/ 4 levels "I don't have a preference",..: 3 1 4 1 3 1 1 3 2 1 ...
## $ Q6: Factor w/ 4 levels "$100 or more",..: 2 4 4 3 2 4 2 4 3 2 ...
## $ Q7: Factor w/ 4 levels "I don't pay attention to textbook prices",..: 4 1 3 4 4 2 4 1 4 4 ...
summary(cluster_vars)
## Q1 Q2 Q3
## Grad student: 2 electronic :17 McGraw Hill:12
## Junior : 6 I have no preference: 4 Pearson :15
## Senior :18 paper : 6
## Sophmore : 1
## Q4 Q6
## I don't have a preference :10 $100 or more : 2
## I prefer a different platform: 4 $50-$75 : 7
## McGraw Hill : 5 $75-$100 : 5
## Pearson : 8 less than $50:13
## Q7
## I don't pay attention to textbook prices: 4
## McGraw Hill is more expensive : 5
## Pearson is more expensive : 3
## They are about the same :15
Because clustering and PCA require numeric inputs, the categorical survey variables are converted to dummy variables.
survey_matrix <- model.matrix(~ . - 1, data = cluster_vars)
head(survey_matrix)
## Q1Grad student Q1Junior Q1Senior Q1Sophmore Q2I have no preference Q2paper
## 2 0 0 1 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 1 0 0 0
## 5 0 0 1 0 0 0
## 6 0 0 1 0 0 0
## 7 0 1 0 0 0 0
## Q3Pearson Q4I prefer a different platform Q4McGraw Hill Q4Pearson Q6$50-$75
## 2 0 0 1 0 1
## 3 1 0 0 0 0
## 4 0 0 0 1 0
## 5 1 0 0 0 0
## 6 0 0 1 0 1
## 7 1 0 0 0 0
## Q6$75-$100 Q6less than $50 Q7McGraw Hill is more expensive
## 2 0 0 0
## 3 0 1 0
## 4 0 1 0
## 5 1 0 0
## 6 0 0 0
## 7 0 1 1
## Q7Pearson is more expensive Q7They are about the same
## 2 0 1
## 3 0 0
## 4 1 0
## 5 0 1
## 6 0 1
## 7 0 0
dim(survey_matrix)
## [1] 27 16
use <- scale(survey_matrix, center = TRUE, scale = TRUE)
head(use)
## Q1Grad student Q1Junior Q1Senior Q1Sophmore Q2I have no preference
## 2 -0.2775555 -0.5245305 0.6938887 -0.1924501 -0.4092332
## 3 -0.2775555 -0.5245305 0.6938887 -0.1924501 -0.4092332
## 4 -0.2775555 -0.5245305 0.6938887 -0.1924501 -0.4092332
## 5 -0.2775555 -0.5245305 0.6938887 -0.1924501 -0.4092332
## 6 -0.2775555 -0.5245305 0.6938887 -0.1924501 -0.4092332
## 7 -0.2775555 1.8358568 -1.3877773 -0.1924501 -0.4092332
## Q2paper Q3Pearson Q4I prefer a different platform Q4McGraw Hill
## 2 -0.5245305 -1.0971343 -0.4092332 2.0584064
## 3 -0.5245305 0.8777075 -0.4092332 -0.4678196
## 4 -0.5245305 -1.0971343 -0.4092332 -0.4678196
## 5 -0.5245305 0.8777075 -0.4092332 -0.4678196
## 6 -0.5245305 -1.0971343 -0.4092332 2.0584064
## 7 -0.5245305 0.8777075 -0.4092332 -0.4678196
## Q4Pearson Q6$50-$75 Q6$75-$100 Q6less than $50
## 2 -0.6367559 1.6587112 -0.4678196 -0.9456109
## 3 -0.6367559 -0.5805489 -0.4678196 1.0183502
## 4 1.5122953 -0.5805489 -0.4678196 1.0183502
## 5 -0.6367559 -0.5805489 2.0584064 -0.9456109
## 6 -0.6367559 1.6587112 -0.4678196 -0.9456109
## 7 -0.6367559 -0.5805489 -0.4678196 1.0183502
## Q7McGraw Hill is more expensive Q7Pearson is more expensive
## 2 -0.4678196 -0.3469443
## 3 -0.4678196 -0.3469443
## 4 -0.4678196 2.7755547
## 5 -0.4678196 -0.3469443
## 6 -0.4678196 -0.3469443
## 7 2.0584064 -0.3469443
## Q7They are about the same
## 2 0.8777075
## 3 -1.0971343
## 4 -1.0971343
## 5 0.8777075
## 6 0.8777075
## 7 -1.0971343
dist_matrix <- dist(use)
seg.hclust <- hclust(dist_matrix, method = "complete")
plot(seg.hclust, main = "Hierarchical Clustering Dendrogram: Pearson vs McGraw Hill")
groups.3 <- cutree(seg.hclust, k = 3)
table(groups.3)
## groups.3
## 1 2 3
## 18 8 1
length(groups.3)
## [1] 27
nrow(mydata)
## [1] 27
# Add cluster labels back to the data
mydata$Cluster <- groups.3
cluster_vars$Cluster <- groups.3
head(mydata)
## Q1 Q2 Q3 Q4 Q5 Q6
## 2 Senior electronic McGraw Hill McGraw Hill N/A $50-$75
## 3 Senior electronic Pearson I don't have a preference N/A less than $50
## 4 Senior electronic McGraw Hill Pearson n/a less than $50
## 5 Senior electronic Pearson I don't have a preference $75-$100
## 6 Senior electronic McGraw Hill McGraw Hill N/A $50-$75
## 7 Junior electronic Pearson I don't have a preference less than $50
## Q7 Cluster
## 2 They are about the same 1
## 3 I don't pay attention to textbook prices 1
## 4 Pearson is more expensive 1
## 5 They are about the same 1
## 6 They are about the same 1
## 7 McGraw Hill is more expensive 2
Because these variables are categorical, the mode is more useful than the mean or median.
get_mode <- function(x) {
x <- na.omit(x)
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
cluster_profiles <- cluster_vars %>%
group_by(Cluster) %>%
summarise(across(everything(), get_mode))
cluster_profiles
## # A tibble: 3 × 7
## Cluster Q1 Q2 Q3 Q4 Q6 Q7
## <int> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 1 Senior electronic McGraw Hill I don't have a … less… They…
## 2 2 Junior electronic Pearson Pearson less… They…
## 3 3 Sophmore I have no preference McGraw Hill I don't have a … $75-… They…
cluster_counts <- as.data.frame(table(groups.3))
names(cluster_counts) <- c("Cluster", "Count")
cluster_counts
## Cluster Count
## 1 1 18
## 2 2 8
## 3 3 1
write.csv(mydata, "clustered_survey_data.csv", row.names = FALSE)
write.csv(cluster_profiles, "cluster_profiles.csv", row.names = FALSE)
write.csv(cluster_counts, "cluster_counts.csv", row.names = FALSE)
How many observations do we have in each cluster? Answer: Use
table(groups.3) to report the number of respondents in each
cluster.
Why is it important to look at the common features of the variables in each cluster? Answer: It helps identify the defining characteristics of each segment and makes the clusters interpretable.
Should mean or median be used when analyzing the differences among clusters? Why? Answer: Because these variables are categorical, neither mean nor median is ideal. The mode is the most appropriate summary.
What summary measures of each cluster are appropriate for building a targeting strategy? Answer: Cluster size, most common textbook format preference, most common platform preference, spending pattern, and perceptions of price.
What are the major differences between K-means clustering and hierarchical clustering? Which one do you prefer, and why? Answer: Hierarchical clustering shows how observations group together step by step in a dendrogram, while K-means places observations into a fixed number of clusters. Hierarchical clustering is useful for exploration, while K-means is useful for simpler partitioning.
Should we use mydata or mydata[, -1] with
the aggregate() function? Why? Answer: In this case,
aggregate() with mean or median is not ideal because the
variables are categorical. It is better to summarize the original survey
variables using the mode within each cluster.
set.seed(123)
fit <- kmeans(use, centers = 3, iter.max = 1000, nstart = 25)
table(fit$cluster)
##
## 1 2 3
## 4 15 8
barplot(table(fit$cluster), main = "Cluster Sizes (K-Means)")
PCA is performed on the dummy-coded numeric matrix.
pca <- prcomp(use, scale. = FALSE)
pca_data <- as.data.frame(pca$x[, 1:2])
pca_data$Cluster <- factor(fit$cluster)
head(pca_data)
## PC1 PC2 Cluster
## 2 -1.1841219 -1.2656964 2
## 3 -0.7702527 0.6846993 2
## 4 -1.1401446 1.0662215 2
## 5 0.5633162 -1.7254825 1
## 6 -1.1841219 -1.2656964 2
## 7 0.2748862 2.3856181 3
ggplot(pca_data, aes(x = PC1, y = PC2, fill = Cluster)) +
geom_point(size = 3, color = "gray40", shape = 21) +
theme_bw() +
labs(title = "PCA Plot with Cluster Membership",
fill = "Cluster")
autoplot(fit, data = as.data.frame(use), frame = TRUE, frame.type = "norm")
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6375 1.5590 1.4359 1.3190 1.2990 1.19644 1.02596
## Proportion of Variance 0.1676 0.1519 0.1289 0.1087 0.1055 0.08947 0.06579
## Cumulative Proportion 0.1676 0.3195 0.4484 0.5571 0.6625 0.75202 0.81781
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.89336 0.76411 0.68032 0.65331 0.52791 0.43256 0.35904
## Proportion of Variance 0.04988 0.03649 0.02893 0.02668 0.01742 0.01169 0.00806
## Cumulative Proportion 0.86769 0.90418 0.93311 0.95978 0.97720 0.98889 0.99695
## PC15 PC16
## Standard deviation 0.22084 1.659e-16
## Proportion of Variance 0.00305 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00
pca$rotation
## PC1 PC2 PC3
## Q1Grad student 0.32568698 0.21643171 0.26785359
## Q1Junior 0.15468572 0.25124808 -0.38839146
## Q1Senior -0.42688606 -0.21713090 0.22986570
## Q1Sophmore 0.27340059 -0.31124236 -0.09022132
## Q2I have no preference 0.40819394 -0.14416354 -0.03423190
## Q2paper -0.24024579 -0.03564046 0.16264147
## Q3Pearson 0.16811392 0.25128341 0.21980906
## Q4I prefer a different platform -0.13691323 -0.23885536 0.31074349
## Q4McGraw Hill -0.25451698 -0.09808482 -0.30436436
## Q4Pearson 0.28461816 0.33421139 0.17793641
## Q6$50-$75 -0.04822318 0.03149312 -0.45330250
## Q6$75-$100 0.26303585 -0.44861156 0.08525572
## Q6less than $50 -0.07812663 0.32341833 0.36963181
## Q7McGraw Hill is more expensive -0.08257913 0.25962917 -0.19080484
## Q7Pearson is more expensive -0.20802329 0.05109041 0.15273436
## Q7They are about the same 0.26110607 -0.32494077 0.10253839
## PC4 PC5 PC6
## Q1Grad student -0.003052781 -0.28149456 -0.05624096
## Q1Junior 0.309611746 0.16364223 0.06977410
## Q1Senior -0.330954266 -0.01589885 -0.12812551
## Q1Sophmore 0.148766502 0.06980694 0.24421222
## Q2I have no preference 0.038309847 0.17114825 0.14291507
## Q2paper 0.384140424 0.13926048 -0.39021338
## Q3Pearson -0.339656693 0.26630006 -0.34002435
## Q4I prefer a different platform 0.078838228 0.39474990 0.04765889
## Q4McGraw Hill 0.216796620 -0.36278202 -0.17308132
## Q4Pearson -0.202586835 -0.05024727 -0.08047681
## Q6$50-$75 -0.469778177 -0.04992580 -0.04288601
## Q6$75-$100 0.034850472 0.25012919 -0.06752058
## Q6less than $50 0.365654968 -0.18822561 0.16417339
## Q7McGraw Hill is more expensive 0.039122867 0.48132091 -0.22617622
## Q7Pearson is more expensive -0.227014812 0.03941590 0.63965439
## Q7They are about the same -0.071674493 -0.38155307 -0.30860237
## PC7 PC8 PC9
## Q1Grad student 0.2856225762 -0.318456022 -0.251555691
## Q1Junior -0.4403357952 -0.200402553 0.133763272
## Q1Senior 0.1141566972 0.130152280 0.043316092
## Q1Sophmore 0.2883148200 0.557905127 -0.053744744
## Q2I have no preference 0.4731599960 -0.227142956 0.039861098
## Q2paper 0.1087207699 0.053807474 0.364061822
## Q3Pearson -0.1668493166 0.197919068 -0.242622752
## Q4I prefer a different platform 0.0408642514 -0.577147032 -0.061401236
## Q4McGraw Hill 0.2840951913 -0.128444684 0.140913514
## Q4Pearson 0.1067598835 0.061659855 0.724038215
## Q6$50-$75 0.1035415527 -0.227274000 0.022952745
## Q6$75-$100 -0.2699139423 0.001155188 0.204118362
## Q6less than $50 0.0123048740 0.075525673 -0.099490682
## Q7McGraw Hill is more expensive 0.4107327724 0.047508466 -0.001168334
## Q7Pearson is more expensive -0.0002037391 -0.071498153 0.322186312
## Q7They are about the same -0.1364652842 -0.135790518 0.131581309
## PC10 PC11 PC12
## Q1Grad student 0.096683090 -0.47097367 0.07706397
## Q1Junior 0.029282273 0.06324235 -0.03371134
## Q1Senior -0.191345530 0.22729655 -0.13573597
## Q1Sophmore 0.279090050 -0.05346509 0.30616104
## Q2I have no preference -0.083487180 0.29839052 -0.57369402
## Q2paper 0.568989617 -0.18351953 -0.28634297
## Q3Pearson 0.038936012 -0.21169880 -0.14842498
## Q4I prefer a different platform 0.071901622 0.09819773 0.32947830
## Q4McGraw Hill -0.365986953 -0.31512639 0.03406514
## Q4Pearson -0.160466665 0.11695309 0.14612002
## Q6$50-$75 0.458136069 0.12299727 0.10334636
## Q6$75-$100 -0.289069047 -0.28620239 0.06178057
## Q6less than $50 0.009792366 0.38673803 0.14892664
## Q7McGraw Hill is more expensive -0.174942870 0.02073642 0.42823439
## Q7Pearson is more expensive 0.145574360 -0.34146633 0.02021993
## Q7They are about the same 0.181237543 0.25664784 0.30996985
## PC13 PC14 PC15
## Q1Grad student 0.233417146 0.03097752 -0.12940233
## Q1Junior -0.107641663 -0.06804004 0.11949576
## Q1Senior 0.056378690 -0.03057717 -0.06218821
## Q1Sophmore -0.227458899 0.18315037 0.07162216
## Q2I have no preference -0.129456964 -0.20024651 0.03443613
## Q2paper 0.054601320 -0.04548636 -0.06006636
## Q3Pearson -0.571017899 -0.13615094 0.13979287
## Q4I prefer a different platform -0.264455268 0.33320913 0.14302046
## Q4McGraw Hill -0.520204600 -0.01361966 0.04603827
## Q4Pearson -0.046387492 0.35087252 0.01087446
## Q6$50-$75 -0.178637658 0.02873373 -0.48460352
## Q6$75-$100 0.009782356 -0.17126902 -0.58370846
## Q6less than $50 -0.315063304 -0.23578957 -0.46273850
## Q7McGraw Hill is more expensive 0.222888333 -0.41142298 0.02613787
## Q7Pearson is more expensive -0.096538206 -0.43388600 0.15442956
## Q7They are about the same -0.015482529 -0.47076584 0.31422199
## PC16
## Q1Grad student -3.706247e-01
## Q1Junior -5.883484e-01
## Q1Senior -6.671244e-01
## Q1Sophmore -2.672612e-01
## Q2I have no preference 9.869896e-19
## Q2paper 2.337685e-18
## Q3Pearson 6.092334e-17
## Q4I prefer a different platform -1.277235e-16
## Q4McGraw Hill -3.013942e-16
## Q4Pearson -1.421318e-17
## Q6$50-$75 -1.685719e-16
## Q6$75-$100 -7.952409e-17
## Q6less than $50 1.199781e-18
## Q7McGraw Hill is more expensive -2.401293e-17
## Q7Pearson is more expensive -1.020155e-16
## Q7They are about the same -1.034841e-17
biplot(pca, scale = 0)
pca$rotation <- -pca$rotation
pca$x <- -pca$x
biplot(pca, scale = 0)
pca.var <- pca$sdev^2
pve <- pca.var / sum(pca.var)
plot(pve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1),
type = "b")
plot(cumsum(pve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1),
type = "b")
write.csv(pca_data, "pca_data.csv", row.names = FALSE)
This cluster analysis groups students based on class standing,
textbook format preference, platform used, platform preference, spending
behavior, and price perception. Because the survey questions are
categorical, the analysis converts them into dummy variables before
clustering. The open-ended question (Q5) was excluded from
clustering because it does not work well in this type of numeric
distance-based analysis.
The results can help identify different types of textbook users, such as students who prefer Pearson, students who prefer McGraw Hill, and students who are more neutral. In this survey, electronic textbook preference appears to be an important factor, and many respondents seem to view Pearson and McGraw Hill as similarly priced. This suggests that differences among students may be driven more by platform familiarity and format preference than by price alone.
From a marketing perspective, these findings suggest that Pearson and McGraw Hill should not treat all students as one single segment. Instead, each company could tailor messages based on the preferences of each cluster. For example, students who prefer digital formats may respond well to convenience and access messaging, while more neutral students may be more influenced by ease of use or value.
Because the sample is small, the results should be interpreted as exploratory rather than definitive.
Think about at least one question you could answer using this result. Answer: One question is whether students with similar digital textbook preferences also share similar publisher preferences.
Interpret the PCA graphs according to the required reading. Answer: Respondents that appear close together on the PCA plot have similar response patterns, while respondents that appear farther apart differ more in their perceptions and preferences.