Cluster Analysis and PCA: Pearson vs McGraw Hill Survey

Objective

This analysis uses hierarchical clustering and PCA to segment respondents based on their perceptions of Pearson vs McGraw Hill. Because the survey responses are mostly categorical, the data are converted into dummy variables before clustering.

Dataset

The dataset used is: Group 2 Survey Cleaned.csv

Importing data

mydata <- read.csv("Group 2 Survey Cleaned.csv", stringsAsFactors = FALSE)

str(mydata)

## 'data.frame':    28 obs. of  7 variables:
##  $ Q1: chr  "standing" "Senior" "Senior" "Senior" ...
##  $ Q2: chr  "electrontic_paper" "electronic" "electronic" "electronic" ...
##  $ Q3: chr  "platform" "McGraw Hill" "Pearson" "McGraw Hill" ...
##  $ Q4: chr  "preference" "McGraw Hill" "I don't have a preference" "Pearson" ...
##  $ Q5: chr  "other_platform" "N/A" "N/A" "n/a" ...
##  $ Q6: chr  "spending" "$50-$75" "less than $50" "less than $50" ...
##  $ Q7: chr  "expensive" "They are about the same" "I don't pay attention to textbook prices" "Pearson is more expensive" ...

summary(mydata)

##       Q1                 Q2                 Q3                 Q4           
##  Length:28          Length:28          Length:28          Length:28         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       Q5                 Q6                 Q7           
##  Length:28          Length:28          Length:28         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

head(mydata)

##         Q1                Q2          Q3                        Q4
## 1 standing electrontic_paper    platform                preference
## 2   Senior        electronic McGraw Hill               McGraw Hill
## 3   Senior        electronic     Pearson I don't have a preference
## 4   Senior        electronic McGraw Hill                   Pearson
## 5   Senior        electronic     Pearson I don't have a preference
## 6   Senior        electronic McGraw Hill               McGraw Hill
##               Q5            Q6                                       Q7
## 1 other_platform      spending                                expensive
## 2            N/A       $50-$75                  They are about the same
## 3            N/A less than $50 I don't pay attention to textbook prices
## 4            n/a less than $50                Pearson is more expensive
## 5                     $75-$100                  They are about the same
## 6            N/A       $50-$75                  They are about the same

Data Cleaning

The first row contains labels such as “standing” and is not an actual respondent, so it is removed.

# Remove first row because it contains variable labels, not survey data
mydata <- mydata[-1, ]

# Trim whitespace in every column
mydata[] <- lapply(mydata, function(x) trimws(as.character(x)))

# Keep Q5 in the original data, but exclude it from clustering because it is open-ended text
cluster_vars <- mydata[, c("Q1", "Q2", "Q3", "Q4", "Q6", "Q7")]

# Convert clustering variables to factors
cluster_vars[] <- lapply(cluster_vars, as.factor)

# Convert original data to character/factor for easier summaries
mydata[] <- lapply(mydata, as.factor)

str(cluster_vars)

## 'data.frame':    27 obs. of  6 variables:
##  $ Q1: Factor w/ 4 levels "Grad student",..: 3 3 3 3 3 2 3 2 3 2 ...
##  $ Q2: Factor w/ 3 levels "electronic","I have no preference",..: 1 1 1 1 1 1 1 3 3 1 ...
##  $ Q3: Factor w/ 2 levels "McGraw Hill",..: 1 2 1 2 1 2 2 1 2 1 ...
##  $ Q4: Factor w/ 4 levels "I don't have a preference",..: 3 1 4 1 3 1 1 3 2 1 ...
##  $ Q6: Factor w/ 4 levels "$100 or more",..: 2 4 4 3 2 4 2 4 3 2 ...
##  $ Q7: Factor w/ 4 levels "I don't pay attention to textbook prices",..: 4 1 3 4 4 2 4 1 4 4 ...

summary(cluster_vars)

##             Q1                        Q2               Q3    
##  Grad student: 2   electronic          :17   McGraw Hill:12  
##  Junior      : 6   I have no preference: 4   Pearson    :15  
##  Senior      :18   paper               : 6                   
##  Sophmore    : 1                                             
##                              Q4                 Q6    
##  I don't have a preference    :10   $100 or more : 2  
##  I prefer a different platform: 4   $50-$75      : 7  
##  McGraw Hill                  : 5   $75-$100     : 5  
##  Pearson                      : 8   less than $50:13  
##                                         Q7    
##  I don't pay attention to textbook prices: 4  
##  McGraw Hill is more expensive           : 5  
##  Pearson is more expensive               : 3  
##  They are about the same                 :15

Converting categorical survey responses to numeric form

Because clustering and PCA require numeric inputs, the categorical survey variables are converted to dummy variables.

survey_matrix <- model.matrix(~ . - 1, data = cluster_vars)

head(survey_matrix)

##   Q1Grad student Q1Junior Q1Senior Q1Sophmore Q2I have no preference Q2paper
## 2              0        0        1          0                      0       0
## 3              0        0        1          0                      0       0
## 4              0        0        1          0                      0       0
## 5              0        0        1          0                      0       0
## 6              0        0        1          0                      0       0
## 7              0        1        0          0                      0       0
##   Q3Pearson Q4I prefer a different platform Q4McGraw Hill Q4Pearson Q6$50-$75
## 2         0                               0             1         0         1
## 3         1                               0             0         0         0
## 4         0                               0             0         1         0
## 5         1                               0             0         0         0
## 6         0                               0             1         0         1
## 7         1                               0             0         0         0
##   Q6$75-$100 Q6less than $50 Q7McGraw Hill is more expensive
## 2          0               0                               0
## 3          0               1                               0
## 4          0               1                               0
## 5          1               0                               0
## 6          0               0                               0
## 7          0               1                               1
##   Q7Pearson is more expensive Q7They are about the same
## 2                           0                         1
## 3                           0                         0
## 4                           1                         0
## 5                           0                         1
## 6                           0                         1
## 7                           0                         0

dim(survey_matrix)

## [1] 27 16

Standardizing the data

use <- scale(survey_matrix, center = TRUE, scale = TRUE)
head(use)

##   Q1Grad student   Q1Junior   Q1Senior Q1Sophmore Q2I have no preference
## 2     -0.2775555 -0.5245305  0.6938887 -0.1924501             -0.4092332
## 3     -0.2775555 -0.5245305  0.6938887 -0.1924501             -0.4092332
## 4     -0.2775555 -0.5245305  0.6938887 -0.1924501             -0.4092332
## 5     -0.2775555 -0.5245305  0.6938887 -0.1924501             -0.4092332
## 6     -0.2775555 -0.5245305  0.6938887 -0.1924501             -0.4092332
## 7     -0.2775555  1.8358568 -1.3877773 -0.1924501             -0.4092332
##      Q2paper  Q3Pearson Q4I prefer a different platform Q4McGraw Hill
## 2 -0.5245305 -1.0971343                      -0.4092332     2.0584064
## 3 -0.5245305  0.8777075                      -0.4092332    -0.4678196
## 4 -0.5245305 -1.0971343                      -0.4092332    -0.4678196
## 5 -0.5245305  0.8777075                      -0.4092332    -0.4678196
## 6 -0.5245305 -1.0971343                      -0.4092332     2.0584064
## 7 -0.5245305  0.8777075                      -0.4092332    -0.4678196
##    Q4Pearson  Q6$50-$75 Q6$75-$100 Q6less than $50
## 2 -0.6367559  1.6587112 -0.4678196      -0.9456109
## 3 -0.6367559 -0.5805489 -0.4678196       1.0183502
## 4  1.5122953 -0.5805489 -0.4678196       1.0183502
## 5 -0.6367559 -0.5805489  2.0584064      -0.9456109
## 6 -0.6367559  1.6587112 -0.4678196      -0.9456109
## 7 -0.6367559 -0.5805489 -0.4678196       1.0183502
##   Q7McGraw Hill is more expensive Q7Pearson is more expensive
## 2                      -0.4678196                  -0.3469443
## 3                      -0.4678196                  -0.3469443
## 4                      -0.4678196                   2.7755547
## 5                      -0.4678196                  -0.3469443
## 6                      -0.4678196                  -0.3469443
## 7                       2.0584064                  -0.3469443
##   Q7They are about the same
## 2                 0.8777075
## 3                -1.0971343
## 4                -1.0971343
## 5                 0.8777075
## 6                 0.8777075
## 7                -1.0971343

Building the distance matrix and plotting the dendrogram

dist_matrix <- dist(use)
seg.hclust <- hclust(dist_matrix, method = "complete")

plot(seg.hclust, main = "Hierarchical Clustering Dendrogram: Pearson vs McGraw Hill")

Identifying cluster memberships

groups.3 <- cutree(seg.hclust, k = 3)

table(groups.3)

## groups.3
##  1  2  3 
## 18  8  1

length(groups.3)

## [1] 27

nrow(mydata)

## [1] 27

# Add cluster labels back to the data
mydata$Cluster <- groups.3
cluster_vars$Cluster <- groups.3

head(mydata)

##       Q1         Q2          Q3                        Q4  Q5            Q6
## 2 Senior electronic McGraw Hill               McGraw Hill N/A       $50-$75
## 3 Senior electronic     Pearson I don't have a preference N/A less than $50
## 4 Senior electronic McGraw Hill                   Pearson n/a less than $50
## 5 Senior electronic     Pearson I don't have a preference          $75-$100
## 6 Senior electronic McGraw Hill               McGraw Hill N/A       $50-$75
## 7 Junior electronic     Pearson I don't have a preference     less than $50
##                                         Q7 Cluster
## 2                  They are about the same       1
## 3 I don't pay attention to textbook prices       1
## 4                Pearson is more expensive       1
## 5                  They are about the same       1
## 6                  They are about the same       1
## 7            McGraw Hill is more expensive       2

Identifying common features of each cluster

Because these variables are categorical, the mode is more useful than the mean or median.

get_mode <- function(x) {
  x <- na.omit(x)
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

cluster_profiles <- cluster_vars %>%
  group_by(Cluster) %>%
  summarise(across(everything(), get_mode))

cluster_profiles

## # A tibble: 3 × 7
##   Cluster Q1       Q2                   Q3          Q4               Q6    Q7   
##     <int> <fct>    <fct>                <fct>       <fct>            <fct> <fct>
## 1       1 Senior   electronic           McGraw Hill I don't have a … less… They…
## 2       2 Junior   electronic           Pearson     Pearson          less… They…
## 3       3 Sophmore I have no preference McGraw Hill I don't have a … $75-… They…

Cluster counts

cluster_counts <- as.data.frame(table(groups.3))
names(cluster_counts) <- c("Cluster", "Count")
cluster_counts

##   Cluster Count
## 1       1    18
## 2       2     8
## 3       3     1

Exporting cluster analysis results

write.csv(mydata, "clustered_survey_data.csv", row.names = FALSE)
write.csv(cluster_profiles, "cluster_profiles.csv", row.names = FALSE)
write.csv(cluster_counts, "cluster_counts.csv", row.names = FALSE)

Discussion Questions

How many observations do we have in each cluster? Answer: Use table(groups.3) to report the number of respondents in each cluster.
Why is it important to look at the common features of the variables in each cluster? Answer: It helps identify the defining characteristics of each segment and makes the clusters interpretable.
Should mean or median be used when analyzing the differences among clusters? Why? Answer: Because these variables are categorical, neither mean nor median is ideal. The mode is the most appropriate summary.
What summary measures of each cluster are appropriate for building a targeting strategy? Answer: Cluster size, most common textbook format preference, most common platform preference, spending pattern, and perceptions of price.
What are the major differences between K-means clustering and hierarchical clustering? Which one do you prefer, and why? Answer: Hierarchical clustering shows how observations group together step by step in a dendrogram, while K-means places observations into a fixed number of clusters. Hierarchical clustering is useful for exploration, while K-means is useful for simpler partitioning.

Advanced Question

Should we use mydata or mydata[, -1] with the aggregate() function? Why? Answer: In this case, aggregate() with mean or median is not ideal because the variables are categorical. It is better to summarize the original survey variables using the mode within each cluster.

K-Means Clustering

set.seed(123)

fit <- kmeans(use, centers = 3, iter.max = 1000, nstart = 25)

table(fit$cluster)

## 
##  1  2  3 
##  4 15  8

barplot(table(fit$cluster), main = "Cluster Sizes (K-Means)")

Principal Component Analysis (PCA)

PCA is performed on the dummy-coded numeric matrix.

pca <- prcomp(use, scale. = FALSE)

pca_data <- as.data.frame(pca$x[, 1:2])
pca_data$Cluster <- factor(fit$cluster)

head(pca_data)

##          PC1        PC2 Cluster
## 2 -1.1841219 -1.2656964       2
## 3 -0.7702527  0.6846993       2
## 4 -1.1401446  1.0662215       2
## 5  0.5633162 -1.7254825       1
## 6 -1.1841219 -1.2656964       2
## 7  0.2748862  2.3856181       3

PCA Plot

ggplot(pca_data, aes(x = PC1, y = PC2, fill = Cluster)) +
  geom_point(size = 3, color = "gray40", shape = 21) +
  theme_bw() +
  labs(title = "PCA Plot with Cluster Membership",
       fill = "Cluster")

K-Means Plot

autoplot(fit, data = as.data.frame(use), frame = TRUE, frame.type = "norm")

PCA Outputs

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.6375 1.5590 1.4359 1.3190 1.2990 1.19644 1.02596
## Proportion of Variance 0.1676 0.1519 0.1289 0.1087 0.1055 0.08947 0.06579
## Cumulative Proportion  0.1676 0.3195 0.4484 0.5571 0.6625 0.75202 0.81781
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.89336 0.76411 0.68032 0.65331 0.52791 0.43256 0.35904
## Proportion of Variance 0.04988 0.03649 0.02893 0.02668 0.01742 0.01169 0.00806
## Cumulative Proportion  0.86769 0.90418 0.93311 0.95978 0.97720 0.98889 0.99695
##                           PC15      PC16
## Standard deviation     0.22084 1.659e-16
## Proportion of Variance 0.00305 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00

pca$rotation

##                                         PC1         PC2         PC3
## Q1Grad student                   0.32568698  0.21643171  0.26785359
## Q1Junior                         0.15468572  0.25124808 -0.38839146
## Q1Senior                        -0.42688606 -0.21713090  0.22986570
## Q1Sophmore                       0.27340059 -0.31124236 -0.09022132
## Q2I have no preference           0.40819394 -0.14416354 -0.03423190
## Q2paper                         -0.24024579 -0.03564046  0.16264147
## Q3Pearson                        0.16811392  0.25128341  0.21980906
## Q4I prefer a different platform -0.13691323 -0.23885536  0.31074349
## Q4McGraw Hill                   -0.25451698 -0.09808482 -0.30436436
## Q4Pearson                        0.28461816  0.33421139  0.17793641
## Q6$50-$75                       -0.04822318  0.03149312 -0.45330250
## Q6$75-$100                       0.26303585 -0.44861156  0.08525572
## Q6less than $50                 -0.07812663  0.32341833  0.36963181
## Q7McGraw Hill is more expensive -0.08257913  0.25962917 -0.19080484
## Q7Pearson is more expensive     -0.20802329  0.05109041  0.15273436
## Q7They are about the same        0.26110607 -0.32494077  0.10253839
##                                          PC4         PC5         PC6
## Q1Grad student                  -0.003052781 -0.28149456 -0.05624096
## Q1Junior                         0.309611746  0.16364223  0.06977410
## Q1Senior                        -0.330954266 -0.01589885 -0.12812551
## Q1Sophmore                       0.148766502  0.06980694  0.24421222
## Q2I have no preference           0.038309847  0.17114825  0.14291507
## Q2paper                          0.384140424  0.13926048 -0.39021338
## Q3Pearson                       -0.339656693  0.26630006 -0.34002435
## Q4I prefer a different platform  0.078838228  0.39474990  0.04765889
## Q4McGraw Hill                    0.216796620 -0.36278202 -0.17308132
## Q4Pearson                       -0.202586835 -0.05024727 -0.08047681
## Q6$50-$75                       -0.469778177 -0.04992580 -0.04288601
## Q6$75-$100                       0.034850472  0.25012919 -0.06752058
## Q6less than $50                  0.365654968 -0.18822561  0.16417339
## Q7McGraw Hill is more expensive  0.039122867  0.48132091 -0.22617622
## Q7Pearson is more expensive     -0.227014812  0.03941590  0.63965439
## Q7They are about the same       -0.071674493 -0.38155307 -0.30860237
##                                           PC7          PC8          PC9
## Q1Grad student                   0.2856225762 -0.318456022 -0.251555691
## Q1Junior                        -0.4403357952 -0.200402553  0.133763272
## Q1Senior                         0.1141566972  0.130152280  0.043316092
## Q1Sophmore                       0.2883148200  0.557905127 -0.053744744
## Q2I have no preference           0.4731599960 -0.227142956  0.039861098
## Q2paper                          0.1087207699  0.053807474  0.364061822
## Q3Pearson                       -0.1668493166  0.197919068 -0.242622752
## Q4I prefer a different platform  0.0408642514 -0.577147032 -0.061401236
## Q4McGraw Hill                    0.2840951913 -0.128444684  0.140913514
## Q4Pearson                        0.1067598835  0.061659855  0.724038215
## Q6$50-$75                        0.1035415527 -0.227274000  0.022952745
## Q6$75-$100                      -0.2699139423  0.001155188  0.204118362
## Q6less than $50                  0.0123048740  0.075525673 -0.099490682
## Q7McGraw Hill is more expensive  0.4107327724  0.047508466 -0.001168334
## Q7Pearson is more expensive     -0.0002037391 -0.071498153  0.322186312
## Q7They are about the same       -0.1364652842 -0.135790518  0.131581309
##                                         PC10        PC11        PC12
## Q1Grad student                   0.096683090 -0.47097367  0.07706397
## Q1Junior                         0.029282273  0.06324235 -0.03371134
## Q1Senior                        -0.191345530  0.22729655 -0.13573597
## Q1Sophmore                       0.279090050 -0.05346509  0.30616104
## Q2I have no preference          -0.083487180  0.29839052 -0.57369402
## Q2paper                          0.568989617 -0.18351953 -0.28634297
## Q3Pearson                        0.038936012 -0.21169880 -0.14842498
## Q4I prefer a different platform  0.071901622  0.09819773  0.32947830
## Q4McGraw Hill                   -0.365986953 -0.31512639  0.03406514
## Q4Pearson                       -0.160466665  0.11695309  0.14612002
## Q6$50-$75                        0.458136069  0.12299727  0.10334636
## Q6$75-$100                      -0.289069047 -0.28620239  0.06178057
## Q6less than $50                  0.009792366  0.38673803  0.14892664
## Q7McGraw Hill is more expensive -0.174942870  0.02073642  0.42823439
## Q7Pearson is more expensive      0.145574360 -0.34146633  0.02021993
## Q7They are about the same        0.181237543  0.25664784  0.30996985
##                                         PC13        PC14        PC15
## Q1Grad student                   0.233417146  0.03097752 -0.12940233
## Q1Junior                        -0.107641663 -0.06804004  0.11949576
## Q1Senior                         0.056378690 -0.03057717 -0.06218821
## Q1Sophmore                      -0.227458899  0.18315037  0.07162216
## Q2I have no preference          -0.129456964 -0.20024651  0.03443613
## Q2paper                          0.054601320 -0.04548636 -0.06006636
## Q3Pearson                       -0.571017899 -0.13615094  0.13979287
## Q4I prefer a different platform -0.264455268  0.33320913  0.14302046
## Q4McGraw Hill                   -0.520204600 -0.01361966  0.04603827
## Q4Pearson                       -0.046387492  0.35087252  0.01087446
## Q6$50-$75                       -0.178637658  0.02873373 -0.48460352
## Q6$75-$100                       0.009782356 -0.17126902 -0.58370846
## Q6less than $50                 -0.315063304 -0.23578957 -0.46273850
## Q7McGraw Hill is more expensive  0.222888333 -0.41142298  0.02613787
## Q7Pearson is more expensive     -0.096538206 -0.43388600  0.15442956
## Q7They are about the same       -0.015482529 -0.47076584  0.31422199
##                                          PC16
## Q1Grad student                  -3.706247e-01
## Q1Junior                        -5.883484e-01
## Q1Senior                        -6.671244e-01
## Q1Sophmore                      -2.672612e-01
## Q2I have no preference           9.869896e-19
## Q2paper                          2.337685e-18
## Q3Pearson                        6.092334e-17
## Q4I prefer a different platform -1.277235e-16
## Q4McGraw Hill                   -3.013942e-16
## Q4Pearson                       -1.421318e-17
## Q6$50-$75                       -1.685719e-16
## Q6$75-$100                      -7.952409e-17
## Q6less than $50                  1.199781e-18
## Q7McGraw Hill is more expensive -2.401293e-17
## Q7Pearson is more expensive     -1.020155e-16
## Q7They are about the same       -1.034841e-17

biplot(pca, scale = 0)

Reverse PCA Signs for Alternative View

pca$rotation <- -pca$rotation
pca$x <- -pca$x
biplot(pca, scale = 0)

Variance Explained

pca.var <- pca$sdev^2
pve <- pca.var / sum(pca.var)

plot(pve,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

plot(cumsum(pve),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

Export PCA Results

write.csv(pca_data, "pca_data.csv", row.names = FALSE)

Small Analysis

This cluster analysis groups students based on class standing, textbook format preference, platform used, platform preference, spending behavior, and price perception. Because the survey questions are categorical, the analysis converts them into dummy variables before clustering. The open-ended question (Q5) was excluded from clustering because it does not work well in this type of numeric distance-based analysis.

The results can help identify different types of textbook users, such as students who prefer Pearson, students who prefer McGraw Hill, and students who are more neutral. In this survey, electronic textbook preference appears to be an important factor, and many respondents seem to view Pearson and McGraw Hill as similarly priced. This suggests that differences among students may be driven more by platform familiarity and format preference than by price alone.

From a marketing perspective, these findings suggest that Pearson and McGraw Hill should not treat all students as one single segment. Instead, each company could tailor messages based on the preferences of each cluster. For example, students who prefer digital formats may respond well to convenience and access messaging, while more neutral students may be more influenced by ease of use or value.

Because the sample is small, the results should be interpreted as exploratory rather than definitive.

PCA Discussion Questions

Think about at least one question you could answer using this result. Answer: One question is whether students with similar digital textbook preferences also share similar publisher preferences.
Interpret the PCA graphs according to the required reading. Answer: Respondents that appear close together on the PCA plot have similar response patterns, while respondents that appear farther apart differ more in their perceptions and preferences.

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. An Introduction to Statistical Learning with Applications in R.
Penn State STAT 505 PCA notes.