Principal Component Anlysis (PCA)- This data set hosts survey responses from 1950 individuals rating 9 adjectives on a scale from 1 to 7. The survey aims to describe sentiment for 4 different brands.
We’ll explore the data and then run a PCA to see if anything stands out about the data.
Load Packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(corrplot)
## corrplot 0.84 loaded
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(RColorBrewer)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(skimr)
library(lattice)
Read in data
prst.df <- read.csv("~/Desktop/Github/PCA/prst.csv") %>%
clean_names()
str(prst.df)
## 'data.frame': 1950 obs. of 10 variables:
## $ adaptable : int 5 3 5 4 5 4 5 5 6 6 ...
## $ best_value : int 5 4 4 2 4 4 6 6 5 4 ...
## $ cutting_edge: int 3 2 3 2 4 5 4 4 4 5 ...
## $ delightful : int 3 3 3 4 4 5 5 5 3 4 ...
## $ exciting : int 4 3 5 2 4 5 4 5 4 5 ...
## $ friendly : int 6 3 3 5 3 4 5 4 7 5 ...
## $ generous : int 4 4 4 4 4 5 5 2 3 5 ...
## $ helpful : int 5 4 3 3 3 4 4 2 5 4 ...
## $ intuitive : int 5 4 3 5 2 3 3 3 6 5 ...
## $ brand : chr "Romeo" "Romeo" "Sierra" "Papa" ...
head(prst.df)
## adaptable best_value cutting_edge delightful exciting friendly generous
## 1 5 5 3 3 4 6 4
## 2 3 4 2 3 3 3 4
## 3 5 4 3 3 5 3 4
## 4 4 2 2 4 2 5 4
## 5 5 4 4 4 4 3 4
## 6 4 4 5 5 5 4 5
## helpful intuitive brand
## 1 5 5 Romeo
## 2 4 4 Romeo
## 3 3 3 Sierra
## 4 3 5 Papa
## 5 3 2 Sierra
## 6 4 3 Sierra
describe(prst.df)
## vars n mean sd median trimmed mad min max range skew
## adaptable 1 1950 4.29 0.96 4 4.27 1.48 1 7 6 -0.01
## best_value 2 1950 3.87 1.04 4 3.89 1.48 1 7 6 -0.02
## cutting_edge 3 1950 4.05 1.03 4 4.05 1.48 1 7 6 -0.03
## delightful 4 1950 3.90 0.99 4 3.92 1.48 1 7 6 -0.03
## exciting 5 1950 3.87 1.02 4 3.89 1.48 1 7 6 -0.03
## friendly 6 1950 4.24 1.00 4 4.20 1.48 1 7 6 0.14
## generous 7 1950 4.02 0.94 4 4.02 1.48 1 7 6 0.02
## helpful 8 1950 3.87 1.05 4 3.89 1.48 1 7 6 -0.01
## intuitive 9 1950 3.68 0.94 4 3.70 1.48 1 7 6 -0.10
## brand* 10 1950 2.43 0.87 2 2.42 1.48 1 4 3 0.03
## kurtosis se
## adaptable 0.06 0.02
## best_value 0.12 0.02
## cutting_edge -0.10 0.02
## delightful 0.05 0.02
## exciting -0.04 0.02
## friendly -0.08 0.02
## generous -0.03 0.02
## helpful -0.20 0.02
## intuitive 0.09 0.02
## brand* -0.70 0.02
skim(prst.df)
Data summary
| Name |
prst.df |
| Number of rows |
1950 |
| Number of columns |
10 |
| _______________________ |
|
| Column type frequency: |
|
| character |
1 |
| numeric |
9 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
Variable type: numeric
| adaptable |
0 |
1 |
4.29 |
0.96 |
1 |
4 |
4 |
5 |
7 |
▁▃▇▆▂ |
| best_value |
0 |
1 |
3.87 |
1.04 |
1 |
3 |
4 |
5 |
7 |
▂▆▇▅▁ |
| cutting_edge |
0 |
1 |
4.05 |
1.03 |
1 |
3 |
4 |
5 |
7 |
▁▅▇▆▂ |
| delightful |
0 |
1 |
3.90 |
0.99 |
1 |
3 |
4 |
5 |
7 |
▂▅▇▅▁ |
| exciting |
0 |
1 |
3.87 |
1.02 |
1 |
3 |
4 |
5 |
7 |
▂▆▇▅▁ |
| friendly |
0 |
1 |
4.24 |
1.00 |
1 |
4 |
4 |
5 |
7 |
▁▃▇▆▂ |
| generous |
0 |
1 |
4.02 |
0.94 |
1 |
3 |
4 |
5 |
7 |
▁▅▇▅▁ |
| helpful |
0 |
1 |
3.87 |
1.05 |
1 |
3 |
4 |
5 |
7 |
▂▆▇▅▁ |
| intuitive |
0 |
1 |
3.68 |
0.94 |
1 |
3 |
4 |
4 |
7 |
▂▆▇▃▁ |
2 - Generate a correltion plot using the adjective ratings. How many groups of adjectives does this plot suggest there might be? Rationale provided below
cor(prst.df[, 1:9])
## adaptable best_value cutting_edge delightful
## adaptable 1.0000000000 0.380970661 0.0087593237 -0.0006843708
## best_value 0.3809706611 1.000000000 -0.0092000399 0.0198595003
## cutting_edge 0.0087593237 -0.009200040 1.0000000000 0.2900501053
## delightful -0.0006843708 0.019859500 0.2900501053 1.0000000000
## exciting 0.0088167117 -0.004825184 0.4150407073 0.2899300974
## friendly 0.2450568565 0.018336296 0.0375276189 0.0264161517
## generous -0.0094708960 -0.036970767 0.2699132134 0.1378107490
## helpful 0.1963087169 0.042590445 0.0205042979 0.0309901751
## intuitive 0.2436255356 -0.007144723 -0.0005556566 0.0061830017
## exciting friendly generous helpful intuitive
## adaptable 0.008816712 0.24505686 -0.009470896 0.19630872 0.2436255356
## best_value -0.004825184 0.01833630 -0.036970767 0.04259045 -0.0071447233
## cutting_edge 0.415040707 0.03752762 0.269913213 0.02050430 -0.0005556566
## delightful 0.289930097 0.02641615 0.137810749 0.03099018 0.0061830017
## exciting 1.000000000 0.02391905 0.229731805 0.00949948 0.0185476892
## friendly 0.023919054 1.00000000 -0.034218597 0.30257369 0.3835271954
## generous 0.229731805 -0.03421860 1.000000000 -0.02293925 -0.0343708701
## helpful 0.009499480 0.30257369 -0.022939250 1.00000000 0.3557581923
## intuitive 0.018547689 0.38352720 -0.034370870 0.35575819 1.0000000000
# corrplot to define associations among adjectives in survey data
corrplot.mixed(corr = cor(prst.df[,-10]),
upper.col = colorpanel(50, "red4", "gray70", "blue4"),
lower.col = colorpanel(50, "red4", "gray70", "blue4"),
upper = "ellipse", tl.pos = "lt", tl.col = "black",
tl.cex = .9)

# There are 3 distinct groupings which I identify as the User Experience Choice, the Early Adopters Choice, and the Bargain Hunters Choice.
# - The user experience choice (intuitive, helpful, friendly, adaptable),
# - The early adopters choice (exciting, delightful, and cutting edge),
# - The bargain hunters choice (best value, generous)
3 - Aggregate the mean of each adjective rating by brand. Plot a heatmap for the mean ratings by brand. How many groupings of adjectives does this plot suggest there might be? Provide a rationale for your answer.
# compute mean for heatmap interpretation
adj.mean <- aggregate(.~brand, data = prst.df, mean)
rownames(adj.mean) <- adj.mean[,1]
adj.mean <- adj.mean[, -1]
adj.mean
## adaptable best_value cutting_edge delightful exciting friendly generous
## Papa 4.362069 3.617241 3.434483 3.482759 3.324138 4.693103 3.662069
## Romeo 4.323410 3.855210 3.891746 3.810555 3.729364 4.293640 3.905277
## Sierra 4.265248 3.977305 4.280851 4.056738 4.079433 4.095035 4.177305
## Tango 4.162037 3.939815 4.643519 4.268519 4.444444 3.916667 4.370370
## helpful intuitive
## Papa 4.237931 4.131034
## Romeo 3.928281 3.803789
## Sierra 3.787234 3.495035
## Tango 3.425926 3.236111
# generate a heatmap of mean ratings (heatmap.2 in gplots)
heatmap.2(as.matrix(adj.mean[,]), col = brewer.pal(9, "GnBu"),
trace = "none", key = F, dend = "none", mar = c(6, 18) + .1,
main = "\n\n\n\n\nBrand Attributes")

# The heatmap almost encourages a 4th grouping in the adjectives that may not have been as obvious in the corrplot. The heatmap ultimately makes an argument for splitting the User Experience Choice into 2 different groups where you have intuitive and helpful grouped together, and then friendly and adaptable grouped together. In light of this, I would still group as 3 groups for the sake of explanation, and due to the fact the differentiation of those groups is weak for half of the brands represented.
4 - Compute the principal components for the adjectives using the original data. Make a scree plot. How many principal components are needed to explain the majority of the variance in the data using the scree plot method?
# Compute Principal Component Analysis
brand.pc <- prcomp(prst.df[, -10], scale = T)
summary(brand.pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3889 1.3525 1.1068 0.93107 0.85339 0.8341 0.78533
## Proportion of Variance 0.2143 0.2033 0.1361 0.09632 0.08092 0.0773 0.06853
## Cumulative Proportion 0.2143 0.4176 0.5537 0.65003 0.73095 0.8083 0.87678
## PC8 PC9
## Standard deviation 0.75674 0.73236
## Proportion of Variance 0.06363 0.05959
## Cumulative Proportion 0.94041 1.00000
# Generate scree plot to determine PCA cutoff.
plot(brand.pc, type = "l")

# Based on the summary of cumulative proportion and the bend in the scree plot, I would choose the first 2 components. There is a strong argument to be made for stopping at 3 components since the scree plot has a bit of a double 'kink', or even a triple 'kink' at the 5th component, but the summary shows that the 3rd component drops below 20% of the explanation, so I will stop at 2.
5 - Use the principal components for the mean adjective ratings to generate a biplot of the adjectives loadings and brand scores for the first two principal components. How many groupings of adjectives does the biplot suggest?
# We have already computed the mean back in step 3, and we will use it now to rerun our analysis
# perform PCA for means
brand.mean.pc <- prcomp(adj.mean, scale = T)
summary(brand.mean.pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.9456 0.54780 0.15193 5.482e-15
## Proportion of Variance 0.9641 0.03334 0.00256 0.000e+00
## Cumulative Proportion 0.9641 0.99744 1.00000 1.000e+00
# perform PCA for means
biplot(brand.mean.pc, main = "Brand Positioning", cex = c(1.5, 1))

# The biplot suggests strongly for 2 groupings. One could place "best value" as a third category though it is very close to delightful, cutting_edge, and exciting
6 - Suppose you are the brand manager for Sierra, and you wish to change your brand position to become more competitive with the market leader Tango. What are some strategies you could pursue based on the PCA positions?
# As the brand manager for Sierra, I would recommend that we pursue more of the initiatives that are deemed to be more exciting and cutting edge by the customer base. Assuming that differentiation through being the best value in the market is not a core pillar of our brand, leaning more on the cutting edge and exciting may edge us closer to Tango.