Principal Component Anlysis (PCA)- This data set hosts survey responses from 1950 individuals rating 9 adjectives on a scale from 1 to 7. The survey aims to describe sentiment for 4 different brands.

We’ll explore the data and then run a PCA to see if anything stands out about the data.

Load Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(corrplot)

## corrplot 0.84 loaded

library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(RColorBrewer)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(skimr)
library(lattice)

Read in data

prst.df <- read.csv("~/Desktop/Github/PCA/prst.csv") %>%
  clean_names()
str(prst.df)

## 'data.frame':    1950 obs. of  10 variables:
##  $ adaptable   : int  5 3 5 4 5 4 5 5 6 6 ...
##  $ best_value  : int  5 4 4 2 4 4 6 6 5 4 ...
##  $ cutting_edge: int  3 2 3 2 4 5 4 4 4 5 ...
##  $ delightful  : int  3 3 3 4 4 5 5 5 3 4 ...
##  $ exciting    : int  4 3 5 2 4 5 4 5 4 5 ...
##  $ friendly    : int  6 3 3 5 3 4 5 4 7 5 ...
##  $ generous    : int  4 4 4 4 4 5 5 2 3 5 ...
##  $ helpful     : int  5 4 3 3 3 4 4 2 5 4 ...
##  $ intuitive   : int  5 4 3 5 2 3 3 3 6 5 ...
##  $ brand       : chr  "Romeo" "Romeo" "Sierra" "Papa" ...

head(prst.df)

##   adaptable best_value cutting_edge delightful exciting friendly generous
## 1         5          5            3          3        4        6        4
## 2         3          4            2          3        3        3        4
## 3         5          4            3          3        5        3        4
## 4         4          2            2          4        2        5        4
## 5         5          4            4          4        4        3        4
## 6         4          4            5          5        5        4        5
##   helpful intuitive  brand
## 1       5         5  Romeo
## 2       4         4  Romeo
## 3       3         3 Sierra
## 4       3         5   Papa
## 5       3         2 Sierra
## 6       4         3 Sierra

describe(prst.df)

##              vars    n mean   sd median trimmed  mad min max range  skew
## adaptable       1 1950 4.29 0.96      4    4.27 1.48   1   7     6 -0.01
## best_value      2 1950 3.87 1.04      4    3.89 1.48   1   7     6 -0.02
## cutting_edge    3 1950 4.05 1.03      4    4.05 1.48   1   7     6 -0.03
## delightful      4 1950 3.90 0.99      4    3.92 1.48   1   7     6 -0.03
## exciting        5 1950 3.87 1.02      4    3.89 1.48   1   7     6 -0.03
## friendly        6 1950 4.24 1.00      4    4.20 1.48   1   7     6  0.14
## generous        7 1950 4.02 0.94      4    4.02 1.48   1   7     6  0.02
## helpful         8 1950 3.87 1.05      4    3.89 1.48   1   7     6 -0.01
## intuitive       9 1950 3.68 0.94      4    3.70 1.48   1   7     6 -0.10
## brand*         10 1950 2.43 0.87      2    2.42 1.48   1   4     3  0.03
##              kurtosis   se
## adaptable        0.06 0.02
## best_value       0.12 0.02
## cutting_edge    -0.10 0.02
## delightful       0.05 0.02
## exciting        -0.04 0.02
## friendly        -0.08 0.02
## generous        -0.03 0.02
## helpful         -0.20 0.02
## intuitive        0.09 0.02
## brand*          -0.70 0.02

skim(prst.df)

Data summary
Name	prst.df
Number of rows	1950
Number of columns	10
_______________________
Column type frequency:
character	1
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
brand	0	1	4	6	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
adaptable	1	4.29	0.96	1	4	4	5	7	▁▃▇▆▂
best_value	1	3.87	1.04	1	3	4	5	7	▂▆▇▅▁
cutting_edge	1	4.05	1.03	1	3	4	5	7	▁▅▇▆▂
delightful	1	3.90	0.99	1	3	4	5	7	▂▅▇▅▁
exciting	1	3.87	1.02	1	3	4	5	7	▂▆▇▅▁
friendly	1	4.24	1.00	1	4	4	5	7	▁▃▇▆▂
generous	1	4.02	0.94	1	3	4	5	7	▁▅▇▅▁
helpful	1	3.87	1.05	1	3	4	5	7	▂▆▇▅▁
intuitive	1	3.68	0.94	1	3	4	4	7	▂▆▇▃▁

2 - Generate a correltion plot using the adjective ratings. How many groups of adjectives does this plot suggest there might be? Rationale provided below

cor(prst.df[, 1:9])

##                  adaptable   best_value  cutting_edge    delightful
## adaptable     1.0000000000  0.380970661  0.0087593237 -0.0006843708
## best_value    0.3809706611  1.000000000 -0.0092000399  0.0198595003
## cutting_edge  0.0087593237 -0.009200040  1.0000000000  0.2900501053
## delightful   -0.0006843708  0.019859500  0.2900501053  1.0000000000
## exciting      0.0088167117 -0.004825184  0.4150407073  0.2899300974
## friendly      0.2450568565  0.018336296  0.0375276189  0.0264161517
## generous     -0.0094708960 -0.036970767  0.2699132134  0.1378107490
## helpful       0.1963087169  0.042590445  0.0205042979  0.0309901751
## intuitive     0.2436255356 -0.007144723 -0.0005556566  0.0061830017
##                  exciting    friendly     generous     helpful     intuitive
## adaptable     0.008816712  0.24505686 -0.009470896  0.19630872  0.2436255356
## best_value   -0.004825184  0.01833630 -0.036970767  0.04259045 -0.0071447233
## cutting_edge  0.415040707  0.03752762  0.269913213  0.02050430 -0.0005556566
## delightful    0.289930097  0.02641615  0.137810749  0.03099018  0.0061830017
## exciting      1.000000000  0.02391905  0.229731805  0.00949948  0.0185476892
## friendly      0.023919054  1.00000000 -0.034218597  0.30257369  0.3835271954
## generous      0.229731805 -0.03421860  1.000000000 -0.02293925 -0.0343708701
## helpful       0.009499480  0.30257369 -0.022939250  1.00000000  0.3557581923
## intuitive     0.018547689  0.38352720 -0.034370870  0.35575819  1.0000000000

# corrplot to define associations among adjectives in survey data
corrplot.mixed(corr = cor(prst.df[,-10]),
               upper.col = colorpanel(50, "red4", "gray70", "blue4"),
               lower.col = colorpanel(50, "red4", "gray70", "blue4"),
               upper = "ellipse", tl.pos = "lt", tl.col = "black",
               tl.cex = .9)

# There are 3 distinct groupings which I identify as the User Experience Choice, the Early Adopters Choice, and the Bargain Hunters Choice.  
# - The user experience choice (intuitive, helpful, friendly, adaptable), 
# - The early adopters choice (exciting, delightful, and cutting edge), 
# - The bargain hunters choice (best value, generous)

3 - Aggregate the mean of each adjective rating by brand. Plot a heatmap for the mean ratings by brand. How many groupings of adjectives does this plot suggest there might be? Provide a rationale for your answer.

# compute mean for heatmap interpretation
adj.mean <- aggregate(.~brand, data = prst.df, mean)
rownames(adj.mean) <- adj.mean[,1]
adj.mean <- adj.mean[, -1]
adj.mean

##        adaptable best_value cutting_edge delightful exciting friendly generous
## Papa    4.362069   3.617241     3.434483   3.482759 3.324138 4.693103 3.662069
## Romeo   4.323410   3.855210     3.891746   3.810555 3.729364 4.293640 3.905277
## Sierra  4.265248   3.977305     4.280851   4.056738 4.079433 4.095035 4.177305
## Tango   4.162037   3.939815     4.643519   4.268519 4.444444 3.916667 4.370370
##         helpful intuitive
## Papa   4.237931  4.131034
## Romeo  3.928281  3.803789
## Sierra 3.787234  3.495035
## Tango  3.425926  3.236111

# generate a heatmap of mean ratings (heatmap.2 in gplots)
heatmap.2(as.matrix(adj.mean[,]), col = brewer.pal(9, "GnBu"),
          trace = "none", key = F, dend = "none", mar = c(6, 18) + .1,
          main = "\n\n\n\n\nBrand Attributes")

# The heatmap almost encourages a 4th grouping in the adjectives that may not have been as obvious in the corrplot.  The heatmap ultimately makes an argument for splitting the User Experience Choice into 2 different groups where you have intuitive and helpful grouped together, and then friendly and adaptable grouped together.  In light of this, I would still group as 3 groups for the sake of explanation, and due to the fact the differentiation of those groups is weak for half of the brands represented.

4 - Compute the principal components for the adjectives using the original data. Make a scree plot. How many principal components are needed to explain the majority of the variance in the data using the scree plot method?

# Compute Principal Component Analysis
brand.pc <- prcomp(prst.df[, -10], scale = T)
summary(brand.pc)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     1.3889 1.3525 1.1068 0.93107 0.85339 0.8341 0.78533
## Proportion of Variance 0.2143 0.2033 0.1361 0.09632 0.08092 0.0773 0.06853
## Cumulative Proportion  0.2143 0.4176 0.5537 0.65003 0.73095 0.8083 0.87678
##                            PC8     PC9
## Standard deviation     0.75674 0.73236
## Proportion of Variance 0.06363 0.05959
## Cumulative Proportion  0.94041 1.00000

# Generate scree plot to determine PCA cutoff.
plot(brand.pc, type = "l")

# Based on the summary of cumulative proportion and the bend in the scree plot, I would choose the first 2 components.  There is a strong argument to be made for stopping at 3 components since the scree plot has a bit of a double 'kink', or even a triple 'kink' at the 5th component, but the summary shows that the 3rd component drops below 20% of the explanation, so I will stop at 2.

5 - Use the principal components for the mean adjective ratings to generate a biplot of the adjectives loadings and brand scores for the first two principal components. How many groupings of adjectives does the biplot suggest?

# We have already computed the mean back in step 3, and we will use it now to rerun our analysis 
# perform PCA for means
brand.mean.pc <- prcomp(adj.mean, scale = T)
summary(brand.mean.pc)

## Importance of components:
##                           PC1     PC2     PC3       PC4
## Standard deviation     2.9456 0.54780 0.15193 5.482e-15
## Proportion of Variance 0.9641 0.03334 0.00256 0.000e+00
## Cumulative Proportion  0.9641 0.99744 1.00000 1.000e+00

# perform PCA for means
biplot(brand.mean.pc, main = "Brand Positioning", cex = c(1.5, 1))

# The biplot suggests strongly for 2 groupings.  One could place "best value" as a third category though it is very close to delightful, cutting_edge, and exciting

6 - Suppose you are the brand manager for Sierra, and you wish to change your brand position to become more competitive with the market leader Tango. What are some strategies you could pursue based on the PCA positions?

# As the brand manager for Sierra, I would recommend that we pursue more of the initiatives that are deemed to be more exciting and cutting edge by the customer base. Assuming that differentiation through being the best value in the market is not a core pillar of our brand, leaning more on the cutting edge and exciting may edge us closer to Tango.