The dataset provides detailed insights into Starbucks customer demographics, purchasing behaviors, satisfaction ratings, and loyalty patterns. It includes variables such as age, income, visit frequency, time spent, product preferences, and ratings for aspects like product quality, pricing, ambiance, and service. Promotional influence through various channels (e.g., app, social media, email) is also captured, offering valuable data to assess marketing effectiveness. This information can be used to segment customers, identify high-value demographics, and improve targeted marketing campaigns or product offerings.
By analyzing satisfaction metrics and loyalty indicators, Starbucks can pinpoint areas for improvement, such as pricing perception or Wi-Fi quality, while leveraging high-performing aspects like ambiance or product quality. The data supports predictive modeling to understand customer loyalty and spending patterns, enabling tailored loyalty programs and promotions. Additionally, the dataset can guide strategic decisions on store locations, operational enhancements, and resource allocation for marketing efforts, driving improved customer experiences and business growth.
We will be using our data set to conduct a principal component analysis to:
Variable Descriptions:
As our Data has already arrive pre-cleaned, we will take a quick look at the statistical spread of our present columns to see what we are working with.
## Id gender age status
## Min. : 1.00 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 29.00 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.000
## Median : 60.00 Median :1.0000 Median :1.000 Median :2.000
## Mean : 60.15 Mean :0.5221 Mean :1.186 Mean :1.221
## 3rd Qu.: 90.00 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :122.00 Max. :1.0000 Max. :3.000 Max. :3.000
## income visitNo method timeSpend
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :3.000 Median :1.000 Median :0.0000
## Mean :0.7611 Mean :2.558 Mean :1.071 Mean :0.6106
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.0000 Max. :3.000 Max. :5.000 Max. :4.0000
## location membershipCard itemPurchaseCoffee itempurchaseCold
## Min. :0.000 Min. :0.000 Min. :1 Min. :1
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1 1st Qu.:1
## Median :1.000 Median :0.000 Median :1 Median :1
## Mean :1.274 Mean :0.469 Mean :1 Mean :1
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1 3rd Qu.:1
## Max. :2.000 Max. :1.000 Max. :1 Max. :1
## itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
## Min. :1 Min. :1 Min. :1
## 1st Qu.:1 1st Qu.:1 1st Qu.:1
## Median :1 Median :1 Median :1
## Mean :1 Mean :1 Mean :1
## 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1
## Max. :1 Max. :1 Max. :1
## itemPurchaseOthers spendPurchase productRate priceRate
## Min. :1 Min. :0.000 Min. :1.000 Min. :1.000
## 1st Qu.:1 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1 Median :1.000 Median :4.000 Median :3.000
## Mean :1 Mean :1.478 Mean :3.743 Mean :2.929
## 3rd Qu.:1 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1 Max. :3.000 Max. :5.000 Max. :5.000
## promoRate ambianceRate wifiRate serviceRate chooseRate
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :2.000 Min. :1.00
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.00
## Median :4.000 Median :4.000 Median :3.000 Median :4.000 Median :4.00
## Mean :3.876 Mean :3.841 Mean :3.283 Mean :3.814 Mean :3.54
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
## promoMethodApp promoMethodSoc promoMethodEmail promoMethodDeal
## Min. :1 Min. :1 Min. :1 Min. :1
## 1st Qu.:1 1st Qu.:1 1st Qu.:1 1st Qu.:1
## Median :1 Median :1 Median :1 Median :1
## Mean :1 Mean :1 Mean :1 Mean :1
## 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1
## Max. :1 Max. :1 Max. :1 Max. :1
## promoMethodFriend promoMethodDisplay promoMethodBillboard promoMethodOthers
## Min. :1 Min. :1 Min. :1 Min. :0.0000
## 1st Qu.:1 1st Qu.:1 1st Qu.:1 1st Qu.:1.0000
## Median :1 Median :1 Median :1 Median :1.0000
## Mean :1 Mean :1 Mean :1 Mean :0.9912
## 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.0000
## Max. :1 Max. :1 Max. :1 Max. :1.0000
## loyal
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2035
## 3rd Qu.:0.0000
## Max. :1.0000
Upon initial inspection, we can see there are a few unnecessary columns in our filtered data set that would negatively affect our principal component analysis. Components like “ID” give us any insight on the purchasing habits or demographics of our sample, and thus we will exclude it.
Our aim with our PCA is to create 2 holistic variables to capture all the variation present in our data to produce meaningful insight for our stakeholders. With that said, we will remove any columns with redundant information or information that doest give us any genuine variation (in other words, columns with values that are all the same.) These would be columns such as PromomethodApp, itemPurchasePastries, etc.
So lets continue with our report and remove these values, then see a summarized table & visualization of our generated Principal Components.
check_min_max_equal <- function(data) {
# Initialize an empty vector to store column names
equal_cols <- c()
# Iterate through each column in the data
for (col_name in names(data)) {
# Check if the column is numeric or integer
if (is.numeric(data[[col_name]]) || is.integer(data[[col_name]])) {
# Check if the min and max are equal
if (min(data[[col_name]], na.rm = TRUE) == max(data[[col_name]], na.rm = TRUE)) {
equal_cols <- c(equal_cols, col_name)
}
}
}
# Return the list of columns with min == max
return(equal_cols)
}
check_min_max_equal(starbs_data)## [1] "itemPurchaseCoffee" "itempurchaseCold" "itemPurchasePastries"
## [4] "itemPurchaseJuices" "itemPurchaseSandwiches" "itemPurchaseOthers"
## [7] "promoMethodApp" "promoMethodSoc" "promoMethodEmail"
## [10] "promoMethodDeal" "promoMethodFriend" "promoMethodDisplay"
## [13] "promoMethodBillboard"
# Feature Engineering:
pca_table <- starbs_data%>%
select(-Id, -itemPurchaseCoffee, -itempurchaseCold, -itemPurchasePastries, -itemPurchaseJuices, -itemPurchaseSandwiches, -itemPurchaseOthers,
- promoMethodApp, -promoMethodSoc, -promoMethodEmail, -promoMethodDeal, -promoMethodFriend, -promoMethodDisplay, -promoMethodBillboard, -promoMethodOthers)
# Initial Call for PCA
pca_result = prcomp(pca_table, scale = TRUE)
summary(pca_result)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9330 1.6427 1.30238 1.2304 1.03516 0.93399 0.92294
## Proportion of Variance 0.2076 0.1499 0.09423 0.0841 0.05953 0.04846 0.04732
## Cumulative Proportion 0.2076 0.3575 0.45173 0.5358 0.59537 0.64383 0.69115
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.89038 0.82395 0.80877 0.80098 0.7336 0.69799 0.65709
## Proportion of Variance 0.04404 0.03772 0.03634 0.03564 0.0299 0.02707 0.02399
## Cumulative Proportion 0.73519 0.77291 0.80925 0.84489 0.8748 0.90185 0.92584
## PC15 PC16 PC17 PC18
## Standard deviation 0.63694 0.57776 0.55054 0.54061
## Proportion of Variance 0.02254 0.01854 0.01684 0.01624
## Cumulative Proportion 0.94838 0.96692 0.98376 1.00000
fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
geom_bar(stat = "identity", fill = "#006241", color = "black") +
geom_line(aes(group = 1), color = "black")+
theme_minimal()Our initial PCA creation shows us a cumulative variation proportion of 35% from our first two derived Principle Components, and 73% explained from our first 8.
For this reason, we will continue forward with visualizing with respect to our first two components, but drawing additional analysis from the total 8.
We can directly observe how each of our variables contribute to our PCA to make initial guesses as what our 2 factors of interest could potentially be:
p1 = fviz_contrib(pca_result, choice = "var", axes = 1) +
ggtitle("Contributions to PC1")+
geom_bar(stat = "identity", fill = "#006241", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
p2 = fviz_contrib(pca_result, choice = "var", axes = 2) +
ggtitle("Contributions to PC2")+
geom_bar(stat = "identity", fill = "#006241", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
grid.arrange(p1, p2, ncol = 2)PC1 (Left): Key contributors are ProductRate, ChooseRate, AmbianceRate, and ServiceRate, highlighting customer satisfaction. Variables like Income, Method, and Gender have minimal impact.
PC2 (Right): Dominated by Income, WifiRate, and SpendPurchase, reflecting economic and spending behavior. Factors like PriceRate and ChooseRate contribute less.
This suggests PC1 focuses on satisfaction metrics, while PC2 emphasizes spending patterns. We can continue with a factor analysis to see if our assumptions are supported.
##
## Call:
## factanal(x = pca_table, factors = 2, scores = "Bartlett", rotation = "varimax")
##
## Uniquenesses:
## gender age status income visitNo
## 0.877 0.780 0.874 0.662 0.723
## method timeSpend location membershipCard spendPurchase
## 0.968 0.913 0.952 0.746 0.574
## productRate priceRate promoRate ambianceRate wifiRate
## 0.549 0.630 0.852 0.413 0.540
## serviceRate chooseRate loyal
## 0.475 0.626 0.707
##
## Loadings:
## Factor1 Factor2
## gender -0.339
## age 0.469
## status 0.178 0.307
## income 0.578
## visitNo -0.223 -0.477
## method -0.178
## timeSpend 0.294
## location -0.209
## membershipCard -0.310 -0.397
## spendPurchase 0.648
## productRate 0.644 0.190
## priceRate 0.570 0.213
## promoRate 0.337 -0.186
## ambianceRate 0.755 -0.132
## wifiRate 0.592 -0.332
## serviceRate 0.715 -0.118
## chooseRate 0.536 0.294
## loyal -0.407 -0.357
##
## Factor1 Factor2
## SS loadings 2.976 2.163
## Proportion Var 0.165 0.120
## Cumulative Var 0.165 0.286
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 187.74 on 118 degrees of freedom.
## The p-value is 4.66e-05
AmbianceRate (0.755): Ambiance plays a significant role in Factor 1. ServiceRate (0.715): Service quality is also strongly associated with this factor. ProductRate (0.644): Customer perception of the product has a significant impact. PriceRate (0.570): Price rating contributes moderately to this factor. WifiRate (0.592): The quality of Wi-Fi influences this component. Interpretation:
F-1 seems to represent customer satisfaction or experience based on service, ambiance, products, and Wi-Fi quality. Businesses looking to optimize these features should focus on this factor to improve overall customer satisfaction.
Income (0.578): Customer income is the largest contributor to this factor. VisitNo (-0.477): Number of visits is negatively associated, suggesting repeat visits may differ across income levels. MembershipCard (-0.397): Membership card usage is negatively correlated. SpendPurchase (0.648): Spending on purchases is positively associated with Factor 2. Interpretation:
F-2 seems to represent economic factors or spending behavior, as it is strongly influenced by income and spending patterns. Customers with higher incomes likely spend more but are less associated with frequent visits or loyalty programs.
We can visualize a Bi-plot to show the direction of variables effects on the placement of our points within our PCA graph with a Bi-Plot Highlighting the contribution levels
fviz_pca_var(pca_result, col.var = "contrib")+
labs(title = 'PCA Variable Bi-Plot')+
scale_color_gradient(low="pink", high="#006241") +
theme_minimal()We can add onto this bi-plot by also outputting our surveyed individuals onto a graph and seeing how they land on our newly created 2-dimensional plane.
fviz_pca_biplot(pca_result,
ind.var = 'cos2',
col.var = "contrib",
gradient.cols = c(low="pink", high="#006241"),
repel = TRUE, # Avoid overlapping labels
geom = "point", # Display individuals as points
labelsize = 3
)We were able to draw some strong conclusions in working with our two initial Principal components, but PCA shows there to be valuable insight to be gathered from other components, so we will move forward and work with it here, starting with eigen values:
# initializing of PCA for this section
starbs.pca = PCA(pca_table,scale.unit = T , ncp = 6 , graph=F)
var = get_pca_var(starbs.pca)
eig.val = get_eigenvalue(starbs.pca)
eig.val## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.7363775 20.757653 20.75765
## Dim.2 2.6986167 14.992315 35.74997
## Dim.3 1.6962049 9.423360 45.17333
## Dim.4 1.5138111 8.410062 53.58339
## Dim.5 1.0715643 5.953135 59.53653
## Dim.6 0.8723396 4.846331 64.38286
## Dim.7 0.8518112 4.732284 69.11514
## Dim.8 0.7927714 4.404286 73.51943
## Dim.9 0.6788901 3.771612 77.29104
## Dim.10 0.6541047 3.633915 80.92495
## Dim.11 0.6415652 3.564251 84.48920
## Dim.12 0.5381204 2.989558 87.47876
## Dim.13 0.4871961 2.706645 90.18541
## Dim.14 0.4317701 2.398723 92.58413
## Dim.15 0.4056930 2.253850 94.83798
## Dim.16 0.3338046 1.854470 96.69245
## Dim.17 0.3030972 1.683873 98.37632
## Dim.18 0.2922617 1.623676 100.00000
The first two components (Dim.1 and Dim.2) explain 35.75% of the variance, while the first four components capture 53.58%, making them suitable for dimensionality reduction. The variance explained diminishes significantly after Dim.4 or Dim.5, suggesting an elbow point. Using the first 4 to 6 components balances dimensional reduction with retaining meaningful information.
Let us move forward with looking at correlation levels with our variables and dimensions in the form of a graph and table.
We will observe these correlations in the form of a table and a heatmap:
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## gender -0.06 0.49 -0.39 0.00 -0.29 0.19
## age 0.23 -0.54 -0.20 0.30 0.43 -0.09
## status 0.35 -0.26 -0.48 0.41 0.07 0.11
## income 0.21 -0.69 -0.14 0.31 0.12 -0.25
## visitNo -0.44 0.37 -0.02 0.31 -0.03 -0.44
## method -0.08 0.28 -0.71 -0.17 0.24 0.27
## timeSpend 0.11 -0.42 0.57 0.34 -0.24 0.28
## location -0.29 -0.07 0.40 -0.23 0.59 0.16
## membershipCard -0.52 0.27 0.22 -0.16 0.39 0.13
## spendPurchase 0.37 -0.56 -0.16 -0.34 0.02 0.23
## productRate 0.70 0.12 0.20 -0.20 -0.15 -0.03
## priceRate 0.66 0.10 0.15 -0.35 -0.01 -0.22
## promoRate 0.28 0.34 0.16 0.48 -0.02 0.49
## ambianceRate 0.65 0.43 0.16 0.15 0.07 -0.04
## wifiRate 0.43 0.59 0.06 0.09 0.30 -0.09
## serviceRate 0.64 0.41 -0.06 0.31 0.21 -0.10
## chooseRate 0.67 -0.04 0.18 -0.08 0.10 0.06
## loyal -0.59 0.10 0.20 0.44 0.05 0.02
The loadings matrix reveals that Dim.1 captures customer satisfaction and experience, driven by metrics like ProductRate, AmbianceRate, and ChooseRate, while negatively associated with Loyalty and MembershipCard, indicating satisfaction and loyalty may not always align. Dim.2, on the other hand, emphasizes economic and spending behaviors, with Income and SpendPurchase playing opposing roles, and WifiRate indicating service quality relevance.
Higher dimensions (Dim.3 and beyond) uncover more specific behaviors, such as TimeSpend (Dim.3) and Location preferences (Dim.6), suggesting niche patterns in customer interactions or service use. Overall, the first two dimensions explain broad themes of satisfaction and spending, while subsequent dimensions capture finer, context-dependent details.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## gender 0.00 0.24 0.15 0.00 0.08 0.04
## age 0.05 0.29 0.04 0.09 0.19 0.01
## status 0.12 0.07 0.23 0.17 0.00 0.01
## income 0.04 0.47 0.02 0.10 0.01 0.06
## visitNo 0.20 0.14 0.00 0.09 0.00 0.19
## method 0.01 0.08 0.50 0.03 0.06 0.08
## timeSpend 0.01 0.18 0.33 0.12 0.06 0.08
## location 0.08 0.01 0.16 0.05 0.35 0.03
## membershipCard 0.27 0.07 0.05 0.02 0.15 0.02
## spendPurchase 0.14 0.31 0.02 0.11 0.00 0.05
## productRate 0.49 0.01 0.04 0.04 0.02 0.00
## priceRate 0.43 0.01 0.02 0.12 0.00 0.05
## promoRate 0.08 0.11 0.02 0.23 0.00 0.24
## ambianceRate 0.42 0.18 0.02 0.02 0.00 0.00
## wifiRate 0.18 0.35 0.00 0.01 0.09 0.01
## serviceRate 0.41 0.17 0.00 0.10 0.05 0.01
## chooseRate 0.45 0.00 0.03 0.01 0.01 0.00
## loyal 0.35 0.01 0.04 0.19 0.00 0.00
The Cos2 values show that Dim.1 captures customer satisfaction metrics like ProductRate, AmbianceRate, and ServiceRate, while Dim.2 focuses on economic factors, driven by Income and WifiRate. Dim.3 reflects operational behaviors such as Method and TimeSpend, while higher dimensions, like Dim.6, capture niche patterns like Location influence. Overall, the first three dimensions account for key satisfaction, spending, and behavioral trends, with higher dimensions offering finer, less dominant details.
set.seed(123)
ind = get_pca_ind(starbs.pca)
res.km = kmeans(ind$coord, centers = 4, nstart = 25)
grp2 = as.factor(res.km$cluster)
fviz_pca_ind(starbs.pca,
geom.ind = "point", # display only points (no text)
col.ind = grp2, # color by groups
palette = c("red", "blue", "green", "orange"),
addEllipses = TRUE, # concentration ellipses
legend.title = "Groups"
)The plot divides individuals into four clusters, spread across Dim1 (20.8%) and Dim2 (15%), which explain 35.8% of the variance.
Focus on enhancing ambiance, service quality, and product offerings, as these are key drivers of satisfaction. Use surveys and feedback to prioritize changes, train staff to improve service, and ensure consistent product quality.
Segment customers by income and spending potential to tailor offerings. High-income groups could benefit from premium services, while promotions and value-based offers can engage lower-income segments. Reevaluating Loyalty Programs:
Offering personalized perks and clear communication of benefits. Redesign programs to better incentive spending and satisfaction while tracking their impact through retention and engagement metrics.
By aligning services, spending strategies, and loyalty programs with customer needs, businesses can enhance satisfaction and drive revenue.