Abstract

The dataset provides detailed insights into Starbucks customer demographics, purchasing behaviors, satisfaction ratings, and loyalty patterns. It includes variables such as age, income, visit frequency, time spent, product preferences, and ratings for aspects like product quality, pricing, ambiance, and service. Promotional influence through various channels (e.g., app, social media, email) is also captured, offering valuable data to assess marketing effectiveness. This information can be used to segment customers, identify high-value demographics, and improve targeted marketing campaigns or product offerings.

By analyzing satisfaction metrics and loyalty indicators, Starbucks can pinpoint areas for improvement, such as pricing perception or Wi-Fi quality, while leveraging high-performing aspects like ambiance or product quality. The data supports predictive modeling to understand customer loyalty and spending patterns, enabling tailored loyalty programs and promotions. Additionally, the dataset can guide strategic decisions on store locations, operational enhancements, and resource allocation for marketing efforts, driving improved customer experiences and business growth.

We will be using our data set to conduct a principal component analysis to:

  • Focus on the most impactful variables for decision-making.
  • Reduce dimensionality for streamlined machine learning or clustering tasks.
  • Generate actionable insights for enhancing customer experience and boosting loyalty.

Breakdown of Variables:

Variable Descriptions:

  • Id: Unique identifier for each survey respondent.
  • gender: Encoded value indicating the respondent’s gender (e.g., 0 = Female, 1 = Male).
  • age: Encoded value representing age group.
  • status: Encoded marital or employment status of the respondent.
  • income: Encoded income level.
  • visitNo: Number of visits to Starbucks within a certain period.
  • method: Encoded value indicating the preferred method of visiting Starbucks (e.g., in-store, drive-thru, etc.).
  • timeSpend: Encoded time spent per visit.
  • location: Encoded Starbucks location type (e.g., urban, suburban, or rural).
  • membershipCard: Encoded value indicating if the respondent owns a Starbucks membership card (e.g., 0 = No, 1 = Yes).
  • itempurchase(…): Encoded value indicating if specified item was purchased.
  • spendPurchase: Amount spent per purchase, encoded.
  • (…)Rate: Rating of Starbucks in specified category (e.g., taste, quality).
  • serviceRate: Rating of customer service quality.
  • chooseRate: Overall satisfaction rating indicating the likelihood of choosing Starbucks again.
  • promoMethod(…): Encoded value indicating promotional influence via specified starbucks method.
  • loyal: Encoded value indicating loyalty (e.g., 1 = Loyal customer, 0 = Non-loyal customer).

PCA

As our Data has already arrive pre-cleaned, we will take a quick look at the statistical spread of our present columns to see what we are working with.

##        Id             gender            age            status     
##  Min.   :  1.00   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 29.00   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 60.00   Median :1.0000   Median :1.000   Median :2.000  
##  Mean   : 60.15   Mean   :0.5221   Mean   :1.186   Mean   :1.221  
##  3rd Qu.: 90.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :122.00   Max.   :1.0000   Max.   :3.000   Max.   :3.000  
##      income          visitNo          method        timeSpend     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :3.000   Median :1.000   Median :0.0000  
##  Mean   :0.7611   Mean   :2.558   Mean   :1.071   Mean   :0.6106  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :4.0000   Max.   :3.000   Max.   :5.000   Max.   :4.0000  
##     location     membershipCard  itemPurchaseCoffee itempurchaseCold
##  Min.   :0.000   Min.   :0.000   Min.   :1          Min.   :1       
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.:1          1st Qu.:1       
##  Median :1.000   Median :0.000   Median :1          Median :1       
##  Mean   :1.274   Mean   :0.469   Mean   :1          Mean   :1       
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1          3rd Qu.:1       
##  Max.   :2.000   Max.   :1.000   Max.   :1          Max.   :1       
##  itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
##  Min.   :1            Min.   :1          Min.   :1             
##  1st Qu.:1            1st Qu.:1          1st Qu.:1             
##  Median :1            Median :1          Median :1             
##  Mean   :1            Mean   :1          Mean   :1             
##  3rd Qu.:1            3rd Qu.:1          3rd Qu.:1             
##  Max.   :1            Max.   :1          Max.   :1             
##  itemPurchaseOthers spendPurchase    productRate      priceRate    
##  Min.   :1          Min.   :0.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1          1st Qu.:1.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :1          Median :1.000   Median :4.000   Median :3.000  
##  Mean   :1          Mean   :1.478   Mean   :3.743   Mean   :2.929  
##  3rd Qu.:1          3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1          Max.   :3.000   Max.   :5.000   Max.   :5.000  
##    promoRate      ambianceRate      wifiRate      serviceRate      chooseRate  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.00  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.00  
##  Median :4.000   Median :4.000   Median :3.000   Median :4.000   Median :4.00  
##  Mean   :3.876   Mean   :3.841   Mean   :3.283   Mean   :3.814   Mean   :3.54  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##  promoMethodApp promoMethodSoc promoMethodEmail promoMethodDeal
##  Min.   :1      Min.   :1      Min.   :1        Min.   :1      
##  1st Qu.:1      1st Qu.:1      1st Qu.:1        1st Qu.:1      
##  Median :1      Median :1      Median :1        Median :1      
##  Mean   :1      Mean   :1      Mean   :1        Mean   :1      
##  3rd Qu.:1      3rd Qu.:1      3rd Qu.:1        3rd Qu.:1      
##  Max.   :1      Max.   :1      Max.   :1        Max.   :1      
##  promoMethodFriend promoMethodDisplay promoMethodBillboard promoMethodOthers
##  Min.   :1         Min.   :1          Min.   :1            Min.   :0.0000   
##  1st Qu.:1         1st Qu.:1          1st Qu.:1            1st Qu.:1.0000   
##  Median :1         Median :1          Median :1            Median :1.0000   
##  Mean   :1         Mean   :1          Mean   :1            Mean   :0.9912   
##  3rd Qu.:1         3rd Qu.:1          3rd Qu.:1            3rd Qu.:1.0000   
##  Max.   :1         Max.   :1          Max.   :1            Max.   :1.0000   
##      loyal       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2035  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Feature Engineering:

Upon initial inspection, we can see there are a few unnecessary columns in our filtered data set that would negatively affect our principal component analysis. Components like “ID” give us any insight on the purchasing habits or demographics of our sample, and thus we will exclude it.

Our aim with our PCA is to create 2 holistic variables to capture all the variation present in our data to produce meaningful insight for our stakeholders. With that said, we will remove any columns with redundant information or information that doest give us any genuine variation (in other words, columns with values that are all the same.) These would be columns such as PromomethodApp, itemPurchasePastries, etc.

So lets continue with our report and remove these values, then see a summarized table & visualization of our generated Principal Components.

check_min_max_equal <- function(data) {
  # Initialize an empty vector to store column names
  equal_cols <- c()
  
  # Iterate through each column in the data
  for (col_name in names(data)) {
    # Check if the column is numeric or integer
    if (is.numeric(data[[col_name]]) || is.integer(data[[col_name]])) {
      # Check if the min and max are equal
      if (min(data[[col_name]], na.rm = TRUE) == max(data[[col_name]], na.rm = TRUE)) {
        equal_cols <- c(equal_cols, col_name)
      }
    }
  }
  
  # Return the list of columns with min == max
  return(equal_cols)
}
  
check_min_max_equal(starbs_data)
##  [1] "itemPurchaseCoffee"     "itempurchaseCold"       "itemPurchasePastries"  
##  [4] "itemPurchaseJuices"     "itemPurchaseSandwiches" "itemPurchaseOthers"    
##  [7] "promoMethodApp"         "promoMethodSoc"         "promoMethodEmail"      
## [10] "promoMethodDeal"        "promoMethodFriend"      "promoMethodDisplay"    
## [13] "promoMethodBillboard"
# Feature Engineering:
pca_table <- starbs_data%>%
  select(-Id, -itemPurchaseCoffee, -itempurchaseCold, -itemPurchasePastries, -itemPurchaseJuices, -itemPurchaseSandwiches, -itemPurchaseOthers,
         - promoMethodApp, -promoMethodSoc, -promoMethodEmail, -promoMethodDeal, -promoMethodFriend, -promoMethodDisplay, -promoMethodBillboard, -promoMethodOthers)

# Initial Call for PCA
pca_result = prcomp(pca_table, scale = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.9330 1.6427 1.30238 1.2304 1.03516 0.93399 0.92294
## Proportion of Variance 0.2076 0.1499 0.09423 0.0841 0.05953 0.04846 0.04732
## Cumulative Proportion  0.2076 0.3575 0.45173 0.5358 0.59537 0.64383 0.69115
##                            PC8     PC9    PC10    PC11   PC12    PC13    PC14
## Standard deviation     0.89038 0.82395 0.80877 0.80098 0.7336 0.69799 0.65709
## Proportion of Variance 0.04404 0.03772 0.03634 0.03564 0.0299 0.02707 0.02399
## Cumulative Proportion  0.73519 0.77291 0.80925 0.84489 0.8748 0.90185 0.92584
##                           PC15    PC16    PC17    PC18
## Standard deviation     0.63694 0.57776 0.55054 0.54061
## Proportion of Variance 0.02254 0.01854 0.01684 0.01624
## Cumulative Proportion  0.94838 0.96692 0.98376 1.00000
fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
  geom_bar(stat = "identity", fill = "#006241", color = "black") +
  geom_line(aes(group = 1), color = "black")+
  theme_minimal()

Our initial PCA creation shows us a cumulative variation proportion of 35% from our first two derived Principle Components, and 73% explained from our first 8.

For this reason, we will continue forward with visualizing with respect to our first two components, but drawing additional analysis from the total 8.

We can directly observe how each of our variables contribute to our PCA to make initial guesses as what our 2 factors of interest could potentially be:

Working with Principal components:

p1 = fviz_contrib(pca_result, choice = "var", axes = 1) + 
  ggtitle("Contributions to PC1")+
  geom_bar(stat = "identity", fill = "#006241", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")


p2  = fviz_contrib(pca_result, choice = "var", axes = 2) + 
  ggtitle("Contributions to PC2")+
  geom_bar(stat = "identity", fill = "#006241", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")

grid.arrange(p1, p2, ncol = 2)

PC1 (Left): Key contributors are ProductRate, ChooseRate, AmbianceRate, and ServiceRate, highlighting customer satisfaction. Variables like Income, Method, and Gender have minimal impact.

PC2 (Right): Dominated by Income, WifiRate, and SpendPurchase, reflecting economic and spending behavior. Factors like PriceRate and ChooseRate contribute less.

This suggests PC1 focuses on satisfaction metrics, while PC2 emphasizes spending patterns. We can continue with a factor analysis to see if our assumptions are supported.

x.f <- factanal(pca_table, 2, scores="Bartlett", rotation="varimax")
x.f
## 
## Call:
## factanal(x = pca_table, factors = 2, scores = "Bartlett", rotation = "varimax")
## 
## Uniquenesses:
##         gender            age         status         income        visitNo 
##          0.877          0.780          0.874          0.662          0.723 
##         method      timeSpend       location membershipCard  spendPurchase 
##          0.968          0.913          0.952          0.746          0.574 
##    productRate      priceRate      promoRate   ambianceRate       wifiRate 
##          0.549          0.630          0.852          0.413          0.540 
##    serviceRate     chooseRate          loyal 
##          0.475          0.626          0.707 
## 
## Loadings:
##                Factor1 Factor2
## gender                 -0.339 
## age                     0.469 
## status          0.178   0.307 
## income                  0.578 
## visitNo        -0.223  -0.477 
## method                 -0.178 
## timeSpend               0.294 
## location       -0.209         
## membershipCard -0.310  -0.397 
## spendPurchase           0.648 
## productRate     0.644   0.190 
## priceRate       0.570   0.213 
## promoRate       0.337  -0.186 
## ambianceRate    0.755  -0.132 
## wifiRate        0.592  -0.332 
## serviceRate     0.715  -0.118 
## chooseRate      0.536   0.294 
## loyal          -0.407  -0.357 
## 
##                Factor1 Factor2
## SS loadings      2.976   2.163
## Proportion Var   0.165   0.120
## Cumulative Var   0.165   0.286
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 187.74 on 118 degrees of freedom.
## The p-value is 4.66e-05
  • Factor 1: Key contributors:

AmbianceRate (0.755): Ambiance plays a significant role in Factor 1. ServiceRate (0.715): Service quality is also strongly associated with this factor. ProductRate (0.644): Customer perception of the product has a significant impact. PriceRate (0.570): Price rating contributes moderately to this factor. WifiRate (0.592): The quality of Wi-Fi influences this component. Interpretation:

F-1 seems to represent customer satisfaction or experience based on service, ambiance, products, and Wi-Fi quality. Businesses looking to optimize these features should focus on this factor to improve overall customer satisfaction.

  • Factor 2: Key contributors:

Income (0.578): Customer income is the largest contributor to this factor. VisitNo (-0.477): Number of visits is negatively associated, suggesting repeat visits may differ across income levels. MembershipCard (-0.397): Membership card usage is negatively correlated. SpendPurchase (0.648): Spending on purchases is positively associated with Factor 2. Interpretation:

F-2 seems to represent economic factors or spending behavior, as it is strongly influenced by income and spending patterns. Customers with higher incomes likely spend more but are less associated with frequent visits or loyalty programs.

We can visualize a Bi-plot to show the direction of variables effects on the placement of our points within our PCA graph with a Bi-Plot Highlighting the contribution levels

fviz_pca_var(pca_result, col.var = "contrib")+
  labs(title = 'PCA Variable Bi-Plot')+
  scale_color_gradient(low="pink", high="#006241") +
  theme_minimal()

We can add onto this bi-plot by also outputting our surveyed individuals onto a graph and seeing how they land on our newly created 2-dimensional plane.

fviz_pca_biplot(pca_result,
ind.var = 'cos2',
col.var = "contrib",
gradient.cols = c(low="pink", high="#006241"), 
repel = TRUE, # Avoid overlapping labels
geom = "point", # Display individuals as points
labelsize = 3
)

Insight from other Principal Components:

We were able to draw some strong conclusions in working with our two initial Principal components, but PCA shows there to be valuable insight to be gathered from other components, so we will move forward and work with it here, starting with eigen values:

# initializing of PCA for this section
starbs.pca = PCA(pca_table,scale.unit = T , ncp = 6 , graph=F)
var = get_pca_var(starbs.pca)

eig.val = get_eigenvalue(starbs.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   3.7363775        20.757653                    20.75765
## Dim.2   2.6986167        14.992315                    35.74997
## Dim.3   1.6962049         9.423360                    45.17333
## Dim.4   1.5138111         8.410062                    53.58339
## Dim.5   1.0715643         5.953135                    59.53653
## Dim.6   0.8723396         4.846331                    64.38286
## Dim.7   0.8518112         4.732284                    69.11514
## Dim.8   0.7927714         4.404286                    73.51943
## Dim.9   0.6788901         3.771612                    77.29104
## Dim.10  0.6541047         3.633915                    80.92495
## Dim.11  0.6415652         3.564251                    84.48920
## Dim.12  0.5381204         2.989558                    87.47876
## Dim.13  0.4871961         2.706645                    90.18541
## Dim.14  0.4317701         2.398723                    92.58413
## Dim.15  0.4056930         2.253850                    94.83798
## Dim.16  0.3338046         1.854470                    96.69245
## Dim.17  0.3030972         1.683873                    98.37632
## Dim.18  0.2922617         1.623676                   100.00000

The first two components (Dim.1 and Dim.2) explain 35.75% of the variance, while the first four components capture 53.58%, making them suitable for dimensionality reduction. The variance explained diminishes significantly after Dim.4 or Dim.5, suggesting an elbow point. Using the first 4 to 6 components balances dimensional reduction with retaining meaningful information.

Let us move forward with looking at correlation levels with our variables and dimensions in the form of a graph and table.

Corr by Dimension:

  • This shows the correlation coefficients between the original variables and the principal components (dimensions).
  • High correlations (positive or negative) indicate that the variable contributes significantly to that principal component.
  • These correlations are useful to interpret the meaning of each principal component.

We will observe these correlations in the form of a table and a heatmap:

round(var$cor,2)
##                Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## gender         -0.06  0.49 -0.39  0.00 -0.29  0.19
## age             0.23 -0.54 -0.20  0.30  0.43 -0.09
## status          0.35 -0.26 -0.48  0.41  0.07  0.11
## income          0.21 -0.69 -0.14  0.31  0.12 -0.25
## visitNo        -0.44  0.37 -0.02  0.31 -0.03 -0.44
## method         -0.08  0.28 -0.71 -0.17  0.24  0.27
## timeSpend       0.11 -0.42  0.57  0.34 -0.24  0.28
## location       -0.29 -0.07  0.40 -0.23  0.59  0.16
## membershipCard -0.52  0.27  0.22 -0.16  0.39  0.13
## spendPurchase   0.37 -0.56 -0.16 -0.34  0.02  0.23
## productRate     0.70  0.12  0.20 -0.20 -0.15 -0.03
## priceRate       0.66  0.10  0.15 -0.35 -0.01 -0.22
## promoRate       0.28  0.34  0.16  0.48 -0.02  0.49
## ambianceRate    0.65  0.43  0.16  0.15  0.07 -0.04
## wifiRate        0.43  0.59  0.06  0.09  0.30 -0.09
## serviceRate     0.64  0.41 -0.06  0.31  0.21 -0.10
## chooseRate      0.67 -0.04  0.18 -0.08  0.10  0.06
## loyal          -0.59  0.10  0.20  0.44  0.05  0.02
corrplot(var$cor, is.corr = FALSE)

The loadings matrix reveals that Dim.1 captures customer satisfaction and experience, driven by metrics like ProductRate, AmbianceRate, and ChooseRate, while negatively associated with Loyalty and MembershipCard, indicating satisfaction and loyalty may not always align. Dim.2, on the other hand, emphasizes economic and spending behaviors, with Income and SpendPurchase playing opposing roles, and WifiRate indicating service quality relevance.

Higher dimensions (Dim.3 and beyond) uncover more specific behaviors, such as TimeSpend (Dim.3) and Location preferences (Dim.6), suggesting niche patterns in customer interactions or service use. Overall, the first two dimensions explain broad themes of satisfaction and spending, while subsequent dimensions capture finer, context-dependent details.

COS2 by Dimention:

  • Cos2 (squared cosine) measures the quality of the representation of a variable on the PCA components.
  • It is the squared correlation between the variable and a given principal component.
  • A high Cos2 value means the variable is well-represented by that principal component (it contributes significantly to its variance).
round(var$cos2,2)
##                Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## gender          0.00  0.24  0.15  0.00  0.08  0.04
## age             0.05  0.29  0.04  0.09  0.19  0.01
## status          0.12  0.07  0.23  0.17  0.00  0.01
## income          0.04  0.47  0.02  0.10  0.01  0.06
## visitNo         0.20  0.14  0.00  0.09  0.00  0.19
## method          0.01  0.08  0.50  0.03  0.06  0.08
## timeSpend       0.01  0.18  0.33  0.12  0.06  0.08
## location        0.08  0.01  0.16  0.05  0.35  0.03
## membershipCard  0.27  0.07  0.05  0.02  0.15  0.02
## spendPurchase   0.14  0.31  0.02  0.11  0.00  0.05
## productRate     0.49  0.01  0.04  0.04  0.02  0.00
## priceRate       0.43  0.01  0.02  0.12  0.00  0.05
## promoRate       0.08  0.11  0.02  0.23  0.00  0.24
## ambianceRate    0.42  0.18  0.02  0.02  0.00  0.00
## wifiRate        0.18  0.35  0.00  0.01  0.09  0.01
## serviceRate     0.41  0.17  0.00  0.10  0.05  0.01
## chooseRate      0.45  0.00  0.03  0.01  0.01  0.00
## loyal           0.35  0.01  0.04  0.19  0.00  0.00
corrplot(var$cos2, is.corr = FALSE)

The Cos2 values show that Dim.1 captures customer satisfaction metrics like ProductRate, AmbianceRate, and ServiceRate, while Dim.2 focuses on economic factors, driven by Income and WifiRate. Dim.3 reflects operational behaviors such as Method and TimeSpend, while higher dimensions, like Dim.6, capture niche patterns like Location influence. Overall, the first three dimensions account for key satisfaction, spending, and behavioral trends, with higher dimensions offering finer, less dominant details.

Clustering

set.seed(123)
ind = get_pca_ind(starbs.pca)
res.km = kmeans(ind$coord, centers = 4, nstart = 25)
grp2 = as.factor(res.km$cluster)
fviz_pca_ind(starbs.pca,
             geom.ind = "point", # display only points (no text)
             col.ind = grp2, # color by groups
             palette = c("red", "blue", "green", "orange"),
             addEllipses = TRUE, # concentration ellipses
             legend.title = "Groups"
)

The plot divides individuals into four clusters, spread across Dim1 (20.8%) and Dim2 (15%), which explain 35.8% of the variance.

  • Cluster 1 (Red) is distinct, located on the negative side of Dim1 and Dim2, likely representing individuals with lower satisfaction or economic traits.
  • Cluster 2 (Blue) is on the positive side of both dimensions, indicating higher satisfaction and income levels.
  • Cluster 3 (Green) emphasizes economic behavior
  • Cluster 4 (Orange) represents a balanced, average group.
  • Overlap between clusters suggests nuanced differences, especially among Clusters 2, 3, and 4.

Actionable Insights

Improving Customer Satisfaction:

Focus on enhancing ambiance, service quality, and product offerings, as these are key drivers of satisfaction. Use surveys and feedback to prioritize changes, train staff to improve service, and ensure consistent product quality.

Targeting Spending Behavior:

Segment customers by income and spending potential to tailor offerings. High-income groups could benefit from premium services, while promotions and value-based offers can engage lower-income segments. Reevaluating Loyalty Programs:

Align loyalty rewards with customer preferences:

Offering personalized perks and clear communication of benefits. Redesign programs to better incentive spending and satisfaction while tracking their impact through retention and engagement metrics.

By aligning services, spending strategies, and loyalty programs with customer needs, businesses can enhance satisfaction and drive revenue.