Introduction

This analysis explores a dataset containing information on over 8,000 beauty products offered by Sephora, a major cosmetics retailer, on their online store as of March 2023. The data was scraped from the Sephora website and is available on Kaggle.

The dataset includes rich details about each product, such as brand and product names, prices, ingredients, customer ratings, review counts, and a “loves” count indicating customer favoritism. It also provides insights into product attributes like limited edition, newly launched, online exclusivity, stock availability, and whether a product belongs to a Sephora exclusive brand.

By analyzing this comprehensive dataset, we aim to uncover valuable insights into market trends, customer preferences, and Sephora’s overall product strategy. Key areas explored include:

  1. Brand Comparison: Examining the top brands by number of products offered, providing a glimpse into their relative market shares and product assortments.

  2. Product Type Distribution: Evaluating the strategic mix of Sephora exclusive, online-only, new, and limited-edition products, which can drive customer loyalty and differentiation.

  3. Customer Favoritism: Utilizing the “loves” count metric to understand which product categories resonate most strongly with consumers and identify potential star products.

  4. Stock Availability: Investigating out-of-stock products across categories, shedding light on demand patterns, supply chain challenges, and opportunities for inventory optimization.

  5. Pricing Strategies: Analyzing the price distribution for each product category through cumulative distribution functions (CDFs), revealing pricing positioning and market segmentation approaches.

Through this multi-faceted analysis, we aim to provide valuable insights into Sephora’s online beauty business, enabling data-driven decisions on product assortment, pricing, inventory management, and customer engagement strategies.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from collections import Counter
from collections import defaultdict

path = "/Users/janellnapper/Desktop/DS 736/Python Project/"

filename = 'product_info.csv'

df = pd.read_csv(path + filename)

# Data Cleaning
# Fill missing numeric data
df['rating'].fillna(0, inplace=True) 
df['reviews'].fillna(0, inplace=True)

# Fill missing categorical data
categorical_columns = ['size', 'variation_type', 'variation_value', 'variation_desc', 
                       'ingredients', 'highlights', 'secondary_category', 'tertiary_category']
for col in categorical_columns:
    df[col].fillna('Unknown', inplace=True)

    
# Drop columns with a large number of missing values
df.drop(['value_price_usd', 'sale_price_usd', 'child_max_price', 'child_min_price'], axis=1, inplace=True)


# Correcting data types
df['reviews'] = df['reviews'].astype(int)

Brand Comparsion

The graph below presents a clear and concise visualization of the top 10 brands ranked by the number of products they offer. The horizontal bars, rendered in a vibrant shade of magenta, effectively convey the quantitative data, making it easy to compare the product offerings across different brands.

top_brands = df['brand_name'].value_counts().head(10)
fig, ax = plt.subplots(figsize=(12, 8)) 
barplot = sns.barplot(x=top_brands.values, y=top_brands.index, color='magenta', ax=ax)
ax.set_title('Top 10 Brands by Number of Products')
ax.set_xlabel('Number of Products')
ax.set_ylabel('Brand Name')
ax.set_yticklabels(ax.get_yticklabels(), rotation=45, ha='right')

# Use enumerate to iterate over bars and their indices
for row_counter, p in enumerate(barplot.patches):
    value_at_row_counter = top_brands.values[row_counter]  # Explicitly storing the value
    ax.text(p.get_width() + 1, p.get_y() + p.get_height() / 2, '{:1.0f}'.format(value_at_row_counter), ha='left', va='center', fontsize=10)

 
plt.show()

At the top of the chart, SEPHORA COLLECTION stands out with a commanding lead, boasting a remarkable 352 products in its assortment. This bar is significantly longer than the rest, nearly doubling the length of the second-highest bar, CLINIQUE, which offers 179 products. The substantial gap between these two brands highlights SEPHORA COLLECTION’s dominant position in terms of product variety.

Moving down the chart, the brand Dior follows with 136 products, slightly trailing CLINIQUE but still maintaining a solid presence. The next cluster of brands, including tarte (131 products), NEST New York (115 products), bumble and bumble (110 products), Kérastase (108 products), and TOM FORD (100 products), exhibits a more gradual decline in the number of products offered.

Notably, the bars representing Charlotte Tilbury (99 products) and Anastasia Beverly Hills (95 products) are closely spaced, indicating a highly competitive landscape among these brands in terms of product assortment breadth.

Overall, the graph effectively communicates the diverse range of product offerings among these top beauty brands, highlighting the significant lead held by SEPHORA COLLECTION while also showcasing the fierce competition among the other players in the market.

Product Type Distribution

The pie chart below illustrates the distribution of products categorized by their type at Sephora. Four distinct segments are visible, each representing a unique product classification and denoted by a different color.

product_types = ['online_only', 'limited_edition', 'new', 'sephora_exclusive']
new_labels = ['Online Only', 'Limited Edition', 'New', 'Sephora Exclusive']
type_counts = [df[type].sum() for type in product_types]

# Define a pastel color palette
pastel_colors = ['#ffb3e6', '#c2c2f0', '#ffcc99', '#baffc9']  # Pastel pink, blue, orange, and green

plt.figure(figsize=(10, 6))
plt.pie(type_counts, labels=new_labels, colors=pastel_colors, autopct='%1.1f%%', startangle=140)
plt.title('Proportion of Different Product Types')
plt.show()

The largest segment of the pie chart is “Sephora Exclusive,” which constitutes 43.6% of the products. This suggests that nearly half of the products offered are exclusive to Sephora, highlighting the brand’s strategy to offer unique items that customers cannot find at other retailers.

The next substantial segment is “Online Only,” making up 34.2% of the products. This significant proportion implies that over a third of Sephora’s product offerings are available exclusively through their online platform, reflecting the importance of e-commerce in their business model.

“New” products account for 11.2% of the pie, indicating that a smaller yet notable portion of the product mix is composed of recent additions. This demonstrates Sephora’s commitment to innovation and keeping their product offerings fresh and updated.

Finally, “Limited Edition” products represent 11.0% of the pie. While these products make up the smallest slice, their presence is non-negligible, indicating a strategy to create a sense of urgency and exclusivity, encouraging customers to make immediate purchases.

Overall, the pie chart effectively communicates the strategic mix of product types offered by Sephora, with a clear emphasis on exclusive and online-only products, which can enhance customer loyalty and differentiate their product line in a competitive market.

Customer Favortism

Understanding that the ‘loves count’ represents the number of individuals who have marked a product as a favorite gives us valuable insight into consumer preferences and the relative appeal of products within different categories. The boxplot below displays the distribution of consumer favorites within various product categories on a logarithmic scale to account for wide-ranging values.

# Visualization 4: Boxplot for 'loves_count' across Primary Categories with Log Transformation
plt.figure(figsize=(12, 8))
sns.boxplot(x='primary_category', y='loves_count', data=df)

# Using strip plot for better handling of large data
sns.stripplot(x='primary_category', y='loves_count', data=df, color='black', alpha=0.5, size=2)

# Applying log transformation to the y-axis
plt.yscale('log')


plt.title('Distribution of Loves Count Across Primary Categories (Log Scale)')
plt.xlabel('Primary Category')
plt.ylabel('Loves Count (Log Scale)')
plt.show()

The chart reveals several key observations. First, the “Skincare” and “Makeup” categories show a high median ‘loves count’, suggesting that products in these categories are frequently marked as favorites. The widespread and numerous outliers, particularly in “Makeup”, indicate that there are some exceptionally popular products that stand out significantly in terms of consumer favoritism. However, the considerable range and outliers in these categories also imply that while there are standout favorites, there is substantial variability in how much products are favored within these categories.

In contrast, categories such as “Men” and “Gifts” display a more condensed interquartile range with a lower median, indicating that products in these categories tend to have fewer marks as favorites. This could point to a more niche market or suggest that these categories contain products that are less commonly marked as favorites by consumers. Additionally, “Men” and “Gifts” show fewer outliers, suggesting that favoritism in these categories is more evenly distributed across products.

The “Fragrance” category shows a wide range but with a median ‘loves count’ that is not as high as “Makeup” or “Skincare”. This may reflect a consistent but more varied interest in fragrance products, with no single item dominating in popularity.

The graph indicates that while certain categories like “Makeup” and “Skincare” have products that are very frequently marked as favorites, likely driving their sales, other categories may benefit from strategies to increase their visibility or consumer engagement. Additionally, the data can inform inventory and stocking decisions, as products with higher ‘loves count’ might require more stock to meet consumer interest.

Stock Availibility

The graph below is a bubble chart depicting the number of out-of-stock products categorized by their primary category. The size of each bubble corresponds to the quantity of out-of-stock items within that particular category.

# Filter out-of-stock products
out_of_stock_df = df[df['out_of_stock'] == 1]

# Count the number of out-of-stock products in each primary category
category_counts = out_of_stock_df['primary_category'].value_counts()

# Create a bubble chart
plt.figure(figsize=(12, 10))
for i, (category, count) in enumerate(category_counts.items()):
    plt.scatter(x=i, y=count, s=count*10, alpha=0.5)  
   

plt.xticks(range(len(category_counts)), category_counts.index)
plt.title('Bubble Chart of Out of Stock Products by Primary Category')
plt.xlabel('Primary Category')
plt.ylabel('Number of Out of Stock Products')
plt.show()

The most notable observation is the significantly larger bubble for the “Makeup” category, situated at the top-left, indicating it has the highest number of out-of-stock products, surpassing 250. This suggests that makeup products are either very popular and quickly sold out, or there might be supply chain issues keeping stock levels low.

Following “Makeup,” the “Fragrance” and “Skincare” categories also show a substantial number of out-of-stock items, with bubbles situated in the mid-range of the chart, around 150 and 140 out-of-stock products respectively. Their sizes are relatively close, which could imply that these categories share a similar level of demand or stocking challenges.

The “Hair” category is represented by a smaller bubble, suggesting fewer out-of-stock products compared to the previous categories, which indicates better stock availability or lower demand in this category.

The remaining categories, including “Mini Size,” “Bath & Body,” “Men,” and “Tools & Brushes,” are represented with even smaller bubbles clustered towards the bottom right of the chart. These categories appear to have the least number of out-of-stock products, which may suggest they have a lower demand or are better stocked than the categories with larger bubbles.

In summary, the bubble chart effectively illustrates the disparity in stock availability among different product categories, with “Makeup” standing out as the category most prone to stock shortages.

Pricing Strategies

The graph below is a Cumulative Distribution Function (CDF) showing the price distribution across various product categories. The x-axis represents price points in USD, while the y-axis shows the cumulative proportion of products priced at or below those points within each category.

categories = df['primary_category'].unique()

plt.figure(figsize=(12, 8))

# Loop through each category and plot the CDF
for category in categories:
    # Filter the DataFrame for the category and drop NA values
    category_prices = df[df['primary_category'] == category]['price_usd'].dropna()
    
    # Calculate the 99th percentile for the category prices to limit the x-axis
    category_price_threshold = category_prices.quantile(0.99)
    
    # Filter prices below the 99th percentile threshold to exclude extreme outliers
    filtered_category_prices = category_prices[category_prices < category_price_threshold]
    
    # Sort prices for CDF
    sorted_category_prices = np.sort(filtered_category_prices)
    
    # Calculate the CDF
    cumulative_category = np.arange(1, len(sorted_category_prices)+1) / len(sorted_category_prices)
    
    # Plot the CDF
    plt.plot(sorted_category_prices, cumulative_category, marker='.',markersize=4,  linestyle='none', label=category)


# Customizing the plot
plt.title('Cumulative Distribution Function of Prices per Product Category')
plt.xlabel('Price (USD)')
plt.ylabel('CDF - Proportion of Total')
plt.legend()
plt.grid(True)
plt.xlim(0, df['price_usd'].quantile(0.99))  # Set a common x-axis limit for all categories
plt.show()

Fragrance Category: This category starts to rise steeply at the lower price end and continues to gradually increase through the mid-price range, suggesting a wide range of prices. It reaches near-total proportion at higher prices, indicating some premium products.

Bath & Body, Mini Size, Hair: These categories show a rapid rise in the CDF at lower price points, suggesting that a significant proportion of products are more affordable.

Makeup: The makeup category’s CDF line suggests a moderate range of prices, with a steady ascent, indicating a more even distribution of price points.

Skincare: Skincare prices are more spread out, as the line rises more gradually, indicating a significant presence of mid-range to high-end products.

Tools & Brushes: This category seems to have a balance of price points with a consistent gradual rise in its CDF line.

Men: The CDF line for men’s products rises sharply at first but then levels off, suggesting that while there are affordable options, the category lacks higher-priced items compared to others.

Gifts: The gifts category appears to have a very steep initial rise, indicating a large number of lower-priced items, before it becomes more gradual.

Overall, the graph illustrates the diversity in pricing strategies among the different product categories. Categories like Fragrance, Skincare, and Makeup suggest a broad range that caters to various market segments, including premium options. In contrast, categories like Bath & Body, Mini Size, and Hair seem to focus more on affordability. The clearly delineated CDF lines for each category also highlight the effectiveness of price differentiation as a market segmentation strategy.

Conclusion

In conclusion, this comprehensive analysis has delved deep into the extensive dataset of Sephora’s beauty products, unveiling the strategic nuances and consumer behavior patterns that define the brand’s market presence. From the commanding assortment of the SEPHORA COLLECTION to the competitive dynamism of brands like Clinique and Dior, we’ve observed the varying degrees of product diversity that form the competitive landscape of the beauty industry.

The pie chart’s revelation of a significant fraction of exclusive and online-only offerings speaks to Sephora’s innovative approach to retail, capitalizing on the exclusivity and convenience that modern consumers seek. The analytical scrutiny into ‘loves count’ has highlighted the categories of Skincare and Makeup not just as areas of high consumer engagement but also as realms of diverse preferences and potentially high sales volume. Meanwhile, the out-of-stock analysis has shed light on supply chain robustness and demand forecasting, with the ‘Makeup’ category’s proneness to stock shortages potentially signaling a high turnover or stock management opportunities.

The pricing strategy, articulated through the CDF plots, has laid bare the pricing diversity within and across categories, illustrating how Sephora caters to a broad spectrum of consumer spending capabilities. While some categories offer a gamut of price points, from budget to luxury, others maintain a focus on affordability.

This multifaceted exploration not only underscores the intricate balance between product variety, exclusivity, and pricing but also encapsulates the essence of consumer preference and market trends. For Sephora, these insights are not mere observations but are actionable intelligence that can inform strategies to enhance product offerings, optimize inventory, and refine customer engagement, ultimately fortifying its position as a leader in the global beauty retail space.