SPS_Data607_Discussion11_DC

Author

David Chen

Analyze an existing recommender system

You should:

Perform a Scenario Design analysis as described below. Consider whether it makes sense for your selected recommender system to perform scenario design twice, once for the organization (e.g. Amazon.com) and once for the organization’s customers.
Attempt to reverse engineer what you can about the site, from the site interface and any available information that you can find on the Internet or elsewhere.
Include specific recommendations about how to improve the site’s recommendation capabilities going forward.
Create your report using a Quarto (.qmd) file, and create a discussion thread with a link to the GitHub repo where your Markdown file notebook resides.

Amazon.com Key Challenges in Recommendations

According to the industry report, e-commerce recommendation algorithms must operate under the following difficult conditions:

Massive Data Volume: Retailers may have tens of millions of customers and millions of distinct catalog items.
Real-time Requirements: Systems must return high-quality results in no more than half a second (500ms).
Data Volatility: The algorithm must respond immediately to new information, as every interaction provides valuable data.
Information Imbalance:
- New Customers: Often have extremely limited information, sometimes based on just a few purchases or ratings.
- Older Customers: Can have a “glut” of information, with thousands of purchases and ratings that must be processed.

Three question framework

Who are the target users?

The target users are every individual customer visiting the site, ranging from brand-new visitors to long-time “power users”. The document emphasizes that the system must work for “new customers” with extremely limited data as well as “older customers” who have a history of thousands of purchases.

What are their goals?

The goal for these users is to find and discover new, relevant, and interesting items within a massive catalog of millions of products. Users need to find these items quickly—specifically, the system aims to return these high-quality recommendations in no more than half a second. The user’s intent is to navigate the “huge amounts of data” to find products that match their specific interests.

What is the solution?

The solution is the item-to-item collaborative filtering algorithm. Instead of the computationally expensive task of searching for similar customers, the solution:

Matches each of the user’s purchased and rated items to similar items using a pre-computed table.
Integrates these recommendations directly into high-traffic areas like the homepage and the shopping cart.
Provides a “targeted marketing tool” that reacts immediately to a user’s changing data to show them exactly what they are looking for.

Limitation of common approaches

Traditional collaborative filtering faces severe performance and scaling issues in large systems because it typically requires scanning millions of customers and products in real-time. The algorithm’s complexity is roughly \(O(M+N)\), where \(M\) represents the number of customers and \(N\) represents the number of items. While retailers can attempt to address these issues by reducing the data size through random sampling, discarding customers with few purchases, or eliminating popular items, these methods significantly degrade recommendation quality. For example, item-space partitioning restricts suggestions to specific categories, while discarding low-frequency items prevents customers from discovering unique products they might enjoy.

Cluster models attempt to solve the scaling problem by dividing the customer base into segments and treating recommendations as a classification problem. This approach has better online performance because it compares a user to a controlled number of segments rather than the entire customer base. However, the quality of these recommendations is relatively poor. Because cluster models group numerous customers together and consider everyone in that segment similar, the resulting suggestions are less relevant to an individual user’s specific interests. Increasing the number of segments to improve accuracy eventually makes the online classification process nearly as expensive as traditional filtering.

Search-based methods, or content-based methods, treat the problem as a search for related items based on keywords, authors, or subjects. These systems scale and perform well if a user has very few purchases, but they become impractical for older customers who have thousands of ratings. In such cases, the algorithm must use only a subset of the data, which reduces the quality of the results. Furthermore, search-based recommendations often fail to help customers discover new interests because they provide results that are either too broad, such as general best-seller lists, or too narrow, such as only listing other books by an author the user has already read.

Amazon’s solution: Item-to-Item Collaborative Filtering

(Reverse Engineering)

To address these issues, Amazon developed item-to-item collaborative filtering.

The core “issue-solver” in this approach is moving the most expensive calculations offline:

Offline Computation: The system builds a “similar-items” table by finding products that customers tend to purchase together.

Build Item Similarity Table (Offline)

For each item, find all customers who bought it Look at other items those customers bought Count co-occurrences. Then compute similarity. (e.g., cosine similarity)

Online Scalability: Because the heavy lifting is done beforehand, the online component only needs to look up similar items for the user’s specific purchases.

Generate Recommendations (Online)

For a user, take items they purchased Look up similar items Aggregate and rank.

Independence: This allows the online response time to scale independently of the total number of customers or the size of the product catalog.