
Enhancing Listing Integrity with AI-driven De-duplication at a Leading E-commerce Giant
Project Background
Fastwheel.ai was approached by a major multinational e-commerce platform to address a critical issue of duplicate product listings as they transitioned to a marketplace model. The presence of duplicate listings, resulting from varied supplier practices and motivations, was complicating the shopping experience and affecting the platform's operational efficiency.
Objectives
Automate Duplicate Detection: To reduce reliance on manual processes for identifying duplicate listings, thereby streamlining operations and improving listing accuracy.
Scalable and Efficient System: Ensure the system's scalability without incurring significant computing costs, even with the continuous addition of new listings and categories.
Minimize Human Intervention: Develop a model that automatically adapts to new categories without the need for retraining or significant human oversight.
Solution Design
1. Dataset Construction and Annotation
Objective: Create a robust training and validation dataset, known as the "golden dataset," consisting of product descriptions with annotated similarity levels (high, medium, low).
Methodology:
- Utilize stratified sampling to extract representative pairs of SKU descriptions from the platform's existing data.
- Engage the platform's data science team for expert elicitation to ensure accurate and consistent annotation, reflecting the e-commerce inventory nuances.
2. Data Augmentation and Preprocessing
Augmentation: Employ synthetic data generation techniques, such as GANs, to enhance the diversity and volume of the training set, aiming to improve the model's ability to generalize across different product categories.
Preprocessing: Standardize and vectorize the product descriptions using text normalization techniques to prepare the data for effective machine learning processing.
3. Model Development and Iteration
Baseline Assessment: Evaluate existing systems and explore various embedding techniques (e.g., BERT, GloVe) and distance metrics to establish a performance baseline.
Advanced Model Training: If initial models underperform, proceed to fine-tune transformer-based architectures using the golden dataset to optimize detection accuracy.
Innovative Techniques: Implement chain of thought prompting combined with Retrieval-Augmented Generation (RAG), enhancing the model's capacity to discern subtle similarities and discrepancies among listings.
4. Operational Implementation
Integration: Deploy the AI model into the platform's existing infrastructure, allowing for real-time duplicate detection and reporting.
User Interface: Develop an intuitive interface for category managers to review potential duplicates and oversee the automated processes, ensuring transparency and control over the automated decisions.
Performance Evaluation
Metrics: Utilize precision, recall, and F1-score to assess the accuracy of the duplicate detection system. Employ contrastive loss metrics for fine-tuning the model's ability to distinguish nuanced differences.
Iterative Refinement: Establish a continuous feedback loop with active learning mechanisms to incrementally refine the model based on real-world performance and user feedback.
Results and Impact
- Efficiency Gains: Significant reduction in manual labor required for listing verification, allowing category managers to focus on strategic initiatives.
- Cost Reduction: The AI system's scalability and efficiency helped maintain computing costs below the projected increase, even as listing and category numbers grew.
- Enhanced User Experience: Improved accuracy of product listings enhanced the shopping experience, reducing customer confusion and support queries related to duplicate listings.
Conclusion
The implementation of an AI-driven de-duplication system at this major e-commerce platform showcases Fastwheel.ai's capabilities in delivering innovative AI solutions tailored to the complex challenges of modern e-commerce platforms. This project not only improved operational efficiencies and reduced costs but also played a crucial role in enhancing the overall customer experience by ensuring listing integrity on the platform.
