The increasing number of entities created online raises the problem of integrating and relating entities from different sources. In this work, we focus on product entities. In E-commerce, product data pose some important problems e.g., entity resolution, which is the difficult problem of matching products from one domain (e.g., Amazon) to another domain (e.g., eBay); link prediction and personalized recommendation (which product should be targeted to which user), and product canonicalization (clustering products and materializing prototypes for each cluster and sub-cluster).
There are several challenges in tackling product databases for tasks like link prediction and entity resolution. The first challenge, which mostly arises when the data is scoured from external sources like schema.org markup on webpages, is due to the poor quality of data, such as different spellings, missing values and ambiguity. This makes traditional pairwise distance measures approaches less effective with noisy content and context. The second challenge is due to the one-to-many and many-to-many relation between entities. For instance, in the product entity resolution example, a product might be associated with many prices (normal or discount), and each manufacturer is associated with many products. The heterogeneous nature of relationships brings in an additional challenge when using a collective approach such as Probabilistic Soft Logic: to determine which kind of relationship is best suited for resolving a particular type of entity.
It was usually the case that all of these different tasks (such as entity resolution and recommendation) were tackled piecemeal by researchers from different communities, without much sharing of algorithmic methodology between the communities (notwithstanding the common use of machine learning and statistical algorithms). However, with the advent of neural networks, including both deep nets but also shallower architectures like ‘skip-gram’, representation learning has emerged as a powerful unifying methodology within machine learning that takes the guesswork out of important problems like feature engineering. Many (though not all) representation learning algorithms work in an unsupervised fashion on big datasets, taking the dataset itself as input (e.g., a corpus of sentences, or a large social network or graph, which is the closest analog to the problem we are tackling) and returning continuous, low-dimensional vectors for elements in the dataset.
The E-Commerce project explores how best to learn such representations on product and E-commerce graphs in order to support tasks like entity resolution, product categorization and recommendation. Beyond algorithmic research, it is also focused on gathering resources to support open innovation in E-commerce knowledge graphs. We anticipate making available a range of resources, including datasets compiled and cleaned from sources like the Web Common Crawl, ground-truth datasets on public sources and open-source codebases on GitHub exposing graph modeling and representation learning best practices for product and E-commerce graphs.