Big ideas from big data
The Dahu EDGE platform is a set of technologies for the assimilation, aggregation and enrichment of content from a variety of structured and unstructured sources, including, crucially, automatic deep web crawls of unrelated web sites. We developed our own unique toolset to automate the process of extracting and modelling complex information from complex, largely unstructured, diverse web data. The distilled, high-value content is aggregated and further enriched using domain-specific subject ontologies and expert databases, and finally emitted in a format that can easily be ingested by standard search platforms. Automation of the whole process makes it cost-effective for our customers to analyse, and potentially monetise large aggregated datasets. Possible revenue models include subscription, advertising, classifieds, lead generation and data cleansing, to name just a few.
We have chosen to build EDGE using widely-used open source technologies to complement the proprietary components that have been developed in-house by Dahu's leading search experts. Flexibility is core to our approach, and EDGE is search-engine agnostic by design. We have integrations with Lucene Solr, ElasticSearch, Exalead CloudView and MarkLogic, and we can quickly modify the formats to feed into other leading engines.
Starting with an ability to mine content from different web sites, EDGE includes a new, self-controlled web crawler. Rather than spending all of its time randomly jumping from site-to-site looking for the good stuff, the EDGE crawler is a goal-driven tool that initialises itself from a starting set of web sites but is able to jump to to un-related sites that it finds to have interesting content.
On any given web site, the crawler will encounter a range of different types of web pages - landing pages, site maps, help pages, product listings, product details pages, company details pages etc. To help cut down the amount of work it needs to do, EDGE classifies these into general types, so it can decide which ones it can safely ignore.
One standard model for EDGE is to crawl and seek listings of similar items, for example, houses or holiday listings. Most web sites will present both listing pages containing multiple items in an HTML list, which link to product details pages that showcase more information for a given product. EDGE handles these two cases very differently.
For product listing pages, EDGE looks to find one main list from the random background noise of a page, noise that may including several other lists on the same page. It can do this using hints from the HTML page structure, or from semantic clues coming from the text on the page, or indeed from graphical inferences, for example where non-contiguous elements are displayed vertically aligned, perhaps sharing common visual clues such as font size and colour.
On a product details page, EDGE uses a variety of approaches for entity extraction and model validation to build up a complex data record from the information on the page. The data record is potentially further enriched by reference to domain-specific ontologies and databases, before being written into a standard database. In our case, we use the highly scalable, flexibly, NoSQL database, MongoDB. The database of extracted enriched content has to be updated and maintained over time, allowing for changes both in the existing data record sets but also in the data landscape as web sites are updated or removed, or new ones launched. We also take into account data-cleansing tasks such as de-duplication and auto-expiration.
By automating the vast majority of these tasks, EDGE makes it cost-effective to aggregate very large volumes of data across large numbers of web sites and other data repositories. By integrating the data and managing the platform, Dahu EDGE empowers our customers to focus on leveraging the huge potential value to be found just below the surface of the web.