Data Granularity and the Web
work continues on our new engine – EDGE (Enhanced Data Granularity Engine), so we though it might be a good idea to spend a little time to explain what we mean by data granularity.
pretty much everyone is aware of “Big Data” and the challenges it brings. Exploding data volumes and an inability to address the data in a meaningful (or affordable) way mean that people are now looking to other technologies to address their big data needs. A classic example that highlights the problem is smart-metering. Whereas in the old days someone might visit your property and take a reading of your electricity meter once a quarter (or indeed simply ask you to do it yourself), smart meters are able to take a reading at a frequency set by the company.
One example we looked at some time back was taking readings ever 30 minutes. These readings are gathered automatically and stored centrally and clearly, allow the business to monitor their business in a much more granular way. Except of course they don’t. Think about it – moving from a quarterly reading to a half-hour reading is roughly a four-thousand times increase in data from every property. Despite the processing, storage and access headaches, the data is of huge potential value when viewed as a whole. The individual value of a single 30 minute reading might be minuscule, but when viewed as a whole, it can reveal real insight into the real-time usage of electricity allowing the providers to create specific packages aimed at consumer types and also provide a real tool to fight fraudulent use of electricity.
Now the cost of storing this additional data using traditional RDBMS systems doesn’t decrease when you extend massively like this, even given some extreme licensing curves from your favourite RDBMS vendor. At these kind of volumes the systems don’t necessarily behave linearly and the conflicts between secure transaction processing and information retrieval exacerbate this non-linearity. There are a number of alternative approaches including the use of non-relational databases (typified by the no-sql movement) and also Search-Based Applications – the use of search technologies to handle the retrieval operations.
In our smart-meter example, what often happens is that the data is collected and stored but not really made available in any meaningful way. The granularity of the data can be said to be coarse-grained; while the system is itself fine-grained (having a lot of detail) the cost of really utilising the fine-grained detail is too high for most business to bear. Putting this another way, to be able to make use of the data the cost-per-item needs to decrease massively as the volumes increase.
While we have experience in using search technologies to tackle problems like this, Dahu’s own specific area of interest and the focus of our products and services is mining data from the web. Many businesses want to make use of content from the web, data from social media and inferences drawn from the two and there is a wealth of freely available content out there if only you can get at it. It exists at a sub web-page level (meaning that a single page may contain many distinct pieces of information or data).
Gathering data from a multitude of web-sites is perfectly possible using a wide variety of techniques, but up to now, these techniques have been quite intensive, requiring considerable set-up time and on-going management. This is fine if you are trying to gather data from perhaps one or two web sites and can keep on top of any changes in those sites – but if you need to gather content from perhaps thousands of sites, then you are going to struggle to manage the process in a cost-effective way.
Dahu EDGE is a set of tools and services we are developing to automate this process. It allows us to focus on a specific kind of content from a large range of web-sites, automatically find the pages of interest and automatically find the items on the page and extract the pertinent attributes that we need. The process is designed to be as automatic as possible. It does need to be taught a few things about the kind of content we are looking for, but importantly, it does not need to be taught anything about the particular web-sites where the content might be located.
Using EDGE, we are able to extract millions of items from thousands of web-sites at a cost achievable by most businesses and entrepreneurial start-ups in need of data.