Welcome to the Dahu blog



Using libcurl on OSX

getting started with libcurl

As part of on-going development of our Edge platform, our web crawler needs to repeatedly download web pages of content on multiple threads. Oddly, when looking around, there are remarkably few packages and libraries for C++ to handle HTTP. HTTP is not overly complex and its a simple enough matter to use a framework like BOOST::ASIO for example and code it yourself, but really, we just need a simple class that will get a URL for us.

After an initial sweep, we settled on two main contenders; cpp-netlib and libcurl. They represent very different approaches to the task. cpp-netlib is/was being developed  as a potential candidate for inclusion in BOOST and uses a nice modern template metaprogramming model.

I had a good play with cpp-netlib and got it doing most of what I wanted. It requires BOOST which is not a big concern for us as we are using a fair amount of the BOOST libraries anyway, although if you are not, you will need to get BOOST installed (easy enough using macports on OSX or indeed with a download and build). It also has its own build procedure that is somewhat at odds with the traditional BOOST build methods for BOOST candidate libraries.

In the end, for me, cpp-netlib was missing one or two specific features that we need and was perhaps a little too new and evolving. Its also designed as an abstraction to support many network protocols – not just HTTP, which might perhap makes it a little complex for those who simply want to download some web content. The team is very responsive and its very much a live project right now. It certainly something we will be keeping an eye on and re-visiting from time to time. If you need an HTTP server, its going to be a fine choice – If you just want to get some content down, others might be easier.

Libcurl is well-established (for well-established, read ‘old as the hills’). There is a ton of information out there and the code is nice and stable, supported with bindings for every possible language and also actively being developed. It underlies the ‘curl’ tool that is available on most systems (it’s installed by default on OSX). One issue is that for C++ we are talking about an old-school ‘C’ style API. It uses a model of callbacks to gather the content and header information that don’t sit too nicely with a class model in C++ – it’s certainly at the other end of the modernity spectrum compared with cpp-netlib.

While there is a wealth of info out there on libcurl, there is not too much concise getting started information including installing and building your first app to do the most common thing – get a url’s content and headers in a nice, simple class. The typical examples will download the content and or headers and dump them on sysout or write them to files. What we want to do is get the content programatically so we can do stuff with it in our programs. For the rest of this post, we will explain how to do just that.

Getting started with libcurl on OSX

I’m going to explain this specifically for OSX (Lion) as thats where we do the bulk of our development – although the build process is the same for other platforms assuming you have the development environment you need.

libcurl can be downloaded as a set of binaries (pre-built libraries) or you can get the latest stable build and build it yourself. I would suggest doing this – you don’t need any pre-requisites other than a development environment. You do need to have installed Xcode on OSX to get the developer tools (make, GCC etc.) but I’m guessing you already have Xcode.

follow these steps to get and build libcurl:

  1. Go to http://curl.haxx.se/download.html and download the latest ‘source archive’, and get the ‘tar’ version. At the time of writing, this is curl-7.24.0.tar
  2. When its downloaded, move the tar file from your downloads folder to a temporary folder of your choice (e.g. ~/Desktop/libcurl)
  3. open a temporary window and change directory to the temporary folder
  4. expand the tar file:
     tar -xvf curl-7.24.0.tar
  5. the expansion will have created a sub-directory. CD into curl-7.24.0 and then issue the following commands:
    ./configure
    make
    make test
    sudo make install
  6. ‘configure’ will set up the build environment for the platform concerned (OSX in this case), ‘make’ will do the actual build, ‘make test’ will run the unit tests to ensure everything is working ok (note that these tests might take five minutes to complete) and ‘sudo make install’ will move the various components into sensible places for the environment. Note also that because we will be moving files to some system locations, you should run the install as root using sudo – give it your administration password.

That’s it – the install all done. the make install will have put our header files in:

/Usr/local/include/curl

and the libraries in

/Usr/local/lib/libcurl.a
/Usr/local/lib/libcurl.dylib

so now a simple example of a class to retrieve the content, status code and headers from a site:

Simple libcurl class

again, we are assuming you are doing this on OSX in Xcode – although it should be fine on pretty much any platform (note on Windows there may be additional initialisation that you need to do – haven’t tested it).

Create a new Xcode command-line tool application and then add a new ‘h’ file and a ‘cpp’ file and include the code below. Copy the code in ‘main’ below into your main.cpp that Xcode will have created for you. You shouldn’t need to set any include paths as Xcode should be looking in /usr/local/include for headers and /usr/local/lib for libraries. If your installation isn’t. then you can set ‘Header search paths’ and ‘Library search paths’ in the ‘Search paths’ section of the Build Settings.

You will however need to add the library to your project so the linker knows to link against it. in the Build Settings, in the ‘Linking’ section, add -lcurl to the ‘other linker options’ item.

curly.h:

#ifndef curly_h
#define curly_h

#include <string>
#include <vector>
#include <curl/curl.h>

class Curly{
private:
    std::string        mContent;
    std::string        mType;
    std::vector        mHeaders;
    unsigned int       mHttpStatus;
    CURL*              pCurlHandle;
    static size_t      HttpContent(void* ptr, size_t size,
                        size_t nmemb, void* stream);
    static size_t      HttpHeader(void* ptr, size_t size,
                        size_t nmemb, void* stream);

public:
    Curly():pCurlHandle(curl_easy_init()){};  // constructor
    ~Curly(){};
    CURLcode    Fetch (std::string);

    inline std::string   Content()    const { return mContent; }
    inline std::string   Type()       const { return mType; }
    inline unsigned int  HttpStatus() const { return mHttpStatus; }
    inline std::vector   Headers()    const { return mHeaders; }
};

#endif

 curly.cpp

#include "curlget.h"

CURLcode Curly::Fetch(std::string url){

    // clear things ready for our 'fetch'
    mHttpStatus = 0;
    mContent.clear();
    mHeaders.clear();

    // set our callbacks
    curl_easy_setopt(pCurlHandle , CURLOPT_WRITEFUNCTION, HttpContent);
    curl_easy_setopt(pCurlHandle, CURLOPT_HEADERFUNCTION, HttpHeader);
    curl_easy_setopt(pCurlHandle, CURLOPT_WRITEDATA, this);
    curl_easy_setopt(pCurlHandle, CURLOPT_WRITEHEADER, this);

    // set the URL we want
    curl_easy_setopt(pCurlHandle, CURLOPT_URL, url.c_str());

    //  go get 'em, tiger
    CURLcode curlErr = curl_easy_perform(pCurlHandle);
    if (curlErr == CURLE_OK){

        // assuming everything is ok, get the content type and status code
        char* content_type = NULL;
        if ((curl_easy_getinfo(pCurlHandle, CURLINFO_CONTENT_TYPE,
              &content_type)) == CURLE_OK)
            mType = std::string(content_type);            

        unsigned int http_code = 0;
        if((curl_easy_getinfo (pCurlHandle, CURLINFO_RESPONSE_CODE,
              &http_code)) == CURLE_OK)
            mHttpStatus = http_code;

    }
    return curlErr;
}

size_t Curly::HttpContent(void* ptr, size_t size,
                            size_t nmemb, void* stream) {

	Curly* handle = (Curly*)stream;
	size_t data_size = size*nmemb;
    if (handle != NULL){
        handle->mContent.append((char *)ptr,data_size);
    }
	return data_size;
}

size_t Curly::HttpHeader(void* ptr, size_t size,
                            size_t nmemb, void* stream) {

    Curly* handle = (Curly*)stream;
	size_t data_size = size*nmemb;
    if (handle != NULL){
        std::string header_line((char *)ptr,data_size);
        handle->mHeaders.push_back(header_line);
    }
    return data_size;
}

 main.cpp

#include
#include "curly.h"

int main (int argc, const char * argv[])
{

    Curly curly;

    if (curly.Fetch("http://www.dahu.co.uk") == CURLE_OK){

        std::cout << "status: " << curly.HttpStatus() << std::endl;
        std::cout << "type: " << curly.Type() << std::endl;
        std::vector headers = curly.Headers();

        for(std::vector::iterator it = headers.begin();
                it != headers.end(); it++)
            std::cout << "Header: " << (*it) << std::endl;

        std::cout << "Content:\n" << curly.Content() << std::endl;
    }

    return 0;
}

 A brief explanation

The constructor for the ‘curly’ class calls

curl_easy_init()

to initialise curl, and importantly, the deconstructor calls

curl_easy_cleanup(handle)

to release the resources libcurl has used. Curl provides a wealth of calls to do pretty much everything you might ever want to do with HTTP, including handling redirects, cookies and setting custom headers and also provides a set of ‘easy’ calls to do the most common things with little or no fuss, which is what we are using in this example.

Once we have an instance created, we can call ‘Fetch’ and pass it a URL to get. Note again that we are not providing any parsing or validation of the URL for this simple example – you will need to ensure that yourself before you call Fetch.

In Fetch, first we clean down our members for the list of headers, the content and the status code (we want to be able to repeatedly call Fetch on our instance for multiple URLs after all). Then we set up the all-important callbacks for processing the received content and header information. When we instruct libcurl to go get content for us by calling

curl_easy_perform(handle)

The library will call the callback we specify with each content block that it gets. It might very well get called multiple times (in fact the header callback will get called once for each header entry). This means that the callbacks need to handle the accumulation of the data.

The callbacks need a static function – we can’t simply call a member function of our class, and of course, if we define the functions in our class as static, they are unaware of our specific instance – hence we tell the library with CURLOPT_WRITEDATA and CURLOPT_WRITEHEADER options to pass the ‘this’ pointer to our static functions as user data, and then we can de-reference the ‘this’ pointer to get a handle back to our instance and update the member variables. simples.

The callbacks are defined like this:

static size_t func(void* ptr, size_t size, size_t nmemb, void* stream);

Which at first looks quite confusing. the reason that we have both a ‘size’ and a ‘nmemb’ (number-of-members) is that typical examples in the past used FWRITE to write the content to a file and the parameters mirror those of write making that job nice and easy. For our example, we simply multiply the number of chunks by the size of chunks and then process that amount of data. Note that the ptr to the data will not be terminated – so if you are creating or appending to a string as we are, you will need to use one of the overloaded methods that let you specify a size.

We create a vector of strings for our header entries and a simple string for the content, appending to it with each invocation.

Note that a handle to an easy libcurl instance like we are showing encapsulated in our example below is not thread safe. You can of course have multiple instances of the class (and hence libcurl handles), each on single threads – but don’t cross the streams and start sharing the handles. There is a multi interface that should allow you much more flexibility, but that is way beyond the point of this blog.

hopefully that gives you enough information to get started – we don’t claim to be experts so please feel free to point out any howlers or glaring omissions we might have made.

 

 

Posted in C++ | Tagged , , , | Comments Off

Dahu in the Community

As a responsible, green-thinking, forward-looking technology company, Dahu is always looking for ways in which we can demonstrate our commitment to support the local community. We are proud to announce our sponsorship of the BLC for 2012. This is a truly great organisation that gives under-privileged sales executives a chance to interact with and learn from technical experts in a friendly and social setting. This year, the BLC are taking senior sales executives and technical account managers to Vallorcine in the France Alps, where they will have a chance to experience and learn for themselves how to cook the wonderful local cuisine, whilst learning essential life-skills such as cleaning, washing, and running up and down hills to the shops. Money can’t buy experience like this.

Watch this space for more Dahu philanthropy in the future.

Posted in Uncategorized | Tagged , , | Comments Off

Aurasma – What Autonomy did next…

Here at Dahu Towers, we are always on the lookout for the next big thing, or even the next little thing if it manages to amuse us for long enough. And our attention was recently drawn to an intriguing new app for augmented virtual reality, namely Aurasma, the latest plaything from Autonomy, themselves the latest plaything of HP.

We first saw the Aurasma visual browser at the recent London AdTech show where it was introduced by Matt Mills from Autonomy. In an impressive live demonstration, he photographed a page from a glossy magazine, which he then uploaded to a central server along with a short video taken of the audience. Within seconds, he was able to point an iPad running the Aurasma app at the original magazine photo and have it recognise the image. Presumably the app was sending a series of live input images to a central service which searched over its library of tagged images in real-time. Once it matched the newly uploaded photo, it immediately linked, downloaded and ran the video clip showing the awed AdTech audience in place of the original magazine advert.

Very impressive on several levels. Firstly, just to run a live demo at a large show over a relatively slow 3G link, that involves uploading two multimedia files to a remote server, shows an impressive level of confidence in the entire technology stack, from iPhone through dongle to Aurasma’s back-end servers and back to the iPad running the app. Big respect to Matt for having the cojones to attempt that!

Also one has to be impressed at the speed with which the uploaded image was received, processed, indexed and made available for searching by the back-end services supporting the application. Presumably, the search is based on Autonomy’s IDOL platform, which, in our experience, has not always proven to be the most scalable or performant search engine in the market place. One does wonder how it will scale as the number of adopters of the free downloadable app increases. For now though, it works impressively well.

We were left wondering as to the uses for this technology. Autonomy’s Mike Lynch claims this is the idea that will make augmented reality go mainstream. As a consumer app, it does seem to offer huge potential for fun and amusement to comment, tag and share your view of real-world objects and places with your friends (although the latest comments on the itunes app store from early users would suggest some further work is still required on usability). From a commercial perspective, which we think is where Autonomy expects this technology to go, we are less convinced. The idea is that static, 2-D advertising can come to life; point your smart phone at a newspaper and see the latest news headlines in place of the outdated printed copy; link the label of a jar of pasta sauce to a recipe; take a photo of a public building like a major gallery and see what events are happening inside right now, with links to buy tickets.

One can see committed geo-cachers might appreciate the new world of possibilities they now have to hunt for virtual satisfaction. For the rest of us, do we really want to see the world through the lens of a smartphone?

More worryingly, we don’t see how this would work in practice. At the AdTech show, Mr Mills appeared to select a page at random from a glossy magazine. From memory, the page showed a large and presumably expensively prepared advert for a perfume. I don’t recall any permission being sought from the owner of the image rights before Autonomy tagged their prime advertising space with an amateur video of the AdTech audience. I’ve had a quick look around the Aurasma web site and I do not see any options for disgruntled victims of Aurasma graffiti to object and have images removed. I’m sure the lawyers at Autonomy will have considered this scenario before launching the product, and clearly there must be protection available to the holders of image rights and copyright, but from the information available on the web site, I do not see how it is intended to be controlled.

Next, one imagines the most popular images may well be tagged as “auras” multiple times, from different users or companies. Is there to be a market for the top position in an Aurasma image search? If so, how is this to work, and how is it to be governed? Are we opening up a whole new world of SEO for images? A new market for auctioning images and faces on which to stamp messages and content by the highest bidder?

Imagine a competitor taking a photograph of the entrance to a corporate HQ, for example Autonomy’s in Cambridge. They can upload their own messaging to appear in connection with that building. Imagine an anti-capitalist organisation tagging the Palace of Westminster with anti-globalisation messages, presumably to remain associated with the seat of Parliament for all time? We understand “auras” will soon be possible based on a face. Imagine if your face were to be tagged with a link to offensive or even pornographic material. How is one to demand instant retraction of offending images? Is there to be a hotline to Autonomy to report and remove offensive, libellous or illegal tags? As I say, I’m sure the Autonomy lawyers have thought this through, but their preferred remedy is not clearly identifiable on the Aurasma site.

Interesting, but for now, we just don’t get it.

Posted in Uncategorized | Tagged , , | 1 Comment

Data Granularity – what’s *that* all about?

Data Granularity and the Web

work continues on our new engine – EDGE (Enhanced Data Granularity Engine), so we though it might be a good idea to spend a little time to explain what we mean by data granularity.

Big Data

pretty much everyone is aware of “Big Data” and the challenges it brings. Exploding data volumes and an inability to address the data in a meaningful (or affordable) way mean that people are now looking to other technologies to address their big data needs. A classic example that highlights the problem is smart-metering. Whereas in the old days someone might visit your property and take a reading of your electricity meter once a quarter (or indeed simply ask you to do it yourself), smart meters are able to take a reading at a frequency set by the company.

One example we looked at some time back was taking readings ever 30 minutes. These readings are gathered automatically and stored centrally and clearly, allow the business to monitor their business in a much more granular way. Except of course they don’t. Think about it – moving from a quarterly reading to a half-hour reading is roughly a four-thousand times increase in data from every property. Despite the processing, storage and access headaches, the data is of huge potential value when viewed as a whole. The individual value of a single 30 minute reading might be minuscule, but when viewed as a whole, it can reveal real insight into the real-time usage of electricity allowing the providers to create specific packages aimed at consumer types and also provide a real tool to fight fraudulent use of electricity.

Now the cost of storing this additional data using traditional RDBMS systems doesn’t decrease when you extend massively like this, even given some extreme licensing curves from your favourite RDBMS vendor. At these kind of volumes the systems don’t necessarily behave linearly and the conflicts between secure transaction processing and information retrieval exacerbate this non-linearity. There are a number of alternative approaches including the use of non-relational databases (typified by the no-sql movement) and also Search-Based Applications – the use of search technologies to handle the retrieval operations.

In our smart-meter example, what often happens is that the data is collected and stored but not really made available in any meaningful way. The granularity of the data can be said to be coarse-grained; while the system is itself fine-grained (having a lot of detail) the cost of really utilising the fine-grained detail is too high for most business to bear. Putting this another way, to be able to make use of the data the cost-per-item needs to decrease massively as the volumes increase.

Dahu Edge

While we have experience in using search technologies to tackle problems like this, Dahu’s own specific area of interest and the focus of our products and services  is mining data from the web. Many businesses want to make use of content from the web, data from social media and inferences drawn from the two and there is a wealth of freely available content out there if only you can get at it. It exists at a sub web-page level (meaning that a single page may contain many distinct pieces of information or data).

Gathering data from a multitude of web-sites is perfectly possible using a wide variety of techniques, but up to now, these techniques have been quite intensive, requiring considerable set-up time and on-going management. This is fine if you are trying to gather data from perhaps one or two web sites and can keep on top of any changes in those sites – but if you need to gather content from perhaps thousands of sites, then you are going to struggle to manage the process in a cost-effective way.

Dahu EDGE is a set of tools and services we are developing to automate this process. It allows us to focus on a specific kind of content from a large range of web-sites, automatically find the pages of interest and automatically find the items on the page and extract the pertinent attributes that we need. The process is designed to be as automatic as possible. It does need to be taught a few things about the kind of content we are looking for, but importantly, it does not need to be taught anything about the particular web-sites where the content might be located.

Using EDGE, we are able to extract millions of items from thousands of web-sites at a cost achievable by most businesses and entrepreneurial start-ups in need of data.

Posted in Uncategorized | Tagged , , | 1 Comment

open-source vs closed-source – “Is this a five minute argument or the full half-hour?”

open-source vs closed-source – how to make a choice.


Nothing is likely to raise the blood pressure in a collection of search experts quicker than a lively debate over the virtues of open versus closed source search. Here at Dahu Towers, we of course have our own views and thought a gentle discussion  might be a nice jolly way of starting off our new blog. Our view, without wishing to be too trite, is that the decision to use open source search depends largely on whether you have an open source problem or not. The traditional view is that open source is for those wishing to avoid a licence fee and that are happy to roll up their sleeves and get down-and-dirty with the code. As with all things in life, its really not that simple. Firstly, consider what open source and indeed closed-source looks like at the moment. Open source search is good. Very good. Great in fact. There, i’ve said it. Its feature-rich, very scaleable and very reliable. We at Dahu towers are using open-source for our own product development. We also consult on closed-source search however.

The reality is that the boundaries of open-source solutions (Solr, Lucene, ELasticSearch et al) are much more sparse than the core of a typical closed-source solution. Open source search tends to solve the technical core issues of search but doesn’t really provide all the additional features and facilities that the closed source vendors tend to provide. It would be interesting to look at a typical closed-source vendor’s engineering department and see how much of their time is spent engineering in the kernel or core of their product and how much time is spent in the periphery.

So back to the question – “do you have an open source problem?”. The traditional answer (from the vendors at least) has always been that open source is for solving a) simple problems and b) for tech. savvy organisations. ‘a’ is simply not the case – some of the largest and most complex solutions are based on open source, and for ‘b’, this is somewhat of a myth. It is true however that for open source you will be gathering a number of components together and for anything but the simplest requirement, you will be doing (or having done for you) some customisations. There is now quite an eco system out there of tools for content gathering, processing, search User Interface creation, taxonomy and ontology management (include our own deep-content offerings). So a primary question needs to be “is what you are doing a specialisation that exists in the closed-source world or are you breaking new ground?“. Its no surprise that most startups that are using search in innovative ways will opt for open-source. If you are going to have to take the back off and start tinkering about on something truly novel, then often, closed-source will not offer you any particular advantage.

If however your problem is a specialisation that is already catered for in the closed-source world than it is worth looking at the vendors and doing a thorough investigation of the costs (and we mean all the costs – the total cost of ownership not just the licence cost or lack of it). The vendors of closed-source solutions all have their specialisations. Some are particularly strong in true enterprise search, offering business-level management tools, connectors for all the content sources you are likely to need, and easily customisable User Interfaces. Some are particularly strong in E-commerce, providing campaign management tools, support for managing catalogues with real-time price updates. Some specialise in specific domains like Legal and Compliance and provide tools and methods aimed specifically at these verticals. The way that open-source software is developed tends towards generic search facilities. Any thing that was specific to a particular solution or domain is unlikely to make it into the main trunk of an open-source search product unless it can be abstracted and made available as a feature of universal application. This keeps the core of the open-source engines powerful, but devoid of any vertical or domain bias. It is possible that in the future we might start to see flavours of search aimed at,say, the compliance market, although right now, we doubt it.

The closed-source vendors identify their specific markets and pursue them, often with almost religious zeal, without having the same sort of agnostic constraints. This is often seen in the way the specific content is connected to, filtered and processed and presented, often out-of-the-box. Innovation on the other hand can be hindered. As we mentioned, many startups with a search bias will chose open-source precisely because of the low-level flexibility.

Another important consideration is the rate of change in your application. For example, if you are a large enterprise looking to deploy a number of search-based solutions at a global and departmental level, then a closed-source vendor with a soup-to-nuts solution will be cheaper in the long run. If you are looking to build something relatively static for a point-solution, then open-source will be cheaper. Large-scale enterprises tend to standardise on solutions to keep costs down. Single technology platforms require one set of skills and with a full-featured solution, and while this may enforce some (potentially unpopular) constraints, it ultimately lowers support costs. In controlled environments, continual development using a wide range of tools and modules can be seen as anarchy – it can certainly cause difficulty in control, and introduces potential duplication of effort and skill-sets.

In conclusion, we feel that open-source is not just for the tech-savy, the frugal and those with simple problems. Closed source is not just for those who are technology adverse and those with deep pockets. We expect to see more inovation based on open source, and we expect to see a lot more specialisation in the closed-source community.

 

Posted in Uncategorized | Tagged , | Comments Off

Its Alive!! (welcome to the Dahu blog)

Welcome to the new Dahu blog. We intend to blog on aspects of our work in search and provide some insight into how the development of Dahu Edge deep-content framework is proceeding.

Posted in Uncategorized | Comments Off