Complex network measurement with link queries

> By Fabien Tarissan, Matthieu Latapy and Christophe Prieur

Download

In many cases, complex networks are not directly available: one has to conduct an expensive measurement procedure in order to collect information on their structure. One then obtains a sample of the network under concern, which is in general only a small part of the whole network.

We present here such a situation. The basic measurement operation consists in asking whether a link is present between two given nodes, which we call a link query; a measurement of the network then consists in a series of link queries, between chosen pairs of nodes. We suppose that the list of all nodes is initially known (but no edges).

Such a typical situation is the case of interactions between proteins: one considers a set of proteins, and wants to study which proteins react with each other (which defines a link between proteins). One may check that a given protein reacts with another one by conducting an expensive chemical experiment (which basically consists in putting these proteins into contact and observe what happens). This experiment is nothing but a link query, and the number of link queries to perform should be as small as possible while discovering as many interactions as possible.

It must be clear that many strategies are possible for choosing pairs of nodes used in link queries. One may for instance choose random pairs (there is limited other choices at the beginning, when we know nothing about the network). But most pairs of nodes are not connected in general, and so random link queries will most often give negative answers. However, when some links have been discovered (using the random method, for instance), one may rely on this knowledge to perform link queries with a high success probability.

For instance, one may notice that if a node has a high degree (i.e. a large number of links) then the random strategy has a higher chance to discover one of its links, simply because there are many possibilities to do so. For instance, if a node has 1000 links and another one has only 1, then making random link queries may lead to the discovery of a link of the first node 1000 times more probably than for the second node.

Based on this remark, one may guess that extremities of links discovered during the random strategy probably have a high-degree. There is therefore a good chance that a link from these nodes to any other one exists. As a consequence, one may expect to discover many links by performing link queries for pairs of nodes composed of these extremities and all other nodes. If the extremity has a high degree, then these queries will often be successful.

Going further, one may perform link queries on pairs of nodes which are both extremities of links discovered with the random approach. As both probably have a high degree, the probability that they are linked together is very high, and so such queries have an even higher probability to be successful.

The video above displays such a measurement. For a given set of 476 nodes (displayed twice with the same layout), we first perform random queries until a given number of links have been discovered (200 in this simulation). Then, we switch to one of the two strategies sketched above: for the layout on the left, we search for all possible links from extremities of discovered links; for the layout on the right, we search for links between pairs of extremities of previously discovered links.

This video shows clearly that, in both cases, links are discovered much faster after the random search, thus confirming that the more subtle approaches are more efficient (they discover more links with less link queries). Moreover, the video shows that patterns of link discoveries are different depending on the strategy: star-like discoveries in the left (as we discover all links of high-degree nodes), and clustered links in the right (as we query pairs of nodes already well connected).

Going further, one may notice that, although the obtained samples finally contain the same number of links, discovered links are not the same depending on the measurement strategy. More importantly, properties of the obtained samples are not the same. This shows that, although some measurement strategies lead to the discovery of more links, they may induce a bias in the obtained sample.

See our paper published in NetSciCom 2009 for more details.

This entry was posted in Videos and tagged