Probabilistic k-swap method for uniform graph generation beyond the configuration model

Lionel Tabourier, Julien Karadayi

In Journal of Complex Networks, 2024, 12 (1)

DOI: 10.1093/comnet/cnae002

Generating graphs with realistic structural characteristics is an important challenge for complex networks analysis, as these graphs are the null models that allow to describe and understand the properties of real-world networks. However, the field lacks systematic Generating graphs with realistic structural characteristics is an important challenge for complex networks analysis, as these graphs are the null models that allow to describe and understand the properties of real-world networks. However, the field lacks systematic means to generate samples of graphs with predefined structural properties, because it is difficult to devise a method that is both flexible and guarantees to get a uniform sample, i.e., where any graph of the target set has the same probability to be represented in the sample. In practice, it limits the experimental investigation to a handful of models, including the well-known Erdős-Rényi graphs or the configuration model. The aim of this paper is to provide such a method: we design and implement a Monte-Carlo Markov Chain process which is both flexible and satisfies the uniformity condition. Its assumptions are that: 1) the graphs are simple, 2) their degree sequence is fixed, 3) the user has at least one graph of the set available. Within these limitations, we prove that it is possible to generate a uniform sample of any set of such graphs. We provide an implementation in python and extensive experiments to show that this method is practically operational in several relevant cases. We use it with five specific set of constraints and verify that the samples obtained are consistent with existing methods when such a method is available. In those cases, we report that state-of-the-art methods are usually faster, as our method favors versatility at the cost of a lower efficiency. Also, the implementation provided has been designed so that users may adapt it to relevant constraints for their own field of work.


Link Streams as a Generalization of Graphs and Time Series

Esteban Bautista, Matthieu Latapy

In « Temporal Network Theory », Holme, P. and Saramäki, J. (eds), Springer, 2023

DOI: 10.1007/978-3-031-30399-9_22œ

A link stream is a set of possibly weighted triplets (t, u, v) modeling that u and v interacted at time t. Link streams offer an effective model for datasets containing both temporal and relational information, making their proper analysis crucial in many applications. They are commonly regarded as sequences of graphs or collections of time series. Yet, a recent seminal work demonstrated that link streams are more general objects of which graphs are only particular cases. It therefore started the construction of a dedicated formalism for link streams by extending graph theory. In this work, we contribute to the development of this formalism by showing that link streams also generalize time series. In particular, we show that a link stream corresponds to a time-series extended to a relational dimension, which opens the door to also extend the framework of signal processing to link streams. We therefore develop extensions of numerous signal concepts to link streams: from elementary ones like energy, correlation, and differentiation, to more advanced ones like Fourier transform and filters.


Computing Betweenness Centrality in Link Stream

Frédéric Simard, Clémence Magnien, Matthieu Latapy

Journal of Graph Algorithms and Applications 27:3, 2023 DOI: 10.7155/jgaa.00620

Betweenness centrality is one of the most important concepts in graph analysis. It was recently extended to link streams, a graph generalization where links arrive over time. However, its computation raises non-trivial issues, due in particular to the fact that time is considered as continuous. We provide here the first algorithms to compute this generalized betweenness centrality, as well as several companion algorithms that have their own interest. They work in polynomial time and space, we illustrate them on typical examples, and we provide an implementation.


Code available here

A Frequency-Structure Approach for Link Stream Analysis

Esteban Bautista, Matthieu Latapy

Fifth IEEE International Conference on Cognitive Machine Intelligence (CogMI), 2023

A link stream is a set of triplets (t,u,v) indicating that u and v interacted at time t. Link streams model numerous datasets and their proper study is crucial in many applications. In practice, raw link streams are often aggregated or transformed into time series or graphs where decisions are made. Yet, it remains unclear how the dynamical and structural information of a raw link stream carries into the transformed object. This work shows that it is possible to shed light into this question by studying link streams via algebraically linear graph and signal operators, for which we introduce a novel linear matrix framework for the analysis of link streams. We show that, due to their linearity, most methods in signal processing can be easily adopted by our framework to analyze the time/frequency information of link streams. However, the availability of linear graph methods to analyze relational/structural information is limited. We address this limitation by developing (i) a new basis for graphs that allow us to decompose them into structures at different resolution levels; and (ii) filters for graphs that allow us to change their structural information in a controlled manner. By plugging-in these developments and their time-domain counterpart into our framework, we are able to (i) obtain a new basis for link streams that allow us to represent them in a frequency-structure domain; and (ii) show that many interesting transformations to link streams, like the aggregation of interactions or their embedding into a euclidean space, can be seen as simple filters in our frequency-structure domain.


LSCPM: Communities in Massive Real-World Link Streams by Clique Percolation Method

Alexis Baudin, Lionel Tabourier, Clémence Magnien

30th International Symposium on Temporal Representation and Reasoning (TIME 2023)

Community detection is a popular approach to understand the organization of interactions in static networks. For that purpose, the Clique Percolation Method (CPM), which involves the percolation of k-cliques, is a well-studied technique that offers several advantages. Besides, studying interactions that occur over time is useful in various contexts, which can be modeled by the link stream formalism. The Dynamic Clique Percolation Method (DCPM) has been proposed for extending CPM to temporal networks.
However, existing implementations are unable to handle massive datasets. We present a novel algorithm that adapts CPM to link streams, which has the advantage that it allows us to speed up the computation time with respect to the existing DCPM method. We evaluate it experimentally on real datasets and show that it scales to massive link streams. For example, it allows to obtain a complete set of communities in under twenty-five minutes for a dataset with thirty million links, what the state of the art fails to achieve even after a week of computation. We further show that our method provides communities similar to DCPM, but slightly more aggregated. We exhibit the relevance of the obtained communities in real world cases, and show that they provide information on the importance of vertices in the link streams.


Faster maximal clique enumeration in large real-world link streams

Alexis Baudin, Clémence Magnien, Lionel Tabourier

arXiv preprint arXiv:2302.00360

Link streams offer a good model for representing interactions over time. They consist of links (b,e,u,v), where u and v are vertices interacting during the whole time interval [b,e]. In this paper, we deal with the problem of enumerating maximal cliques in link streams. A clique is a pair (C,[t0,t1]), where C is a set of vertices that all interact pairwise during the full interval [t0,t1]. It is maximal when neither its set of vertices nor its time interval can be increased. Some of the main works solving this problem are based on the famous Bron-Kerbosch algorithm for enumerating maximal cliques in graphs. We take this idea as a starting point to propose a new algorithm which matches the cliques of the instantaneous graphs formed by links existing at a given time t to the maximal cliques of the link stream. We prove its validity and compute its complexity, which is better than the state-of-the art ones in many cases of interest. We also study the output-sensitive complexity, which is close to the output size, thereby showing that our algorithm is efficient. To confirm this, we perform experiments on link streams used in the state of the art, and on massive link streams, up to 100 million links. In all cases our algorithm is faster, mostly by a factor of at least 10 and up to a factor of 10**4. Moreover, it scales to massive link streams for which the existing algorithms are not able to provide the solution.


Énumération efficace des cliques maximales dans les flots de liens réels massifs

Alexis Baudin, Clémence Magnien, Lionel Tabourier

EGC 2023, vol. RNTI-E-39, pp.139-150

Les flots de liens offrent un formalisme de description d’interactions au cours du temps. Un lien correspond à deux sommets qui interagissent sur un intervalle de temps. Une clique est un ensemble de sommets associé à un intervalle de temps durant lequel ils sont tous connectés. Elle est maximale si ni son ensemble de sommets ni son intervalle de temps ne peuvent être augmentés. Les algorithmes existants pour énumérer ces structures ne permettent pas de traiter des jeux de données réels de plus de quelques centaines de milliers d’interactions. Or, l’accès à des données toujours plus massives demande d’adapter les outils à de plus grandes échelles. Nous proposons alors un algorithme qui énumère les cliques maximales sur des réseaux temporels réels et massifs atteignant jusqu’à plus de 100 millions de liens. Nous montrons expérimentalement qu’il améliore l’état de l’art de plusieurs ordres de grandeur.


Tailored vertex ordering for faster triangle listing in large graphs

Fabrice Lécuyer, Louis Jachiet, Clémence Magnien, Lionel Tabourier


Listing triangles is a fundamental graph problem with many applications, and large graphs require fast algorithms. Vertex ordering allows the orientation of edges from lower to higher vertex indices, and state-of-the-art triangle listing algorithms use this to accelerate their execution and to bound their time complexity. Yet, only basic orderings have been tested. In this paper, we show that studying the precise cost of algorithms instead of their bounded complexity leads to faster solutions. We introduce cost functions that link ordering properties with the running time of a given algorithm. We prove that their minimization is NP-hard and propose heuristics to obtain new orderings with different trade-offs between cost reduction and ordering time. Using datasets with up to two billion edges, we show that our heuristics accelerate the listing of triangles by an average of 38% when the ordering is already given as an input, and 16% when the ordering time is included.


A combinatorial link between labelled graphs and increasingly labelled Schröder trees

Antoine Genitrini, Mehdi Naima, Olivier Bodini

The 15th Latin American Theoretical Informatics Symposium (LATIN 2022)

In this paper we study a model of Schr ̈oder trees whose labelling is increasing along the branches. Such tree family is useful in the context of phylogenetic. The tree nodes are of arbitrary arity (i.e. out-degree) and the node labels can be repeated throughout different branches of the tree. Once a formal construction of the trees is formalized, we then turn to the enumeration of the trees inspired by a renormalisation due to Stanley on acyclic orientations of graphs. We thus exhibit links between our tree model and labelled graphs and prove a one-to-one correspondence between a subclass of our trees and labelled graphs. As a by-product we obtain a new natural combinatorial interpretation of Stanley’s renormalising factor. We then study different combinatorial characteristics of our tree model and finally, we design an efficient uniform random sampler for our tree model which allows to generate uniformly Erdös-Renyi graph with a constant number of rejections on


Compressing bipartite graphs with a dual reordering scheme

Maximilien Danisch, Ioannis Panagiotas, Lionel Tabourier


In order to manage massive graphs in practice, it is often necessary to resort to graph compression, which aims at reducing the memory used when storing and processing the graph. Efficient compression methods have been proposed in the literature, especially for web graphs. In most cases, they are combined with a vertex reordering pre-processing step which significantly improves the compression rate. However, these techniques are not as efficient when considering other kinds of graphs. In this paper, we focus on the class of bipartite graphs and adapt the vertex reordering phase to their specific structure by proposing a dual reordering scheme. By reordering each group of vertices in the purpose of minimizing a specific score, we show that we can reach better compression rates. We also suggest that this approach can be further refined to make the node orderings more adapted to the compression phase that follows the ordering phase.


A Fast Algorithm for Ranking Users by their Influence in Online Social Platforms

Nouamane Arhachoui, Esteban Bautista, Maximilien Danisch, Anastasios Giovanidis


Abstract—Measuring the influence of users in social networks is key for numerous applications. A recently proposed influence metric, coined as $\psi$-score, allows to go beyond traditional centrality metrics, which only assess structural graph importance, by further incorporating the rich information provided by the posting and re-posting activity of users. The $\psi$-score is shown in fact to generalize PageRank for non-homogeneous node activity. Despite its significance, it scales poorly to large datasets; for a network of N users it requires to solve N linear systems of equations of size N. To address this problem, this work introduces a novel scalable algorithm for the fast approximation of $\psi$-score, named Power-$\psi$. The proposed algorithm is based on a novel equation indicating that it suffices to solve one system of equations of size N to compute the $\psi$-score. Then, our algorithm exploits the fact that such system can be recursively and distributedly approximated to any desired error. This permits the $\psi$-score, summarizing both structural and behavioral information for the nodes, to run as fast as PageRank. We validate the effectiveness of the proposed algorithm on several real-world datasets.


A local updating algorithm for Personalized PageRank via Chebyshev Polynomials

Esteban Bautista, Matthieu Latapy

In Social Network Analysis and Mining, 2022, vol. 12, no 1, p. 1-11.

The personalized PageRank algorithm is one of the most versatile tools for the analysis of networks. In spite of its ubiquity, maintaining personalized PageRank vectors when the underlying network constantly evolves is still a challenging task. To address this limitation, this work proposes a novel distributed algorithm to locally update personalized PageRank vectors when the graph topology changes. The proposed algorithm is based on the use of Chebyshev polynomials and a novel update equation that encompasses a large family of PageRank-based methods. In particular, the algorithm has the following advantages: (i) it has faster convergence speed than state-of-the-art alternatives for local PageRank updating; and (ii) it can update the solution of recent generalizations of PageRank for which no updating algorithms have been developed. Experiments in a real-world temporal network of an autonomous system validate the effectiveness of the proposed algorithm.


Clique percolation method: memory efficient almost exact communities

Alexis Baudin, Maximilien Danisch, Sergey Kirgizov, Clémence Magnien

International Conference on Advanced Data Mining and Applications (ADMA), 2021

Automatic detection of relevant groups of nodes in large real-world graphs, i.e. community detection, has applications in many fields and has received a lot of attention in the last twenty years. The most popular method designed to find overlapping communities (where a node can belong to several communities) is perhaps the clique percolation method (CPM). This method formalizes the notion of community as a maximal union of k-cliques that can be reached from each other through a series of adjacent k-cliques, where two cliques are adjacent if and only if they overlap on k-1 nodes. Despite much effort CPM has not been scalable to large graphs for medium values of k. Recent work has shown that it is possible to efficiently list all k-cliques in very large real-world graphs for medium values of k. We build on top of this work and scale up CPM. In cases where this first algorithm faces memory limitations, we propose another algorithm, CPMZ, that provides a solution close to the exact one, using more time but less memory.


A logical approach for temporal and multiplex networks analysis

Esteban Bautista, Matthieu Latapy

In 10th International Conference on Complex Networks and their Applications, Madrid (Spain), December 2021 (Poster)

Many systems generate data as a set of triplets (a,b,c): they may represent that user a
called b at time c or that customer a purchased product b in store c. These datasets are
traditionally studied as networks with an extra dimension (time or layer), for which the
fields of temporal and multiplex networks have extended graph theory to account for
the new dimension [1]. However, such frameworks detach one variable from the others
and allow to extend one same concept in many ways, making it hard to capture pat-
terns across all dimensions and to identify the best definitions for a given dataset. This
work overrides this vision and proposes a direct processing of the set of triplets. While
[2] also approaches triplets directly, it focuses on specific patterns and applications.
Our work shows that a more general analysis is possible by partitioning the data and
building categorical propositions (CPs) that encode informative patterns. We show that
several concepts from graph theory can be framed under this formalism and leverage
such insights to extend the concepts to data triplets. Lastly, we propose an algorithm to
list CPs satisfying specific constraints and apply it to a real world dataset.


Shared-memory implementation of the Karp-Sipser kernelization process

Johannes Langguth, Ioannis Panagiotas, Bora Uçar

28th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2021), Dec 2021, Bangalore, India

We investigate the parallelization of the Karp-Sipser kernelization technique, which constitutes the central part of the well-known Karp-Sipser heuristic for the maximum cardinality matching problem. The technique reduces a given problem instance to a smaller but equivalent one, by repeated applications of two operations: vertex removal, and merging two vertices. The operation of merging two vertices poses the principal challenge in parallelizing the technique. We describe an algorithm that minimizes the need for synchronization and present an efficient shared-memory parallel implementation of the kernelization technique for bipartite graphs. Using extensive experiments on a variety of multicore CPUs, we show that our implementation scales well up to 32 cores on one socket.


Full Bitcoin Blockchain Data Made Easy

Jules Azad Emery, Matthieu Latapy

ASONAM, 2021

Despite the fact that it is publicly available, collecting and processing the full bitcoin blockchain data is not trivial. Its mere size, history, and other features indeed raise quite specific challenges, that we address in this paper. The strengths of our approach are the following: it relies on very basic and standard tools, which makes the procedure reliable and easily reproducible; it is a purely lossless procedure ensuring that we catch and preserve all existing data; it provides additional indexing that makes it easy to further process the whole data and select appropriate subsets of it. We present our procedure in details and illustrate its added value on large-scale use cases, like address clustering. We provide an implementation online, as well as the obtained dataset.


Assessing conservation of alternative splicing with evolutionary splicing graphs

Diego Javier Zea, Sofya Laskina, Alexis Baudin, Hugues Richard and Elodie Laine

Genome Research, 2021

Understanding how protein function has evolved and diversified is of great importance for human genetics and medicine. Here, we tackle the problem of describing the whole transcript variability observed in several species by generalising the definition of splicing graph. We provide a practical solution to construct parsimonious evolutionary splicing graphs where each node is a minimal transcript building block defined across species. We show a clear link between the functional relevance, tissue-regulation and conservation of alternative transcripts on a set of 50 genes. By scaling up to the whole human protein-coding genome, we identify a few thousands of genes where alternative splicing modulates the number and composition of pseudo-repeats. We have implemented our approach in ThorAxe, an efficient, versatile, robust and freely available computational tool.


Link weights recovery in heterogeneous information networks

Hông-Lan Botterman, Robin Lamarche-Perrin

In Computational Social Network, 8 (15), 2021

Socio-technical systems usually consists of many intertwined networks, each connecting different types of objects (or actors) through a variety of means. As these networks are co-dependent, one can take advantage of this entangled structure to study interaction patterns in a particular network from the information provided by other related networks. A method is hence proposed and tested to recover the weights of missing or unobserved links in heterogeneous information networks (HIN) – abstract representations of systems composed of multiple types of entities and their relations. Given a pair of nodes in a HIN, this work aims at recovering the exact weight of the incident link to these two nodes, knowing some other links present in the HIN. To do so, probability distributions resulting from path-constrained random walks i.e., random walks where the walker is forced to follow only a specific sequence of node types and edge types, capable to capture specific semantics and commonly called a meta-path, are combined in a linearly fashion in order to approximate the desired result. This method is general enough to compute the link weight between any types of nodes. Experiments on Twitter and bibliographic data show the applicability of the method.


[Re] Speedup Graph Processing by Graph Ordering

Fabrice Lécuyer, Maximilien Danisch, Lionel Tabourier

In ReScience C 7, 1 (3), 2021

Cache systems keep data close to the processor to access it faster than main memory would. Graph algorithms benefit from this when a cache line contains highly related nodes. Hao Wei extitet al. propose to reorder the nodes of a graph to optimise the proximity of nodes on a cache line. Their contribution, Gorder, creates such an ordering with a greedy procedure. In this replication, we implement ten different orderings and measure the execution time of nine standard graph algorithms on nine real-world datasets. We monitor cache performances to show that runtime variations are caused by cache management. We confirm that Gorder leads to the fastest execution in most cases due to cache-miss reductions. Our results show that simpler procedures are yet almost as efficient and much quicker to compute. This replication validates the initial results but highlights that generating a complex ordering like Gorder is time-consuming.


Ranking Online Social Users by their Influence

Anastasios Giovanidis, Bruno Baynat, Clémence Magnien, and Antoine Vendeville

IEEE Transactions on Networking, 2021

We introduce an original mathematical model to analyse the diffusion of posts within a generic online social platform. The main novelty is that each user is not simply considered as a node on the social graph, but is further equipped with his/her own Wall and Newsfeed, and has his/her own individual self-posting and re-posting activity. As a main result using our developed model, we derive in closed form the probabilities that posts originating from a given user are found on the Wall and Newsfeed of any other. These are the solution of a linear system of equations, which can be resolved iteratively. In fact, our model is very flexible with respect to the modelling assumptions. Using the probabilities derived from the solution, we define a new measure of per-user influence over the entire network, the Ψ-score, which combines the user position on the graph with user (re-)posting activity. In the homogeneous case where all users (re-)post with the same rate, it is shown that a variant of the Ψ-score is equal to PageRank. Furthermore, we compare the new model and its Ψ-score against the empirical influence measured from very large data traces (Twitter, Weibo). The results illustrate that these new tools can accurately rank influencers with asymmetric (re-)posting activity for such real world applications.