Esteban Bautista, Matthieu Latapy
In 10th International Conference on Complex Networks and their Applications, Madrid (Spain), December 2021 (Poster)
Many systems generate data as a set of triplets (a,b,c): they may represent that user a
called b at time c or that customer a purchased product b in store c. These datasets are
traditionally studied as networks with an extra dimension (time or layer), for which the
fields of temporal and multiplex networks have extended graph theory to account for
the new dimension . However, such frameworks detach one variable from the others
and allow to extend one same concept in many ways, making it hard to capture pat-
terns across all dimensions and to identify the best definitions for a given dataset. This
work overrides this vision and proposes a direct processing of the set of triplets. While
 also approaches triplets directly, it focuses on specific patterns and applications.
Our work shows that a more general analysis is possible by partitioning the data and
building categorical propositions (CPs) that encode informative patterns. We show that
several concepts from graph theory can be framed under this formalism and leverage
such insights to extend the concepts to data triplets. Lastly, we propose an algorithm to
list CPs satisfying specific constraints and apply it to a real world dataset.
Esteban Bautista, Matthieu Latapy
In Social Network Analysis and Mining, 2022, vol. 12, no 1, p. 1-11.
The personalized PageRank algorithm is one of the
most versatile tools for the analysis of networks. In spite of its
ubiquity, maintaining personalized PageRank vectors when the
underlying network constantly evolves is still a challenging task.
To address this limitation, this work proposes a novel distributed
algorithm to locally update personalized PageRank vectors when
the graph topology changes. The proposed algorithm is based on
the use of Chebyshev polynomials and a novel update equation
that encompasses a large family of PageRank-based methods.
In particular, the algorithm has the following advantages: (i) it
has faster convergence speed than state-of-the-art alternatives
for local PageRank updating; and (ii) it can update the solution
of recent generalizations of PageRank for which no updating
algorithms have been developed. Experiments in a real-world
temporal network of an autonomous system validate the effec-
tiveness of the proposed algorithm.
Johannes Langguth, Ioannis Panagiotas, Bora Uçar
28th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2021), Dec 2021, Bangalore, India
We investigate the parallelization of the Karp-Sipser kernelization technique, which constitutes the central part of the well-known Karp-Sipser heuristic for the maximum cardinality matching problem. The technique reduces a given problem instance to a smaller but equivalent one, by repeated applications of two operations: vertex removal, and merging two vertices. The operation of merging two vertices poses the principal challenge in parallelizing the technique. We describe an algorithm that minimizes the need for synchronization and present an efficient shared-memory parallel implementation of the kernelization technique for bipartite graphs. Using extensive experiments on a variety of multicore CPUs, we show that our implementation scales well up to 32 cores on one socket.
Jules Azad Emery, Matthieu Latapy
Despite the fact that it is publicly available, collecting and processing the full bitcoin blockchain data is not trivial. Its mere size, history, and other features indeed raise quite specific challenges, that we address in this paper. The strengths of our approach are the following: it relies on very basic and standard tools, which makes the procedure reliable and easily reproducible; it is a purely lossless procedure ensuring that we catch and preserve all existing data; it provides additional indexing that makes it easy to further process the whole data and select appropriate subsets of it. We present our procedure in details and illustrate its added value on large-scale use cases, like address clustering. We provide an implementation online, as well as the obtained dataset.
Alexis Baudin, Maximilien Danisch, Sergey Kirgizov, Clémence Magnien
International Conference on Advanced Data Mining and Applications (ADMA), 2021
Automatic detection of relevant groups of nodes in large real-world graphs,
i.e. community detection, has applications in many fields and has received a
lot of attention in the last twenty years. The most popular method designed to
find overlapping communities (where a node can belong to several communities)
is perhaps the clique percolation method (CPM). This method formalizes the
notion of community as a maximal union of $k$-cliques that can be reached from
each other through a series of adjacent $k$-cliques, where two cliques are
adjacent if and only if they overlap on $k-1$ nodes. Despite much effort CPM
has not been scalable to large graphs for medium values of $k$.
Recent work has shown that it is possible to efficiently list all $k$-cliques
in very large real-world graphs for medium values of $k$. We build on top of
this work and scale up CPM. In cases where this first algorithm faces memory
limitations, we propose another algorithm, CPMZ, that provides a solution close
to the exact one, using more time but less memory.
Diego Javier Zea, Sofya Laskina, Alexis Baudin, Hugues Richard and Elodie Laine
Genome Research, 2021
Understanding how protein function has evolved and diversified is of great importance for human genetics and medicine. Here, we tackle the problem of describing the whole transcript variability observed in several species by generalising the definition of splicing graph. We provide a practical solution to construct parsimonious evolutionary splicing graphs where each node is a minimal transcript building block defined across species. We show a clear link between the functional relevance, tissue-regulation and conservation of alternative transcripts on a set of 50 genes. By scaling up to the whole human protein-coding genome, we identify a few thousands of genes where alternative splicing modulates the number and composition of pseudo-repeats. We have implemented our approach in ThorAxe, an efficient, versatile, robust and freely available computational tool.
Hông-Lan Botterman, Robin Lamarche-Perrin
In Computational Social Network, 8 (15), 2021
Socio-technical systems usually consists of many intertwined networks, each connecting different types of objects (or actors) through a variety of means. As these networks are co-dependent, one can take advantage of this entangled structure to study interaction patterns in a particular network from the information provided by other related networks. A method is hence proposed and tested to recover the weights of missing or unobserved links in heterogeneous information networks (HIN) – abstract representations of systems composed of multiple types of entities and their relations. Given a pair of nodes in a HIN, this work aims at recovering the exact weight of the incident link to these two nodes, knowing some other links present in the HIN. To do so, probability distributions resulting from path-constrained random walks i.e., random walks where the walker is forced to follow only a specific sequence of node types and edge types, capable to capture specific semantics and commonly called a meta-path, are combined in a linearly fashion in order to approximate the desired result. This method is general enough to compute the link weight between any types of nodes. Experiments on Twitter and bibliographic data show the applicability of the method.
Fabrice Lécuyer, Maximilien Danisch, Lionel Tabourier
In ReScience C 7, 1 (3), 2021
Cache systems keep data close to the processor to access it faster than main memory would. Graph algorithms benefit from this when a cache line contains highly related nodes. Hao Wei extitet al. propose to reorder the nodes of a graph to optimise the proximity of nodes on a cache line. Their contribution, Gorder, creates such an ordering with a greedy procedure. In this replication, we implement ten different orderings and measure the execution time of nine standard graph algorithms on nine real-world datasets. We monitor cache performances to show that runtime variations are caused by cache management. We confirm that Gorder leads to the fastest execution in most cases due to cache-miss reductions. Our results show that simpler procedures are yet almost as efficient and much quicker to compute. This replication validates the initial results but highlights that generating a complex ordering like Gorder is time-consuming.
Matthieu Latapy, Clémence Magnien, Tiphaine Viard
In Temporal Network Theory, Holme P., Saramäki J. (eds), Computational Social Sciences, Springer, 2019
We recently introduced a formalism for the modeling of temporal networks, that we call stream graphs. It emphasizes the streaming nature of data and allows rigorous definitions of many important concepts generalizing classical graphs. This includes in particular size, density, clique, neighborhood, degree, clustering coefficient, and transitivity. In this contribution, we show that, like graphs, stream graphs may be extended to cope with bipartite structures, with node and link weights, or with link directions. We review the main bipartite, weighted or directed graph concepts proposed in the literature, we generalize them to the cases of bipartite, weighted, or directed stream graphs, and we show that obtained concepts are consistent with graph and stream graph ones. This provides a formal ground for an accurate modeling of the many temporal networks that have one or several of these features.
Anastasios Giovanidis, Bruno Baynat, Clémence Magnien, and Antoine Vendeville
IEEE Transactions on Networking, 2020
We introduce an original mathematical model to analyse the diffusion of posts within a generic online social platform. The main novelty is that each user is not simply considered as a node on the social graph, but is further equipped with his/her own Wall and Newsfeed, and has his/her own individual self-posting and re-posting activity. As a main result using our developed model, we derive in closed form the probabilities that posts originating from a given user are found on the Wall and Newsfeed of any other. These are the solution of a linear system of equations, which can be resolved iteratively. In fact, our model is very flexible with respect to the modelling assumptions. Using the probabilities derived from the solution, we define a new measure of per-user influence over the entire network, the Ψ-score, which combines the user position on the graph with user (re-)posting activity. In the homogeneous case where all users (re-)post with the same rate, it is shown that a variant of the Ψ-score is equal to PageRank. Furthermore, we compare the new model and its Ψ-score against the empirical influence measured from very large data traces (Twitter, Weibo). The results illustrate that these new tools can accurately rank influencers with asymmetric (re-)posting activity for such real world applications.
EDITE de Paris, LIP6, Thalès SIX – ThereSIS
Keywords:stream graphs, temporal networks, time-varying graphs, dynamic graphs,dynamic networks, interactions, graphs, networks, connected components, temporalpaths, algorithms, link streamsFor a long time, structured data and temporal data have been analysed separately. Many real world complex networks have a temporal dimension, such as contacts between individuals or financial transactions. Graph theory provides a wide set of tools to model and analyze static connections between entities. Unfortunately, this approach does not take into account the temporal nature of interactions. Stream graph theory is a formalism to model highly dynamic networks in which nodes and/or links arrive and/or leave over time. The number of applications of stream graph theory has risen rapidly, along with the number of theoretical concepts and algorithms to compute them. Several theoretical concepts such as connected components and temporal paths in stream graphs were defined recently, but no algorithm was provided to compute them. Moreover, the algorithmic complexities of these problems are unknown, as well as the insight they may shed on real-world stream graphs of interest. In this thesis, we present several solutions to compute notions of connectivity and path concepts in stream graphs. We also present alternative representations – data structures designed to facilitate specific computations – of stream graphs. We provide implementations and experimentally compare our methods in a wide range of practical cases. We show that these concepts indeed give much insight on features of large-scale datasets. Straph, a python library, was developed in order to have a reliable library for manipulating, analysing and visualising stream graphs, to design algorithms and models, and to rapidly evaluate them.
Pedro Ramaciotti Morales , Robin Lamarche-Perrin, Raphaël Fournier-S’Niehotta, Rémy Poulain, Lionel Tabourier, Fabien Tarissan
In Theoretical Computer Science, 859, pp 80-115, 2021
Diversity is a concept relevant to numerous domains of research varying from ecology, to information theory, andto economics, to cite a few. It is a notion that is steadily gaining attention in the information retrieval, networkanalysis, and artificial neural networks communities. While the use of diversity measures in network-structured datacounts a growing number of applications, no clear and comprehensive description is available for the different waysin which diversities can be measured. In this article, we develop a formal framework for the application of a largefamily of diversity measures to heterogeneous information networks (HINs), a flexible, widely-used network dataformalism. This extends the application of diversity measures, from systems of classifications and apportionments,to more complex relations that can be better modeled by networks. In doing so, we not only provide an effectiveorganization of multiple practices from different domains, but also unearth new observables in systems modeled byheterogeneous information networks. We illustrate the pertinence of our approach by developing different applicationsrelated to various domains concerned by both diversity and networks. In particular, we illustrate the usefulness of thesenew proposed observables in the domains of recommender systems and social media studies, among other fields.
Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume and Bivas Mitra
In Proceedings of the 13th ACM International WSDM conference, 2020
Network embedding, that aims to learn low-dimensional vector representation of nodes such that the network structure is preserved, has gained significant research attention in recent years. However, most state-of-the-art network embedding methods are computationally expensive and hence unsuitable for representing nodes in billion-scale networks. In this paper, we present LouvainNE, a hierarchical clustering approach to network embedding. Precisely, we employ Louvain, an extremely fast and accurate community detection method, to build a hierarchy of successively smaller subgraphs. We obtain representations of individual nodes in the original graph at different levels of the hierarchy, then we aggregate these representations to learn the final embedding vectors. Our theoretical analysis shows that our proposed algorithm has quasi-linear run-time and memory complexity. Our extensive experimental evaluation, carried out on multiple real-world networks of different scales, demonstrates both (i) the scalability of our proposed approach that can handle graphs containing tens of billions of edges, as well as (ii) its effectiveness in performing downstream network mining tasks such as network reconstruction and node classification.
Bintao Sun, Maximilien Danisch, T-H. Hubert Chan and Mauro Sozio
In Proceedings of the VLDB Endowment, 2020
The problem of finding densest subgraphs has received increasing attention in recent years finding applications in biology, finance, as well as social network analysis. The k-clique densest subgraph problem is a generalization of the densest subgraph problem, where the objective is to find a subgraph maximizing the ratio between the number of k-cliques in the subgraph and its number of nodes. It includes as a special case the problem of finding subgraphs with largest average number of triangles (k=3), which plays an important role in social network analysis. Moreover, algorithms that deal with larger values of k can effectively find quasi-cliques. The densest subgraph problem can be solved in polynomial time with algorithms based on maximum flow, linear programming or a recent approach based on convex optimization. In particular, the latter approach can scale to graphs containing tens of billions of edges. While finding a densest subgraph in large graphs is no longer a bottleneck, the k-clique densest subgraph remains challenging even when k=3. Our work aims at developing near-optimal and exact algorithms for the k-clique densest subgraph problem on large real-world graphs. We give a surprisingly simple procedure that can be employed to find the maximal k-clique densest subgraph in large-real world graphs. By leveraging appealing properties of existing results, we combine it with a recent approach for listing all k-cliques in a graph and a sampling scheme, obtaining the state-of-the-art approaches for the aforementioned problem. Our theoretical results are complemented with an extensive experimental evaluation showing the effectiveness of our approach in large real-world graphs.
Léo Rannou, Clémence Magnien, and Matthieu Latapy
In Proceedings of the 9th International Conference on Complex Networks and their Applications, 2020
Stream graphs model highly dynamic networks in which nodes and/or links arrive and/or leave over time. Strongly connected components in stream graphs were defined recently, but no algorithm was provided to compute them. We present here several solutions with polynomial time and space complexities, each with its own strengths and weaknesses. We provide an implementation and experimentally compare the algorithms in a wide variety of practical cases. In addition, we propose an approximation scheme that significantly reduces computation costs, and gives even more insight on the dataset.
Pedro Ramaciotti Morales, Lionel Tabourier, Raphaël Fournier-S’niehotta
In Proceedings of the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 2020 (virtual)
The Heterogeneous Information Network (HIN) formalism is very flexible and enables complex recommendations models. We evaluate the effect of different parts of a HIN on the accuracy and the diversity of recommendations, then investigate if these effects are only due to the semantic content encoded in the network. We use recently-proposed diversity measures which are based on the network structure and better suited to the HIN formalism. Finally, we randomly shuffle the edges of some parts of the HIN, to empty the network from its semantic content, while leaving its structure relatively unaffected. We show that the semantic content encoded in the network data has a limited importance for the performance of a recommender system and that structure is crucial.
Pierluigi Crescenzi, Clémence Magnien and Andrea Marino
In Algorithms, 13 (9), 211, 2020
The harmonic closeness centrality measure associates, to each node of a graph, the average of the inverse of its distances from all the other nodes (by assuming that unreachable nodes are at infinite distance). This notion has been adapted to temporal graphs (that is, graphs in which edges can appear and disappear during time) and in this paper we address the question of finding the top-k nodes for this metric. Computing the temporal closeness for one node can be done in O(m) time, where m is the number of temporal edges. Therefore computing exactly the closeness for all nodes, in order to find the ones with top closeness, would require O(nm) time, where n is the number of nodes. This time complexity is intractable for large temporal graphs. Instead, we show how this measure can be efficiently approximated by using a “backward” temporal breadth-first search algorithm and a classical sampling technique. Our experimental results show that the approximation is excellent for nodes with high closeness, allowing us to detect them in practice in a fraction of the time needed for computing the exact closeness of all nodes. We validate our approach with an extensive set of experiments.
Nicolas Gensollen, Matthieu Latapy
In Applied Network Science, 5 (25), 2020
We study the interplay between social ties and financial transactions made through a recent cryptocurrency called G1. It has the particularity of combining the usual transaction record with a reliable network of identified users. This gives the opportunity to observe exactly who sent money to whom over a social network. This social network is a key piece of this cryptocurrency, which therefore puts much effort in ensuring that nodes correspond to unique, well identified, real living human users, linked together only if they met at least once in real world. Using this data, we study how social ties impact the structure of transactions and conversely. We show that users make transactions almost exclusively with people they are connected with in the social network. Instead, they tend to build social connections with people they will never make transactions with.
Thibaud Arnoux, Lionel Tabourier, Matthieu Latapy
In Journal of Interdisciplinary Methodologies and Issues in Sciences, 2019
Capturing both structural and temporal features of interactions is crucial in many real-world situations like studies of contact between individuals. Using the link stream formalism to model data, we address here the activity prediction problem: we predict the number of links that will occur during a given time period between each pair of nodes. To do this, we take benefit from the temporal and structural information captured by link streams. We design and implement a modular supervised learning method to make prediction, and we study the key elements influencing its performances. We then introduce classes of node pairs, which improves prediction quality and increases diversity
Fabien Tarissan, Lionel Tabourier
In Proceedings of the 8th International Conference on Complex Networks and their Applications, Lisbonne, Portugal, 2019