A local updating algorithm for Personalized PageRank via Chebyshev Polynomials

Esteban Bautista, Matthieu Latapy

In Social Network Analysis and Mining, 2022, vol. 12, no 1, p. 1-11.

The personalized PageRank algorithm is one of the most versatile tools for the analysis of networks. In spite of its ubiquity, maintaining personalized PageRank vectors when the underlying network constantly evolves is still a challenging task. To address this limitation, this work proposes a novel distributed algorithm to locally update personalized PageRank vectors when the graph topology changes. The proposed algorithm is based on the use of Chebyshev polynomials and a novel update equation that encompasses a large family of PageRank-based methods. In particular, the algorithm has the following advantages: (i) it has faster convergence speed than state-of-the-art alternatives for local PageRank updating; and (ii) it can update the solution of recent generalizations of PageRank for which no updating algorithms have been developed. Experiments in a real-world temporal network of an autonomous system validate the effectiveness of the proposed algorithm.

Download

The Role of Network Topology on Community-aware Centrality Measures

Stephany Rajeh

April 4th, 2022, 11am
Room : 24-25/405

While there is a great deal of work on designing centrality measures, the mainstream does not exploit the network’s community structure. Nevertheless, communities are pervasive in many real-world networks. A community is generally apprehended as a group of nodes densely connected between each other and sparsely connected with other nodes. As communities play a significant role in understanding how nodes behave in networks, a research area concerned with the relation between community structure and the importance of nodes has recently emerged in network science. These works have shown that incorporating community structure information allows designing more effective centrality measures. We refer to them as “community-aware” centrality measures. In this talk, we shed light on how classical (i.e., community-agnostic) centrality measures relate to community-aware centrality measures given a network’s macroscopic and mesoscopic topology. Then, we show the subtility of using these measures in different dynamic models, namely the Susceptible-Infected-Recovered (SIR) model and the Linear Threshold (LT) model. Additionally, as there are plenty of works to detect overlapping communities, few scientists make use of the overlapping community structure to identify critical nodes. Indeed, nodes may belong to several communities in many situations, indicating an overlapping community structure. We propose a framework to target influential nodes in networks with an overlapping community structure inspired by the concept of vitality. Finally, ascribable to the significance of communities in real-world networks, we present a backbone extraction method that maintains the network’s modularity while essentially reducing its  original size.

Randomized reference models for complex networks

Christian Lyngby Vestergaard, PhD

March 25th, 2022, 11am
Room : 26-00/332

Video: https://nuage.lip6.fr/s/db5bsrFMssEAoQD

Many dynamical systems can be successfully analyzed by representing them as networks. Empirically measured networks and dynamic processes that take place in these situations show heterogeneous, non-Markovian, and intrinsically correlated topologies and dynamics. This makes their analysis particularly challenging. Randomized reference models (RRMs) have emerged as a general and versatile toolbox for studying such systems. Defined as random networks with given features constrained to match those of an input (empirical) network, they may for example be used to identify important features of empirical networks and their effects on dynamical processes unfolding in the network. RRMs are typically implemented as procedures that reshuffle an empirical network, making them very generally applicable. However, the effects of most shuffling procedures on network features remain poorly understood, rendering their use non-trivial and susceptible to misinterpretation. I will describe a unified framework for the important class of RRMs generated by uniform shuffling procedures, which we by analogy to statistical physics will name microcanonical RRMs (MRRMs). MRRMs constrain chosen features to take exactly the same value as in the empirical network but are otherwise maximally random. Our framework lets us build a taxonomy of MRRMs that orders them and deduces their effects on important network features. It additionally tells us how we may generate new MRRMs by composition of existing ones. I will show examples of how the framework can be applied to unravel the influence of different features of an empirical network of mobile-phone calls on the spread of information and to characterize structural circuit motifs in the brain wiring diagrams (connectomes) of small animals. If time permits, I will finally discuss how we may use graph compression techniques to alleviate the statistical problems associated to classical null hypothesis testing for network motif discovery.
Reference: hal.archives-ouvertes.fr/hal-01817633v4

Complex Systems approach to the emergence of socio-economic phenomena

Yérali Gandica

February 3rd 2022, 2pm; Room 26-00/332

Slides

In this talk, I will present some of my publications using Complex Systems approaches to understand the emergence of socio-economic phenomena. Complex System science aims to study the phenomena that emerge from the interactions between the constituents and, thus, cannot be understood by studying a singular, isolated component. The field has incorporated concepts and methods derived from many areas, ranging from statistical physics and biology to economics and sociology, which, in a constant process of cross-fertilization, have given rise to new types of questions framed into the field of Complex Systems. The study of interacting particle systems has, for along time, been an essential subject of physics. The use of statistical methods has allowed for significant advances in this area by providing a bridge between the microscopic interactions and the sizeable collective behaviour of the system. The application of the methods of statistical physics to social phenomena, where the interacting particles are now interacting human beings, has proven to be very fruitful in allowing for the understanding of many features of social collective behaviour. The success of the statistical physics approaches to explain social data is currently attracting much interest, as demonstrated by the rapidly increasing number of publications in the field from both natural and social scientists. Some of the works that I will present are based on simulations, while some others are based on real data. In some cases, the comparison between simulations and real data is achieved. I will also explain some works in progress and projects for the incoming medium and long term to give a flavour about the scientific path that I plan, to illustrate the coherence of my scientific line of research.

Random Spanning Forests on Graphs for Fast Laplacian-Based Computations

Nicolas Tremblay

lundi 17 janvier 2022, LIP6, Sorbonne Université

Graphs are ubiquitous tools to represent networks, may they be networks modeling data from neurosciences, sociology, molecular biology, chemistry, etc. A cornerstone of the analysis of graphs is the Laplacian matrix L that encodes their structure. From a linear algebra point-of-view, the analysis of L offers fundamental insights on key properties of the graph: the diffusion speed of an information or a disease on a network, the vulnerability of a network to targeted or random attacks, the redundancy of certain parts of the network, the network’s structure in more or less independent modules, are all examples of characteristics of a network one may extract from the Laplacian matrix.

In this work, we concentrate on two specific problems that often arise in the context of graph-based data: i/ compute inverse traces of the form Tr( (L+qI)^(-1) ), ii/ compute smoothing operations of the form (L+qI)^(-1) y where q>0 and y some vector defined over the nodes of the graph. These two problems arise in many well-known graph-based algorithms, such as semi-supervised learning, label propagation, graph Tikhonov regularization, graph inpainting, etc.

In the context of large graphs, the required inverse which scales as O(n^3) in the worst-case, is often too expensive in practice. Many approaches have been developed in the state-of-the-art to circumvent this problem: polynomial approximation and (preconditioned) conjugate gradient are the two most well-known.

In this work, we develop a new class of techniques based on random spanning forests. We show that these forests are natural candidates to provide original, efficient, and easy-to-implement estimators.

This is joint work with Pierre-Olivier Amblard, Luca Avena, Simon Barthelmé, Alexandre Gaudillière and Yusuf Yigit Pilavci

Clique percolation method: memory efficient almost exact communities

Alexis Baudin, Maximilien Danisch, Sergey Kirgizov, Clémence Magnien

International Conference on Advanced Data Mining and Applications (ADMA), 2021

Automatic detection of relevant groups of nodes in large real-world graphs, i.e. community detection, has applications in many fields and has received a lot of attention in the last twenty years. The most popular method designed to find overlapping communities (where a node can belong to several communities) is perhaps the clique percolation method (CPM). This method formalizes the notion of community as a maximal union of k-cliques that can be reached from each other through a series of adjacent k-cliques, where two cliques are adjacent if and only if they overlap on k-1 nodes. Despite much effort CPM has not been scalable to large graphs for medium values of k. Recent work has shown that it is possible to efficiently list all k-cliques in very large real-world graphs for medium values of k. We build on top of this work and scale up CPM. In cases where this first algorithm faces memory limitations, we propose another algorithm, CPMZ, that provides a solution close to the exact one, using more time but less memory.

Download

A logical approach for temporal and multiplex networks analysis

Esteban Bautista, Matthieu Latapy

In 10th International Conference on Complex Networks and their Applications, Madrid (Spain), December 2021 (Poster)

Many systems generate data as a set of triplets (a,b,c): they may represent that user a
called b at time c or that customer a purchased product b in store c. These datasets are
traditionally studied as networks with an extra dimension (time or layer), for which the
fields of temporal and multiplex networks have extended graph theory to account for
the new dimension [1]. However, such frameworks detach one variable from the others
and allow to extend one same concept in many ways, making it hard to capture pat-
terns across all dimensions and to identify the best definitions for a given dataset. This
work overrides this vision and proposes a direct processing of the set of triplets. While
[2] also approaches triplets directly, it focuses on specific patterns and applications.
Our work shows that a more general analysis is possible by partitioning the data and
building categorical propositions (CPs) that encode informative patterns. We show that
several concepts from graph theory can be framed under this formalism and leverage
such insights to extend the concepts to data triplets. Lastly, we propose an algorithm to
list CPs satisfying specific constraints and apply it to a real world dataset.

Download

Shared-memory implementation of the Karp-Sipser kernelization process

Johannes Langguth, Ioannis Panagiotas, Bora Uçar

28th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2021), Dec 2021, Bangalore, India

We investigate the parallelization of the Karp-Sipser kernelization technique, which constitutes the central part of the well-known Karp-Sipser heuristic for the maximum cardinality matching problem. The technique reduces a given problem instance to a smaller but equivalent one, by repeated applications of two operations: vertex removal, and merging two vertices. The operation of merging two vertices poses the principal challenge in parallelizing the technique. We describe an algorithm that minimizes the need for synchronization and present an efficient shared-memory parallel implementation of the kernelization technique for bipartite graphs. Using extensive experiments on a variety of multicore CPUs, we show that our implementation scales well up to 32 cores on one socket.

Download

Full Bitcoin Blockchain Data Made Easy

Jules Azad Emery, Matthieu Latapy

ASONAM, 2021

Despite the fact that it is publicly available, collecting and processing the full bitcoin blockchain data is not trivial. Its mere size, history, and other features indeed raise quite specific challenges, that we address in this paper. The strengths of our approach are the following: it relies on very basic and standard tools, which makes the procedure reliable and easily reproducible; it is a purely lossless procedure ensuring that we catch and preserve all existing data; it provides additional indexing that makes it easy to further process the whole data and select appropriate subsets of it. We present our procedure in details and illustrate its added value on large-scale use cases, like address clustering. We provide an implementation online, as well as the obtained dataset.

Download

Assessing conservation of alternative splicing with evolutionary splicing graphs

Diego Javier Zea, Sofya Laskina, Alexis Baudin, Hugues Richard and Elodie Laine

Genome Research, 2021

Understanding how protein function has evolved and diversified is of great importance for human genetics and medicine. Here, we tackle the problem of describing the whole transcript variability observed in several species by generalising the definition of splicing graph. We provide a practical solution to construct parsimonious evolutionary splicing graphs where each node is a minimal transcript building block defined across species. We show a clear link between the functional relevance, tissue-regulation and conservation of alternative transcripts on a set of 50 genes. By scaling up to the whole human protein-coding genome, we identify a few thousands of genes where alternative splicing modulates the number and composition of pseudo-repeats. We have implemented our approach in ThorAxe, an efficient, versatile, robust and freely available computational tool.

Download

Link weights recovery in heterogeneous information networks

Hông-Lan Botterman, Robin Lamarche-Perrin

In Computational Social Network, 8 (15), 2021

Socio-technical systems usually consists of many intertwined networks, each connecting different types of objects (or actors) through a variety of means. As these networks are co-dependent, one can take advantage of this entangled structure to study interaction patterns in a particular network from the information provided by other related networks. A method is hence proposed and tested to recover the weights of missing or unobserved links in heterogeneous information networks (HIN) – abstract representations of systems composed of multiple types of entities and their relations. Given a pair of nodes in a HIN, this work aims at recovering the exact weight of the incident link to these two nodes, knowing some other links present in the HIN. To do so, probability distributions resulting from path-constrained random walks i.e., random walks where the walker is forced to follow only a specific sequence of node types and edge types, capable to capture specific semantics and commonly called a meta-path, are combined in a linearly fashion in order to approximate the desired result. This method is general enough to compute the link weight between any types of nodes. Experiments on Twitter and bibliographic data show the applicability of the method.

Download

[Re] Speedup Graph Processing by Graph Ordering

Fabrice Lécuyer, Maximilien Danisch, Lionel Tabourier

In ReScience C 7, 1 (3), 2021

Cache systems keep data close to the processor to access it faster than main memory would. Graph algorithms benefit from this when a cache line contains highly related nodes. Hao Wei extitet al. propose to reorder the nodes of a graph to optimise the proximity of nodes on a cache line. Their contribution, Gorder, creates such an ordering with a greedy procedure. In this replication, we implement ten different orderings and measure the execution time of nine standard graph algorithms on nine real-world datasets. We monitor cache performances to show that runtime variations are caused by cache management. We confirm that Gorder leads to the fastest execution in most cases due to cache-miss reductions. Our results show that simpler procedures are yet almost as efficient and much quicker to compute. This replication validates the initial results but highlights that generating a complex ordering like Gorder is time-consuming.

Download

Ranking Online Social Users by their Influence

Anastasios Giovanidis, Bruno Baynat, Clémence Magnien, and Antoine Vendeville

IEEE Transactions on Networking, 2021

We introduce an original mathematical model to analyse the diffusion of posts within a generic online social platform. The main novelty is that each user is not simply considered as a node on the social graph, but is further equipped with his/her own Wall and Newsfeed, and has his/her own individual self-posting and re-posting activity. As a main result using our developed model, we derive in closed form the probabilities that posts originating from a given user are found on the Wall and Newsfeed of any other. These are the solution of a linear system of equations, which can be resolved iteratively. In fact, our model is very flexible with respect to the modelling assumptions. Using the probabilities derived from the solution, we define a new measure of per-user influence over the entire network, the Ψ-score, which combines the user position on the graph with user (re-)posting activity. In the homogeneous case where all users (re-)post with the same rate, it is shown that a variant of the Ψ-score is equal to PageRank. Furthermore, we compare the new model and its Ψ-score against the empirical influence measured from very large data traces (Twitter, Weibo). The results illustrate that these new tools can accurately rank influencers with asymmetric (re-)posting activity for such real world applications.

Temporal Connectivity and Path Computation for Stream Graph

Léo Rannou

EDITE de Paris, LIP6, Thalès SIX – ThereSIS

Keywords:stream graphs, temporal networks, time-varying graphs, dynamic graphs,dynamic networks, interactions, graphs, networks, connected components, temporalpaths, algorithms, link streamsFor a long time, structured data and temporal data have been analysed separately. Many real world complex networks have a temporal dimension, such as contacts between individuals or financial transactions. Graph theory provides a wide set of tools to model and analyze static connections between entities. Unfortunately, this approach does not take into account the temporal nature of interactions.  Stream graph theory is a formalism to model highly dynamic networks in which nodes and/or links arrive and/or leave over time.  The number of applications of stream graph theory has risen rapidly, along with the number of theoretical concepts and algorithms to compute them. Several theoretical concepts such as connected components and temporal paths in stream graphs were defined recently, but no algorithm was provided to compute them.  Moreover, the algorithmic complexities of these problems are unknown, as well as the insight they may shed on real-world stream graphs of interest. In this thesis, we present several solutions to compute notions of connectivity and path concepts in stream graphs. We also present alternative representations – data structures designed to facilitate specific computations – of stream graphs. We provide implementations and experimentally compare our methods in a wide range of practical cases. We show that these concepts indeed give much insight on features of large-scale datasets. Straph, a python library, was developed in order to have a reliable library for manipulating, analysing and visualising stream graphs, to design algorithms and models, and to rapidly evaluate them.

Download

Measuring Diversity in Heterogeneous Information Networks

Pedro Ramaciotti Morales , Robin Lamarche-Perrin, Raphaël Fournier-S’Niehotta, Rémy Poulain, Lionel Tabourier,  Fabien Tarissan

In Theoretical Computer Science, 859, pp 80-115, 2021 

Diversity is a concept relevant to numerous domains of research varying from ecology, to information theory, andto economics, to cite a few. It is a notion that is steadily gaining attention in the information retrieval, networkanalysis, and artificial neural networks communities. While the use of diversity measures in network-structured datacounts a growing number of applications, no clear and comprehensive description is available for the different waysin which diversities can be measured. In this article, we develop a formal framework for the application of a largefamily of diversity measures to heterogeneous information networks (HINs), a flexible, widely-used network dataformalism. This extends the application of diversity measures, from systems of classifications and apportionments,to more complex relations that can be better modeled by networks. In doing so, we not only provide an effectiveorganization of multiple practices from different domains, but also unearth new observables in systems modeled byheterogeneous information networks. We illustrate the pertinence of our approach by developing different applicationsrelated to various domains concerned by both diversity and networks. In particular, we illustrate the usefulness of thesenew proposed observables in the domains of recommender systems and social media studies, among other fields.

Download

Presentation by Esteban Bautista on Fractional PageRank

Esteban Bautista

Lundi 25 Janvier 2021, à 14h Présentiel: salle 25-26-105,  Revoir La Présentation:

Slides

Graph-Based Semi-Supervised Learning (G-SSL) techniques learn from both labelled and unlabelled data to build better classifiers. This classification paradigm has received considerable attention since modern applications allow to collect large amounts of unlabelled but structured data, naturally encoded by a graph, in a relatively easy and inexpensive manner, while annotated data is expensive to obtain.  Despite successes, the performance of G-SSL techniques can still be improved, particularly in challenging data settings or unbalanced scenarios. To address such limitations, I will present a novel G-SSL method based on the positive γ -th powers of the graph Laplacian, referred to as the Lγ-PageRank method. I will present a theoretical analysis of the new method, showing that (i) for γ < 1, it extends the standard PageRank algorithm to Lévy processes: where random walkers can now perform far-distant jumps in a single step; and (ii) for γ > 1, it classifies data on signed graphs: where nodes belonging to one same class are more likely to share positive edges while nodes from different classes are more likely to be connected with negative edges. I will also show the existence of an optimal Laplacian power maximizing performance, for which I will propose an algorithm for its automatic estimation. Lastly, I will show the significant classification improvements allowed by the proposed approach on several real-world datasets commonly used for classification.  

How to Teach the Undecidability of Malware Detection Problem and Halting Problem

Matthieu Journault, Pascal Lafourcade, Malika More, Rémy Poulain, Léo Robert

WISE13: The 13th World Conference on Information Security Education, 2020

Malware detection is a term that is often associated to Computer Science Security. The underlying main problem is called Virus detection and consists in answering the following question: Is there a program that can always decide if a program is a virus or not? On the other hand, the undecidability of some problems is an important notion in Computer Science : an undecidable problem is a problem for which no algorithm exists to solve it. We propose an activity that demonstrates that virus detection is an undecidable problem. Hence we prove that the answer to the above question is no. We follow the proof given by Cohen in his PhD in 1983. The proof is close to the proof given by Turing in 1936 of the undecidability of the Halting problem. We also give an activity to prove the undecidability of the Halting problem. These proofs allow us to introduce two important ways of proving theorems in Computer Science : proof by contradiction and proof by case disjunction. We propose a simple way to present these notions to students using a maze. Our activity is unplugged, i.e. we use only a paper based model of computer, and is designed for high-school students. This is the reason why we use Scratch to write our « programs ».

Download

GraKeL: A Graph Kernel Library in Python

Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, Konstantinos Skianis, Michalis Vazirgiannis

Journal of Machine Learning Research 21, 2020

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Graph kernels have recently emerged as a promising approach to this problem. There are now many kernels, each focusing on different structural aspects of graphs. Here, we present GraKeL, a library that unifies several graph kernels into a common framework. The library is written in Python and adheres to the scikit-learn interface. It is simple to use and can be naturally combined with scikit-learn’s modules to build a complete machine learning pipeline for tasks such as graph classification and clustering. The code is BSD licensed and is available at: https://github.com/ysig/GraKeL

Download

LouvainNE: Hierarchical Louvain Method for High Quality and Scalable Network Embedding

Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume and Bivas Mitra

In Proceedings of the 13th ACM International WSDM conference, 2020

Network embedding, that aims to learn low-dimensional vector representation of nodes such that the network structure is preserved, has gained significant research attention in recent years. However, most state-of-the-art network embedding methods are computationally expensive and hence unsuitable for representing nodes in billion-scale networks. In this paper, we present LouvainNE, a hierarchical clustering approach to network embedding. Precisely, we employ Louvain, an extremely fast and accurate community detection method, to build a hierarchy of successively smaller subgraphs. We obtain representations of individual nodes in the original graph at different levels of the hierarchy, then we aggregate these representations to learn the final embedding vectors. Our theoretical analysis shows that our proposed algorithm has quasi-linear run-time and memory complexity. Our extensive experimental evaluation, carried out on multiple real-world networks of different scales, demonstrates both (i) the scalability of our proposed approach that can handle graphs containing tens of billions of edges, as well as (ii) its effectiveness in performing downstream network mining tasks such as network reconstruction and node classification.

Download