Measuring Diversity in Heterogeneous Information Networks

Pedro Ramaciotti Morales , Robin Lamarche-Perrin, Raphaël Fournier-S’Niehotta, Rémy Poulain, Lionel Tabourier,  Fabien Tarissan

In Theoretical Computer Science, 859, pp 80-115, 2021 

Diversity is a concept relevant to numerous domains of research varying from ecology, to information theory, andto economics, to cite a few. It is a notion that is steadily gaining attention in the information retrieval, networkanalysis, and artificial neural networks communities. While the use of diversity measures in network-structured datacounts a growing number of applications, no clear and comprehensive description is available for the different waysin which diversities can be measured. In this article, we develop a formal framework for the application of a largefamily of diversity measures to heterogeneous information networks (HINs), a flexible, widely-used network dataformalism. This extends the application of diversity measures, from systems of classifications and apportionments,to more complex relations that can be better modeled by networks. In doing so, we not only provide an effectiveorganization of multiple practices from different domains, but also unearth new observables in systems modeled byheterogeneous information networks. We illustrate the pertinence of our approach by developing different applicationsrelated to various domains concerned by both diversity and networks. In particular, we illustrate the usefulness of thesenew proposed observables in the domains of recommender systems and social media studies, among other fields.


How to Teach the Undecidability of Malware Detection Problem and Halting Problem

Matthieu Journault, Pascal Lafourcade, Malika More, Rémy Poulain, Léo Robert

WISE13: The 13th World Conference on Information Security Education, 2020

Malware detection is a term that is often associated to Computer Science Security. The underlying main problem is called Virus detection and consists in answering the following question: Is there a program that can always decide if a program is a virus or not? On the other hand, the undecidability of some problems is an important notion in Computer Science : an undecidable problem is a problem for which no algorithm exists to solve it. We propose an activity that demonstrates that virus detection is an undecidable problem. Hence we prove that the answer to the above question is no. We follow the proof given by Cohen in his PhD in 1983. The proof is close to the proof given by Turing in 1936 of the undecidability of the Halting problem. We also give an activity to prove the undecidability of the Halting problem. These proofs allow us to introduce two important ways of proving theorems in Computer Science : proof by contradiction and proof by case disjunction. We propose a simple way to present these notions to students using a maze. Our activity is unplugged, i.e. we use only a paper based model of computer, and is designed for high-school students. This is the reason why we use Scratch to write our « programs ».


GraKeL: A Graph Kernel Library in Python

Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, Konstantinos Skianis, Michalis Vazirgiannis

Journal of Machine Learning Research 21, 2020

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Graph kernels have recently emerged as a promising approach to this problem. There are now many kernels, each focusing on different structural aspects of graphs. Here, we present GraKeL, a library that unifies several graph kernels into a common framework. The library is written in Python and adheres to the scikit-learn interface. It is simple to use and can be naturally combined with scikit-learn’s modules to build a complete machine learning pipeline for tasks such as graph classification and clustering. The code is BSD licensed and is available at:


LouvainNE: Hierarchical Louvain Method for High Quality and Scalable Network Embedding

Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume and Bivas Mitra

In Proceedings of the 13th ACM International WSDM conference, 2020

Network embedding, that aims to learn low-dimensional vector representation of nodes such that the network structure is preserved, has gained significant research attention in recent years. However, most state-of-the-art network embedding methods are computationally expensive and hence unsuitable for representing nodes in billion-scale networks. In this paper, we present LouvainNE, a hierarchical clustering approach to network embedding. Precisely, we employ Louvain, an extremely fast and accurate community detection method, to build a hierarchy of successively smaller subgraphs. We obtain representations of individual nodes in the original graph at different levels of the hierarchy, then we aggregate these representations to learn the final embedding vectors. Our theoretical analysis shows that our proposed algorithm has quasi-linear run-time and memory complexity. Our extensive experimental evaluation, carried out on multiple real-world networks of different scales, demonstrates both (i) the scalability of our proposed approach that can handle graphs containing tens of billions of edges, as well as (ii) its effectiveness in performing downstream network mining tasks such as network reconstruction and node classification.


KClist++: A Simple Algorithm for Finding k-Clique Densest Subgraphs in Large Graphs

Bintao Sun, Maximilien Danisch, T-H. Hubert Chan and Mauro Sozio

In Proceedings of the VLDB Endowment, 2020

The problem of finding densest subgraphs has received increasing attention in recent years finding applications in biology, finance, as well as social network analysis. The k-clique densest subgraph problem is a generalization of the densest subgraph problem, where the objective is to find a subgraph maximizing the ratio between the number of k-cliques in the subgraph and its number of nodes. It includes as a special case the problem of finding subgraphs with largest average number of triangles (k=3), which plays an important role in social network analysis. Moreover, algorithms that deal with larger values of k can effectively find quasi-cliques. The densest subgraph problem can be solved in polynomial time with algorithms based on maximum flow, linear programming or a recent approach based on convex optimization. In particular, the latter approach can scale to graphs containing tens of billions of edges. While finding a densest subgraph in large graphs is no longer a bottleneck, the k-clique densest subgraph remains challenging even when k=3. Our work aims at developing near-optimal and exact algorithms for the k-clique densest subgraph problem on large real-world graphs. We give a surprisingly simple procedure that can be employed to find the maximal k-clique densest subgraph in large-real world graphs. By leveraging appealing properties of existing results, we combine it with a recent approach for listing all k-cliques in a graph and a sampling scheme, obtaining the state-of-the-art approaches for the aforementioned problem. Our theoretical results are complemented with an extensive experimental evaluation showing the effectiveness of our approach in large real-world graphs.


Strongly Connected Components in StreamGraphs: Computation and Experimentations

Léo Rannou, Clémence Magnien, and Matthieu Latapy

In Proceedings of the 9th International Conference on Complex Networks and their Applications, 2020

Stream graphs model highly dynamic networks in which nodes and/or links arrive and/or leave over time. Strongly connected components in stream graphs were defined recently, but no algorithm was provided to compute them. We present here several solutions with polynomial time and space complexities, each with its own strengths and weaknesses. We provide an implementation and experimentally compare the algorithms in a wide variety of practical cases. In addition, we propose an approximation scheme that significantly reduces computation costs, and gives even more insight on the dataset.


Testing the Impact of Semantics and Structure on Recommendation Accuracy and Diversity

Pedro Ramaciotti Morales, Lionel Tabourier, Raphaël Fournier-S’niehotta

In Proceedings of the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 2020 (virtual)

The Heterogeneous Information Network (HIN) formalism is very flexible and enables complex recommendations models. We evaluate the effect of different parts of a HIN on the accuracy and the diversity of recommendations, then investigate if these effects are only due to the semantic content encoded in the network. We use recently-proposed diversity measures which are based on the network structure and better suited to the HIN formalism. Finally, we randomly shuffle the edges of some parts of the HIN, to empty the network from its semantic content, while leaving its structure relatively unaffected. We show that the semantic content encoded in the network data has a limited importance for the performance of a recommender system and that structure is crucial.


Finding Top-k Nodes for Temporal Closeness in Large Temporal Graphs

Pierluigi Crescenzi, Clémence Magnien and Andrea Marino

In Algorithms, 13 (9), 211, 2020

The harmonic closeness centrality measure associates, to each node of a graph, the average of the inverse of its distances from all the other nodes (by assuming that unreachable nodes are at infinite distance). This notion has been adapted to temporal graphs (that is, graphs in which edges can appear and disappear during time) and in this paper we address the question of finding the top-k nodes for this metric. Computing the temporal closeness for one node can be done in O(m) time, where m is the number of temporal edges. Therefore computing exactly the closeness for all nodes, in order to find the ones with top closeness, would require O(nm) time, where n is the number of nodes. This time complexity is intractable for large temporal graphs. Instead, we show how this measure can be efficiently approximated by using a “backward” temporal breadth-first search algorithm and a classical sampling technique. Our experimental results show that the approximation is excellent for nodes with high closeness, allowing us to detect them in practice in a fraction of the time needed for computing the exact closeness of all nodes. We validate our approach with an extensive set of experiments.


Do you trade with your friends or become friends with your trading partners? A case study in the G1 cryptocurrency

Nicolas Gensollen, Matthieu Latapy

In Applied Network Science, 5 (25), 2020

We study the interplay between social ties and financial transactions made through a recent cryptocurrency called G1. It has the particularity of combining the usual transaction record with a reliable network of identified users. This gives the opportunity to observe exactly who sent money to whom over a social network. This social network is a key piece of this cryptocurrency, which therefore puts much effort in ensuring that nodes correspond to unique, well identified, real living human users, linked together only if they met at least once in real world. Using this data, we study how social ties impact the structure of transactions and conversely. We show that users make transactions almost exclusively with people they are connected with in the social network. Instead, they tend to build social connections with people they will never make transactions with.


Investigating the Lack of Diversity in User Behavior: the Case of Musical Content on Online Platforms.

Rémy Poulain, Fabien Tarissan

In Information Processing & Management, 57(2), Elsevier, 2020

Whether to deal with issues related to information ranking (e.g. search engines) or content recommendation (on social networks, for instance), algorithms are at the core of processes that select which information is made visible. Such algorithmic choices have a strong impact on users’ activity de facto, and therefore on their access to information. This raises the question of how to measure the quality of the choices algorithms make and their impact on users. As a first step in that direction, this paper presents a framework with which to analyze the diversity of information accessed by users in the context of musical content. The approach adopted centers on the representation of user activity through a tripartite graph that maps users to products and products to categories. In turn, conducting random walks in this structure makes it possible to analyze how categories catch users’ attention and how this attention is distributed. Building upon this distribution, we propose a new index referred to as the (calibrated) herfindahl diversity, which is aimed at quantifying the extent to which this distribution is diverse and representative of existing categories. To the best of our knowledge, this paper is the first to connect the output of random walks on graphs with diversity indexes. We demonstrate the benefit of such an approach by applying our index to two datasets that record user activity on online platforms involving musical content. The results are threefold. First, we show that our index can discriminate between different user behaviors. Second, we shed some light on a saturation phenomenon in the diversity of users’ attention. Finally, we show that the lack of diversity observed in the datasets derives from exogenous factors related to the heterogeneous popularity of music styles, as opposed to internal factors such as recurrent user behaviors.

Predicting interactions between individuals with structural and dynamical information

Thibaud Arnoux, Lionel Tabourier, Matthieu Latapy

In Journal of Interdisciplinary Methodologies and Issues in Sciences, 2019

Capturing both structural and temporal features of interactions is crucial in many real-world situations like studies of contact between individuals. Using the link stream formalism to model data, we address here the activity prediction problem: we predict the number of links that will occur during a given time period between each pair of nodes. To do this, we take benefit from the temporal and structural information captured by link streams. We design and implement a modular supervised learning method to make prediction, and we study the key elements influencing its performances. We then introduce classes of node pairs, which improves prediction quality and increases diversity

Weighted, Bipartite, or Directed Stream Graphs for the Modeling of Temporal Networks

Matthieu Latapy, Clémence Magnien, Tiphaine Viard

In Temporal Network Theory, Holme P., Saramäki J. (eds), Computational Social Sciences, Springer, 2019

We recently introduced a formalism for the modeling of temporal networks, that we call stream graphs. It emphasizes the streaming nature of data and allows rigorous definitions of many important concepts generalizing classical graphs. This includes in particular size, density, clique, neighborhood, degree, clustering coefficient, and transitivity. In this contribution, we show that, like graphs, stream graphs may be extended to cope with bipartite structures, with node and link weights, or with link directions. We review the main bipartite, weighted or directed graph concepts proposed in the literature, we generalize them to the cases of bipartite, weighted, or directed stream graphs, and we show that obtained concepts are consistent with graph and stream graph ones. This provides a formal ground for an accurate modeling of the many temporal networks that have one or several of these features.

Role of the Website Structure in the Diversity of Browsing Behaviors

Pedro Ramaciotti Morales, Lionel Tabourier, Sylvain Ung, and Christophe Prieur.

In Proceedings of the 30th ACM Conference on Hypertext and Social Media, pp. 133-142. ACM, 2019.

The quantitative measurement of the diversity of information consumption has emerged as a prominent tool in the examination of relevant phenomena such as filter bubbles. This paper proposes an analysis of the diversity of the navigation of users inside a website through the analysis of server log files. The methodology, guided and illustrated by a case study, but easily applicable to other cases, establishes relations between types of users’ behavior, site structure, and diversity of web browsing. Using the navigation paths of sessions reconstructed from the log file, the proposed methodology offers three main insights: 1) it reveals diversification patterns associated with the page network structure, 2) it relates human browsing characteristics (such as multi-tabbing or click frequency) with the degree of diversity, and 3) it helps identifying diversification patterns specific to subsets of users. These results are in turn useful in the analysis of recommender systems and in the design of websites when there are diversity-related goals or constrains.


Approximating the Temporal Neighbourhood Function of Large Temporal Graphs

Pierluigi Crescenzi, Clémence Magnien and Andrea Marino

Algorithms 2019, 12(10), 211 (Special Issue Algorithm Engineering: Towards Practically Efficient Solutions to Combinatorial Problems)

Temporal networks are graphs in which edges have temporal labels, specifying their starting times and their traversal times. Several notions of distances between two nodes in a temporal network can be analyzed, by referring, for example, to the earliest arrival time or to the latest starting time of a temporal path connecting the two nodes. In this paper we mostly refer to the notion of temporal reachability by using the earliest arrival time. In particular, we first show how the sketch approach, which has been already used in the case of classical graphs, can be applied to the case of temporal networks in order to approximately compute the sizes of the temporal cones of a temporal network. By making use of this approach, we subsequently show how we can approximate the temporal neighborhood function (that is, the number of pairs of nodes reachable from one another in a given time interval) of large temporal networks in a few seconds. Finally, we apply our algorithm in order to analyze and compare the behavior of 25 public transportation temporal networks. Our results can be easily adapted to the case in which we want to refer to the notion of distance based on the latest starting time.


Qualified personalities: Sociology of the French Media Government from Cinema to the Digital Era

Olivier Alexandre

Chapter in Reconceptualising Film Policies, 2017

The nature of French audiovisual sector is determined by a layering of policies, created at various periods of time. A public policy system has been continuously developed and adapted since the 1950s, mostly focusing on the support to and defence of the artistic and moral quality of film and television programmes. This institutional system has relied on ‘qualified personalities’ emanating from diverse sectors such as cinema, television, arts, culture, education, administration and the political world. The chapter presents a sociological analysis of the French model matrix. It focuses on the revolving-door system and the policy-making personnel that have enforced a stable regulatory frame for audiovisual industries. The rise of digital operators and executives – more internationalised and engineering-solution oriented – is currently destabilising this ecosystem. There is an important generational, cultural, ideological and linguistic gap between the French ‘Media government’ and the management teams of the new players.


A general graph-based framework for top-N recommendation using content, temporal and trust information

Armel Jacques Nzekon Nzeko’o, Maurice Tchuente and Matthieu Latapy

Journal of Interdisciplinary Methodologies and Issues in Sciences, 2019

Recommending appropriate items to users is crucial in many e-commerce platforms that containimplicit data as users’ browsing, purchasing and streaming history. One common approach con-sists in selecting the N most relevant items to each user, for a given N, which is called top-Nrecommendation. To do so, recommender systems rely on various kinds of information, like itemand user features, past interest of users for items, browsing history and trust between users. How-ever, they often use only one or two such pieces of information, which limits their performance.In this paper, we design and implement GraFC2T2, a general graph-based framework to easilycombine and compare various kinds of side information for top-N recommendation. It encodescontent-based features, temporal and trust information into a complex graph, and uses personal-ized PageRank on this graph to perform recommendation. We conduct experiments on Epinionsand Ciao datasets, and compare obtained performances using F1-score, Hit ratio and MAP eval-uation metrics, to systems based on matrix factorization and deep learning. This shows that ourframework is convenient for such explorations, and that combining different kinds of informationindeed improves recommendation in general.


Spreading dynamics in a cattle trade network: Size, speed, typical profile and consequences on epidemic control strategies

Aurore Payen, Lionel Tabourier and Matthieu Latapy

PLOS ONE, 2019

Infections can spread among livestock notably because infected animals can be brought to uncontaminated holdings, therefore exposing a new group of susceptible animals to the dis- ease. As a consequence, the structure and dynamics of animal trade networks is a major focus of interest to control zoonosis. We investigate the impact of the chronology of animal trades on the dynamics of the process. Precisely, in the context of a basic SI model spread- ing, we measure on the French database of bovine transfers to what extent a snapshot- based analysis of the cattle trade networks overestimates the epidemic risks. We bring into light that an analysis taking into account the chronology of interactions would give a much more accurate assessment of both the size and speed of the process. For this purpose, we model data as a temporal network that we analyze using the link stream formalism in order to mix structural and temporal aspects. We also show that in this dataset, a basic SI spread- ing comes down in most cases to a simple two-phases scenario: a waiting period, with few contacts and low activity, followed by a linear growth of the number of infected holdings. Using this portrait of the spreading process, we identify efficient strategies to control a potential outbreak, based on the identification of specific elements of the link stream which have a higher probability to be involved in a spreading process.


Degree-based Outlier Detection within IP Traffic Modelled as a Link Stream (extended version)

Audrey Wilmet, Tiphaine Viard, Matthieu Latapy and Robin Lamarche-Perrin

Computer Networks, 2019

This paper aims at precisely detecting and identifying anomalous events in IP traffic. To this end, we adopt the link stream formalism which properly captures temporal and structural features of the data. Within this framework, we focus on finding anomalous behaviours with respect to the degree of IP addresses over time. Due to diversity in IP profiles, this feature is typically distributed heterogeneously, preventing us to directly find anomalies. To deal with this challenge, we design a method to detect outliers as well as precisely identify their cause in a sequence of similar heterogeneous distributions. We apply it to several MAWI captures of IP traffic and we show that it succeeds in detecting relevant patterns in terms of anomalous network activity.