> By Raphaël Fournier and Matthieu Latapy
P2P systems are known to host a large amount of paedophile activity. Thus, quantifying the number of paedophile users on a P2P system is crucial, for many reasons: easy access to such content is a major societal concern, policy making and law-enforcement budgeting rely on this figure and the spreading of online paedophilia may influence real-world behaviors .
However, it is very challenging to deal with this issue. One must obtain and process large-scale data and cope with the high dynamicity of users in the system: they leave shortly after they have arrived. Plus, identifying users is even a hard work in itself: several users may use the same computer or IP address and one may use several computers.
We focused our study on the eDonkey system. We performed a ten-week measurement on a server to build a dataset  with 127 millions of queries submitted by users to the search engine. We then thoroughly designed a paedophile query detection tool, which we evaluated . We consider that a user becomes paedophile as soon as a paedophile query originates from its identifier. We assess here two methods of identifying users: one based on their IP addresses only, whereas the second one makes use of the IP address and the port number.
This plot shows the ratios of paedophile queries, of paedophile IP addresses and of paedophile (IP, port) discovered from the beginning of the measurement. The plot of the ratio of paedophile IP addresses (red plot) clearly grows with the measurement duration. This reveals a pollution phenomenon: since IP addresses may host different users over the measurement and since a single paedophile user is sufficient to consider an IP as paedophile, then the probability that any given IP address is considered as paedophile grows with measurement time — all IP addresses may eventually be considered as paedophile. This confirms that using IP addresses only is misleading in this case. Conversely, using both IP address and port (green plot) number gives a very different plot: it rapidly reaches a steady regime, very similar to the fraction of paedophile queries (blue plot). This shows that pollution due to dynamic allocation of addresses and ports, and the change of users on the same computers, has a negligible impact in a measurement of such scale and duration.
We may then conclude that distinguishing users by their IP address and port is sufficient in our measurement (whereas IP address only is clearly not sufficient).
 Kim C., From Fantasy to Reality: The Link Between Viewing Child Pornography and Molesting Children, Prosecutor 39(2): 17-18,20,47, 2005.