XRootD Popularity on Hadoop Clusters


2016 (San Francisco, Oct 2016)


Primary authors Marco Meoni (Universita di Pisa & INFN (IT)) Co-authors Tommaso Boccali (Universita di Pisa & INFN (IT)) Nicolo Magini (Fermi National Accelerator Lab. (US)) Luca Menichetti (CERN) Domenico Giordano (CERN)

Back to...


p { margin-bottom: 0.1in; line-height: 120%; }

The CMS experiment has implemented a computing model where distributed monitoring infrastructures are collecting any kind of data and metadata about the performance of the computing operations. This data can be probed further by harnessing Big Data analytics approaches and discovering patterns and correlations that can improve the throughput and the efficiency of the computing model.

CMS has already begun to store a large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - in a Hadoop cluster. This offers the ability to run fast arbitrary query on the data and test several computing MapReduce-based frameworks.

In this work we analyze the XrootD logs collected in Hadoop through Gled and Flume and we benchmark their aggregation at the level of dataset for monitoring purpose of popularity queries, thus proving how dashboard and monitoring systems can benefit from Hadoop parallelism. Processing time on existing Oracle DBMS of XrootD time-series logs does not scale linearly with data volume. Conversely, Big Data architectures do and make it very effective re-processing any user-defined time interval. The entire set of existing Oracle queries is replicated in the Hadoop data store and result validation is performed accordingly.

These results constitute the set of features on top of which a mining platform is designed to predict the popularity of a new dataset, the best location for replicas or the proper amount of CPU and storage in future timeframes. Learning techniques applied to Big Data architectures are extensively explored to study the correlations between aggregated data and seek for patterns in the CMS computing ecosystem. Examples of this kind are primarily represented by operational information like file access statistics or dataset attributes, which are organised in samples suitable for feeding several classifiers.

(CMS abstract, submitted here because two authors are from IT)

You are here