2 New Internal Mobility Opportunities (CM-ST)
Computing Engineer (IT-CM-MM, Monitoring and Messaging)
As part of the monitoring team, the candidate would be involved in designing, developing and deploying a new unified monitoring and alarms infrastructure that is progressively replacing the IT and WLCG specific solutions. This unified monitoring covers all phases of the dataflow such as collecting metric and logs, live online data streaming, large-scale analytics and comprehensive alarms. This common dataflow is based on established open source solutions such as Collectd, Kafka, Spark, ElasticSearch, InfluxDB and Grafana.
This position gives the opportunity to learn state of the art monitoring, streaming and storage technologies and their application to provide reliable and scalable monitoring solutions for CERN and the WLCG. Proven experience in software development, knowledge in the core technologies and commitment to work in an Agile development environment are important requirements for this position.
Storage operations and support engineer (IT-ST-FDO, File and Disk Operations)
The FDO section is currently managing more than 1,500 servers with more than 60,000 disks, representing about 300 petabyte of raw storage, most notably the EOS and the Ceph services for CERN and the Worldwide LHC computing Grid (WLCG) community.
The successful applicant should manage critical services and drive their evolution for handling future scale evolution. The position requires experience in the following areas:
- Linux system administration (Red Hat or Centos in particular)
- DevOps-style operations (most notably Configuration Management and Infrastructure Automation) for large storage services
- Diagnostic and optimisation Linux file-systems (XFS and/or ZFS)
- Troubleshooting anomalies of complex C++ applications (gdb, network tools, …)
- Service management, most notably proactive management of critical situations that impact user activities
EOS is at the centre of physics computing (notably LHC data taking) and Ceph provides the backbone of the IT cloud and support for mission-critical services (e.g. Puppet, GitLab, CERNBox and others), so out-of-working-hours interventions to respond to critical tickets may be necessary by FDO section members.