1 New Internal Mobility Opportunity (IT-CF)
IT/CF is looking for an experienced computing engineer to join our data centre operations team to work on the data centre infrastructure management (DCIM). The efficient management of our rack-level electrical distribution and cooling infrastructure is crucial for meeting the demanding needs of LHC and in particular for the preparation of a huge capacity increase anticipated for Run 3 (2021).
To that end we are looking for a skilled software developer and architect, ideally with some experience from telemetry and control of large computing hardware infrastructure and rack level power distribution devices, to review the existing tools/automation and propose a functional overview of a new approach.
In a second phase the successful candidate shall also investigate, together with Openstack experts in IT-CM, the possibility of automating workload placement with respect to electrical power availability, e.g. geographical zones where significant amount of persistent stranded power has been detected. In parallel to this project a technology review of commercial DCIM solutions shall be performed, including testing of the more promising products, with respect to our existing infrastructure limitations and needs and produce a detailed report on how to proceed based on evolution of the existing tools and/or integration of a commercial DCIM system. The successful candidate will then perform the definition, implementation, testing and maintenance of the specific controls and monitoring infrastructure (whether home-grown or commercial) of the data centres. In addition, he/she shall develop quality assurance and device management methods and procedures to improve the overall efficiency of running the data centres.
The ideal candidate shall be able to work independently through deep dives and expert analyses.
- Architecture and management of complex monitoring and orchestration systems
- Coordination with service managers and related software projects in the department
- Development of tools and systems working closely with data centre operation team as well as groups responsibility for facility monitoring and building management system
- Together with other members of the operations team, develop (or purchase) a graphical real-time electrical power and environmental (temperature) map of the data centre allowing for individual device (e.g. power supply) granularity
- Support data centre controls and monitoring applications including device (e.g. PDU) management
- Own the problem management process following rack or row level power incidents
- At least two years of software development of monitoring or orchestration systems
- Advanced experience with Linux operating system, shell scripting and Python programming language
- Working experience with agile software development processes and tools, e.g. Scrum, Gitlab, Jira, Jenkins
- Working experience with container-based technologies, orchestration and platforms (e.g. Docker, Kubernetes, OpenShift) is a plus.
- Experience with IPMI, SNMP, Modbus, etc. device communication protocols is an asset