High performance computing systems consume and dissipate a great amount of power. Excessive heat dissipation requires aggressive cooling and extra space that adds to the power consumption and infrastructure cost. Moreover, as the sizes of the system as well as the system temperature rapidly increase, high system failure rates are observed. Thus, a feature of interest for scheduling scientific applications in such environments is support for fault detection and management. This characterizes the quality aspect of the time-to-solution.
A solution to the application-level resilience to faults problem must meet the following requirements: (i) Efficiency, without compromising performance; (ii) The reliability level must be user controlled – greater reliability incurs a higher cost (either in terms of resources, CPU time, energy consumption, or allocation price); and (iii) Minimal code changes in the application. Scheduling algorithms that detect faults and are able to manage them are called fault tolerant (or resilient to faults). The most common fault tolerance strategies include task replication (via double or triple modular redundancy) and application checkpointing. However, it is unclear which of the existing solutions will scale to the size of the exascale computing systems expected by the beginning of the next decade.
We are looking for candidates who are highly motivated to conduct quality research, publish in top venues, and pursue a doctoral degree in Computer Science, with a focus on High Performance Computing. Applicants must have:
A Master’s degree (or equivalent) in Computer Science, Computer Engineering, or Mathematics
Very good programming skills (C, C++, Java);
Very good knowledge of operating systems, in particular Linux;
Fluency in English (verbally and in writing), while knowledge of German, although not required, can be a plus
Strong team-working abilities; and
Good analytical skills.
Experience in carrying out research projects and writing scientific articles will be considered a plus. Knowledge of hardware components specifications and computing systems monitoring is also a plus.