In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing (HPDC ’13). ACM, New York, NY, USA, 215-226.
It is remarkably easy to offload processing to Intel’s newest manycore coprocessor, the Xeon Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Easy portability is attracting programmer efforts to achieve high performance for many applications. But Linux makes it easy for different users to share the Xeon Phi coprocessor, and multiprocessing inefficiencies can easily offset gains made by individual programmers. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that coprocessor multiprocessing not only slows down the processes but also introduces unreliability (some processes crash unexpectedly).
We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, it has two benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.