Title | Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives |
Publication Type | Journal Article |
Year of Publication | 2017 |
Authors | Zhou, H, Gracia, J |
Secondary Title | International Journal of Networking and Computing |
Volume | 7 |
Publisher | International Journal of Networking and Computing |
Place Published | International Journal of Networking and Computing |
ISSN Number | ISSN 2185-2839 |
Keywords | blocking, DART, DART-MPI, Locality-awareness, MPI, non-blocking, one-sided communication, programmer productivity |
Abstract | Nowadays, the individual nodes of a distributed parallel computer consist of multi- or many-core processors allowing to execute more than one process per node. The large difference in communication speed within a node through shared memory, versus across nodes through the network interconnect, requires to use locality-aware communication schemes for any efficient distributed application. However, writing an efficient locality-aware MPI code is complex and error-prone, because the developer has to use very different APIs for communication operations within and across nodes, respectively, and manage inter-process synchronization. In this paper, we analyze and enhance a recent one-sided communication model, namely DART-MPI, which is implemented on top of MPI-3. In this runtime system, the complexities of handling locality-awareness of MPI memory access operations, either remote or local, and the related synchronization calls are hidden inside the related DART-MPI interfaces resulting in concise code and improved application and developer productivity. We have carried out in-depth evaluation of our DART-MPI system. Foremost, a micro benchmark is conducted to help understanding the prime performance overhead of implementing APIs in DART-MPI system, which is small and becomes negligible with the growing message sizes. We then compare the performance of DART-MPI and flat MPI without locality awareness, in particular blocking and non-blocking memory operations, using a realistic scientific application on a large-scale supercomputer. The comparison demonstrates that in most cases the DART-MPI version of this application shows better performance than the flat MPI version. Further, we compare the DART-MPI version to a functionally equivalent MPI version, which thus includes code to deal with data-locality, and show that DART-MPI realizes almost the full potential of highly optimized MPI while maintaining high productivity for non-expert programmers. |
URL | http://www.ijnc.org/index.php/ijnc/article/view/147 |
Citation Key | 13375 |