Authors: AYŞE YILMAZER
Abstract: GPUs employ simple coherence mechanisms and require explicit use of costly synchronization operations for data integrity. Local-scoped synchronization can be utilized to lower the performance penalty of synchronization when sharing is within a subgroup of threads. Unfortunately, in asymmetric sharing (which is an important dynamic sharing pattern), it is necessary to use global-scoped synchronization due to possible accesses by remote sharers. Remote Scope Promotion (RSP) was introduced to take advantage of local-scoped synchronization at regular accesses while using scope promotion at occasional remote accesses. First implementation of RSP makes use of a simple approach that performs costly cache operations on all L1 data caches when implementing scope promotion, and therefore, it performs poorly on large scale GPU systems. We present nRSP which utilizes a static naming mechanism to identify regularly accessing agent in asymmetric sharing and avoids applying costly coherence actions on every L1 data cache when implementing scope promotion. We evaluate nRSP using timing detailed Gem5-APU simulator modeling a GPU system with 128 Compute Units and show that nRSP lowers remote synchronization overhead greatly and improves performance considerably. On average, nRSP provides around 28% speedup on a 128 Compute Unit GPU device.
Keywords: Asymmetric synchronization, GPUs, remote scope promotion, work-stealing
Full Text: PDF