南京有什么特产| 一生辛苦不得财是什么生肖| 十月九号什么星座| 霸王龙的后代是什么| 脸一边大一边小是什么原因| 肺炎是什么症状| 83年属猪是什么命| 梅花在什么季节开放| 九华山求什么最灵验| 绊倒是什么意思| 避孕套和安全套有什么区别| boq是什么意思| 职业病是什么意思| bbox是什么意思| 很轴是什么意思| 身体痒是什么原因| 下肢静脉血栓挂什么科| cr是什么意思| 坐车晕车是什么原因| head是什么牌子| 喝酒后头疼是什么原因| 经期缩短是什么原因| 张飞穿针歇后语下一句是什么| 什么教导| 高血压吃什么中药| 什么盐比较好| 脑梗吃什么药效果最好| 死不瞑目是什么意思| 1990是什么生肖| 流氓是什么意思| 李商隐号什么| 92年1月属什么生肖| 麻薯是什么| 肺气泡吃什么药| 石榴木命是什么意思| 一本万利是什么生肖| 什么是低碳生活| 复出是什么意思| 桢字五行属什么| 哦吼是什么意思| 河东狮吼什么意思| 防弹衣是由什么材料制成的| 喝益生菌有什么好处| 惊恐发作是什么病| 头发汗多是什么原因| 什么的超市| 碳酸饮料喝多了有什么危害| 炎帝叫什么| 痛风喝什么水| 为什么会牙疼| 白芷有什么功效| 动物的尾巴有什么用处| 更年期补钙吃什么钙片好| 吹箫是什么意思| 梦见杀蛇是什么意思| 自控能力是什么意思| 9月9号是什么星座| 利好是什么意思| 九月初九是什么节日| 什么是执念| 牛油果坏了是什么样| 晚上睡不着是什么原因引起的| 大宗物品是什么意思| 什么是宫缩| nba下个赛季什么时候开始| 真丝和桑蚕丝有什么区别| 腿脚发麻是什么原因| 天秤座男生喜欢什么样的女生| 工作性质是什么| 田字出头是什么字| 恭敬地看的词语是什么| 什么药降肌酐最有效| 昆虫记是什么类型的书| 小龙虾不能和什么一起吃| 鸡为什么吃自己下的蛋| 查抗体是做什么检查| 取其轻前一句是什么| 沙僧为什么被贬下凡间| 什么最赚钱| 小恙是什么意思| 猫咪吐黄水有泡沫没有精神吃什么药| 急性腹泻拉水吃什么药| 牙龈长泡是什么原因| 鲱鱼罐头那么臭为什么还有人吃| 红红火火是什么意思| 口腔溃疡为什么是白色的| 宋字五行属什么| 晚上1点是什么时辰| 脱发是什么原因| 嘴唇开裂是什么原因| 梦见蛇和老鼠是什么意思| rarone是什么牌子的手表| 君子菜是什么蔬菜| 朱砂痣是什么| 盛世的意思是什么| 棕色和什么颜色搭配好看| 海葡萄是什么| 辣椒含有什么维生素| ntr是什么意思| 肝脏低密度影是什么意思| 什么是抖m| 心脏挂什么科| 小产后可以吃什么水果| 藏红花能治什么病| 什么降压药副作用小且效果最好| 轻度脑梗吃什么药最好| mect是什么意思| 生蚝和什么不能一起吃| 肾精亏虚是什么意思| 718是什么星座| 属鼠男和什么属相最配| 肌酐低什么原因| 二甲双胍是什么药| 七月有什么花| 膳食纤维是什么| 什么的花| 黄鳝不能和什么一起吃| 微醺是什么意思| 金益什么字| 静心什么意思| 独角兽是什么动物| 鸡蛋与什么食物相克| 抑郁症是什么症状| 脑梗吃什么最好| 银耳什么时候吃最好| 柔式按摩是什么| 尿蛋白高不能吃什么食物| 水果的英文是什么| 梦见丧尸是什么预兆| 肾五行属什么| 语字五行属什么| 仙人掌有什么功效| 什么人不能吃人参| 有过之而不及什么意思| 3.8什么星座| 维c有什么功效和作用| 和风对什么| 黄泉路什么意思| 什么东东| 烀是什么意思| 血压什么时候最高| 蛋白粉什么时候吃效果最好| ltp是什么意思| 宝宝舌苔白厚是什么原因| 双绉是什么面料| 一个鱼一个完读什么| 院士是什么级别| 筛子是什么意思| 70年出生属什么生肖| wonderland什么意思| 肠炎有什么症状表现| 什么是提肛运动| 诸君是什么意思| 连续做噩梦是什么原因| 朕是什么时候开始用的| 晒伤涂什么药膏| 办健康证在什么地方办| 打嗝是什么原因| 感冒咳嗽吃什么水果好| 捆是什么意思| 肾痛是什么原因| 考试吃什么早餐| 气胸有什么症状| 黑鱼是什么鱼| 心脾两虚吃什么食物补最快| 3.1是什么星座| 白细胞低有什么危害| 尖斌卡引是什么意思| 女生被摸胸是什么感觉| 屈服是什么意思| 鳞状上皮内高度病变是什么意思| 女主是什么意思| 凉血是什么意思| 拍胸片挂什么科室| 香叶是什么树的叶子| 一什么傍晚| 为什么会经常流鼻血| 羊肉不能和什么一起吃| 6月16号是什么星座| 磨砂膏有什么作用| 独善其身是什么意思啊| 缺钙查什么化验项目| 懒惰是什么意思| m指的是什么| 红烧肉配什么菜好吃| 8月11是什么星座| 九一年属什么生肖| 糖尿病人能吃什么| 北京为什么叫北平| 1963属什么生肖| 考虑黄体是什么意思| 结婚登记需要什么| 孙武和孙膑是什么关系| 梦见自己吃肉是什么预兆| 白细胞低是怎么回事有什么危害| 什么鼻子好看| 吃卡培他滨禁止吃什么| 屁股里面疼是什么原因| 三个土念什么| 什么牌子的大米好吃| 颅骨早闭合有什么症状| 胸口痛吃什么药| 骨折的人吃什么恢复快| 体检需要注意什么| 什么茶好喝| 阿罗裤是什么意思| 晚上六点半是什么时辰| 胰岛素起什么作用| 大眼角痒是什么原因| 手掌横纹代表什么意思| 吃什么食物养肝护肝| 11月12日什么星座| 蟑螂是什么样子的| 眼睛痛什么原因| 乙木代表什么| 菠萝和什么不能一起吃| 家里养什么宠物好| 青瓜是什么瓜| 标王是什么意思| mankind是什么意思| 理想主义者是什么意思| 湿疹什么样| 1998年什么命| 什么时间喝酸奶最好| 地球代表什么生肖| 大使是什么行政级别| 谷胱甘肽是什么| 唇炎去药店买什么药| 芒果和什么相克| 什么狗不掉毛适合家养| 拔牙后吃什么消炎药| hpv52型阳性是什么意思严重吗| 诊查费是什么| 2003属什么生肖| 天机不可泄露是什么意思| 妇科检查bv是什么意思| 一什么鹿角| 豚是什么意思| 生姜放肚脐眼有什么功效| 血稠吃什么药最好| 药流后需要注意什么| aosc是什么病| 2007年属猪五行属什么| 低血糖吃什么| 冻顶乌龙茶属于什么茶| 深圳市市长什么级别| 梦见打死黄鼠狼是什么意思| 7月30日是什么星座| 财代表什么生肖| 反话是什么意思| 温暖的近义词是什么| 胎位头位是什么意思| 干眼症用什么药| 包皮与包茎有什么区别| ppt什么意思| 鸭屎香为什么叫鸭屎香| 吃维生素b族有什么好处| 生意兴隆是什么生肖| 绝经是什么意思| 体寒是什么意思| 奕字属于五行属什么| 孕妇吃葡萄对胎儿有什么好处| 角的大小和什么有关| 东北方是什么方位| 百度
Simulation / Modeling / Design

新華時評:打貿易戰?中國不怕事

Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor.
Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor.
Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor.
百度 科技创新捷报频传,国际领先的重大科技成果不断涌现。

NVIDIA GPUs execute groups of threads known as warps?in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by taking advantage of?warp execution. In this blog we show how to use primitives introduced in CUDA 9 to make your warp-level programing safe and effective.

Warp-level Primitives

NVIDIA GPUs and the CUDA programming model employ an execution model called SIMT (Single Instruction, Multiple Thread). SIMT extends Flynn’s Taxonomy?of computer architectures, which describes four classes of architectures in terms of their numbers of instruction and data streams. One of Flynn’s four classes, SIMD (Single Instruction, Multiple Data) is commonly used to describe architectures like GPUs. But there is a subtle but important difference between SIMD and SIMT. In a SIMD architecture, each instruction applies the same operation in parallel across many data elements. SIMD is typically implemented using processors with vector registers and execution units; a scalar thread issues vector instructions that execute in SIMD fashion. In a SIMT architecture, rather than a single thread issuing vector instructions applied to data vectors, multiple threads issue common instructions to arbitrary data.

The benefits of SIMT for programmability led NVIDIA’s GPU architects to coin a new name for this architecture, rather than describing it as SIMD. NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths. The CUDA compiler and the GPU work together to ensure the threads of a warp execute the same instruction sequences together as frequently as possible to maximize performance.

While the high performance obtained by warp execution happens behind the scene, many CUDA programs can achieve even higher performance by using explicit warp-level programming. Parallel programs often use collective communication operations, such as parallel reductions and scans. CUDA C++ supports such collective operations by providing warp-level primitives and Cooperative Groups?collectives. The Cooperative Groups?collectives?(described in this previous post)?are implemented on top of the warp primitives, on which this article focuses.

Part of a warp-level parallel reduction using shfl_down_sync().
Part of a warp-level parallel reduction using shfl_down_sync().

Listing?1?shows an example of using warp-level?primitives. It uses __shfl_down_sync() to perform a tree-reduction to compute the sum of?the val variable held by each thread in a warp. ?At the end of the loop, val of the first thread in the warp contains the sum.

#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
    val += __shfl_down_sync(FULL_MASK, val, offset);

A warp comprises 32 lanes, with each thread occupying one lane. For a thread at lane X in the warp, __shfl_down_sync(FULL_MASK, val, offset) gets the value of the val variable from the thread at lane X+offset of the same warp. The data exchange is performed between registers, and more efficient than going through shared memory, which requires a load, a store and an extra register to hold the address.

CUDA 9 introduced?three categories of new or updated warp-level primitives.

  1. Synchronized data exchange: exchange data between threads in warp.
    • __all_sync, __any_sync, __uni_sync, __ballot_sync
    • __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync
    • __match_any_sync, __match_all_sync
  2. Active?mask query: returns a 32-bit mask indicating which threads in a warp are active?with the current executing thread.
    • __activemask
  3. Thread synchronization: synchronize threads in a warp and provide?a memory fence.
    • __syncwarp

Please see the ?CUDA Programming Guide?for detailed descriptions?of these primitives.

Synchronized Data Exchange

Each of the “synchronized data exchange” primitives perform a collective operation among a set of threads in a warp. For example, Listing?2 shows three of these. Each thread that calls __shfl_sync() or?__shfl_down_sync() receives?data from a thread in the same warp, and each thread that calls __ballot_sync() receives?a bit?mask representing all the threads in the warp that pass?a true value for?the predicate argument.

int __shfl_sync(unsigned mask, int val, int src_line, int width=warpSize);
int __shfl_down_sync(unsigned mask, int var, unsigned detla, 
                     int width=warpSize);
int __ballot_sync(unsigned mask, int predicate);

The set of threads?that participates?in invoking each?primitive is specified using a 32-bit mask, which is the first argument of these primitives. All the participating?threads must be synchronized?for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.

A frequently asked question is “what should I use for the mask argument?”. You can consider the mask to mean the set of threads in the warp that should participate in the collective operation. This set of threads is?determined by the program logic, and can usually be computed by some branch condition earlier?in the program flow. Take the reduction code in Listing?1 as?an example. Assume we want to compute the sum of all the elements of an array input[], whose size NUM_ELEMENTS is less than the number of threads in the thread block. We can use the method?in Listing 3.

unsigned mask = __ballot_sync(FULL_MASK, threadIdx.x < NUM_ELEMENTS);
if (threadIdx.x < NUM_ELEMENTS) { 
    val = input[threadIdx.x]; 
    for (int offset = 16; offset > 0; offset /= 2)
        val += __shfl_down_sync(mask, val, offset);
    …
}

The code uses the condition thread.idx.x < NUM_ELEMENTS to determine whether or not a thread will participate in the reduction. ?__ballot_sync() is used to compute the membership mask for the __shfl_down_sync() operation. __ballot_sync() itself uses FULL_MASK (0xffffffff for 32 threads) because we assume all threads will execute it.

On Volta and later GPU architectures, the data exchange primitives can be used in thread-divergent branches:?branches where?some threads in the warp take a different path?than the others. Listing?4 shows an example where all the threads in a warp get the value of val from the thread at lane 0. The even-?and odd-numbered?threads take?different branches of an if statement.

if (threadIdx.x % 2) {
    val += __shfl_sync(FULL_MASK, val, 0);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 0);
…
}

On the latest?Volta?(and future) GPUs, you?can run?library functions that use warp synchronous primitives without worrying whether the function is called in a thread-divergent branch.

Active Mask Query

__activemask() returns a 32-bit unsigned int mask of all currently active threads in the calling warp. In other words, it shows the calling thread which threads in its?warp are also executing the same __activemask(). This?is useful for?the :opportunistic warp-level?programming” technique we explain later, as well as for debugging and understanding program behavior.

However, it’s important to use __activemask() correctly. Listing?5 illustrates an incorrect use. The code tries to perform the same sum reduction ?shown in Listing?4,?but instead of using __ballot_sync() to compute the mask before the branch, it uses __activemask() inside the branch. This is incorrect, as it would result in partial sums instead of a total sum. The CUDA execution model does not guarantee that all threads taking the branch together will execute the __activemask() together. Implicit lock step execution is not guaranteed, as we will explain.

//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) { 
    unsigned mask = __activemask(); 
    val = input[threadIdx.x]; 
    for (int offset = 16; offset > 0; offset /= 2)
        val += __shfl_down_sync(mask, val, offset);
    …
}

Warp Synchronization

When threads in a warp need to perform more complicated communications or collective operations than what the data exchange primitives provide, you can use the __syncwarp() primitive to synchronize threads in a warp. It is similar to the __syncthreads() primitive (which synchronizes?all threads in the thread block) but at finer?granularity.

void __syncwarp(unsigned mask=FULL_MASK);

The __syncwarp() primitive causes the executing thread to wait until all threads specified in mask have executed a __syncwarp() (with the same mask) before resuming execution. It also provides a?memory fence?to allow threads to communicate via memory before and after calling the primitive.

Listing 6 shows an example of shuffling the ownership of matrix elements among threads in a warp.

float val = get_value(…);
__shared__ float smem[4][8];
 
//   0  1  2  3  4  5  6  7 
//   8  9 10 11 12 13 14 15 
//  16 17 18 19 20 21 22 23
//  24 25 26 27 28 29 30 31
int x1 = threadIdx.x % 8;
int y1 = threadIdx.x / 8;
 
//   0  4  8 12 16 20 24 28
//   1  5 10 13 17 21 25 29
//   2  6 11 14 18 22 26 30
//   3  7 12 15 19 23 27 31
int x2= threadIdx.x / 4;
int y2 = threadIdx.x % 4;
 
smem[y1][x1] = val;
__syncwarp();
val = smem[y2][x2];
 
use(val);

Assume a 1-D thread block is used (i.e. threadIdx.y is always 0). At the beginning of the code, each thread in a warp owns one element of a 4×8 matrix with row-major indexing. In other words, lane 0 owns [0][0] and lane 1 owns [0][1]. Each thread stores its value into the corresponding position of a 4×8 array in shared memory. Then __syncwarp() is used to ensure all threads have done the store, before each thread reads from a transposed position in the array. In the end, each thread in the warp owns one element of the matrix with column-major indexing: lane 0 owns [0][0] and lane 1 owns [1][0].

Make sure that __syncwarp() separates shared memory reads and writes to avoid race conditions. Listing 7 illustrates an incorrect use in a tree sum reduction in shared memory. There is a shared memory read followed by a shared memory write between every two __syncwarp() calls. The CUDA programming?model does not guarantee that all the reads will be performed before all the writes, so there is a race condition.

unsigned tid = threadIdx.x;

// Incorrect use of __syncwarp()
shmem[tid] += shmem[tid+16]; __syncwarp();
shmem[tid] += shmem[tid+8];  __syncwarp();
shmem[tid] += shmem[tid+4];  __syncwarp();
shmem[tid] += shmem[tid+2];  __syncwarp();
shmem[tid] += shmem[tid+1];  __syncwarp();

Listing 8 fixes the race condition by inserting extra __syncwarp() calls. The CUDA compiler may elide some of these synchronization instructions in the final generated code depending on the target architecture?(e.g. on pre-Volta architectures).

unsigned tid = threadIdx.x;
int v = 0;

v += shmem[tid+16]; __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+8];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+4];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+2];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+1];  __syncwarp();
shmem[tid] = v;

On the latest Volta (and future) GPUs,?you can also use __syncwarp() in thread-divergent branches to synchronize threads from both branches.?But once they return from the primitive, the threads will become divergent again. See Listing?13 for such an example.

Opportunistic Warp-level Programming

As we showed?in the Synchronized Data Exchange section, the membership mask used in the synchronized data exchange primitives is often computed before a?branch condition in the program flow. In many cases, the program needs to pass the mask along the program flow;?for example, as a function argument when warp-level primitives?are?used inside a function. This may be difficult if you?want to use warp-level?programming inside a library function but you cannot change the function interface.

Some computations?can?use whatever threads happen to be executing?together. We can use a technique called opportunistic warp-level programming, as the following example illustrates. (See this post on warp-aggregated atomics?for more information on the algorithm, and this post for discussion of how Cooperative Groups makes the implementation much simpler.)

// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int *ptr) {
    int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
    int leader = __ffs(mask) – 1;    // select a leader
    int res;
    if(lane_id() == leader)                  // leader does the update
        res = atomicAdd(ptr, __popc(mask));
    res = __shfl_sync(mask, res, leader);    // get leader’s old value
    return res + __popc(mask & ((1 << lane_id()) – 1)); //compute old value
}

atomicAggInc() atomically increments the value pointed to by ptr by 1 and returns the old value. It uses the atomicAdd() function, which may incur contention. To reduce contention, atomicAggInc replaces the per-thread atomicAdd() operation with a per-warp atomicAdd(). The __activemask() in line 4?finds the set of threads in the warp that are about to perform?the atomic operation. __match_any_sync() returns the bit?mask of the threads that have the same value ptr, partitioning the incoming threads into groups whose members have the same ptr value. Each group elects?a leader thread (line 5), which performs?the atomicAdd() (line 8) for the whole group. Every thread gets the old value from the leader (line 9) returned by the atomicAdd(). Line 10?computes and returns the old value the current thread would get from atomicInc() if it were to call the function instead of atomicAggInc.

Implicit Warp-Synchronous Programming is Unsafe

CUDA toolkits prior to version 9.0 provided a (now legacy)?version of warp-level primitives. Compared with the CUDA?9?primitives, the legacy primitives do not accept a?mask argument. For example, int __any(int predicate) is the legacy version of int __any_sync(unsigned mask, int predicate).

The mask argument, as explained previously, specifies the set of threads in a warp that must participate in the primitives. The new primitives?perform intra-warp thread-level synchronization if the threads?specified by the mask?are not already synchronized during execution.

The legacy warp-level primitives do not allow programmers to specify the required threads and do not perform synchronization. Therefore, the threads that must participate in the warp-level operation are not explicitly expressed by the CUDA program. The correctness of such a program depends on implicit warp-synchronous behavior, which may change from one hardware architecture to another, from one CUDA toolkit release to another (due to changes in compiler optimizations, for example), or even from one run-time execution to another. Such implicit warp-synchronous programming is unsafe and may not work correctly.

For example, in the following code, let’s assume all 32 threads in a warp execute line 2 together. The if statement at line 4 causes the threads to diverge, with the odd threads calling foo() at line 5 and the even threads calling bar() at line 8.

// Assuming all 32 threads in a warp execute line 1 together.
assert(__ballot(1) == FULL_MASK);
int result;
if (thread_id % 2) {
    result = foo();
}
else {
    result = bar();
}
unsigned ballot_result = __ballot(result);

The CUDA compiler and the hardware will try to re-converge?the threads at line 10?for better performance. But this?re-convergence is not guaranteed. Therefore, the ballot_result may not contain the ballot result from all 32 threads.

Calling the new __syncwarp() primitive at line 10?before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.

__syncwarp();
unsigned ballot_result = __ballot(result);

The correct fix is to use __ballot_sync() as in Listing 12.

unsigned ballot_result = __ballot_sync(FULL_MASK, result);

A common mistake is to assume that calling __syncwarp() before and/or after a legacy warp-level primitive is functionally equivalent to calling the sync version of the primitive. For example, is __syncwarp(); v = __shfl(0); __syncwarp(); the same as __shfl_sync(FULL_MASK, 0)? The answer is no, for two reasons. First, if the sequence is used in a thread-divergent branch, then __shfl(0) won’t be executed by all threads together. Listing 13 shows an example. The __syncwarp() at line 3 and line 7 would ensure foo() is called by all threads in the warp before line 4 or line 8 is executed. Once threads leave the __syncwarp(), the odd threads and the even threads become divergent again. Therefore, the __shfl(0) at line 4 will get an undefined value because lane 0 is inactive when line 4 is executed. __shfl_sync(FULL_MASK, 0) can be used in thread-divergent branches without this problem.

v = foo();
if (threadIdx.x % 2) {
    __syncwarp();
    v = __shfl(0);       // L3 will get undefined result because lane 0 
    __syncwarp();        // is not active when L3 is executed. L3 and L6
} else {                 // will execute divergently.
    __syncwarp();
    v = __shfl(0);
    __syncwarp();
}

Second, even when the sequence is called by all the threads together, the CUDA execution model does not guarantee threads will stay convergent after leaving __syncwarp(), as Listing 14 shows. Implicit lock-step execution is not guaranteed. Remember, thread convergence is guaranteed only within explicitly synchronous warp-level primitives.

assert(__activemask() == FULL_MASK); // assume this is true
__syncwarp();
assert(__activemask() == FULL_MASK); // this may fail

Because using them can lead to unsafe programs, the legacy warp-level primitives are deprecated starting in CUDA 9.0.

Update Legacy Warp-Level Programming

If your program uses legacy warp-level primitives or any form of implicit warp-synchronous programming (such as communicating between threads of a warp without synchronization), you should ?update the code to use the sync version of the primitives. You may also want to restructure your code to use Cooperative Groups, which provides a higher level of abstraction?as well as new features such as multi-block synchronization.

The trickiest part of using the warp-level primitives is figuring out the membership mask to be used. We hope the above sections give you a good idea where to start and what to look out for. ?Here is a list of suggestions:

  1. Don’t just use?FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
  2. Don’t?just?use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
  3. Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
  4. If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
  5. Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.

One last trick. If your existing CUDA program gives a different result on Volta architecture?GPUs, and you suspect the difference?is caused by Volta’s new independent thread scheduling?which can change warp synchronous behavior, you may want to recompile your program with nvcc options -arch=compute_60 -code=sm_70. Such compiled programs opt-in to Pascal’s thread scheduling. When used selectively, it can help pin down the culprit module more quickly, allowing you to update the code to avoid implicit warp-synchronous programming.

Volta independent thread scheduling enables interleaved execution of statements from divergent branches. This enables execution of fine-grain parallel algorithms where threads within a warp may synchronize and communicate.
Volta independent thread scheduling enables interleaved execution of statements from divergent branches. This enables execution of fine-grain parallel algorithms where threads within a warp may synchronize and communicate.
Discuss (20)

Tags

老虎的祖先是什么动物 手术刀口吃什么愈合快 咽喉疱疹是什么症状 心衰吃什么药最好 睡觉总醒是什么原因
伤寒是什么意思 什么牌子的洗衣机最好 厚黑学的精髓是什么 七月七日是什么节日 老人流口水是什么原因引起的
血脂高看什么指标 安逸是什么意思 小伙子是什么意思 宫颈炎吃什么药最好 fila是什么牌子
补中益气丸适合什么人吃 mido手表什么档次 不知道干什么 小麦和大麦有什么区别 齐活儿是什么意思
减肥挂什么科hcv8jop6ns9r.cn 工种是什么意思creativexi.com 爱豆是什么意思wmyky.com 海胆是什么hcv9jop8ns2r.cn 杜松子是什么hcv8jop7ns8r.cn
9月21号是什么星座jinxinzhichuang.com 什么草药治肿瘤最佳hkuteam.com 大饼脸适合什么发型hcv7jop9ns6r.cn 从子是什么意思hcv8jop8ns0r.cn 月亮什么时候是圆的hcv9jop5ns8r.cn
好男儿志在四方是什么生肖hcv8jop0ns1r.cn 父母有刑是什么意思cj623037.com 2027是什么年hcv8jop5ns8r.cn 胃火旺怎么调理吃什么药最好hcv9jop3ns0r.cn 半夜十二点是什么时辰wuhaiwuya.com
尿道感染吃什么消炎药hcv8jop7ns1r.cn 什么店可以买到老鼠药hcv8jop6ns4r.cn joan什么意思sanhestory.com 慢性浅表性胃炎吃什么药好hcv8jop5ns2r.cn 什么叫憩室hcv7jop5ns3r.cn
百度