窝窝头是用什么做的| 身上有白点是什么原因| 接盘侠什么意思| pop是什么意思| 17度穿什么衣服合适| 周杰伦英文名叫什么| 碧螺春是什么茶| 牛b克拉斯什么意思| 何炅和谢娜是什么关系| 杨梅泡酒有什么功效和作用| 水仙什么意思| 3个火读什么| 白带发黄吃什么药| 压榨是什么意思| 八月2号是什么星座| 糖化血红蛋白高是什么意思| 海棠什么时候开花| 为什么会尿床| 专科是什么意思| 阴囊潮湿吃什么中成药| 428是什么意思| 作灶是什么意思| 还俗是什么意思| 08年是什么年| 走花路是什么意思| 脚肿腿肿是什么原因引起的| 胃消化不良吃什么药| 耄耋什么意思| hb是什么意思医学| 经期喝什么汤| 什么是乙肝病毒携带者| ca199检查是什么意思| 梅毒长什么样子| 舌头麻木是什么原因引起| AMY医学上是什么意思| 左氧氟沙星治什么病| 刮目相看是什么意思| 高血压适合吃什么食物| 笔走龙蛇是什么生肖| 甲沟炎去医院挂什么科| 花嫁是什么意思| 阿玛尼是什么品牌| 二维是什么意思| 周正是什么意思| 胸部彩超能检查出什么| 少一颗牙齿有什么影响| 狼狗是什么品种| 没收个人全部财产是什么意思| 乳钉的作用是什么| 五花肉炒什么好吃| 再说吧是什么意思| 四时是什么时辰| 长痔疮有什么症状| 女人抖腿代表什么意思| 三高是什么| 三观是指什么| 大吉是什么意思| 什么是干燥综合症| 为什么头朝西睡觉不好| 什么是目标| 2是什么生肖| 6.25是什么日子| 十一月是什么月| 发狂是什么意思| 菠萝不能和什么一起吃| 纳米是什么意思| 粉籍是什么意思| 落是什么意思| 荔枝和什么吃会中毒| 肾有结晶是什么意思| 洗洗睡吧什么意思| 千年杀是什么| 家门不幸是什么意思| 阴道里面痒用什么药| 做b超能查出什么| 香菜吃多了有什么坏处| 眼压是什么| 丹田是什么器官| 东窗事发是什么意思| 什么病会引起牙疼| 睡觉翻白眼是什么原因| 肺栓塞的主要症状是什么| 23度穿什么衣服合适| 肠粘连吃什么药| 肾外肾盂是什么意思| 买手店是什么意思| 麻婆豆腐是什么菜系| 什么人容易得梦游症| gap什么意思| 后入是什么意思| 脚臭用什么药最好| 水痘疤痕用什么药膏| 冉字五行属什么| 钢琴十级什么水平| 郴州有什么好玩的景点| 糖尿病早期什么症状| 什么血型和什么血型不能生孩子| 医院按摩挂什么科| 角质增生是什么意思| 皮赘是什么原因引起的| 粉瘤挂什么科| 女人更年期有什么症状| 路过是什么意思| 牙齿痛吃什么药最管用| 丁桂鱼吃什么食物| 大三阳吃什么药好| 牛和什么属相最配| 漂洗是什么意思| 富贵包去医院挂什么科| 脾虚湿蕴证是什么意思| 迪化是什么意思| 扩胸运动有什么好处| 女性分泌物少是什么原因| 小腿抽筋是什么原因| 护士节送什么花| 火热是什么意思| 老黄瓜炖什么好吃| s925是什么| 放荡不羁爱自由什么意思| 地势是什么意思| 飞机下降时耳朵疼是什么原因| 弃猫效应是什么| au999是什么金| 胡歌真名叫什么| 朱砂痣是什么意思| 妮字五行属什么| 硬着头皮是什么意思| 拔完牙不能吃什么| 茭白是什么| 乙肝病毒表面抗原阳性是什么意思| 卢沟桥事变又称什么| 4月28日是什么日子| 湿疹怎么治用什么药膏| 拉锯战是什么意思| 坐骨神经痛吃什么药好| 为什么嘴唇发紫| 女人右眼跳预示着什么| 2004是什么年| 胖子适合什么发型| 为什么会宫外孕| 腰痛宁胶囊为什么要用黄酒送服| 文员是什么| 湿气重会有什么症状| 物以类聚是什么意思| 8.19是什么星座| 杨梅和什么不能一起吃| 盆腔钙化灶是什么意思| nk是什么意思| 断头婚是什么意思| 宝付支付是什么| 腿弯处的筋痛是什么原因| 家道中落是什么意思| 阑尾炎属于什么科室| 形而下是什么意思| 来大姨妈肚子疼是什么原因| 吃芹菜有什么好处| 肚子肥胖是什么原因引起的| 空心菜是什么菜| 1月22日是什么星座| 一什么山泉| 男士带什么手串好| 梁五行属什么| 马岱字什么| 人生苦短是什么意思| 草莓印是什么意思| 坐月子能吃什么菜| siemens是什么品牌| 腿麻是什么原因| chihiro是什么意思| 硬卧是什么样子的| 吃荔枝有什么好处| 妄想症有什么症状| 吴亦凡什么学历| 丙球是什么| 补铁吃什么| 什么的鞋子| 青色五行属什么| 为什么腿会酸痛| remember是什么意思| 佝偻病是什么意思| 冬天有什么水果| 性格好的女生是什么样| 脸上长斑是什么原因| guava是什么水果| 早上眼屎多是什么原因| 21.75是什么意思| 双鱼配什么星座| 做梦买房子是什么预兆| 乳腺术后吃什么最好| 嬴荡和嬴政什么关系| 终身是什么意思| 官员出狱后靠什么生活| 喉咙痒是什么原因| cd20阳性什么意思| 7月17号什么星座| 白细胞低是什么原因造成的| spf50是什么意思| 7月14什么星座| 查肝炎做什么检查项目| 醛固酮高有什么危害| 痛风可以喝什么饮料| 2017年属鸡火命缺什么| 房间朝向什么方向最好| 踏雪寻梅什么意思| 土豆和什么不能一起吃| 吃什么可以补胶原蛋白| 女人腰上有痣代表什么| 辰砂和朱砂有什么区别| 眼晴干涩模糊用什么药| 样本是什么意思| a2是什么材质| 奇货可居是什么意思| 17岁属什么| 头伏吃什么| 怀孕吃核桃对宝宝有什么好处| 有黄鼻涕吃什么药| 老天爷叫什么名字| 办理港澳通行证需要带什么证件| 风湿是什么原因造成的| 阴唇为什么会长痘痘| 查肾功能需要做什么检查| 做梦梦见前男友是什么意思| 出马仙是什么意思| 做核磁共振需要注意什么| 五官指什么| 什么是工科| 为什么来姨妈左侧输卵管会痛| 喝醉酒是什么感觉| 九月份什么星座| 野格是什么酒| 武则天是什么星座的| 神经性耳鸣吃什么药好| 吃汉堡为什么要配可乐| 小便尿血是什么原因| 脑白质稀疏什么意思| 生肖马和什么生肖相冲| 手指关节肿胀是什么原因| 脂溢性皮炎是什么原因引起的| 令公子车祸隐藏了什么| 烫伤涂什么| 冥界是什么意思| 金砖国家是什么意思| 12年属什么| 上海话十三点是什么意思| 生姜什么时候吃最好| 疯狂动物城里的狐狸叫什么| 4月3日是什么星座| 葛根主治什么病| 胃热吃什么中成药| 社会保险是什么意思| 号外是什么意思| 幻听是什么原因| 2.0是什么意思| 蓝牙耳机什么品牌好| 农历十月十八是什么星座| 高反吃什么药| 牙龈上火是什么原因引起的| 吃什么降三高最好| 梅菜扣肉的梅菜是什么菜| emo是什么意思| 来姨妈头疼是什么原因| 摄人心魄是什么意思| 老咳嗽是什么原因| 九知道指的是什么| 百度
Simulation / Modeling / Design

神龙背水一战:中法股东存分歧经销商贴钱甩车

百度 这种精神是我们党作为工人阶级先锋队的本质属性所决定的。

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA hardware, and would seem to be unavoidable in many cases, such as when accessing elements in a multidimensional array along the second and higher dimensions. However, it is possible to coalesce memory access in such cases if we use shared memory. Before I show you how to avoid striding through global memory in the next post, first I need to describe shared memory in some detail.

Shared Memory

Because it is on-chip, shared memory is much faster than local and global memory. In fact, shared memory latency is roughly 100x lower than uncached global memory latency (provided that there are no bank conflicts between the threads, which we will examine later in this post). Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. This capability (combined with thread synchronization) has a number of uses, such as user-managed data caches, high-performance cooperative parallel algorithms (parallel reductions, for example), and to facilitate global memory coalescing in cases where it would otherwise not be possible.

Thread Synchronization

When sharing data between threads, we need to be careful to avoid race conditions, because while threads in a block run logically in parallel, not all threads can execute physically at the same time. Let’s say that two threads A and B each load a data element from global memory and store it to shared memory. Then, thread A wants to read B’s element from shared memory, and vice versa. Let’s assume that A and B are threads in two different warps. If B has not finished writing its element before A tries to read it, we have a race condition, which can lead to undefined behavior and incorrect results.

To ensure correct results when parallel threads cooperate, we must synchronize the threads. CUDA provides a simple barrier synchronization primitive, __syncthreads(). A thread’s execution can only proceed past a __syncthreads() after all threads in its block have executed the __syncthreads(). Thus, we can avoid the race condition described above by calling __syncthreads() after the store to shared memory and before any threads load from shared memory. It’s important to be aware that calling __syncthreads() in divergent code is undefined and can lead to deadlock—all threads within a thread block must call __syncthreads() at the same point.

Shared Memory Example

Declare shared memory in CUDA C/C++ device code using the __shared__ variable declaration specifier. There are multiple ways to declare shared memory inside a kernel, depending on whether the amount of memory is known at compile time or at run time. The following complete code (available on GitHub) illustrates various methods of using shared memory.

#include 

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

__global__ void dynamicReverse(int *d, int n)
{
  extern __shared__ int s[];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

int main(void)
{
  const int n = 64;
  int a[n], r[n], d[n];

  for (int i = 0; i < n; i++) {
    a[i] = i;
    r[i] = n-i-1;
    d[i] = 0;
  }

  int *d_d;
  cudaMalloc(&d_d, n * sizeof(int)); 

  // run version with static shared memory
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  staticReverse<<<1,n>>>(d_d, n);
  cudaMemcpy(d, d_d, n*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)n", i, i, d[i], r[i]);

  // run dynamic shared memory version
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
  cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)n", i, i, d[i], r[i]);
}

This code reverses the data in a 64-element array using shared memory. The two kernels are very similar, differing only in how the shared memory arrays are declared and how the kernels are invoked.

Static Shared Memory

If the shared memory array size is known at compile time, as in the staticReverse kernel, then we can explicitly declare an array of that size, as we do with the array s.

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

In this kernel, t and tr are the two indices representing the original and reverse order, respectively. Threads copy the data from global memory to shared memory with the statement s[t] = d[t], and the reversal is done two lines later with the statement d[t] = s[tr]. But before executing this final line in which each thread accesses data in shared memory that was written by another thread, remember that we need to make sure all threads have completed the loads to shared memory, by calling __syncthreads().

The reason shared memory is used in this example is to facilitate global memory coalescing on older CUDA devices (Compute Capability 1.1 or earlier). Optimal global memory coalescing is achieved for both reads and writes because global memory is always accessed through the linear, aligned index t. The reversed index tr is only used to access shared memory, which does not have the sequential access restrictions of global memory for optimal performance. The only performance issue with shared memory is bank conflicts, which we will discuss later. (Note that on devices of Compute Capability 1.2 or later, the memory system can fully coalesce even the reversed index stores to global memory. But this technique is still useful for other access patterns, as I’ll show in the next post.)

Dynamic Shared Memory

The other three kernels in this example use dynamically allocated shared memory, which can be used when the amount of shared memory is not known at compile time. In this case the shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter, as in the following excerpt.

dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);

The dynamic shared memory kernel, dynamicReverse(), declares the shared memory array using an unsized extern array syntax, extern shared int s[] (note the empty brackets and use of the extern specifier). The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse() kernel.

What if you need multiple dynamically sized arrays in a single kernel? You must declare a single extern unsized array as before, and use pointers into it to divide it into multiple arrays, as in the following excerpt.

extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars

In the kernel launch, specify the total shared memory needed, as in the following.

myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

Shared memory bank conflicts

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

For devices of compute capability 2.0, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp.

Devices of compute capability 3.x have configurable bank size, which can be set using cudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.

Configuring the amount of shared memory

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime API from the host for all kernels using cudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, and cudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the option cudaFuncCachePreferEqual.

Summary

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate. One way to use shared memory that leverages such thread cooperation is to enable global memory coalescing, as demonstrated by the array reversal in this post. By reversing the array using shared memory we are able to have all global memory reads and writes performed with unit stride, achieving full coalescing on any CUDA GPU. In the next post I will continue our discussion of shared memory by using it to optimize a matrix transpose.

Discuss (36)

Tags

阴历六月十五是什么日子 猪咳嗽用什么药效果好 继往开来是什么意思 嗡阿吽是什么意思 脚肿看什么科
庚什么意思 12583是什么电话 青少年流鼻血是什么原因引起的 过敏性鼻炎引起眼睛痒用什么药 头皮痒用什么止痒最好
三个金念什么 里脊肉是什么肉 囊性灶是什么意思 无花果有什么好处 同性恋是什么
樱花的花语是什么 皮肤过敏用什么药膏 姑息性化疗什么意思 欢五行属什么 国字脸适合什么发型
深海鱼油起什么作用hcv9jop7ns5r.cn 6个月宝宝可以吃什么辅食hcv8jop4ns1r.cn 途明是什么档次的包hcv8jop7ns4r.cn 喉咙痛有什么好办法hcv8jop2ns3r.cn 南方有什么水果hcv9jop7ns2r.cn
白露是什么季节的节气hcv8jop7ns7r.cn 淋巴肉是什么hcv8jop4ns8r.cn 什么是植物蛋白creativexi.com z是什么火车hcv9jop6ns3r.cn prn医学上是什么意思hcv7jop6ns1r.cn
rh阴性血是什么血型hcv8jop0ns6r.cn 40不惑什么意思hcv8jop7ns0r.cn 六月六是什么日子hcv8jop5ns4r.cn 鬼是什么意思hcv8jop3ns8r.cn 老爹鞋适合什么人穿cl108k.com
周杰伦什么学历hcv8jop2ns1r.cn 为什么精液是流出来的hcv8jop5ns7r.cn 卧轨什么意思hcv8jop7ns4r.cn 鸡胗炒什么菜好吃hcv9jop0ns8r.cn 精子什么颜色hcv7jop6ns9r.cn
百度